# PART I: Creating the Study's Datasets

# 0. Setup

Before we begin, make sure you have installed all the required Python packages. (The instructions below use pip. You can use easy_install, too.) Also, consider using virtualenv for a cleaner installation experience instead of sudo. I also recommend to running the code via IPython Notebook.
* pip install --upgrade turicreate
* pip install --upgrade networkx
* pip install --upgrade pymongo



Please download the KDD Cup 2016 data, and please also download the project files from our GitHub repository. Through this research, we use the various constants that appear in consts.py. Please change the DATASETS_AMINER_DIR, DATASETS_BASE_DIR, and SFRAMES_BASE_DIR to your local directories, where you can download the datasets and save the project's SFrames.

**Note: Creating the following SFrame requires considerable computation power for long periods.** 

In [None]:
%load_ext autoreload
%autoreload 2
%aimport
%matplotlib inline

In [None]:
import os
import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

In [None]:
from tqdm import tqdm_notebook as tqdm 

# 1. Creating the SFrames

In this study, we used the following datasets:
* [The Microsoft Academic KDD Cup 2016 dataset](https://kddcup2016.azurewebsites.net/Data) - The Microsoft Academic KDD Cup Graph dataset (referred to as the MAG 2016 dataset) contains data on over 126 million papers. The main advantage of this dataset is that it has undergone several preprocessing iterations of author entity matching (any author is identified by ID) and paper deduplication. Additionally, the dataset match between papers and their fields of study includes the hierarchical structure and connections between various fields of study.  <br/>  The link is dead...

* [AMiner dataset](https://aminer.org/open-academic-graph) - The AMiner dataset contains information on over 154 million papers collected by the AMiner team. The dataset contains papers' abstracts, ISSNs, ISBNs, and details on each paper. <br/>  Cureently there is a V2 and to download which one...

* [SJR dataset](http://www.scimagojr.com/journalrank.php) -  The SCImago Journal Rank open dataset (referred to as the SJR dataset) contains journals and country specific metric data starting from 1999. In this study, we used the SJR dataset to better understand how various journal metrics have changed over time. <br/>  How to download?


## 1.1 The Microsoft Academic Dataset

The first step is to convert the dataset text files into SFrame objects using the code located under the SFrames creator directory, using the following code.

In [None]:
from ScienceDynamics.datasets.microsoft_academic_graph import MicrosoftAcademicGraph
from ScienceDynamics.config.configs import DATASETS_BASE_DIR
mag = MicrosoftAcademicGraph(DATASETS_BASE_DIR)

In [None]:
import pandas as pd

In [None]:
paper_author_affiliations = pd.read_csv("/storage/homedir/dima/.scidyn2/MAG/PaperAuthorAffiliations.txt.gz", sep="\t", names=["PaperId", "AuthorId", "AffiliationId", "AuthorSequenceNumber", "OriginalAuthor", "OriginalAffiliation"])

In [None]:
sf = SFrame(paper_author_affiliations.replace({pd.np.nan: None}))

In [None]:
sf.save("/storage/homedir/dima/.scidyn2/MAG/sframesPaperAuthorAffiliations.sframe")

In [None]:
!mv  /storage/homedir/dima/.scidyn2/MAG/sframesPaperAuthorAffiliations.sframe /storage/homedir/dima/.scidyn2/MAG/sframes/PaperAuthorAffiliations.sframe

In [None]:
["PaperId", "Rank", "Doi", "DocType", "PaperTitle", "OriginalTitle", "BookTitle", "Year", "Date",
                "Publisher", "JournalId", "ConferenceSeriesId", "ConferenceInstanceId", "Volume", "Issue", "FirstPage",
                "LastPage", "ReferenceCount", "CitationCount", "EstimatedCitation", "OriginalVenue", "FamilyId",
                "CreatedDate"]

In [None]:
mag.fields_of_study_papers_ids()

The above two lines of code will create a set of SFrames with all the dataset data. The SFrames will include data on authors’ papers, keywords, fields of study, and more. Moreover, the code will construct the Extended Papers SFrame, which contains various meta data on each paper in the dataset.

In [None]:
sframe_list = [ self.papers_citation_number_by_year, 
                        self.urls]

In [None]:
p = mag.papers

In [None]:
os.path.exists("/storage/homedir/dima/.scidyn/MAG/sframes/ExtendedPapers.sframe")

In [None]:
mag_sf = mag.extended_papers


In [None]:
col= 'Fields of study parent list (L1)'

In [None]:
new_col_name = "Field ID"
sf = mag_sf.stack(col, new_column_name=new_col_name)

In [None]:
mag.fields_of_study_papers_ids_sframes()

In our study, we also analyzed how various authors' attributes, such as the number of published papers, number of coauthors, etc., has changed over time. To achieve this, we created an authors features SFrame using the following code:

In [None]:
mag.author_names

In [None]:
import turicreate
turicreate.config.set_runtime_config('TURI_DEFAULT_NUM_PYLAMBDA_WORKERS', 2)
turicreate.config.set_runtime_config('TURI_DEFAULT_NUM_GRAPH_LAMBDA_WORKERS', 2)
from ScienceDynamics.datasets.mag_authors import AuthorsFeaturesExtractor
a = AuthorsFeaturesExtractor(mag)
# a.authors_features
#This need to run on a strong server and can take considerable time to run
# a_sf = a.get_authors_all_features_sframe()
# a_sf #the SFrame can be later loaded using tc.load_sframe(AUTHROS_FEATURES_SFRAME)

In [None]:
a.paper_authors_years

In [None]:
a.paper_author_affiliation_sframe

In [None]:
tc.config.get_runtime_config()


In [None]:
a.get_co_authors_dict_sframe()

In [None]:
a.get_authors_papers_dict_sframe()

In [None]:
self = a

In [None]:
p_sf = self._p_sf[['PaperId']]  # 22082741
a_sf = self._mag.paper_author_affiliations["AuthorId", "PaperId"]
a_sf = a_sf.join(p_sf, on="PaperId")
a_sf = a_sf[["AuthorId"]].unique()
g = self.get_authors_papers_dict_sframe()
a_sf = a_sf.join(g, on="AuthorId", how="left")  # 22443094 rows
a_sf.__materialize__()
del g
del p_sf
g = self.get_co_authors_dict_sframe()

In [None]:
from turicreate import aggregate as agg

In [None]:
# print("Calcualting authors' coauthors by year")
# sf = self.paper_authors_years
# sf = sf.join(sf, on='PaperId')
# sf2 = sf[sf['AuthorId'] != sf['AuthorId.1']]
# sf2 = sf2.remove_column('Year.1')
# sf2.__materialize__()
# g = sf2.groupby(['AuthorId', 'Year'], {'Coauthors List': agg.CONCAT('AuthorId.1')})
# del sf
# g.__materialize__()
# del sf2
g['Coauthors Year'] = g.apply(lambda r: (r['Year'], r['Coauthors List']))
g2 = g.groupby("AuthorId", {'Coauthors list': agg.CONCAT('Coauthors Year')})
g2['Coauthors by Years Dict'] = g2['Coauthors list'].apply(lambda l: {y: coa_list for y, coa_list in l})
g2 = g2.remove_column('Coauthors list')


In [None]:
import turicreate as tc
tc.__version__

The above SFrame contains various features of each author that were constructed based on analyzing the author’s papers that have at least 5 references. If you notice, the author’s SFrame contains each author’s gender prediction. This column was created by obtaining first-name gender statistics from the [SSA Baby Names](http://www.ssa.gov/oact/babynames/names.zip]) and [WikiTree](https://www.wikitree.com/wiki/Help:Database_Dumps) datasets which include over 115 thousands unique first names (see details in geneder_classifier.py). 

## 1.2 The AMiner Dataset

After downloading the [AMiner website](https://aminer.org/open-academic-graph), simply load to an SFrame using the following code:

In [None]:
from ScienceDynamics.datasets.aminer import Aminer
from ScienceDynamics.config.configs import DATASETS_AMINER_DIR

In [None]:
1

In [None]:
a = Aminer(DATASETS_AMINER_DIR)

In [None]:
a.data

## 1.3 The SJR Dataset

First, we download all the journal ranking files from [the SJR website](http://www.scimagojr.com/journalrank.php).
Next, we use the following code to create a single SFrame with all the journal data:

In [None]:
from ScienceDynamics.datasets.sjr import SJR
from ScienceDynamics.config.configs import DATASETS_SJR_DIR
sjr = SJR(DATASETS_SJR_DIR)

In [None]:
sjr.data

## 1.4 Joint Datasets

The MAG and AMiner datasets have a slightly different set of features. While the MAG dataset contains data on each author with a unique author ID, the AMiner contains additional data on each paper, including the paper's abstract and the paper's ISSN or ISBN. Additionally, the SJR dataset contains data about each journal's ranking.

To combine the data from the author publication record and the journals' rankings, we join the datasets. First, we joined the MAG and AMiner datasets by matching DOI values, using the following code (see also create_mag_aminer_sframe.py):

In [None]:
from ScienceDynamics.datasets.joined_dataset import JoinedDataset
from ScienceDynamics.config.configs import DATASETS_BASE_DIR, DATASETS_SJR_DIR, DATASETS_AMINER_DIR, STORAGE_PATH

In [None]:
jd = JoinedDataset(STORAGE_PATH, DATASETS_SJR_DIR, DATASETS_AMINER_DIR, mag_path=DATASETS_BASE_DIR/ "MicrosoftAcademicGraph.zip")

In [None]:
jd.mag._dataset_dir

In [None]:
from tqdm import tqdm_notebook as tqdm

In [None]:
print(b_lim, u_lim)


In [None]:
data[data["MAG Paper ID"]=='74024986']

In [None]:
sf[5865272]

In [None]:
data.save("data.csv","csv")

In [None]:
from turicreate import SFrame

In [None]:
import pandas as pd

In [None]:
data.to_csv("data2.csv", encoding='utf-8')

In [None]:
data = pd.read_csv("data.csv",encoding='latin1',error_bad_lines=False)

In [None]:
len(sf),len(jd.aminer_mag_links_by_doi)

In [None]:
data.head(10)["MAG Paper ID"].str.encode('utf-8')

In [None]:
for col in tqdm(data.columns):
    if data[col].dtype == object:
        print(col)
        data[col] = data[col].str.encode('utf-8')

In [None]:
from turicreate import SFrame, load_sframe

In [None]:
sf = SFrame(data_dict)

In [None]:
sf.save("temp.sframe")

In [None]:
sf = load_sframe("temp.sframe")

In [None]:
sf = sf.unpack('X1',column_name_prefix="")

In [None]:
len(sf)

In [None]:
data_dict = data.to_dict(orient='records')

In [None]:
del data_dict

In [None]:
len(data_dict)

In [None]:
b_lim, u_lim

In [None]:
# data[5866103]
data[5865273]

In [None]:
df = data.to_dataframe()

In [None]:
x = jd.aminer_mag_links_by_doi[0:5866103].append(jd.aminer_mag_links_by_doi[5866128:])

In [None]:
te = SFrame()
te["e"] = ["í".encode('latin-1').decode('latin-1').encode('utf-8')]

In [None]:
str("í".encode('latin-1'), 'utf-8')

In [None]:
jd._sframe_dir

In [None]:
sf.save(str(jd._sframe_dir/"PapersAMinerMagJoin.sframe"))

Using the joined dataset, we obtained an SFrame with the joint meta data of 28.9 million papers. We can take this SFrame and join it with the SJR dataset.

In [None]:
jd.aminer_mag_sjr(2015)

# 2. Loading the Dataset to MongoDB

Using Turicreate and SFrame objects can help us get general data on how academic publication dynamics have changed over time, but it would be challenging to use this data to create more complicated insights, such as the trends of a specific journal. To reveal more complicated insights using the data, we would need to load the dataset to a different framework. In this study, we chose to use MongoDB as our framework for more complicated queries.
We installed MongoDB on Ubuntu 17.10 using the instructions in the following [link](https://medium.com/gatemill/how-to-install-mongodb-3-6-on-ubuntu-17-10-ac0bc225e648). After MongoDB is installed and running, please remember to set the user and password, and update MONGO_HOST & MONGO_PORT vars in consts.py (one can also adjust the connection to include user password auth).
Now, the next step is to load the above created SFrames to collections in MongoDB using mongo_connecter.py:

In [None]:
from ScienceDynamics.mongo_connector import load_sframes
load_sframes(mag,sjr,jd) #this will load the SFrame to a local

In the end of the loading process, six collections will be loaded to the journal database.

In [None]:
from ScienceDynamics.mongo_connector import MongoDBConnector
MD = MongoDBConnector()

In [None]:
MD.client.journals.collection_names()

In [None]:
for i, item in enumerate(MD.client.journals.aminer_mag_papers.find()):
    print(item)
    if i >4:
        break

In [None]:
sf = SFrame.read_csv("data2.csv")

In [None]:
MD.client.journals.aminer_mag_papers.count()

In [None]:
MD.client.journals.aminer_mag_papers.remove()

In the second part of the tutorial, we will demonstrate how the above created MongoDB collections can be utilized to calculate various statistics on paper collections, authors, journals, and research domains.