# PART I: Creating the Study's Datasets

# 0. Setup

Before we begin, make sure you have installed all the required Python packages. (The instructions below use pip. You can use easy_install, too.) Also, consider using virtualenv for a cleaner installation experience instead of sudo. I also recommend to running the code via IPython Notebook.
* pip install --upgrade turicreate
* pip install --upgrade networkx
* pip install --upgrade pymongo



Please download the KDD Cup 2016 data, and please also download the project files from our GitHub repository. Through this research, we use the various constants that appear in consts.py. Please change the DATASETS_AMINER_DIR, DATASETS_BASE_DIR, and SFRAMES_BASE_DIR to your local directories, where you can download the datasets and save the project's SFrames.

**Note: Creating the following SFrame requires considerable computation power for long periods.** 

In [1]:
%load_ext autoreload
%autoreload 2
%aimport
%matplotlib inline

Modules to reload:
all-except-skipped

Modules to skip:



In [5]:
import os
import sys
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

In [6]:
from tqdm import tqdm_notebook as tqdm 

# 1. Creating the SFrames

In this study, we used the following datasets:
* [The Microsoft Academic KDD Cup 2016 dataset](https://kddcup2016.azurewebsites.net/Data) - The Microsoft Academic KDD Cup Graph dataset (referred to as the MAG 2016 dataset) contains data on over 126 million papers. The main advantage of this dataset is that it has undergone several preprocessing iterations of author entity matching (any author is identified by ID) and paper deduplication. Additionally, the dataset match between papers and their fields of study includes the hierarchical structure and connections between various fields of study.  <br/>  The link is dead...

* [AMiner dataset](https://aminer.org/open-academic-graph) - The AMiner dataset contains information on over 154 million papers collected by the AMiner team. The dataset contains papers' abstracts, ISSNs, ISBNs, and details on each paper. <br/>  Cureently there is a V2 and to download which one...

* [SJR dataset](http://www.scimagojr.com/journalrank.php) -  The SCImago Journal Rank open dataset (referred to as the SJR dataset) contains journals and country specific metric data starting from 1999. In this study, we used the SJR dataset to better understand how various journal metrics have changed over time. <br/>  How to download?


## 1.1 The Microsoft Academic Dataset

The first step is to convert the dataset text files into SFrame objects using the code located under the SFrames creator directory, using the following code.

In [63]:
from ScienceDynamics.datasets.microsoft_academic_graph import MicrosoftAcademicGraph
from ScienceDynamics.config.configs import DATASETS_BASE_DIR
mag = MicrosoftAcademicGraph(DATASETS_BASE_DIR)

In [14]:
import pandas as pd

In [15]:
paper_author_affiliations = pd.read_csv("/storage/homedir/dima/.scidyn2/MAG/PaperAuthorAffiliations.txt.gz", sep="\t", names=["PaperId", "AuthorId", "AffiliationId", "AuthorSequenceNumber", "OriginalAuthor", "OriginalAffiliation"])

In [18]:
sf = SFrame(paper_author_affiliations.replace({pd.np.nan: None}))

In [19]:
sf.save("/storage/homedir/dima/.scidyn2/MAG/sframesPaperAuthorAffiliations.sframe")

In [21]:
!mv  /storage/homedir/dima/.scidyn2/MAG/sframesPaperAuthorAffiliations.sframe /storage/homedir/dima/.scidyn2/MAG/sframes/PaperAuthorAffiliations.sframe

In [None]:
["PaperId", "Rank", "Doi", "DocType", "PaperTitle", "OriginalTitle", "BookTitle", "Year", "Date",
                "Publisher", "JournalId", "ConferenceSeriesId", "ConferenceInstanceId", "Volume", "Issue", "FirstPage",
                "LastPage", "ReferenceCount", "CitationCount", "EstimatedCitation", "OriginalVenue", "FamilyId",
                "CreatedDate"]

In [7]:
mag.fields_of_study_papers_ids()

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,str,str,int,int,int,int,int,int,int,int,int,int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


RuntimeError: Unable to interpret value of "Journal" as a integer.

The above two lines of code will create a set of SFrames with all the dataset data. The SFrames will include data on authors’ papers, keywords, fields of study, and more. Moreover, the code will construct the Extended Papers SFrame, which contains various meta data on each paper in the dataset.

In [None]:
sframe_list = [ self.papers_citation_number_by_year, 
                        self.urls]

In [14]:
p = mag.papers

In [15]:
os.path.exists("/storage/homedir/dima/.scidyn/MAG/sframes/ExtendedPapers.sframe")

True

In [31]:
mag_sf = mag.extended_papers


In [34]:
col= 'Fields of study parent list (L1)'

In [36]:
new_col_name = "Field ID"
sf = mag_sf.stack(col, new_column_name=new_col_name)

In [55]:
mag.fields_of_study_papers_ids_sframes()

Field of study ID,Paper IDs,Field of study name,Number of Paper,Level
01C396A7,"[7FE89521, 0E23C189, 59E35B91, 587AF2A4, ...",Media studies,24854,1
02005B3E,"[7887D4CE, 7DAF4838, 7919A107, 5E1E0717, ...",Calculus,181633,1
0A778812,"[7EAF768F, 7DC84FFC, 49816720, 7DF2E00F, ...",Natural resource economics ...,1222,1
0BE4BA29,"[7CFE299E, 5E628D73, 80BEF0CE, 7E555D49, ...",Law,1343748,1
014EF258,"[7F9691D6, 7DED9FA0, 7E4483E2, 7DFA3CA8, ...",Combinatorial chemistry,42382,1
023E10AF,"[7FD252F0, 84287EA0, 728CDCF0, 5C37DA5F, ...",Agricultural science,49446,1
02C0117D,"[790C14A8, 5DD62343, 80D484FA, 78CD50CE, ...",Nuclear magnetic resonance ...,486056,1
06E88D7C,"[7CFE299E, 79F82B7D, 7E087EF0, 5C65E96E, ...",Software Engineering,323620,1
073136E6,"[0A8B0683, 77DE62B4, 5ABC5929, 5E494EDA, ...",Optics,888483,1
08ED7E6D,"[7CC75151, 75AE9967, 58AF9956, 5B9DDC2F, ...",Econometrics,41660,1


In our study, we also analyzed how various authors' attributes, such as the number of published papers, number of coauthors, etc., has changed over time. To achieve this, we created an authors features SFrame using the following code:

In [8]:
mag.author_names

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


Author ID,Author name,First name,Last name
00001F05,nancy praill,nancy,praill
000025C5,david banitt,david,banitt
00002AD3,david s rebergen,david,rebergen
000038C0,西原 一幸,西原,一幸
000060E7,francesco saverio intorcia ...,francesco,intorcia
00006A31,b zelazowska,b,zelazowska
00009F6B,bo glimskar,bo,glimskar
0000B5FA,lars goerigk,lars,goerigk
0000D3AB,default admin user,default,user
0000DF94,leila medjkoune,leila,medjkoune


In [6]:
import turicreate
turicreate.config.set_runtime_config('TURI_DEFAULT_NUM_PYLAMBDA_WORKERS', 2)
turicreate.config.set_runtime_config('TURI_DEFAULT_NUM_GRAPH_LAMBDA_WORKERS', 2)
from ScienceDynamics.datasets.mag_authors import AuthorsFeaturesExtractor
a = AuthorsFeaturesExtractor(mag)
# a.authors_features
#This need to run on a strong server and can take considerable time to run
# a_sf = a.get_authors_all_features_sframe()
# a_sf #the SFrame can be later loaded using tc.load_sframe(AUTHROS_FEATURES_SFRAME)

In [10]:
a.paper_authors_years

AuthorId,PaperId,Year
2151921381,37,2004
2722638736,37,2004
2767506227,37,2004
2126642415,108,2013
1995014452,125,1988
2002579779,125,1988
2250382311,125,1988
2283694448,125,1988
2110272823,147,2008
2165699729,147,2008


In [24]:
a.paper_author_affiliation_sframe

PaperId,AuthorId,AffiliationId,AuthorSequenceNumber,OriginalAuthor,OriginalAffiliation,Year
37,2151921381,30338065.0,2,Helena Elisa Stein,Universidade de Caxias do Sul ...,2004
37,2722638736,,3,Lauro Machado Neto,,2004
37,2767506227,30338065.0,1,Ana Paula Tedesco Gabrieli ...,Ortopedia e Traumatologia |Universidade de Caxias ...,2004
108,2126642415,165102784.0,1,Matthew Lovett,Duquesne University,2013
125,1995014452,169199633.0,2,David Hannaford,Aston University and BIS Applied Systems Ltd ...,1988
125,2002579779,169199633.0,1,Jon Bader,Aston University and BIS Applied Systems Ltd ...,1988
125,2250382311,169199633.0,4,John Edwards,Aston University and BIS Applied Systems Ltd ...,1988
125,2283694448,169199633.0,3,Alastair Cochran,Aston University and BIS Applied Systems Ltd ...,1988
147,2110272823,,2,J.M. García Puga,,2008
147,2165699729,,1,J. E. Callejas Pozo,,2008

ConferenceSeriesId,JournalId,OriginalVenue
,205911048.0,Acta Ortopedica Brasileira ...
,205911048.0,Acta Ortopedica Brasileira ...
,205911048.0,Acta Ortopedica Brasileira ...
,2737631889.0,American Society for Aesthetics Graduate ...
,,
,,
,,
,,
,138842040.0,Revista Pediatría de Atención Primaria ...
,138842040.0,Revista Pediatría de Atención Primaria ...


In [11]:
tc.config.get_runtime_config()


{'TURI_CACHE_FILE_HDFS_LOCATION': '',
 'TURI_CACHE_FILE_LOCATIONS': '/storage/homedir/dima/.scidyn2/tmp',
 'TURI_DEFAULT_NUM_GRAPH_LAMBDA_WORKERS': 12,
 'TURI_DEFAULT_NUM_PYLAMBDA_WORKERS': 2,
 'TURI_FAST_COMPACT_BLOCKS_IN_SMALL_SEGMENT': 8,
 'TURI_FILEIO_ALTERNATIVE_SSL_CERT_DIR': '',
 'TURI_FILEIO_ALTERNATIVE_SSL_CERT_FILE': '/storage/homedir/dima/miniconda3/envs/promed/lib/python3.6/site-packages/certifi/cacert.pem',
 'TURI_FILEIO_INSECURE_SSL_CERTIFICATE_CHECKS': 0,
 'TURI_FILEIO_MAXIMUM_CACHE_CAPACITY': 270489230336,
 'TURI_FILEIO_MAXIMUM_CACHE_CAPACITY_PER_FILE': 270489230336,
 'TURI_FILEIO_MAX_REMOTE_FS_CACHE_ENTRIES': 0,
 'TURI_FORCE_IPC_TO_TCP_FALLBACK': 0,
 'TURI_GLOBALS_PYTHON_EXECUTABLE': '/storage/homedir/dima/miniconda3/envs/promed/bin/python',
 'TURI_LAMBDA_WORKER_CONNECTION_TIMEOUT': 60.0,
 'TURI_MIN_SECONDS_BETWEEN_TICK_PRINTS': 3.0,
 'TURI_ML_DATA_STATS_PARALLEL_ACCESS_THRESHOLD': 1048576,
 'TURI_ML_DATA_TARGET_ROW_BYTE_MINIMUM': 262144,
 'TURI_NUM_GPUS': -1,
 'TURI_S

In [None]:
a.get_co_authors_dict_sframe()

2020-04-02 15:34:12,640 [MainThread  ] [INFO ]  Calcualting authors' coauthors by year


In [None]:
a.get_authors_papers_dict_sframe()

In [9]:
self = a

In [10]:
p_sf = self._p_sf[['PaperId']]  # 22082741
a_sf = self._mag.paper_author_affiliations["AuthorId", "PaperId"]
a_sf = a_sf.join(p_sf, on="PaperId")
a_sf = a_sf[["AuthorId"]].unique()
g = self.get_authors_papers_dict_sframe()
a_sf = a_sf.join(g, on="AuthorId", how="left")  # 22443094 rows
a_sf.__materialize__()
del g
del p_sf
g = self.get_co_authors_dict_sframe()

2020-03-02 10:51:11,648 [MainThread  ] [INFO ]  Calcualting authors' papers by year
2020-03-02 11:31:19,056 [MainThread  ] [INFO ]  Calcualting authors' coauthors by year


RuntimeError: vector::_M_default_append

In [13]:
from turicreate import aggregate as agg

In [15]:
# print("Calcualting authors' coauthors by year")
# sf = self.paper_authors_years
# sf = sf.join(sf, on='PaperId')
# sf2 = sf[sf['AuthorId'] != sf['AuthorId.1']]
# sf2 = sf2.remove_column('Year.1')
# sf2.__materialize__()
# g = sf2.groupby(['AuthorId', 'Year'], {'Coauthors List': agg.CONCAT('AuthorId.1')})
# del sf
# g.__materialize__()
# del sf2
g['Coauthors Year'] = g.apply(lambda r: (r['Year'], r['Coauthors List']))
g2 = g.groupby("AuthorId", {'Coauthors list': agg.CONCAT('Coauthors Year')})
g2['Coauthors by Years Dict'] = g2['Coauthors list'].apply(lambda l: {y: coa_list for y, coa_list in l})
g2 = g2.remove_column('Coauthors list')


MemoryError: std::bad_alloc

In [7]:
import turicreate as tc
tc.__version__

'6.1'

The above SFrame contains various features of each author that were constructed based on analyzing the author’s papers that have at least 5 references. If you notice, the author’s SFrame contains each author’s gender prediction. This column was created by obtaining first-name gender statistics from the [SSA Baby Names](http://www.ssa.gov/oact/babynames/names.zip]) and [WikiTree](https://www.wikitree.com/wiki/Help:Database_Dumps) datasets which include over 115 thousands unique first names (see details in geneder_classifier.py). 

## 1.2 The AMiner Dataset

After downloading the [AMiner website](https://aminer.org/open-academic-graph), simply load to an SFrame using the following code:

In [17]:
from ScienceDynamics.datasets.aminer import Aminer
from ScienceDynamics.config.configs import DATASETS_AMINER_DIR

In [8]:
1

1

In [18]:
a = Aminer(DATASETS_AMINER_DIR)

In [9]:
a.data

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[dict]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


abstract,authors,doi,id
,"[{'name': 'Ö. Tan'}, {'name': 'A. Sahin ...",10.3989/mc.2007.v57.i288. 67 ...,53e9b2dbb7602d9703d8c652
,"[{'name': 'A. Goldman'}, {'name': 'D. G. ...",10.1364/AO.19.003721,53e9b2dbb7602d9703d8c653
Objective:It summarized nursing measures for ...,"[{'name': 'Zhang Rong', 'org': 'Affiliated ...",,53e9b2dbb7602d9703d8c654
,"[{'name': 'Elisabeth van der Linden'}, {'name': ...",10.1075/eurosla.5.07lin,53e9b2dbb7602d9703d8c655
,"[{'name': 'N. Cordes'}, {'name': 'L. Plasswil ...",10.1016/S0959-8049(01)812 39-9 ...,53e9b2dbb7602d9703d8c656
,"[{'name': 'Shigeru Ikeda'}, {'name': ...",10.1021/cm970221c,53e9b2dbb7602d9703d8c657
Bovine chondrocytes were isolated from the ...,"[{'name': 'Takashi Sato', 'org': 'Biomaterials ...",10.1016/j.msec.2003.12.01 0 ...,53e9b2dbb7602d9703d8c658
"Word count: 4667 words (including abstract, key ...","[{'name': 'Ennio Cascetta'}, {'name': ...",,53e9b2dbb7602d9703d8c659
Abstract—A novel family of low-density parity- ...,"[{'name': 'I. B. Djordjevic'}, {'name': ...",10.1109/LCOMM.2004.833833,53e9b2dbb7602d9703d8c65b
,"[{'name': 'DANIEL M. J OEL'}, {'name': 'V ITALY ...",,53e9b2dbb7602d9703d8c65c

isbn,issn,issue,keywords,lang,n_citation,page_end,page_start,pdf
,,288.0,,en,,,,
,0003-6935,22.0,,en,9.0,3724.0,3721.0,
,,34.0,"[acute aortic dissection, intracavity isolation, ...",zh,,18.0,17.0,
,,1.0,,en,,135.0,103.0,
,European Journal of Cancer ...,,[signal transduction],en,,,,
,,1.0,,en,5.0,77.0,72.0,
,Materials Science & Engineering C ...,3.0,"[chondrocyte, cartilage, tissue engineering, ...",en,,372.0,365.0,
,,,,en,6.0,,,
,,8.0,,en,10.0,540.0,538.0,
,,,"[nicotiana tabacum, pr proteins, orobanche ...",en,,,,

references,title,url,venue
,Optimización de la resistencia a compresión ...,[http://dx.doi.org/10.398 9/mc.2007.v57.i288.67] ...,Materiales De Construccion ...
[53e9ad47b7602d970372b511 ] ...,High resolution IR balloon-borne solar ...,[http://www.ncbi.nlm.nih. gov/pubmed/20234684?r ...,Applied optics
,Nursing care of 31 acute aortic dissection ...,,Chinese General Nursing
,Exploring possession in simultaneous ...,[http://dx.doi.org/10.107 5/eurosla.5.07lin] ...,Eurosla Yearbook
,The cytotoxicity of Ukrain does not involve ...,[http://dx.doi.org/10.101 6/S0959-8049(01)81239-9] ...,European Journal of Cancer ...
,Preparation of K<sub>2</s ub>La<sub>2</sub>Ti<s ...,,Chemistry of Materials
"[53e9ba65b7602d9704683131 , ...",Evaluation of PLLA–collagen hybrid ...,[http://dx.doi.org/10.101 6/j.msec.2003.12.010] ...,Materials Science & Engineering C ...
"[53e9b189b7602d9703c14df5 , ...",Dominance Attributes for Alternatives' Perception ...,,
"[53e9baadb7602d97046d806f , ...",MacNeish&#8211 Mann Theorem Based Iterati ...,,IEEE Communications Letters ...
"[53e9b1c3b7602d9703c54d77 , ...",The Angiospermous Root Parasite Orobanche L. ...,,

volume,year
57.0,2007
19.0,1980
,2010
5.0,2005
37.0,2001
10.0,1998
24.0,2004
,2007
8.0,2004
,1998


## 1.3 The SJR Dataset

First, we download all the journal ranking files from [the SJR website](http://www.scimagojr.com/journalrank.php).
Next, we use the following code to create a single SFrame with all the journal data:

In [19]:
from ScienceDynamics.datasets.sjr import SJR
from ScienceDynamics.config.configs import DATASETS_SJR_DIR
sjr = SJR(DATASETS_SJR_DIR)

In [4]:
sjr.data

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,str,str,str,str,str,int,int,int,int,int,int,str,str,str,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


Rank,Sourceid,Title,Type,SJR,SJR Best Quartile,H index,Total Docs.,Total Docs. (3years)
1,16801,Annual Review of Biochemistry ...,journal,50518,Q1,268,30,81
1,16801,Annual Review of Biochemistry ...,journal,50518,Q1,268,30,81
2,18434,Cell,journal,43449,Q1,682,354,1359
2,18434,Cell,journal,43449,Q1,682,354,1359
3,20651,Annual Review of Immunology ...,journal,43020,Q1,274,29,81
3,20651,Annual Review of Immunology ...,journal,43020,Q1,274,29,81
4,18395,Annual Review of Cell and Developmental Biology ...,book series,35051,Q1,199,25,61
4,18395,Annual Review of Cell and Developmental Biology ...,book series,35051,Q1,199,25,61
5,14181,Annual Review of Neuroscience ...,book series,25760,Q1,217,21,60
6,22126,Genes and Development,journal,25272,Q1,401,306,915

Total Refs.,Total Cites (3years),Citable Docs. (3years),Cites / Doc. (2years),Ref. / Doc.,Country
5913,3474,80,3565,19710,United States
5913,3474,80,3565,19710,United States
15869,47925,1328,3475,4483,United States
15869,47925,1328,3475,4483,United States
5236,4073,81,4716,18055,United States
5236,4073,81,4716,18055,United States
4134,1793,60,2689,16536,United States
4134,1793,60,2689,16536,United States
3419,1583,60,2237,16281,United States
16623,17272,888,1882,5432,United States

Publisher,Categories,Year,ISSN
Annual Reviews Inc.,Biochemistry (Q1),1999,664154
Annual Reviews Inc.,Biochemistry (Q1),1999,15454509
Cell Press,"Biochemistry, Genetics and Molecular Biology ...",1999,10974172
Cell Press,"Biochemistry, Genetics and Molecular Biology ...",1999,928674
Annual Reviews Inc.,Immunology (Q1); Immunology and Allergy ...,1999,15453278
Annual Reviews Inc.,Immunology (Q1); Immunology and Allergy ...,1999,7320582
Annual Reviews Inc.,Cell Biology (Q1); Developmental Biology ...,1999,10810706
Annual Reviews Inc.,Cell Biology (Q1); Developmental Biology ...,1999,15308995
Annual Reviews Inc.,Neuroscience (miscellaneous) (Q1) ...,1999,15454126
Cold Spring Harbor Laboratory Press ...,Developmental Biology (Q1); Genetics (Q1) ...,1999,8909369


## 1.4 Joint Datasets

The MAG and AMiner datasets have a slightly different set of features. While the MAG dataset contains data on each author with a unique author ID, the AMiner contains additional data on each paper, including the paper's abstract and the paper's ISSN or ISBN. Additionally, the SJR dataset contains data about each journal's ranking.

To combine the data from the author publication record and the journals' rankings, we join the datasets. First, we joined the MAG and AMiner datasets by matching DOI values, using the following code (see also create_mag_aminer_sframe.py):

In [6]:
from ScienceDynamics.datasets.joined_dataset import JoinedDataset
from ScienceDynamics.config.configs import DATASETS_BASE_DIR, DATASETS_SJR_DIR, DATASETS_AMINER_DIR, STORAGE_PATH

In [7]:
jd = JoinedDataset(STORAGE_PATH, DATASETS_SJR_DIR, DATASETS_AMINER_DIR, mag_path=DATASETS_BASE_DIR/ "MicrosoftAcademicGraph.zip")

In [21]:
jd.mag._dataset_dir

PosixPath('/storage/homedir/dima/.scidyn/MAG')

In [175]:
from tqdm import tqdm_notebook as tqdm

In [20]:
print(b_lim, u_lim)


5865272 5865273


In [1]:
data[data["MAG Paper ID"]=='74024986']

NameError: name 'data' is not defined

In [22]:
sf[5865272]

{'Aminer Paper ID': '53e99d7ab7602d97026364e9',
 'Authors List Sorted': '["7D4BF247","766E9461","12B235CA"]',
 'Authors Number': 3.0,
 'Conference ID mapped to venue name': 'nan',
 'Field of study list': '["078EB9BF","07799FC6","1F9A2DF7","014CFD3E"]',
 'Field of study list names': '["Topology","Encoding",,"Antenna"]',
 'Fields of study parent list (L0)': '["073B64E4","0205A1DB","00F03FC7"]',
 'Fields of study parent list (L1)': '["078EB9BF","00137C13"]',
 'Fields of study parent list (L2)': '["08F300F3"]',
 'Fields of study parent list (L3)': '["07799FC6","014CFD3E"]',
 'Fields of study parent list names (L0)': '["Physics","Mathematics","Psychology"]',
 'Fields of study parent list names (L1)': '["Topology","Astronomy"]',
 'Fields of study parent list names (L2)': '["Radio astronomy"]',
 'Fields of study parent list names (L3)': '["Encoding","Antenna"]',
 'Journal ID mapped to venue name': '0044B422',
 'Keywords List': '["topology","encoding","integrated circuits","antennas"]',
 'MAG 

In [181]:
data.save("data.csv","csv")

In [209]:
from turicreate import SFrame

In [7]:
import pandas as pd

In [211]:
data.to_csv("data2.csv", encoding='utf-8')

In [8]:
data = pd.read_csv("data.csv",encoding='latin1',error_bad_lines=False)

  interactivity=interactivity, compiler=compiler, result=result)


In [13]:
len(sf),len(jd.aminer_mag_links_by_doi)

(29070469, 29070469)

In [258]:
data.head(10)["MAG Paper ID"].str.encode('utf-8')

0    b'01B27BE8'
1    b'027D0030'
2    b'7CFE299E'
3    b'59BEBE1C'
4    b'5873C011'
5    b'0B00AFD8'
6    b'5C66D743'
7    b'040121AE'
8    b'584D8787'
9    b'58B384F6'
Name: MAG Paper ID, dtype: object

In [11]:
for col in tqdm(data.columns):
    if data[col].dtype == object:
        print(col)
        data[col] = data[col].str.encode('utf-8')

HBox(children=(IntProgress(value=0, max=46), HTML(value='')))

MAG Paper ID
Original paper title
Normalized paper title
Paper publish date
Paper Document Object Identifier (DOI)
Original venue name
Normalized venue name
Journal ID mapped to venue name
Conference ID mapped to venue name
Total Citations by Year
Total Citations by Year without Self Citations
Authors List Sorted
Keywords List
Field of study list
Field of study list names
Fields of study parent list (L0)
Fields of study parent list names (L0)
Fields of study parent list (L1)
Fields of study parent list names (L1)
Fields of study parent list (L2)
Fields of study parent list names (L2)
Fields of study parent list (L3)
Fields of study parent list names (L3)
Urls
abstract
authors
Aminer Paper ID
isbn
issn
issue
keywords
lang
page_end
page_start
pdf
references
title
url
venue
volume



In [9]:
from turicreate import SFrame, load_sframe

In [18]:
sf = SFrame(data_dict)

In [21]:
sf.save("temp.sframe")

In [10]:
sf = load_sframe("temp.sframe")

In [11]:
sf = sf.unpack('X1',column_name_prefix="")

In [19]:
len(sf)

29070469

In [12]:
data_dict = data.to_dict(orient='records')

In [20]:
del data_dict

In [14]:
len(data_dict)

29070469

In [168]:
b_lim, u_lim

(5865272, 5865273)

In [170]:
# data[5866103]
data[5865273]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 1247: invalid continuation byte

In [121]:
df = data.to_dataframe()

KeyboardInterrupt: 

In [164]:
x = jd.aminer_mag_links_by_doi[0:5866103].append(jd.aminer_mag_links_by_doi[5866128:])

'Ã±'

In [240]:
te = SFrame()
te["e"] = ["í".encode('latin-1').decode('latin-1').encode('utf-8')]

In [239]:
str("í".encode('latin-1'), 'utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 0: unexpected end of data

b'\xed'

In [68]:
jd._sframe_dir

PosixPath('/storage/homedir/dima/.scidyn/sframes')

In [14]:
sf.save(str(jd._sframe_dir/"PapersAMinerMagJoin.sframe"))

Using the joined dataset, we obtained an SFrame with the joint meta data of 28.9 million papers. We can take this SFrame and join it with the SJR dataset.

In [24]:
jd.aminer_mag_sjr(2015)

MAG Paper ID,Original paper title,Normalized paper title,Paper publish year,Paper publish date
80F8FCAF,Trust-assisted anomaly detection and ...,trust assisted anomaly detection and ...,2011,2011/06
7E7B8146,Incentive-aware data dissemination in delay- ...,incentive aware data dissemination in delay ...,2011,2011/06
77B641FC,MOVi+: Improving the scalability of mobile ...,movi improving the scalability of mobile ...,2012,2012/06
76E06216,Identifying infection sources in large tree ...,identifying infection sources in large tree ...,2012,2012/06
7DAFE959,Coping with packet replay attacks in wireless ...,coping with packet replay attacks in wireless ...,2011,2011/06
7E428CE1,An optimal distributed malware defense system ...,an optimal distributed malware defense system ...,2011,2011/06
762EDB3C,A privacy-preserving social-aware incentive ...,a privacy preserving social aware incentive ...,2012,2012/06
7E018A37,Adaptive energy-efficient spectrum probing in ...,adaptive energy efficient spectrum probing in ...,2011,2011/06
7DDBB811,Broadcasting in multi channel wireless netw ...,broadcasting in multi channel wireless netw ...,2011,2011/06
815AC7AF,A software-hardware emulator for sensor ...,a software hardware emulator for sensor ...,2011,2011/06

Paper Document Object Identifier (DOI) ...,Original venue name,Normalized venue name,Journal ID mapped to venue name ...,Conference ID mapped to venue name ...
10.1109/SAHCN.2011.598492 2 ...,sensor mesh and ad hoc communications and ...,secon,,43A23F32
10.1109/SAHCN.2011.598494 0 ...,sensor mesh and ad hoc communications and ...,secon,,43A23F32
10.1109/SECON.2012.627584 5 ...,sensor mesh and ad hoc communications and ...,secon,,43A23F32
10.1109/SECON.2012.627578 8 ...,sensor mesh and ad hoc communications and ...,secon,,43A23F32
10.1109/SAHCN.2011.598491 9 ...,sensor mesh and ad hoc communications and ...,secon,,43A23F32
10.1109/SAHCN.2011.598491 3 ...,sensor mesh and ad hoc communications and ...,secon,,43A23F32
10.1109/SECON.2012.627583 2 ...,sensor mesh and ad hoc communications and ...,secon,,43A23F32
10.1109/SAHCN.2011.598490 0 ...,sensor mesh and ad hoc communications and ...,secon,,43A23F32
10.1109/SAHCN.2011.598492 0 ...,sensor mesh and ad hoc communications and ...,secon,,43A23F32
10.1109/SAHCN.2011.598492 8 ...,sensor mesh and ad hoc communications and ...,secon,,43A23F32

Paper rank,Ref Number,Total Citations by Year,Total Citations by Year without Self Citations ...,Authors List Sorted,Authors Number
19319,20,"{'2014': 2.0, '2015': 4.0} ...","{'2014': 2.0, '2015': 3.0} ...","[7FFFDD08, 0C652565]",2
19229,24,"{'2012': 1.0, '2013': 2.0, '2014': 4.0, '20 ...","{'2012': 1.0, '2013': 2.0, '2014': 4.0, '20 ...","[7D3561C4, 7F2E53B3, 80454A26, 808AA8DF] ...",4
17473,0,,,"[80AFFCF9, 7CE93E83, 8286ED44] ...",3
17535,0,"{'2013': 1.0, '2014': 1.0, '2015': 3.0} ...","{'2013': 1.0, '2014': 1.0, '2015': 3.0} ...","[7D2934CF, 7D374416]",2
19454,16,"{'2014': 1.0, '2015': 1.0} ...","{'2014': 1.0, '2015': 1.0} ...","[80C7FAAC, 80AFEC7E, 1214B9E1, 094A7831, ...",6
19437,21,"{'2013': 1.0, '2014': 2.0, '2015': 2.0} ...","{'2013': 1.0, '2014': 2.0, '2015': 2.0} ...","[7D12A2C4, 7D33EF8B, 7E288A95, 7EF1ED28, ...",5
19555,0,,,"[8623A2A7, 7DB01108, 7E956002, 7D59933A] ...",4
19555,0,,,"[7F297646, 7E9270C8]",2
19387,22,"{'2011': 3.0, '2012': 3.0, '2013': 4.0, '20 ...","{'2011': 3.0, '2012': 3.0, '2013': 4.0, '20 ...","[00FA05B9, 80E6FE49, 80989FF0, 75C29F98] ...",4
19317,11,"{'2014': 2.0, '2015': 2.0} ...","{'2014': 1.0, '2015': 1.0} ...","[812A1C42, 7DAB2459, 814E3A70, 8149DAE0, ...",6

Keywords List,Field of study list,Field of study list names,Fields of study parent list (L0) ...
"[sensor network, energy efficient, wireless ...","[017737EA, 017737EA, 0029D7DC, 0029D7DC, ...","[Wireless sensor network, Wireless sensor network, ...",[0271BC14]
"[data dissemination, routing, games, mobile ...","[06110AD8, 09A2762C, 1E2C5488, 089D907D, ...","[Dissemination, Routing, Games, Mobile computing, ...","[0271BC14, 09ACE10E, 00F03FC7] ..."
"[switches, servers, video quality, mobile ...","[071DF995, 059A455C, 06318DBC, 05B1168F, ...","[Network switch, Server, Video quality, Mobile ...",[0271BC14]
"[computer viruses, estimation, sensor ...","[012823EE, 03C8C972, 017737EA, 06E8E0BF, ...","[Computer virus, Estimation, Wireless ...","[0271BC14, 0205A1DB, 0895A350] ..."
"[throughput, bloom filter, packet header, ...","[09674BAC, 02DE0735, 08C676B7, 0C1A7D17, ...","[Throughput, None, Bloom filter, Public-key ...","[0271BC14, 0796A60A]"
"[mobile network, mobile computer, greedy ...","[09287742, 089D907D, 0316C4AE, 065950AA, ...","[Cellular network, Mobile computing, None, ...","[0271BC14, 0205A1DB]"
"[privacy, word of mouth, consumer behaviour, ...","[04A33560, 08386431, 01623677, 06B7A533, ...","[Privacy, None, Consumer Behaviour, Advertising, ...","[0271BC14, 09ACE10E]"
"[spectrum, cognitive radio, cognitive radio ...","[099F6B15, 08EFFF50, 08EFFF50, 08EFFF50, ...","[Spectrum, Cognitive radio, Cognitive radio, ...","[0271BC14, 073B64E4, 0205A1DB] ..."
"[information retrieval, information exchange, ...","[0160D514, 0B4E7F98, 06708471, 0382EC42, ...","[Information retrieval, Information exchange, ...","[0271BC14, 0205A1DB, 07982D63] ..."
"[hardware, network protocols, data models, ...","[0090990D, 008F4943, 00BB3814, 040130E0, ...","[Data model, Computer hardware, Communications ...","[0271BC14, 073B64E4, 07982D63] ..."

Fields of study parent list names (L0) ...,Fields of study parent list (L1) ...,Fields of study parent list names (L1) ...,Fields of study parent list (L2) ...
[Computer Science],[08EB4114],[Embedded system],"[000B4A2A, 08DDF80C]"
"[Computer Science, Economics, Psychology] ...","[009AB2E6, 02A1BFD4, 01DCF91B, 06ABC255, ...","[Finance, Simulation, Computer network, ...","[0A8EFC34, 04CC92E5, 089D907D] ..."
[Computer Science],"[06ABC255, 0724DFBA]","[Operating system, Machine learning] ...","[089D907D, 04F6CF02]"
"[Computer Science, Mathematics, Sociology] ...","[093C4716, 0229BD39, 064E5072, 024DC8C8, ...","[Artificial intelligence, Social science, ...","[01516333, 06E8E0BF, 07C09FE0, 1F734584] ..."
"[Computer Science, Geology] ...","[0BE20181, 06D79A36, 07108A62, 01DCF91B, ...","[Programming language, Geomorphology, Computer ...","[0B274714, 1F734584, 09ACCB7D] ..."
"[Computer Science, Mathematics] ...","[00AE2819, 06ABC255, 072BDC64, 024DC8C8] ...","[Algorithm, Operating system, Distributed ...","[07C17D18, 01516333, 089D907D, 1DD1FB15] ..."
"[Computer Science, Economics] ...","[09B4F1FA, 020DA09F, 024DC8C8, 06B7A533] ...","[Marketing, Market economy, Computer ...","[01623677, 1F734584, 0A8CD660] ..."
"[Computer Science, Physics, Mathematics] ...","[01DCF91B, 064E5072]","[Computer network, Statistics] ...","[007E3B49, 01F5E149, 1DD1FB15] ..."
"[Computer Science, Mathematics, Engineer ...","[0160D514, 064E5072]","[Information retrieval, Statistics] ...","[065808E3, 0322F3B2]"
"[Computer Science, Physics, Engineering] ...","[07108A62, 04984686, 02A1BFD4, 01DCF91B, ...","[Computer architecture, Database, Simulation, ...","[00197524, 01F5E149, 0814ADB4, 07A90BBE, ..."

Fields of study parent list names (L2) ...,Fields of study parent list (L3) ...,Fields of study parent list names (L3) ...,Urls
"[Anomaly detection, Approximation algorithm] ...",[017737EA],[Wireless sensor network],[http://ieeexplore.ieee.o rg/xpls/abs_all.jsp?a ...
"[Game theory, Corporate finance, Mobile ...","[1E2C5488, 0B168EFB, 09A2762C, 09287742, ...","[Games, Valuation, Routing, Cellular ...",[http://citeseerx.ist.psu .edu/viewdoc/summary? ...
"[Mobile computing, Supercomputer] ...","[059A455C, 06318DBC, 05B1168F, 071DF995] ...","[Server, Video quality, Mobile telephony, Net ...",[http://ieeexplore.ieee.o rg/stamp/stamp.jsp?ar ...
"[Internet security, Knowledge-based systems, ...","[03C8C972, 03887C04, 012823EE, 05242AA7, ...","[Estimation, Database index, Computer virus, ...",[http://ieeexplore.ieee.o rg/xpls/icp.jsp?arnum ...
"[Computer performance, Information security, ...","[08C676B7, 06708471, 0BA80FE2, 0C1A7D17, ...","[Bloom filter, Wireless network, Network ...",[http://citeseerx.ist.psu .edu/viewdoc/summary? ...
"[Distributed algorithm, Internet security, Mo ...","[0B24E4DD, 09287742, 0496747A, 05B1168F, ...","[Mobile device, Cellular network, Mathematical ...",[http://citeseerx.ist.psu .edu/viewdoc/summary? ...
"[Consumer Behaviour, Information security, ...","[0C1A7D17, 04A33560, 00BB3814, 0A35AA56, ...","[Public-key cryptography, Privacy, Communications ...",[http://cis-linux1.temple .edu/~jiewu/research/ ...
"[Stochastic process, Spectroscopy, Computer ...","[099F6B15, 08EFFF50, 0496747A] ...","[Spectrum, Cognitive radio, Mathematical ...",[http://ieeexplore.ieee.o rg/xpl/abstractAuthor ...
"[Education, Information theory] ...","[04E299D3, 06708471, 0B4E7F98, 083E300E, ...","[Broadcasting, Wireless network, Information ...",[http://ieeexplore.ieee.o rg/xpl/freeabs_all.js ...
"[Data management, Spectroscopy, Data ...","[00F1CA12, 0A500C3A, 0090990D, 0C11C5B4, ...","[Data exchange, Microcontroller, Data ...",[http://ieeexplore.ieee.o rg/xpls/abs_all.jsp?a ...

abstract,authors,Aminer Paper ID,isbn
Fast anomaly detection and localization is ...,"[{'name': 'Shanshan Zheng', 'org': 'Dept. of ...",53e9a433b7602d9702d51013,978-1-4577-0094-1
This work centers on data dissemination in delay- ...,"[{'name': 'Ting Ning'}, {'name': 'Zhipeng Yan ...",53e9ba70b7602d9704692e85,978-1-4577-0094-1
"In this demonstration, we show our ongoing work ...","[{'name': 'Hyun Lee'}, {'name': 'Jae-Yong Yo ...",53e9b0fbb7602d9703b742e6,978-1-4673-1903-4
Estimating which nodes in a network are the ...,"[{'name': 'Wuqiong Luo'}, {'name': 'Wee-Peng Ta ...",53e9af0cb7602d97039439e4,978-1-4673-1903-4
"In this paper, we consider a variant of ...","[{'name': 'Zi Feng'}, {'name': 'Jianxia Nin ...",53e9a539b7602d9702e5be9d,978-1-4577-0094-1
As malware attacks become more frequent in mobile ...,"[{'name': 'Yong Li'}, {'name': 'Pan Hui'}, ...",53e9be57b7602d9704b19c6d,978-1-4577-0094-1
The recent penetration of smart mobile devices ...,"[{'name': 'Wei Peng'}, {'name': 'Feng Li'}, ...",53e9a4e4b7602d9702e05a05,978-1-4673-1903-4
"In cognitive radio networks, secondary u ...","[{'name': 'Zesheng Chen'}, {'name': 'Chao ...",53e9bbb5b7602d9704801e12,978-1-4577-0094-1
We propose an analytical framework to study ...,"[{'name': 'Alfred Asterjadhi'}, {'name': ...",53e9b85bb7602d9704422a7a,978-1-4577-0094-1
Simulators are important tools for analyzing and ...,"[{'name': 'Jingyao Zhang'}, {'name': 'Yi ...",53e9a869b7602d97031b2cdd,978-1-4577-0094-1

issn,issue,keywords,lang,n_citation,page_end,page_start,pdf,...
2155-5486,,"[greedy algorithm, energy efficient, wireless ...",en,15,394,386,,...
2155-5486,,"[mobile communication, nash theorem, two-person ...",en,71,547,539,,...
2155-5486 E-ISBN : 978-1-4673-1903-4 ...,,"[video quality, mobile opportunistic video-on- ...",en,4,72,71,,...
2155-5486 E-ISBN : 978-1-4673-1903-4 ...,,"[computer viruses, geometric tree, knowl ...",en,12,289,281,,...
2155-5486,,"[network congestion, packet dissemination, ...",en,13,376,368,,...
2155-5486,,"[content-based signature, distributed algorithms, ...",en,18,322,314,,...
2155-5486 E-ISBN : 978-1-4673-1903-4 ...,,[privacy preserving social aware incentive ...,en,9,604,596,,...
2155-5486,,"[adaptive energy- efficient spectrum, ...",en,6,214,206,,...
2155-5486,,"[radio broadcasting, medium access, data ...",en,8,385,377,,...
2155-5486,,"[gezel, instruction-set simulator, software ...",en,23,448,440,,...


# 2. Loading the Dataset to MongoDB

Using Turicreate and SFrame objects can help us get general data on how academic publication dynamics have changed over time, but it would be challenging to use this data to create more complicated insights, such as the trends of a specific journal. To reveal more complicated insights using the data, we would need to load the dataset to a different framework. In this study, we chose to use MongoDB as our framework for more complicated queries.
We installed MongoDB on Ubuntu 17.10 using the instructions in the following [link](https://medium.com/gatemill/how-to-install-mongodb-3-6-on-ubuntu-17-10-ac0bc225e648). After MongoDB is installed and running, please remember to set the user and password, and update MONGO_HOST & MONGO_PORT vars in consts.py (one can also adjust the connection to include user password auth).
Now, the next step is to load the above created SFrames to collections in MongoDB using mongo_connecter.py:

In [None]:
from ScienceDynamics.mongo_connector import load_sframes
load_sframes(mag,sjr,jd) #this will load the SFrame to a local

2019-05-01 08:43:05,560 [MainThread  ] [INFO ]  Loading authors features
2019-05-01 08:43:05,608 [MainThread  ] [INFO ]  Converting


2019-05-01 08:52:09,766 [MainThread  ] [INFO ]  Inserting rows 0 - 100000 to journals.authors_features
2019-05-01 08:52:39,118 [MainThread  ] [INFO ]  Inserting rows 100000 - 200000 to journals.authors_features
2019-05-01 08:53:11,346 [MainThread  ] [INFO ]  Inserting rows 200000 - 300000 to journals.authors_features
2019-05-01 08:53:41,134 [MainThread  ] [INFO ]  Inserting rows 300000 - 400000 to journals.authors_features
2019-05-01 08:53:56,001 [MainThread  ] [INFO ]  Inserting rows 400000 - 500000 to journals.authors_features
2019-05-01 08:54:21,961 [MainThread  ] [INFO ]  Inserting rows 500000 - 600000 to journals.authors_features
2019-05-01 08:54:59,461 [MainThread  ] [INFO ]  Inserting rows 600000 - 700000 to journals.authors_features
2019-05-01 08:55:28,084 [MainThread  ] [INFO ]  Inserting rows 700000 - 800000 to journals.authors_features
2019-05-01 08:55:47,462 [MainThread  ] [INFO ]  Inserting rows 800000 - 900000 to journals.authors_features
2019-05-01 08:56:07,444 [MainThre

In the end of the loading process, six collections will be loaded to the journal database.

In [4]:
from ScienceDynamics.mongo_connector import MongoDBConnector
MD = MongoDBConnector()

In [6]:
MD.client.journals.collection_names()

  """Entry point for launching an IPython kernel.


[]

In [61]:
for i, item in enumerate(MD.client.journals.aminer_mag_papers.find()):
    print(item)
    if i >4:
        break

{'_id': ObjectId('5cca685465387acb12698540'), 'Aminer Paper ID': '555045bc45ce0a409eb58f90', 'Authors List Sorted': '["834A11E2","7E8BA14F","852B2668"]', 'Authors Number': 3.0, 'Conference ID mapped to venue name': '42D7146F', 'Field of study list': 'nan', 'Field of study list names': 'nan', 'Fields of study parent list (L0)': 'nan', 'Fields of study parent list (L1)': 'nan', 'Fields of study parent list (L2)': 'nan', 'Fields of study parent list (L3)': 'nan', 'Fields of study parent list names (L0)': 'nan', 'Fields of study parent list names (L1)': 'nan', 'Fields of study parent list names (L2)': 'nan', 'Fields of study parent list names (L3)': 'nan', 'Journal ID mapped to venue name': 'nan', 'Keywords List': 'nan', 'MAG Paper ID': '01B27BE8', 'Normalized paper title': 'evaluating polarity for verbal phraseological units', 'Normalized venue name': 'micai', 'Original paper title': 'Evaluating Polarity for Verbal Phraseological Units', 'Original venue name': 'mexican international confe

In [216]:
sf = SFrame.read_csv("data2.csv")

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str,str,int,str,str,str,str,str,str,int,int,dict,dict,list,float,list,list,list,list,list,list,list,list,list,list,list,list,str,list,str,str,str,str,list,str,float,int,int,str,list,str,list,str,str,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [27]:
MD.client.journals.aminer_mag_papers.count()

  """Entry point for launching an IPython kernel.


29070469

In [7]:
MD.client.journals.aminer_mag_papers.remove()

  """Entry point for launching an IPython kernel.


{'ok': 1, 'n': 0}

In the second part of the tutorial, we will demonstrate how the above created MongoDB collections can be utilized to calculate various statistics on paper collections, authors, journals, and research domains.