# Indexing Scholarly Research Articles

Pipeline for creating and analysing scientific networks. The first phase is downloading the publicly available data. The second phase is to extract the data of the provided date range. The third phase is to filter the cited and citing nodes of DOIs. The fourth phase is to load the data as scientific networks. Finally, the last phase is to index the articles.
Publicly available dumps of both Crossref and COCI are free for use. Other than downloading the data, it is also needed to have seed DOIs. The steps are not dependent on each other but subsequent phases depend on data being placed in correct directories.

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
import os

In [4]:
import Extract



In [5]:
import Transform

In [6]:
import Load

In [7]:
import Index

# Extraction Phase

This phase concerns extracting the metadata for an input date range. Only the past decade is considered to give equal chance to early career researchers. It is also needed to curtail processing all the data. The date range should be selected based on the available processing and RAM. To process large files, DASK is used for out-of-core processing. It can be scaled to multi-node clusters. The steps are not dependent on each other but subsequent phases depend on relevant data being extracted that can fit in memory.


In [8]:
#Input data directory to use
DataDir = "../DATA/"

# Input DASK configurations
physicalCores = 8
virtualPerCore = 2

#Input year range
start_year = 2010
end_year = 2020

In [9]:
#Directory sub folders path
path_to_crossref = DataDir+"Crossref/crossref_public_data_file_2021_01/*.json.gz"
path_to_crossref_parquet = DataDir+"Crossref/"+str(start_year)+"parquet"+str(end_year)+"/"
metadataFull = DataDir+"Crossref/"+str(start_year)+"metadata"+str(end_year)+".pkl"
metadataDesc = DataDir+"Crossref/"+str(start_year)+"desc"+str(end_year)+".pkl"
citNetFull = DataDir+"COCI/"+str(start_year)+"COCI"+str(end_year)+".pkl"

CrossrefColsDF = ['DOI','Venue','Authors','Year']
descColsDF = ['DOI','Title','Type','Subject']

path_to_COCI = DataDir+"COCI/csv/*.csv"
path_to_COCI_parquet = DataDir+"COCI/"+str(start_year)+"parquet"+str(end_year)+"/"
COCIColsDDF=['citing','cited','creation', 'oci']

#### Extract metadata for the provided date range in parquet format
This step relates to creating a parquet binary file format for Crossref JSON. It is needed for the fast processing of subsequent steps. For a date range 2010-2020 this requires around 15 GB of space. This is a compute-heavy step and takes approximately five hours to complete. It uses a DASK Bag to read the JSON files.

In [10]:
%time Extract.convertCrossrefJSONDumpToParquetDDF(path_to_crossref, path_to_crossref_parquet, physicalCores, virtualPerCore, start_year, end_year)

<Client: 'tcp://127.0.0.1:37166' processes=8 threads=16, memory=63.95 GiB>


  [["('to-parquet-c3f2c5312aa9aa1516b4f9a0fc27867a', ... 10parquet2020']
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and 
keep data on workers

    future = client.submit(func, big_data)    # bad

    big_future = client.scatter(big_data)     # good
    future = client.submit(func, big_future)  # good
  % (format_bytes(len(b)), s)
distributed.nanny - ERROR - Failed to kill worker process: [WinError 5] Access is denied
distributed.nanny - ERROR - Failed to kill worker process: [WinError 5] Access is denied
distributed.nanny - ERROR - Failed to kill worker process: [WinError 5] Access is denied
distributed.nanny - ERROR - Failed to kill worker process: [WinError 5] Access is denied
distributed.nanny - ERROR - Failed to kill worker process: [WinError 5] Access is denied
distributed.nanny - ERROR - Failed to kill worker process: [WinError 5] Access is denied


Wall time: 4h 32min 39s


#### Convert parquet to pickle format
This step relates to extracting relevant columns of metadata to pickle format. To process the metadata within 64 GB RAM, it is divided into two different files. The first is used for creating the scientific networks while the second is used to display details of publications indexed by the system. It is a memory-intensive step and takes approximately an hour to complete. It creates two 8 GB binary files that can be used by subsequent phases.

In [11]:
%time Extract.convertParquetToPickleDF (path_to_crossref_parquet, CrossrefColsDF, metadataFull)

Wall time: 55min


In [12]:
%time Extract.convertParquetToPickleDF (path_to_crossref_parquet, descColsDF, metadataDesc)

Wall time: 51min 19s


#### Extract COCI for the provided date range in parquet format
This step relates to creating a parquet binary file format for COCI. It is needed for the fast processing of subsequent steps. For a date range from the year 2010 to the year 2020, this requires around 45 GB of space. This is a compute-heavy step and takes approximately three hours to complete. It uses a DASK Dataframe to read the CSV files.

In [13]:
%time Extract.convertCOCIDumpToParquetDDF (path_to_COCI, COCIColsDDF, physicalCores, virtualPerCore, path_to_COCI_parquet, start_year, end_year)

<Client: 'tcp://127.0.0.1:37458' processes=8 threads=16>




Wall time: 2h 15min 22s


# Transformation Phase

This phase concerns transforming the data acquired to a format that is suitable for loading to the network processing library. The steps are dependent on their preceding steps. This phases takes around 2 hours to complete.

In [14]:
#Input use case name to append 
usecase="IND"

#Input random network Flag
randomFlag = False

#Input citation cascade details / total ego levels 
totalLevels = 3

In [15]:
usecase= "../"+usecase+"/"
DOIFile = usecase+"DOI.txt"
DOIPkl = usecase+"DOIs.pkl"

In [16]:
#make df
citNetFull = usecase+str(start_year)+"COCIFull"+str(end_year)+".pkl"
metaData = usecase+str(start_year)+"metadata"+str(end_year)+".pkl"
citNet = usecase+str(start_year)+"COCI"+str(end_year)+".pkl"

#### Convert DOI List to pickle
This step relates to converting the seed DOIs to pickle format. This step is needed before transforming the data.

In [17]:
%time Transform.getDOIsFromTxt (DOIFile, DOIPkl)


Wall time: 30 ms


#### Filter COCI using DOIs ego network
This step relates to iteratively adding the references and citations of DOIs and updating the DOI List. 

In [18]:
%time Transform.filterCOCIDump (path_to_COCI_parquet, physicalCores, virtualPerCore,citNetFull, DOIPkl, totalLevels)

<Client: 'tcp://127.0.0.1:37639' processes=8 threads=16, memory=63.95 GiB>
Ego Nodes Edges
1 4649 4921
2 100756 210430
3 7222794 16286012
Wall time: 1h 36min 46s


#### Filter metadata using COCI DOIs
This step relates to taking metadata of relevant DOIs. It is needed to create the corresponding venue citation network and author citation network for the publication citation network created by the previous step.

In [19]:
%time Transform.filterMetadata (metadataFull, citNetFull, metaData, citNet)

Wall time: 25min 8s


In [20]:
%time Transform.filterZeroOutDegNodesFromCOCI (metaData, citNet)

Wall time: 3min 3s


# Loading Phase

This phase concerns the loading of data into a network processing library using the following Equations. Scripts have been optimised to work with millions of edges with 64 GB memory. The unique list of nodes is first hashed to an increasing number representing the node labels. Edges are created between the integer labels. This phases takes a few minutes to complete.


In [21]:
# Initialization for Loading
autCitNet = usecase+"autCOCI.pkl"
autCitNetLst = usecase+"autCOCILst.pkl"

refCutoffPub = 3
refCutoffVen = 5
refCutoffAut = 10

pubF = usecase+"Publication"
venF = usecase+"Venue"
autF = usecase+"Author"

#### Generate publication citation network
This step relates to generating the publication citation network using the COCI data. The COCI data has a few self-loops and bi-directional loops which are removed.

In [22]:
%time Load.generatePublicationCitationNetwork(pubF, metaData, citNet, randomFlag, refCutoffPub)

Number of Nodes: 407632, Number of Edges: 657046
Graph created for ../IND/Publication 
Number of Nodes: 5519, Number of Edges: 13776
Wall time: 46.2 s


#### Generate venue citation network
This step relates to generating the venue citation network using the corresponding journal of the filtered publication citation network. Even though self-citations at the level of an individual or publisher are sometimes critiqued, it exists as a valid form of recognition and are kept.

In [23]:
%time Load.generateVenueCitationNetwork(venF, citNet, refCutoffVen)

Number of Nodes: 1547, Number of Edges: 13782
Graph created for ../IND/Venue 
Number of Nodes: 404, Number of Edges: 10243
Wall time: 1.49 s



#### Author disambiguation
One of the key challenges in the bibliometric analysis is the author name disambiguation. Crossref does not provide any disambiguation like Microsoft Academic Graph \cite{wang_microsoft_2020}. If ORCID is available in the Crossref it is recorded along with the given name and last name of the authors. If the name has multiple ORCID associated then the ORCID is used as an identifier instead of a name. This way authors within the field are disambiguated using a simple approach, similar to \cite{milojevic_accuracy_2013}.

In [24]:
%time Load.generateAuthorID (metaData, citNet, autCitNet, autCitNetLst)

Wall time: 2min 7s



#### Generate author citation network
This step relates to generating the author citation network using the corresponding authors of the filtered publication citation network. Cartesian product between the authors of citing and cited publication is made. In case of publication having a large number of authors, the list is kept up to 10 authors including the last two authors. This step is necessary to keep the network in memory.

In [25]:
%time Load.generateAuthorCitationNetwork(autF, autCitNet, autCitNetLst, refCutoffAut)

Number of Nodes: 9084, Number of Edges: 141947
Graph created for ../IND/Author 
Number of Nodes: 2424, Number of Edges: 62852
Wall time: 38.1 s


# Index

This phase concerns with creating an index using the following Equations. Steps are to be executed sequentially. Percentile ranks are used instead of the raw score. This phases takes a few minutes to complete.


In [26]:
# following files are needed in the CWD

VenueGraph = usecase+"Venue.graph"
VenueHash = usecase+"Venue.hash"

PublicationGraph = usecase+"Publication.graph"
PublicationHash = usecase+"Publication.hash"

AuthorGraph = usecase+"Author.graph"
AuthorHash = usecase+"Author.hash"

# following files will be created in the CWD
AuthorInfoCSV = usecase+"Author.info"
VenueInfoCSV = usecase+"Venue.info"
PublicationInfoCSV = usecase+"Publication.info"


#### Calculate author and venue score using PageRank
This step relates to indexing the venue using the PageRank score. A node with citations from a high score node also gets a high score. This recursive mechanism guarantees that not only highly cited nodes are indexed but also nodes cited by other important nodes are also indexed. 

In [27]:
%time Index.generateVenueRank(VenueHash, VenueGraph, VenueInfoCSV)

Venue Ranking Completed
Wall time: 293 ms


In [28]:
%time Index.generateAuthorRank(AuthorHash, AuthorGraph, AuthorInfoCSV)

Author Ranking Completed
Wall time: 2.59 s



#### Calculate publication score using author and venue score
This step relates to indexing publications based on the two scores. First, the score of the venue where it is published and the cumulative score of which authors published it. Second, the cumulative score of venues and authors by which the publication is cited. 

In [29]:
%time Index.generatePublicationRank(VenueInfoCSV, metaData, PublicationGraph, PublicationHash, AuthorInfoCSV, PublicationInfoCSV)

Publication Ranking Completed
Wall time: 17.7 s
