# Ranking of articles using open-access citation-metadata

Scholarly publications have seen exponential growth in past decades, however, linear growth is estimated for the research topics. It suggests that numerous articles appear every day within a research field. Researchers invest their time and energy in keeping up-to-date with the current state-of-the-art. Research is a continuous process and it builds on the past work that has been done attributed through citations. Although, there are numerous reasons why a research article gets cited, as well as, its critics as to why citations should not be used to assess the value of current work. However, with the current information overload, it is not easy to keep abreast of all the published work. Researchers in 20th century would dig through all the available literature to find out the latest trends but the researcher of today has more stuff to read on a topic than their lifetime. They need access to current research as soon as it happens but the citation-count metrics, currently in practice, limit this approach. To use citation-based metrics, the articles must acquire a reasonable number of citations which can vary from field to field. Our main contribution is to use a heterogeneous network that includes the article, author and journal to recommend articles in a research field.

# Import

In [1]:
import Ranking # from https://github.com/bilal-dsu/Ranking/

from matplotlib import pyplot
from scipy.stats import spearmanr
import json, os,sys,snap
import csv
import pandas as pd
import snap
import numpy as np
import re
from itertools import combinations
from os import path
import seaborn as sns
import matplotlib.pyplot as plt

import measures # from https://github.com/dkaterenchuk/ranking_measures


# Initializations Total Citations

The original graph is filtered to work with nodes between the year 2000 till 2018, termed Total Citations (TC). Further, we remove any journals with zero out-degree since they do not participate in the ranking method.

In [2]:
# following files must be present in the CWD
metaDataCSV = "MetaData 2000-2018.csv" 
ArticleGraph = "ArticleGraph.graph"
ArticleHash = "ArticleHash.hash"

# following files will be created in the CWD
JournalCitationTXT = "JournalCitation.txt"
JournalGraph = "JournalGraph.graph"
JournalHash = "JournalHash.hash"
JournalSubGraph = "JournalSubGraph.graph"
SubMetaDataCSV = "SubMetaData.csv"
AuthorCitationTXT = "AuthorCitation.txt"
ArticleCitationTXT = "ArticleCitation.txt"
AuthorGraph = "AuthorGraph.graph"
AuthorHash = "AuthorHash.hash"

AuthorInfoCSV = "AuthorInfo.csv"
JournalInfoCSV = "JournalInfo.csv"
ArticleInfoCSV = "ArticleInfo.csv"

AuthorRankCSV = "AuthorRank.csv"
JournalRankCSV = "JournalRank.csv"
ArticleRankCSV = "ArticleRank.csv"

ArticlesGraphStats="ArticleGraphStats.csv"
JournalGraphStats="JournalGraphStats.csv"
AuthorGraphStats="AuthorGraphStats.csv"
GraphStatsOverall="GraphStatsOverall.csv"

# Generate Total Citations

In [None]:
Ranking.generateJournanalCitationNetworkText(metaDataCSV, JournalCitationTXT)

In [None]:
Ranking.generateJournalCitationGraph(JournalCitationTXT, JournalGraph, JournalHash)

In [None]:
Ranking.generateSubGraph(JournalHash, JournalGraph, JournalSubGraph, metaDataCSV, SubMetaDataCSV) 

In [None]:
Ranking.generateAuthorArticleCitationNetworkText(SubMetaDataCSV, AuthorCitationTXT, ArticleCitationTXT)

In [None]:
Ranking.generateAuthorArticleGraph(AuthorCitationTXT, AuthorGraph, AuthorHash, ArticleCitationTXT, 
                           ArticleGraph, ArticleHash)

# Initializations Early Citations

To evaluate the ranking technique, we take nodes of the year 2005 and apply a cut-off on citations till 2010, termed Early Citations (EC).  The cutoff window is configurable. Only a few past years are considered to give equal chance to early career researchers.

In [None]:
# Provide values for Early Citations cutoff

RankYearStart = 2005
RankYearEnd = 2005
CutOffStart = 2000
CutOffEnd = 2010

# following files will be created in the CWD

metaDataRankYearCSV = "metaData" + str(RankYearStart)  + "-" + str(RankYearEnd) + ".csv"
metaDataCutOffYearCSV = "metaData" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".csv"

JournalCutOffYearTXT = "Journal" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".txt"
JournalCutOffYearGraph = "Journal" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".graph"
JournalCutOffYearHash = "Journal" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".hash"
JournalCutOffYearInfoCSV = "Journal" + str(CutOffStart)  + "-" + str(CutOffEnd) + "Info.csv"
JournalCutOffYearRankCSV = "Journal" + str(CutOffStart)  + "-" + str(CutOffEnd) + "Rank.csv"

JournalCutOffYearSubGraph = "JournalSubGraph"+ str(CutOffStart)  + "-" + str(CutOffEnd) + ".graph"

ArticleCutOffYearTXT = "Article" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".txt"
ArticleCutOffYearGraph = "Article" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".graph"
ArticleCutOffYearHash = "Article" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".hash"
ArticleCutOffYearInfoCSV = "Article" + str(CutOffStart)  + "-" + str(CutOffEnd) + "Info.csv"
ArticleCutOffYearRankCSV = "Article" + str(CutOffStart)  + "-" + str(CutOffEnd) + "Rank.csv"

AuthorCutOffYearTXT = "Author" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".txt"
AuthorCutOffYearGraph = "Author" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".graph"
AuthorCutOffYearHash = "Author" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".hash"
AuthorCutOffYearInfoCSV = "Author" + str(CutOffStart)  + "-" + str(CutOffEnd) + "Info.csv"
AuthorCutOffYearRankCSV = "Author" + str(CutOffStart)  + "-" + str(CutOffEnd) + "Rank.csv"

AuthorGraphStatsCutOffYear = "AuthorGraphStats" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".csv"
ArticleGraphStatsCutOffYear = "ArticleGraphStats" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".csv"
JournalGraphStatsCutOffYear = "JournalGraphStats" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".csv"
GraphStatsCutOffYear = "GraphStats" + str(CutOffStart)  + "-" + str(CutOffEnd) + ".csv"

# Generate Early Citations

In [None]:
Ranking.generateTemporalNetwork(SubMetaDataCSV, RankYearStart,RankYearEnd, CutOffStart, CutOffEnd, 
                        metaDataRankYearCSV, metaDataCutOffYearCSV, ArticleHash, ArticleGraph)

In [None]:
Ranking.generateJournanalCitationNetworkText(metaDataCutOffYearCSV, JournalCutOffYearTXT)

In [None]:
Ranking.generateJournalCitationGraph(JournalCutOffYearTXT, JournalCutOffYearGraph, JournalCutOffYearHash)

In [None]:
Ranking.generateAuthorArticleCitationNetworkText(metaDataCutOffYearCSV, AuthorCutOffYearTXT, ArticleCutOffYearTXT)

In [None]:
Ranking.generateAuthorArticleGraph(AuthorCutOffYearTXT, AuthorCutOffYearGraph, AuthorCutOffYearHash, 
                           ArticleCutOffYearTXT, ArticleCutOffYearGraph, ArticleCutOffYearHash)

# Calculate Rank

The rank of a journal or author is given by the PageRank measure in Equation 1.
\begin{equation}
\label{eq:Rank}
R(i) = ((1-\alpha)/n) + \alpha * \sum_{\substack{j}} R(j)\frac{aij}{Ni}
\end{equation}
where, n is the total no. of nodes in the network,

$\alpha$ $\epsilon$ $(0 , 1)$ is damping factor (set to $0.85$),

aij is 1 if node (i) cites node (j) and 0 otherwise,

Ni is the total neighbours of node i.

The rank of journal and author is transferred to the article given by Equation 2, thereby, inflating the rank of the article which was cited by any influential journal or author. The rank of the article “a” published in journal “b” by the author(s) “c” is:
\begin{equation} \label{eq:ArticleRank}
AR(a) = ((1-\beta) * R(b) + \beta * \frac {\sum_{\substack{i}} R (i)}{c})
\end{equation}
where, $\beta$ $\epsilon$ $(0 , 1)$ is adjustment for weight-age of author and journal influence (set to $0.5$).

In [None]:
Ranking.generateAuthorJournalRank(AuthorHash, AuthorGraph, AuthorInfoCSV, JournalHash, JournalSubGraph, JournalInfoCSV, JournalGraphStats, AuthorGraphStats)

In [None]:
Ranking.generateArticleRank(JournalInfoCSV, SubMetaDataCSV, ArticleGraph, ArticleHash, AuthorInfoCSV, ArticleInfoCSV, ArticlesGraphStats)

In [None]:
Ranking.generateAuthorJournalRank(AuthorCutOffYearHash, AuthorCutOffYearGraph, AuthorCutOffYearInfoCSV, 
             JournalCutOffYearHash, JournalCutOffYearGraph, JournalCutOffYearInfoCSV, JournalGraphStatsCutOffYear,
                           AuthorGraphStatsCutOffYear)

In [None]:
Ranking.generateArticleRank(JournalCutOffYearInfoCSV, metaDataCutOffYearCSV, ArticleCutOffYearGraph, ArticleCutOffYearHash, 
                    AuthorCutOffYearInfoCSV, ArticleCutOffYearInfoCSV, ArticleGraphStatsCutOffYear)

# Analysis

On the temporal citation network, we correlate the EC rank of publications with the rank calculated using TC. It is used as a baseline for evaluating the ranking mechanism. To identify whether our technique captures key articles with a high EC rank that went on to attain a high rank in TC, we apply Spearman's rank correlation. Our preliminary analysis suggests that the ranking technique is stable. The rank calculated with EC correlates with rank calculated with TC. However, there is no significant correlation with citation count, suggesting that the technique does not rely on merely counting citations. It essentially means that instead of only counting citations the value of a citation coming from a reputable journal gets a higher rank.

In [None]:
Ranking.generateGraphStats(JournalGraphStats, AuthorGraphStats, ArticlesGraphStats, GraphStatsOverall)

In [None]:
Ranking.generateGraphStats(JournalGraphStatsCutOffYear, AuthorGraphStatsCutOffYear, 
                   ArticleGraphStatsCutOffYear , GraphStatsCutOffYear)

In [None]:
Ranking.correlationAnalysis(AuthorInfoCSV, AuthorCutOffYearInfoCSV, JournalInfoCSV, 
            JournalCutOffYearInfoCSV, ArticleInfoCSV, ArticleCutOffYearInfoCSV, metaDataRankYearCSV)
