# STA218 Project
## 0.1 data preparation
### dataset introduction
The Data folder contains 2 data files:

1. paper_information.csv    -- the complete list of 5,746 papers including title, publisher, doi, abstract, keywords, references, paper_id  for each  from four journals (the detailed list is shown below). 

2. paper_edge_citation.csv    --the edges in the citation network for the node 'paper_id'. There are 23,737 edges, and each edge represents that 'source' cites 'target' once. Edges between papers are directed. 

Journals are listed as follows:

 [1] JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION                             
 [2] BIOMETRIKA                                                   
 [3] JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY  
 [4] ANNALS OF STATISTICS

 We use abbreviation for each journal in the following analysis:  
    [1] JASA                             
    [2] Biometrika                                                  
    [3] JSSR B  
    [4] Ann. Stat

using to find the publishments status and citation network of authors in these four journals. 


In [10]:
import pandas as pd
import sys
import os
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [25]:
DATA_CONFIG = {
    'files': {
        'info': 'paper_information.csv',
        'edges': 'paper_edge_citation.csv'
    },
    'journal_mapping': {
        'JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION': 'JASA',
        'BIOMETRIKA': 'Biometrika',
        'JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY': 'JRSS B',
        'ANNALS OF STATISTICS': 'Ann. Stat'
    }
}
info = pd.read_csv(DATA_CONFIG['files']['info'])
edges = pd.read_csv(DATA_CONFIG['files']['edges'])

In [28]:
print(info.shape)
print(edges.shape)

(5746, 7)
(23737, 2)


In [16]:
info.head()

Unnamed: 0,paper_id,title,publisher,doi,abstract,references,keywords
0,1,MERGING AND TESTING OPINIONS,ANNALS OF STATISTICS,10.1214/14-AOS1212,We study the merging and the testing of opinio...,"no title+Al-Najjar, N.; Pomatto, L.; Sandroni,...","Test manipulation,Bayesian learning"
1,2,EFFICIENT ESTIMATION OF INTEGRATED VOLATILITY ...,ANNALS OF STATISTICS,10.1214/14-AOS1213,We propose new nonparametric estimators of the...,ESTIMATING THE DEGREE OF ACTIVITY OF JUMPS IN ...,"Quadratic variation,Ito semimartingale,integra..."
2,3,FURTHER RESULTS ON CONTROLLING THE FALSE DISCO...,ANNALS OF STATISTICS,10.1214/14-AOS1214,The probability of false discovery proportion ...,The control of the false discovery rate in mul...,"gamma-FDP,generalized gamma-FDP,multiple testi..."
3,4,POSTERIOR CONTRACTION IN SPARSE BAYESIAN FACTO...,ANNALS OF STATISTICS,10.1214/14-AOS1215,Sparse Bayesian factor models are routinely im...,On some inequalities for the incomplete gamma ...,"Bayesian estimation,covariance matrix,factor m..."
4,5,A REMARK ON THE RATES OF CONVERGENCE FOR INTEG...,ANNALS OF STATISTICS,10.1214/13-AOS1179,The optimal rate of convergence of estimators ...,Power and bipower variation withstochastic vol...,"Semimartingale,volatility,jumps,infinite activ..."


In [17]:
edges.head()

Unnamed: 0,source,target
0,6318,4952
1,3817,4179
2,3817,3783
3,3817,3249
4,3817,6295


In [None]:
journa()l_map = DATA_CONFIG['journal_mapping']
info['journal'] = info['publisher'].map(journal_map).fillna(info['publisher'])
info.head()
info.tail

Unnamed: 0,paper_id,title,publisher,doi,abstract,references,keywords,journal
5741,6502,Parametric and semiparametric models for recap...,JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIE...,10.1111/1467-9868.00302,Capture-recapture processes are biased samplin...,COX REGRESSION-MODEL FOR COUNTING-PROCESSES - ...,"efficiency,identifiability,linear integral ope...",JRSS B
5742,6503,A general method of constructing E(s(2))-optim...,JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIE...,10.1111/1467-9868.00303,There has been much recent interest in supersa...,"SOME SYSTEMATIC SUPERSATURATED DESIGNS+BOOTH, ...","balanced incomplete-block designs,cyclic gener...",JRSS B
5743,6506,Dynamic models for spatiotemporal data,JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIE...,10.1111/1467-9868.00305,We propose a model for non-stationary spatiote...,no title+BOX GEP+TIME SERIES ANAL FOR+1989::Oz...,"Bayesian inference,locally weighted mixture,on...",JRSS B
5744,6507,On expected volumes of multidimensional confid...,JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIE...,10.1111/1467-9868.00306,"We consider a general multiparameter set-up, w...",ADJUSTED VERSIONS OF PROFILE LIKELIHOOD AND DI...,"Bartlett correction,highest posterior density ...",JRSS B
5745,6508,Epidemics in heterogeneous communities: estima...,JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIE...,10.1111/1467-9868.00307,A stochastic multitype model for the spread of...,A GENERALIZED STOCHASTIC-MODEL FOR THE ANALYSI...,"basic reproduction number,consistency,final si...",JRSS B
