## Objective: 
In this notebook we want to provide an intriductory idea to create citation network from research publication in COVID19 corpus. 
We request to see **I - COVID19-NLP-Data-Parsing** notebook for better understanding this notebook. Here we directly focus to gather Bib-entries for each of the documents and try to provide a demo for network analysis.

### 1. Getting Data

In [92]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import json

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

datafiles = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        ifile = os.path.join(dirname, filename)
        if ifile.split(".")[-1] == "json":
            datafiles.append(ifile)
        #print(ifile)

# Any results you write to the current directory are saved as output.

In [93]:
len(datafiles)

13202

### 2. Creation of Citation Network

Here we want to collect all references for each document and create a network data using NetworkX.

In [94]:
id2bib = []
for file in datafiles:
    '''id and title of a single document'''
    with open(file,'r')as f:
        doc = json.load(f)
    id = doc['paper_id']
    title = doc['metadata']['title']
    
    '''collect bib-entries of a single document'''
    bibEntries = []
    for key,value in doc['bib_entries'].items():
        refid = key
        title = value['title']
        year = value['year']
        venue = value['venue']
        try:
            DOI = value['other_ids']['DOI'][0]
        except:
            DOI = 'NA'
        
        bibEntries.append({"refid": refid,\
                      "title":title,\
                      "year": year,\
                      "venue":venue,\
                      "DOI": DOI})
    id2bib.append({"id": id, "bib": bibEntries,"title": title})

In [95]:
import networkx as nx

G = nx.Graph()
for item in id2bib:
    G.add_node(item['id'],title = item['title'])
    for ref in item['bib']:
        G.add_node(ref['DOI'], title = ref['title'], year = ref['year'], venue = ref['venue'])
        G.add_edge(item['id'], ref['DOI'], value = ref['refid'])  

In [96]:
'''How many nodes are there in my network?'''
#155339 nodes
len(G.nodes())

155339

### Sample Network Visualization

We can not create a network with 0.1 million nodes at least for visualization. For visualization purpose, we are selecting 200 publications and restricting the references whivh has 'virus' term in title.

In [97]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

In [98]:
Gs = nx.Graph()
for item in id2bib[0:200]:
    for ref in item['bib']:
        if ref['title'].find('virus'):
            Gs.add_node(item['id'])
            Gs.add_node(ref['DOI'])
            Gs.add_edge(item['id'], ref['DOI'])  

In [None]:
plt.figure(figsize = [12,10]) 
pos = nx.spring_layout(Gs) 
nx.draw(Gs, with_labels=False, node_size = 1, node_color = 'lightblue') 
plt.savefig('cite.png')

### 3.1 Network Analysis

In [None]:
Ga = nx.Graph()
for item in id2bib[0:10]:
    Ga.add_node(item['id'],title = item['title'])
    for ref in item['bib']:
        Ga.add_node(ref['DOI'], title = ref['title'], year = ref['year'], venue = ref['venue'])
        Ga.add_edge(item['id'], ref['DOI'], value = ref['refid'])  

#### Q: What are the data content in network node?

In [None]:
for item in Ga.nodes().data():
    print(item)

#### Q: Can we find the document which cites maximum references for 'virus'?

In [None]:
'''Yes we can measure the degree centrality and select the node which has maximum centrality'''
DegreeCentrality = nx.degree_centrality(Ga)
ID = max(DegreeCentrality)
ID, DegreeCentrality[ID]

In [None]:
'''the node title which has maximum paper citer on virus'''
Ga.nodes[ID]

In [None]:
'''the titles of the cited paper'''
for item in Ga.neighbors(ID):
    print(Ga.nodes[item])

#### Q : Find the temporal profile of virus related publication using citation network

We are bringing update soon....
