# Academic (Sub)Graph

This notebook is aimed to enrich data collected in the step 0. 
Here, author's data will be read in a streaming fashion, and relevant authors' information will be inserted in our authors data structure.

In particular, for each author:
* We read its ID
  * If ID belongs to our authors IDs we save the following data:
    * name
    * h-index
    * number of pubblications
    * number of citations
    * list of organizations
    * fields of research
  * Else we do nothing and such entry won't be stored in main mamory because of the online algorithm.

After that, we will have all the information that we need in order to:
* Discuss a cut-off in order to reduce the size of the graph
* Chose the library with which we will implement the graph
* Create the graph

Only for Google Colab Research

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from itertools import islice
import glob
import os
import json
import pickle
from itertools import islice
from datetime import datetime

## Reading authors data

In the next chunk we show all the files containing information about the authors. We don't need all the entries of such file, but we do need to scan them once in order to fetch information related to the selected authors.

In [None]:
paperFiles = glob.glob('/content/drive/My Drive/NetworkScience/AcademicGraph/data-collection-2020_sabiu/data/authors/*.txt')
for filename in paperFiles:
    print(filename + "\t\t" + str(os.path.getsize(filename)/100000000) + " KB")

/content/drive/My Drive/NetworkScience/AcademicGraph/data-collection-2020_sabiu/data/authors/aminer_authors_1.txt		8.5413124 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data-collection-2020_sabiu/data/authors/aminer_authors_10.txt		6.88733723 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data-collection-2020_sabiu/data/authors/aminer_authors_2.txt		14.09760167 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data-collection-2020_sabiu/data/authors/aminer_authors_0.txt		11.84497563 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data-collection-2020_sabiu/data/authors/aminer_authors_3.txt		12.01266825 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data-collection-2020_sabiu/data/authors/aminer_authors_5.txt		19.05002495 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data-collection-2020_sabiu/data/authors/aminer_authors_15.txt		25.1091867 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data-collection-2020_sabiu/data/authors

## Information retrieval
In the next chunks we read data in a streaming fashion. Once again the chunk size is 10k entries.

One observation is in order at this point: we have to retrieve information related to m over n total authors, with n >> m. <br />
In a usual batch algorithm we would implement this task by performing m lookups in the n entries (both sets are unsorted). Here, because of the online paradigm, we have to revert such approach: for each author whose information is stored, we look whether it is contained in dictionary. If yes, its information is fetched. 

Reading stored data

In [None]:
def load_obj(name):
    with open('/content/drive/My Drive/NetworkScience/AcademicGraph/data/obj/' + name + '.pkl', 'rb') as f:
        return pickle.load(f)

In [None]:
auth_dict = load_obj("authors")
links_dict = load_obj("links")

### Scanning all authors

In [None]:
chunk_size = 10000
count = 0
start = datetime.now()

for file in paperFiles: # For each aminer_papers_*.txt
    with open(file, "r") as f:
        while True:
            chunk = list(islice(f, chunk_size)) # Loading chunk
            if not chunk:
                break
            # Processing chunk
            for paper in chunk: # Paper online processing
                auth_info = json.loads(paper)
                
                auth_id = auth_info["id"]
                if(auth_id in auth_dict):
                    auth_name = auth_info.get("name")
                    auth_orgs = auth_info.get("orgs")
                    auth_h_index = auth_info.get("h_index")
                    auth_n_pubs = auth_info.get("n_pubs")
                    auth_n_citation = auth_info.get("n_citation")
                    tags = auth_info.get("tags")

                    info = {}
                    if(auth_name != None):
                      info["name"] = auth_name
                    if(auth_orgs != None):
                      info["orgs"] = auth_orgs
                    if(auth_h_index != None):
                      info["h_index"] = auth_h_index
                    if(auth_n_pubs != None):
                      info["n_pubs"] = auth_n_pubs
                    if(auth_n_citation != None):
                      info["n_citation"] = auth_n_citation
                    if(tags != None):
                      info["tags"] = tags

                    auth_dict[auth_id] = info
end = datetime.now()
end-start

datetime.timedelta(0, 979, 820776)

### Checking new features

In [None]:
auth_dict[list(auth_dict.keys())[74557]]

{'h_index': 1,
 'n_citation': 1,
 'n_pubs': 8,
 'name': 'Ron N. Alkalay',
 'orgs': ['Beth Israel Deaconess Medical Center Orthopaedic Biomechanics Laboratory, Harvard Medical School, Boston, MA, USA'],
 'tags': [{'t': 'Spinal Fusion', 'w': 1},
  {'t': 'Bone Graft Substitute', 'w': 1},
  {'t': 'Animal Model', 'w': 1},
  {'t': 'Biomechanical Tests', 'w': 1}]}

### Writing updated nodes

In [None]:
def save_obj(obj, name):
    with open('/content/drive/My Drive/NetworkScience/AcademicGraph/data/obj/'+ name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

In [None]:
save_obj(auth_dict, "authors")

## Cutting off
As shown in the next chunks, we have more than 373k authors.

In [None]:
auth_dict = load_obj("authors")

In [None]:
len(auth_dict)

### Exploring authors

How many authors with the following features?
* Having more than 5 publications
* Having at least one associated organization
* Having more than 5 publications and with not null organization field

In [None]:
more_than5 = 0
having_org = 0
both = 0

for auth in list(auth_dict.keys()):
  auth_n_pubs = auth_dict[auth].get("n_pubs")
  orgs = auth_dict[auth].get("orgs")
  
  # More than 5 publications
  if(auth_n_pubs != None and auth_n_pubs > 5):
    more_than5 = more_than5 + 1
  
  # Having organization field
  if(orgs != None):
    having_org += 1

  # Having organization field and more than 5 publications
  if(auth_n_pubs != None and auth_n_pubs > 5 and orgs != None):
    both = both + 1

print(more_than5)
print(having_org)
print(both)

Let's store a filtered dictionary whose authors have an organization name and at least 5 publications (including all years up to 2016). 

In [None]:
filtered_authors = {}

for auth in list(auth_dict.keys()):
  auth_n_pubs = auth_dict[auth].get("n_pubs")
  orgs = auth_dict[auth].get("orgs")

  # Having organization field and more than 5 publications
  if(auth_n_pubs != None and auth_n_pubs > 5 and orgs != None):
    filtered_authors[auth] = auth_dict[auth]

len(filtered_authors)

Writing authors

In [None]:
save_obj(filtered_authors, "filtered_authors")

### Cutting links
We could also filter depending on the number of links that involve each author. 

But first we create the graph.

## Creating the graph
In this section we are going to use the original version of the authors dictionary. <br />
It is made up of:
* 373263 nodes
* 4511734 links

In [None]:
def load_obj(name):
    with open('/content/drive/My Drive/NetworkScience/AcademicGraph/data/obj/' + name + '.pkl', 'rb') as f:
        return pickle.load(f)

#auth_dict = load_obj("filtered_authors")
auth_dict = load_obj("authors")
links_dict = load_obj("links")

In [None]:
print(''.join([str(len(auth_dict.keys())), " nodes"]))
print(''.join([str(len(links_dict.keys())), " links"]))

In [None]:
!pip install python-igraph

In [None]:
import igraph
from igraph import *
import time
print(igraph.__version__)

In [None]:
g = Graph()

In [None]:
g.add_vertices(list(auth_dict.keys()))

Filtering links: obtaining only those related to actual nodes

In [None]:
links = []
weights = []
addToLinks = links.append
addToWeights = weights.append

for l in links_dict.keys():
  nodes = l.split(',')
  v1 = nodes[0]
  v2 = nodes[1]
  w = links_dict[l]

  if(v1 in auth_dict and v2 in auth_dict):
    addToLinks((v1, v2))
    addToWeights(w)

print(links[:3])
print(weights[:3])

Adding links to the graph with their weight. <br />
Attributes can be arbitrary Python objects, but if you are saving graphs to a file, only string and numeric attributes will be kept.

In [None]:
print(''.join([str(len(links)), " remaining links"]))

In [None]:
g.add_edges(links)

Adding weights

In [None]:
g.es["weight"] = weights

Adding nodes attributes

In [None]:
# Initializing attribute lists
ids = []
names = []
hindeces = []
npubs = []
ncits = []
orgs = []
tags = []

# Filling attribute lists
for a in auth_dict.keys():
  # Appending id
  ids.append(a)  
  # Features
  f = auth_dict[a]

  if("name" in f):
    names.append(f["name"])
  else:
    names.append("")

  if("h_index" in f):
    hindeces.append(f["h_index"])
  else:
    hindeces.append("")

  if("n_pubs" in f):
    npubs.append(f["n_pubs"])
  else:
    npubs.append("")

  if("n_citation" in f):
    ncits.append(f["n_citation"])
  else:
    ncits.append("")

  if("orgs" in f):
    orgs.append(f["orgs"])
  else:
    orgs.append("")

  if("tags.t" in f):
    tags.append(f["tags.t"])
  else:
    tags.append("")

# Updating attribute lists
g.vs["id"] = id
g.vs["name"] = names
g.vs["h_index"] = hindeces
g.vs["n_pubs"] = npubs
g.vs["n_citation"] = ncits
g.vs["orgs"] = orgs
g.vs["tags.t"] = tags


Storing graph

In [None]:
save_obj(g, "graph")

### Degree distribution

Maximum degree

In [None]:
degrees = g.degree()
max_degree = max(degrees)
max_degree

x asses: degree (k)

In [None]:
k = list(range(max_degree + 1)) # k = [0, ..., 5134] 

In [None]:
p = [degrees.count(i) for i in k]
tot = sum(p)
p = [i/tot for i in p]

Plotting

In [None]:
import plotly.express as px
fig = px.scatter(x=k, y=p)
fig.update_layout(xaxis_type="log", yaxis_type="log", width=800, height=600)
fig.show()