# Academic (Sub)Graph

This notebook is aimed to collect **Worldwide Academic Coautorship** data, in order to represent and analyse them through a network structure. 
The selected data source is the Open Academic Graph provided by Microsoft at https://www.openacademic.ai/oag/: from such source we have downloaded both aminer_papers_\*.zip and aminer_authors_\*.zip sets of files, where \* stands for the parts in which authors and papers info are splitted.

The used modules are imported in the first line, while requirements.txt file includes the requirements for this notebook. 

Only for Google Colab Research

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
from itertools import islice
import glob
import os
import json
import pickle
from itertools import islice
from datetime import datetime

## Raw Data

The main challenge of the task is represented by the size of the files. As shown in the following lines we have to process:
* About 35 GB of authors related information
* About 138 GB of papers related information

In this project, such files are respectively located in /data/authors and /data/papers folders.

Due to obvious reasons and limitations we will only push some subsamples of them to the GitHub repository.

In [None]:
paperFiles = glob.glob('/content/drive/My Drive/NetworkScience/AcademicGraph/data/papers/*.txt')
for filename in paperFiles:
    print(filename + "\t\t" + str(os.path.getsize(filename)/100000000) + " KB")

/content/drive/My Drive/NetworkScience/AcademicGraph/data/papers/aminer_papers_0.txt		100.00003049 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/papers/aminer_papers_1.txt		100.00001184 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/papers/aminer_papers_2.txt		100.00004163 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/papers/aminer_papers_3.txt		100.00001016 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/papers/aminer_papers_4.txt		100.00004774 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/papers/aminer_papers_5.txt		100.00003508 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/papers/aminer_papers_6.txt		100.00004818 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/papers/aminer_papers_7.txt		100.00003416 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/papers/aminer_papers_8.txt		100.00004111 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/papers/aminer_papers_9.t

In [None]:
authorFiles = glob.glob('/content/drive/My Drive/NetworkScience/AcademicGraph/data/authors/*.txt')
for filename in authorFiles:
    print(filename + "\t\t" + str(os.path.getsize(filename)/100000000) + " KB")

/content/drive/My Drive/NetworkScience/AcademicGraph/data/authors/aminer_authors_1.txt		8.5413124 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/authors/aminer_authors_10.txt		6.88733723 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/authors/aminer_authors_2.txt		14.09760167 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/authors/aminer_authors_0.txt		11.84497563 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/authors/aminer_authors_3.txt		12.01266825 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/authors/aminer_authors_5.txt		19.05002495 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/authors/aminer_authors_15.txt		25.1091867 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/authors/aminer_authors_11.txt		24.86744147 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/authors/aminer_authors_6.txt		25.30414925 KB
/content/drive/My Drive/NetworkScience/AcademicGraph/data/authors/aminer_a

## The Graph

After some explorations of data, that is documented in the report of this notebook, we have decided to fetch only authors whose collaborations occurred in **2016**. 
We collect authors' information in the following way:
* Firstly, we iterate over all the available papers in order to fetch the authors of those papers whose field year is equal to 2016. We remark that this step has been performed in a streaming fashion, loading chunks of 10k items (i.e. papers): in this way we avoid memory overloading and the whole scan of 138 GB of data takes about half an hour. 
* Secondly, we store the newtwork components "on the fly", inserting the relevant objects into two lists:
    * Authors (nodes): for the moment they are made up only of the authors' IDs.
    * Collaborations (weighted links): made up of the number of collaboration between two nodes during the selected year

Notice that **we postpone the creation of the graph**: the first goal is to generate the sets of nodes and links in order to store them as files. Later, we will have the freedom to use the best analytical tool depending on both our needs and the size of the graph.

### Nodes and links
In the following phase (i.e. retrieval of authors), the data structures representing nodes and links are:
* A **dictionary** of strings (authors IDs) representing the nodes
* A **dictionary of strings** representing pairs (author ID, author ID) which in turn represent the links

This decision is motivated by the following observations:
* Concertning the nodes, if we look at the following code chunks, in order to avoid duplication we need to check whether a node already exists before inserting it into the nodes *set*. This implies that we need to perform n lookups, where n is the size of the input (in our case about 6.3 millions): dictionaries are the best data structure in this case, since it provides both lookup and insertion in constant (i.e. optimal) time.
* Using sets instead of double dictionaries would allows us to do not care about the double representation of the same undirected links: since the link (x, y) is the same as (y, x), we could perform the retrieving of its weight in constant time by using a dictionary who locates them in the same bucked (being them the same object!). Unfortunately, sets are not hashable object, hence we use one (or two) dictionary for each co-authorship. 

In [None]:
auth_dict = {}
links_dict = {}

In [None]:
chunk_size = 10000

start = datetime.now()

for file in paperFiles: # For each aminer_papers_*.txt
    with open(file, "r") as f:
        while True:
            chunk = list(islice(f, chunk_size)) # Loading chunk
            if not chunk:
                break
            # Processing chunk
            for paper in chunk: # Paper online processing
                paper_info = json.loads(paper) 
                try:
                    if(paper_info["year"] == 2016):  # if year == 2016
                        authors = paper_info["authors"] # Fetch paper authors
                        # Adding authors if absent
                        for author in authors:
                            try:
                                a = auth_dict[author["id"]] # test
                            except:
                                auth_dict[author["id"]] = {}
                        
                        n = len(authors)
                        if(n > 1):
                            # Updating links among nodes
                            for i in range(n):
                                for j in range(i+1, n):
                                    edge = ','.join(sorted([authors[i]["id"], authors[j]["id"]]))
                                    try: # Incrementing egde weight by 1 (co-authorship)
                                        links_dict[edge] = links_dict[edge] + 1
                                    except: # Creating edge
                                        links_dict[edge] = 1
                    else: # year != 2016
                        pass
                except:
                    pass # No year

end = datetime.now()
end-start

datetime.timedelta(seconds=2325, microseconds=158013)

Number of nodes

In [None]:
nr_links = len(links_dict.keys())
nr_links

4511734

Number of links

In [None]:
nr_authors = len(auth_dict.keys())
nr_authors

373263

Nodes avergage degree

In [None]:
nr_links/nr_authors

12.08727894272939

## Writing nodes and links

In [None]:
def save_obj(obj, name):
    with open('data/obj/'+ name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

In [None]:
save_obj(auth_dict, "authors")
save_obj(links_dict, "links")

## Reading nodes and links

In [None]:
def load_obj(name):
    with open('data/obj/' + name + '.pkl', 'rb') as f:
        return pickle.load(f)

In [None]:
read_auth = load_obj("authors")
read_links = load_obj("links")

## Testing

In [None]:
read_links[list(read_links.keys())[1234567]]

1