# 1. Data 

In [1]:
import ijson
import heapq
import networkx as nx
import pandas as pd
import itertools
from itertools import combinations
from collections import Counter


## Data pre-processing

#### The dataset is quite large and may not fit in your memory when you try constructing your graph. So, what is the solution? You should focus your investigation on a subgraph. You can work on the most connected component in the graph. However, you must first construct and analyze the connections to identify the connected components.

#### Identify the top 10,000 papers with the highest number of citations.

First we read in one item of our json file to get a feeling for our data

In [2]:
counter = 0
limit = 1 # Number of objects to read
dato=[]

with open("/Users/damianzeller/Desktop/HS23/ADM/Homework 5/dblp.v12.json", 'rb') as f:
    for record in ijson.items(f, 'item'):
        if counter >= limit:
            break
        # Process the record here
        dato.append(record)
        counter += 1

In [3]:
dato


[{'id': 1091,
  'authors': [{'name': 'Makoto Satoh',
    'org': 'Shinshu University',
    'id': 2312688602},
   {'name': 'Ryo Muramatsu', 'org': 'Shinshu University', 'id': 2482909946},
   {'name': 'Mizue Kayama', 'org': 'Shinshu University', 'id': 2128134587},
   {'name': 'Kazunori Itoh', 'org': 'Shinshu University', 'id': 2101782692},
   {'name': 'Masami Hashimoto', 'org': 'Shinshu University', 'id': 2114054191},
   {'name': 'Makoto Otani', 'org': 'Shinshu University', 'id': 1989208940},
   {'name': 'Michio Shimizu',
    'org': 'Nagano Prefectural College',
    'id': 2134989941},
   {'name': 'Masahiko Sugimoto',
    'org': 'Takushoku University, Hokkaido Junior College',
    'id': 2307479915}],
  'title': 'Preliminary Design of a Network Protocol Learning Tool Based on the Comprehension of High School Students: Design by an Empirical Study Using a Simple Mind Map',
  'year': 2013,
  'n_citation': 1,
  'page_start': '89',
  'page_end': '93',
  'doc_type': 'Conference',
  'publisher': 

In order to get the 10000 papers with the most citations we read in every paper(itself a dictionnary) and only store the id of the paper (to recognize it later) and the number of references in a dictionnary.

In [8]:
with open('/Users/damianzeller/Desktop/HS23/ADM/Homework 5/dblp.v12.json', 'r') as f:
    papers = ijson.items(f, 'item')
    new_dict = {}
    for dict in papers:
        id = dict["id"]
        num_citations = len(dict["references"]) if "references" in dict else 0
        new_dict[id] = num_citations

Now we want to sort our dictionnary. We do this by creating tuples of our key/value pairs and storing them in a list. Then it is sorted by the second element of each tuple (number of citations) in descending order.

In [11]:
sorted_dict = sorted(new_dict.items(), key=lambda x:x[1], reverse=True)

We only want the 10000 papers with the most citations, that's why only consider the first 10000 tuples in our list. It is possible, that there are papers ingored that have the same number of citations as some that are kept. We are consciously ignoring this.

In [12]:
sorted_dict = sorted_dict[0:10000]

In [13]:
sorted_dict

[(2076024657, 1812),
 (2052326664, 1695),
 (2072748471, 1307),
 (47957325, 1287),
 (2615873723, 1216),
 (2071204548, 983),
 (2620342231, 912),
 (2154930971, 862),
 (1978831484, 853),
 (2614167197, 724),
 (2895896816, 672),
 (1510836926, 666),
 (2403502472, 651),
 (1997797684, 630),
 (2031385260, 592),
 (577451423, 590),
 (2398525104, 577),
 (1981689130, 566),
 (2076063813, 564),
 (2046007336, 563),
 (2910880405, 553),
 (1973788353, 517),
 (2128340703, 497),
 (2962883549, 494),
 (2336121529, 483),
 (2148043549, 480),
 (2166281120, 477),
 (2139317044, 468),
 (2529696250, 460),
 (1584232736, 454),
 (2964248347, 450),
 (2891004411, 436),
 (1660562555, 434),
 (1515422725, 433),
 (1569512051, 430),
 (2024228866, 430),
 (1939596808, 421),
 (2885657717, 421),
 (2140239055, 414),
 (2115167851, 413),
 (2506633516, 408),
 (2517241835, 404),
 (2040340473, 400),
 (2604799547, 397),
 (2944362491, 391),
 (2970434686, 388),
 (2949868354, 385),
 (1982564000, 381),
 (2027417001, 381),
 (2484891765, 381)

Now we extract the id's we are interested in and store it in a list

In [14]:
relevant_ids= [tup[0] for tup in sorted_dict]


Now we have what we need to extract the 10000 papers with the most citations from our json file. We store them in a list called graph_list. This is a list of dictionnaries. Alternatively we could have also just extracted the information that was necessary to construct the graph.

In [15]:
with open('/Users/damianzeller/Desktop/HS23/ADM/Homework 5/dblp.v12.json', 'r') as f:
    papers = ijson.items(f, 'item')
    graph_list = []
    for paper in papers:
        if paper['id'] in relevant_ids:
            graph_list.append(paper)  
      

## Graphs setup



#### Citation graph: This graph should represent the paper's citation relationships. We want this graph to be unweighted and directed. The citation should represent the citation given from one paper to another. For example, if paper A has cited paper B, we should expect an edge from node A to B. Nodes: You can consider each of the papers as your nodes. Edges: Only consider the citation relationship between these 10,000 papers and ignore the rest


In [16]:
#Citation graph
#Initializes a directed graph, that can have double edges
G= nx.MultiDiGraph()
#Creating the nodes
for paper in graph_list:
    G.add_node(paper['id'])
#Creating the edges
for paper in graph_list:
    #Condition to avoid error as it is possible that a paper has no citations
    if 'references' in paper:
        for element in paper['references']:
            # Condition to verify that only the citation relationship of the top 10000 papers is being looked at
            if element in relevant_ids:
                G.add_edge(paper['id'],element)

#### Collaboration graph: This graph should represent the collaborations of the paper's authors. This graph should be weighted and undirected. Consider an appropriate weighting scheme for your edges to make your graph weighted.  Nodes: The authors of these papers would be your nodes. Edges: Only consider the collaborations between the authors of these 10,000 papers and ignore the rest.

The creation of the collaboration graph is a little bit more complicted, therefor we split in several steps

First we initialize an undirected graph

In [17]:
#Initialization of the collaboration graph
N= nx.Graph()    

Now we are creating the nodes. Therefor we iterate through our list of dictionnaries(our papers) and save them in a list. This list is later converted to a set to avoid creating the same node several times.

In [18]:
#Initialization of empty list
author_list=[]
#Extracting the authors
for paper in graph_list:
    for author in paper['authors']:
        author_list.append(author['id'])
#Cooveritng it to a set
nodes=set(author_list)
#Creating the nodes
for node in nodes:
    N.add_node(node)
    
    

Now we have to create the edges. We consider it a collabortaion if two authors worked on a paper together.We consider a collaboration as stronger if the authors worked together on many papers.Therefor our weight will be the number of times two authors worked together. To show these collaborations by edges we write two helping functions.

As for a lot of papers there are several authors, we want to create tuples of each combination of authors (edges). Our permutation function helps us doing that.

In [19]:
def permutation(list_authors_pp):
    # Sort the tuples to make sure they are always created in the same order
    sorted_list_authors_pp= sorted(list_authors_pp)
    #Create tuples of all posible combinations and store them in a list
    all_combinations = list(combinations(sorted_list_authors_pp, 2))
    # Store the tuples in combination_list
    for combo in all_combinations:
        combination_list.append(combo)


Our second function counts the occurence of each tuple and returns a list of tuples. The first element of each tuple is a tuple that contains the id's of the two authors (edge). The second element of each tuple is the number of times the two authors worked together (weight).

In [20]:
def count_tuples(tuple_list):
    #Count the tuples (collaborations) 
    counter = Counter(tuple_list)
    #Store the new tuples in a list
    result_list = [(key, (count)) for key, count in counter.items()]
    return result_list


Now that we have our two helping functions, we create our weighted edges. We iterate through the authors of the individual papers (dictionnaries) and get the collaboration combinations of the authors for every single paper. We count the occurence of every single collaboration between two authors, which will be the weight of the edges, that we then create.

In [21]:
#Initialize list for collaboration combinations
combination_list=[]
#Iterate through the papers
for paper in graph_list:
    #Initialize list for the author id's of a single paper
    list_authors_pp=[]
    for author in paper['authors']:
        list_authors_pp.append(author['id'])
    #Get the author pairs (collaborations) of a single paper
    permutation(list_authors_pp)
#Count all the tupes and create a new tuple with weights
tuples_weighted= count_tuples(combination_list)
#Create the edges
for tuplo in tuples_weighted:
    N.add_edge(tuplo[0][0], tuplo[0][1], weight=tuplo[1])
