# Homework 5 - Visit the Wikipedia hyperlinks graph!

*Group 34: Eleonora Barocco, Mahtab Fotovat, Fabio Montello, Farid Rasulov,*

In this assignment we want to perform some analysis of the Wikipedia Hyperlink graph. In particular, given extra information about the categories to which an article belongs to, we are curious to rank the articles according to some criteria.

<img src="https://raw.githubusercontent.com/fabiomontello/Homework5_Group-34/master/data/complex-graph.png" width="600">

## Retrieve the data

In the first place we want to retrieve all the data needed. We start from the list of categories and the names of the pages that for the Wikipedia graph, that will be downloaded from [SNAP group's webpage](https://snap.stanford.edu/data/wiki-topcats.html). Then we proceed by downloading a [reduced version](https://drive.google.com/file/d/1ghPJ4g6XMCUDFQ2JPqAVveLyytG8gBfL/view) of the links in between nodes, provided along with the homework instructions.

Before starting exploring the research questions provided, we also want to import all the libraries we will need later on. In this way we will keep the code tidy and clear, hoping it will be also less confusing for the readers.

In [1]:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
import collections 
from multiprocessing import Pool

## [RQ1] Datamining to build the graph

In this first part we want to build the graph $G=(V, E)$, where $V$ is the set of articles and $E$ the hyperlinks among them, and provide its basic information:

- If it is direct or not
- The number of nodes
- The number of edges
- The average node degree. Is the graph dense?

We start by getting the nodes we want to build our graph from. Since we are interested only in the nodes that are connected in some way to each other, we will considere just the nodes that appear in `wiki-topcats-reduced.txt`, which is the reduce file containing all the links between nodes we will need to build our rank later on.

In [2]:
data = pd.read_csv('data/wiki-topcats-reduced.txt', sep="	", header=None) #Read values with pandas
data.columns = ["from", "to"]

Our intial data are rappresented in a series of edges between nodes.

In [3]:
data.head()

Unnamed: 0,from,to
0,52,401135
1,52,1069112
2,52,1163551
3,62,12162
4,62,167659


The first thing we want is to obtain a list of nodes: we created two sets (which don't contain duplicates) with the nodes included in the two columns and then the union of the two set in a single final set. The lenght of this set is equal to the number of nodes.

In [4]:
list1 = data["from"].values #Starting points of the links
list2 = data["to"].values #Ending points of the links
set1=set(list1)
set2=set(list2)
fin_set = set1.union(set2)
print('Number of nodes: ',len(fin_set))
print('Number of edges: ',len(data.index))

Number of nodes:  461193
Number of edges:  2645247


A simple check to know wheter the graph is directed or not is to calculate the lenght of both sets: if they are different we can say that it is actually directed.

In [5]:
print(len(set1))
print(len(set2))

428957
352518


Since the lenghts are different we can say the graph is **directed**.   

We choose to use Networkx library as base to build our graph, since it's already optimized and has all the tools necessary later on to store the data we need. We read the previous set to insert nodes and the pandas table to add the edges:

In [6]:
G = nx.DiGraph() #Create a directed graph
for count, line in enumerate(open("data/wiki-topcats-reduced.txt")):
            grlst=line.replace("\n", "").split(sep="\t")
            G.add_edge(int(grlst[0]),int(grlst[1]))

And we want to double check that the number of nodes we computed before is equal to the the one the library returns us:

In [7]:
a= sorted(list(G.nodes()))
len(a)

461193

Now we want to review all the informations we wanted to retrieve in RQ1 in one place, so to have all the basic informations together. These are number of nodes, number of edges, type of graph and density of the graph which is calculate as: <br>
<h1><center>$ D = \frac{\lvert E \rvert} {\lvert V \rvert (\lvert V \rvert - 1)}  \simeq \frac{\lvert E \rvert} {V^2}$</center></h1>

In [8]:
nodes = G.number_of_nodes()
edges = G.number_of_edges()
print('The graph is ', "undirected" if(set1 == set2) else "directed")
print('Number of nodes: ' + str(nodes))
print('Number of edges: ' + str(edges))
print('Density: ' + str(edges/(nodes**2)))

The graph is  directed
Number of nodes: 461193
Number of edges: 2645247
Density: 1.243657566949106e-05


Since the value of density is really close to zero, we can say that it's **sparse graph**.

## [RQ2]  Datamining to find categories

Given a category $C_0 = \{article_1, article_2, \dots \}$ as input we want to rank all of the nodes in $V$ according to a block-ranking, where the blocks are represented by the categories.

First of all, we want to read all the categories from the file containing them. 

In [9]:
categoriesdata = {}
with open('data/wiki-topcats-categories.txt') as f:
    for i, line in enumerate(f):
        tmp = line.split(';')
        categoriesdata[i] = (tmp[0].replace('Category:', ''), [int(i) for i in tmp[1].split()])
f.close()

In [10]:
len(categoriesdata)

17364

We obtained in total 17364 categories raw categories. Since we are using a reduced number of nodes, we proceed filtering just the nodes we actually have in our graph (so the ones that are linked each other) and also we want to filter all the categories that do not have at least 3500 elements.
In each node we will also add the list of categories it belongs to. This will help us later.

In [11]:
for elem in categoriesdata:
    for node in categoriesdata[elem][1]:
        if(node in fin_set):
            if len(categoriesdata[elem][1]) > 3500:
                ncat = G.node[node]['categories'] + [categoriesdata[elem][0]]
                G.node[node].update(categories = ncat)

KeyError: 'categories'

In [12]:
categories = dict()
for tpl in categoriesdata:
    if len(categoriesdata[tpl][1]) > 3500:
        categories[categoriesdata[tpl][0]] = [node for node in categoriesdata[tpl][1] if node in fin_set]

In [13]:
print('Number of categories:', len(categories))

Number of categories: 35


Let's print all the categories we will take in to a account with their relative length:

In [14]:
for i in categories:
    print(i.rjust(70),' ', len(categories[i]))

                                                   English_footballers   7538
                                           The_Football_League_players   7814
                                         Association_football_forwards   5097
                                      Association_football_goalkeepers   3737
                                      Association_football_midfielders   5827
                                        Association_football_defenders   4588
                                                         Living_people   348300
                                                 Year_of_birth_unknown   2536
                                             Harvard_University_alumni   5549
                                        Major_League_Baseball_pitchers   5192
   Members_of_the_United_Kingdom_Parliament_for_English_constituencies   6491
                                                          Indian_films   5568
                                                 Year_of_death

## Block-ranking

In the process of building our block ranking, we need to take into account:

$$distance(C_0, C_i) = median(ShortestPath(C_0, C_i))$$

Were $C_0$ is an arbitrary category set as an input(in this case, chosen by us) and $C_i$ all the other categories, one by one. The lower is the distance from $C_0$, the higher is the $C_i$ position in the rank. We want to consider $ShortestPath(C_0, C_i)$ is the set of all the possible shortest paths between the nodes of $C_0$ and $C_i$. Moreover, the length of a path is given by the sum of the weights of the edges it is composed by, which, since the graph is unweighted, it's equal to 1.

Basically we want to compute all the shortest distances between points from the category we choose to any other category, and than just keep the median (central value) of all the values we obtained for each category

The first thing we want to do is to choose a category as input that will let us reduce the number of computations we have to perform, since the graph has a huge number of nodes and edges. So the logical way to go is to pick the category with the least number of elements as $C_0$.

In [15]:
minimum = 10000000
input_category = ''
for i in categories:
    if(int(len(categories[i])) < minimum):
        minimum = len(categories[i])
        input_category = i
        
print(input_category, minimum )

Year_of_birth_unknown 2536


In [16]:
categories['Year_of_birth_unknown']

[3335,
 10527,
 16310,
 22286,
 23468,
 23469,
 23476,
 24212,
 26206,
 28993,
 31093,
 34422,
 34424,
 34425,
 34762,
 34909,
 35263,
 35264,
 39892,
 40716,
 41699,
 41761,
 41778,
 41941,
 42011,
 42090,
 42245,
 42246,
 42269,
 42303,
 42370,
 42400,
 42466,
 42527,
 42539,
 42669,
 42795,
 42842,
 42951,
 43021,
 43047,
 43142,
 43290,
 43292,
 43293,
 43550,
 43595,
 43770,
 43771,
 44204,
 44205,
 44469,
 45522,
 45523,
 45524,
 45897,
 45906,
 45907,
 45948,
 45950,
 45952,
 45975,
 46318,
 46821,
 46826,
 47007,
 47017,
 47040,
 47085,
 48145,
 48192,
 48200,
 48206,
 48247,
 48847,
 51844,
 53151,
 53770,
 54096,
 54127,
 54371,
 54375,
 54923,
 55010,
 55102,
 55434,
 55747,
 55754,
 55895,
 55905,
 56442,
 56524,
 56668,
 56670,
 56673,
 57023,
 59198,
 59203,
 59216,
 59218,
 59230,
 59673,
 59675,
 60061,
 60066,
 60257,
 60273,
 60457,
 60773,
 61064,
 61116,
 61172,
 61176,
 61206,
 61208,
 61261,
 61303,
 61315,
 61316,
 61345,
 61680,
 61704,
 61708,
 61733,
 61743,
 

As we can see the category `Year_of_birth_unknown` is the smallest one, containing only 2536 elements. So this will become our $C_0$.

Next step is to compute every single distance in between the nodes from $C_0$ and every other category in our graph. Since it will take some time to execute, we suggest to avoid executing the algorithm if not necessary. In fact, once it has been executed, it will store all the value retrieved, so that further blocks of codes can be executed inependently, as long as the files precomputed are saved in the subdirectory `data`.
Since the code require some time to be executed, we decided also to parallelize the computation of the distances between nodes, so to use all the CPU power that is available on the compiling machine and reduce massively the execution time.

We know from the beginning that all the nodes we have in the graph will be part of a category, so we easly want to go from a single point thorugh all the graph, and for each node we encounter, we'll add the distance from $C_0$, this for the whole graph. We will have to repeat this just for all the nodes in $C_0$.

In [16]:
inf = float("inf")

for category in categories:
    file=open("data/" + category, "w")
    file.close()

print("Sources:\n")

def compute(source):

    print(source)

    level=0

    current_level=set()
    next_level=set()
    current_level.add(source)
    visited=set()
    
    for link in list(G.neighbors(source)):
        next_level.add(link)
    
    while next_level!=set():
    
        next_level=set()
    
        for node in current_level:
            visited.add(node)

            if node == source:
                for category in G.node[node]['categories']:
                    file=open("data/" + category, "a")
                    file.write("0")
                    file.write("\n")
                    file.close()
                    
            elif node!=source:
                for category in G.node[node]['categories']:
                    file=open("data/" + category, "a")
                    file.write(str(level))
                    file.write("\n")
                    file.close()

            for link in list(G.neighbors(node)):
                if link not in visited:
                    next_level.add(link)

        level+=1
        current_level=next_level

pool = Pool()
pool.map(compute, categories[input_category])
pool.close()
pool.join()

Sources:



NameError: name 'input_category' is not defined

What we have stored now are 35 files named with the names of the categories, where in every file we have the distances between every node in $C_0$ and every other node in the graph. In this way we will be able to reopen the files again later on, and compute the distances we need with the points in order to get the median value for every category. 

In [17]:
category_dict=dict()

for category in categories:
    
    if category == input_category:
        
        print(category.rjust(70), 0.0)
        category_dict[category]=0.
        
        
    else:
        

        file=open("data/files/" + category +'.txt', "r")
        content=file.read().splitlines()
        file.close()

        content=list(map(float, content))

        content.sort()

        missing_infinities=len(categories[input_category])*len(categories[category])-len(content)

        if (len(content) + missing_infinities)%2 != 0:
            median_index = int(((len(content) + missing_infinities)-1)/2 + 1)
            if median_index < len(content):
                median = content[median_index]
            else:
                median = inf

        if (len(content) + missing_infinities)%2 == 0:
            median_index1 = int(((len(content) + missing_infinities)-1)/2)
            median_index2 = int(((len(content) + missing_infinities)-1)/2 + 1)
            if (median_index1 < len(content)) and (median_index2 < len(content)):
                median = (content[median_index1] + content[median_index2])/2
            else:
                median = inf

        print(category.rjust(70), median)
        category_dict[category]=median

FileNotFoundError: [Errno 2] No such file or directory: 'data/files/English_footballers.txt'

In [18]:
inf=float("inf")
category_dict = {'English_footballers': 9.0, 'The_Football_League_players': 8.0, 'Association_football_forwards': 9.0, 'Association_football_goalkeepers': 10.0, 'Association_football_midfielders': 10.0, 'Association_football_defenders': 10.0, 'Living_people': 8.0, 'Year_of_birth_unknown': inf, 'Harvard_University_alumni': 7.0, 'Major_League_Baseball_pitchers': 8.0, 'Members_of_the_United_Kingdom_Parliament_for_English_constituencies': 7.0, 'Indian_films': 7.0, 'Year_of_death_missing': inf, 'English_cricketers': 9.0, 'Year_of_birth_missing_(living_people)': 8.0, 'Rivers_of_Romania': 8.0, 'Main_Belt_asteroids': inf, 'Asteroids_named_for_people': inf, 'English-language_albums': 7.0, 'English_television_actors': 6.0, 'British_films': 6.0, 'English-language_films': 6.0, 'American_films': 6.0, 'Fellows_of_the_Royal_Society': 7.0, 'People_from_New_York_City': 7.0, 'American_Jews': 6.0, 'American_television_actors': 6.0, 'American_film_actors': 6.0, 'Debut_albums': 7.0, 'Black-and-white_films': 7.0, 'Year_of_birth_missing': inf, 'Place_of_birth_missing_(living_people)': 8.0, 'Article_Feedback_Pilot': 6.0, 'American_military_personnel_of_World_War_II': 7.0, 'Windows_games': 8.0}
category_dict['Year_of_birth_unknown'] = 0.0

We want to sort the data obtained increasing by the score we computed previously.

In [19]:
sorted_category = sorted(category_dict.items(), key=lambda kv: kv[1])
sorted_category

[('Year_of_birth_unknown', 0.0),
 ('English_television_actors', 6.0),
 ('British_films', 6.0),
 ('English-language_films', 6.0),
 ('American_films', 6.0),
 ('American_Jews', 6.0),
 ('American_television_actors', 6.0),
 ('American_film_actors', 6.0),
 ('Article_Feedback_Pilot', 6.0),
 ('Harvard_University_alumni', 7.0),
 ('Members_of_the_United_Kingdom_Parliament_for_English_constituencies', 7.0),
 ('Indian_films', 7.0),
 ('English-language_albums', 7.0),
 ('Fellows_of_the_Royal_Society', 7.0),
 ('People_from_New_York_City', 7.0),
 ('Debut_albums', 7.0),
 ('Black-and-white_films', 7.0),
 ('American_military_personnel_of_World_War_II', 7.0),
 ('The_Football_League_players', 8.0),
 ('Living_people', 8.0),
 ('Major_League_Baseball_pitchers', 8.0),
 ('Year_of_birth_missing_(living_people)', 8.0),
 ('Rivers_of_Romania', 8.0),
 ('Place_of_birth_missing_(living_people)', 8.0),
 ('Windows_games', 8.0),
 ('English_footballers', 9.0),
 ('Association_football_forwards', 9.0),
 ('English_cricketers

In [24]:
categories[str(sorted_category[1][0])]

[32782,
 40338,
 40566,
 53495,
 72636,
 83294,
 84078,
 85090,
 92536,
 94198,
 94537,
 96274,
 109867,
 113194,
 113259,
 131855,
 131913,
 134497,
 137747,
 142496,
 145391,
 146728,
 146750,
 151481,
 151482,
 151506,
 152932,
 156192,
 157543,
 158111,
 158448,
 164301,
 165140,
 166620,
 167466,
 172559,
 175252,
 184837,
 205296,
 208941,
 210464,
 210755,
 212777,
 220519,
 247467,
 247476,
 247485,
 255982,
 255983,
 256816,
 290953,
 332118,
 340938,
 343202,
 343281,
 343408,
 343409,
 348794,
 350088,
 369362,
 373339,
 376202,
 376209,
 392767,
 400868,
 401893,
 401983,
 402899,
 418183,
 419816,
 423988,
 428273,
 430640,
 430653,
 430662,
 430673,
 430675,
 430686,
 430720,
 430748,
 430749,
 430782,
 430783,
 430787,
 431375,
 431603,
 436006,
 436009,
 438370,
 449340,
 451046,
 466551,
 467110,
 469565,
 480173,
 481038,
 499468,
 499543,
 517246,
 526313,
 526315,
 545329,
 552369,
 552703,
 552712,
 552821,
 555866,
 556088,
 559668,
 559696,
 559701,
 559812,
 559

## Weight nodes and creating the subgraph

Now, based on the rank we retrieved previously, we want to give a rank to the single nodes.

In [20]:
rank = [elem[0] for elem in sorted_category]
rank

['Year_of_birth_unknown',
 'English_television_actors',
 'British_films',
 'English-language_films',
 'American_films',
 'American_Jews',
 'American_television_actors',
 'American_film_actors',
 'Article_Feedback_Pilot',
 'Harvard_University_alumni',
 'Members_of_the_United_Kingdom_Parliament_for_English_constituencies',
 'Indian_films',
 'English-language_albums',
 'Fellows_of_the_Royal_Society',
 'People_from_New_York_City',
 'Debut_albums',
 'Black-and-white_films',
 'American_military_personnel_of_World_War_II',
 'The_Football_League_players',
 'Living_people',
 'Major_League_Baseball_pitchers',
 'Year_of_birth_missing_(living_people)',
 'Rivers_of_Romania',
 'Place_of_birth_missing_(living_people)',
 'Windows_games',
 'English_footballers',
 'Association_football_forwards',
 'English_cricketers',
 'Association_football_goalkeepers',
 'Association_football_midfielders',
 'Association_football_defenders',
 'Year_of_death_missing',
 'Main_Belt_asteroids',
 'Asteroids_named_for_people

In [None]:
# UNTIL HERE IS COMPLETE

### Cleaning the categories

Once we obtained the rank we want to clean the categories: all the elements in the input category belong to that one, but from the second category we want to remove the elements that already are in the previous ones.
<br>
We stored the initial lenght of the categories in a list 'len_'

In [21]:
len_=[]
for i in range(len(categories)):
    len_.append(len(categories[rank[i]]))

In [22]:
# clean ranked categories
for i in range(len(rank)):
    if i == 0:
        used_nodes = set(categories[rank[i]])
        categories[rank[i]] = set(categories[rank[i]])
    else:
        categories[rank[i]] = set(categories[rank[i]]).difference(used_nodes)
        used_nodes = used_nodes.union(categories[rank[i]])

In [23]:
for i in range(len(categories)):
    len_[i] = (len_[i],len(categories[rank[i]]))

Now the list 'len_' contains tuples with the old and the new lenght of the cagories.
<br>
We can see how the number of nodes in the following categories decrese.

In [24]:
len_

[(2536, 2536),
 (3362, 3361),
 (4422, 4422),
 (22463, 18909),
 (15159, 4598),
 (3411, 3400),
 (11531, 11065),
 (13865, 4766),
 (3485, 3418),
 (5549, 5296),
 (6491, 6473),
 (5568, 5533),
 (4760, 4747),
 (3446, 3169),
 (4614, 3429),
 (7561, 6648),
 (10759, 3489),
 (3720, 3183),
 (7814, 7793),
 (348300, 319681),
 (5192, 1931),
 (28498, 74),
 (7729, 7729),
 (5532, 43),
 (4025, 4013),
 (7538, 1012),
 (5097, 254),
 (3275, 1839),
 (3737, 161),
 (5827, 122),
 (4588, 117),
 (4122, 3527),
 (11660, 11659),
 (4895, 358),
 (4346, 2438)]

Now we can start creating a subgraph induced by $C_0$:

- For each node in $C_0$ we computed the sum of the weights of the in-edges, giving a weight to them
- Then we want to extend the graph to the nodes of the next category in the rank: immagine $C_1$ is our second category, we first calculate the score as before, then we calculate the in-edges coming from the previous category, that will give as weight as the score of the nodes that sends to the edge.
- We repeat the same for all the categories.

In [25]:
used_nodes = []

For our purpose we defined two functions: one to calculate the weight of nodes given by the internal edges in the category

In [26]:
def weight_nodes (category):
    '''Takes in input a category as a set, returns a list with every node and its weight, 
    sorted according to the highest weights of the nodes (articles)'''
    #creating a vector of tuples which will contain the weight of nodes
    W = []
    for i in category:
        pred = set(G.predecessors(i))
        
        V = [i, len(pred)]        
        W.append(V)
    
    W = sorted(W,key=lambda x: x[1], reverse = True)  
    return W     

The second one assigns to the nodes the weight given from the nodes in the same category and the ones in the previous category.

In [34]:
def weight_nodes2 (category, prev_wei):
    '''Takes in input a category as a set and the other as a list with the weights of internal nodes.
    Returns a list with the internal weight plus the external given from the previous category, 
    sorted according to the highest weights of the nodes (articles)'''
    
    W = weight_nodes(category)
    
    #add the weight given by the previous category
    for i in range(len(W)):
        a = list(G.predecessors(W[i][0]))
        
        #for each predecessor check if the node is in the previous category, 
            #if it is, add the weight of the predecessor to the successor, 
            
        for y in range(len(a)):
            for j in range(len(prev_wei)):
                if a[y] == prev_wei[j][0]:
                    
                    W[i][1] += prev_wei[j][1]
                    break
            
    #calculating the weight    
        
    W = sorted(W,key=lambda x: x[1], reverse = True)    
    return W 

It's time to start calculating the weight: the first category only consider the internal nodes.

In [28]:
#first category
weight = []

C = [categories[rank[0]]]

weight.append(weight_nodes(C[0]))

weight[0]


[[958480, 40],
 [1656246, 33],
 [584219, 25],
 [1203095, 25],
 [62684, 23],
 [1379114, 22],
 [1045188, 22],
 [170163, 22],
 [1270167, 21],
 [1656780, 20],
 [173007, 18],
 [1268881, 18],
 [1656777, 17],
 [1656778, 17],
 [1045180, 17],
 [54923, 16],
 [169696, 16],
 [87370, 14],
 [1123122, 13],
 [1656276, 13],
 [60066, 13],
 [1046056, 13],
 [1046335, 13],
 [1122526, 11],
 [666855, 11],
 [830711, 11],
 [1045309, 11],
 [1270073, 10],
 [1656794, 10],
 [53151, 10],
 [168194, 10],
 [1045173, 10],
 [1045317, 10],
 [1046153, 10],
 [1046336, 10],
 [1046408, 10],
 [170578, 10],
 [1203235, 10],
 [171464, 10],
 [666857, 9],
 [1766063, 9],
 [1045365, 9],
 [1045413, 9],
 [1046132, 9],
 [64632, 9],
 [1656779, 8],
 [174582, 8],
 [1084068, 8],
 [1109485, 8],
 [159606, 8],
 [1151309, 8],
 [1045266, 8],
 [1046076, 8],
 [1046473, 8],
 [1202756, 8],
 [1342864, 8],
 [1269463, 8],
 [1344834, 7],
 [215551, 7],
 [1045172, 7],
 [1045174, 7],
 [1045372, 7],
 [204079, 7],
 [1342960, 7],
 [1343014, 7],
 [1344690, 6]

Then for all the other we can apply the second function that also consider the weight from the previous category.

In [None]:
# weight of nodes of other categories
#while add the edges between each category and the previous (thanks to function weight_nodes2)
for i in range(len(rank)):
    
    if i == 0:
        continue
    else:
        
        C = categories[rank[i]]
 
        w = weight_nodes2(C, weight[i-1])
        
        weight.append(w)


In [42]:
weight[3]

[[1061310, 7458],
 [1058611, 4763],
 [1044631, 4270],
 [479828, 4216],
 [1042784, 4038],
 [1062638, 3574],
 [1061160, 3561],
 [1061285, 3533],
 [1060315, 3467],
 [1063046, 3428],
 [1062453, 3367],
 [1044627, 3302],
 [1062344, 3224],
 [1061187, 3186],
 [1043120, 3112],
 [1061272, 3088],
 [749106, 2956],
 [1062870, 2851],
 [1226630, 2792],
 [1042737, 2771],
 [1060596, 2587],
 [1041818, 2491],
 [1060455, 2453],
 [1061306, 2425],
 [1060105, 2418],
 [1059259, 2367],
 [1253708, 2363],
 [1062409, 2341],
 [1062439, 2304],
 [1043093, 2284],
 [1063416, 2234],
 [1062738, 2226],
 [829969, 2219],
 [829968, 2218],
 [1062639, 2192],
 [1064194, 2167],
 [1060433, 2160],
 [1056682, 2091],
 [1058612, 2076],
 [1063713, 2069],
 [1056783, 2050],
 [1061245, 2031],
 [1059961, 2001],
 [1057020, 1987],
 [901726, 1985],
 [1063361, 1984],
 [1063064, 1972],
 [1063315, 1957],
 [940094, 1955],
 [1046431, 1944],
 [1059826, 1932],
 [1044085, 1931],
 [1060827, 1924],
 [1041861, 1915],
 [940096, 1914],
 [940083, 1912],


We have the weight of all the nodes but we want to create a subgraph as said before.

In [None]:
for i in range (len(rank)):
    for j in range(len(weight[i])):
        F.node[weight[0][j][0]]['weight'] = weight[0][j][1]