# PageRank Stability and Community Structure on Graphs
### <i>Abdel K. Bokharouss, Bart van Helvert, Joris Rombouts, Remco Surtel - January 2018</i>

# Task 1: PageRank Stability on Evolving Graphs

### Imports and general set-up

In [1]:
import networkx as nx
import plotly.plotly as py
import plotly.figure_factory as ff
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt

In [2]:
np.random.seed(98)
random.seed(99)

In [3]:
from IPython.display import display_html
def display_df_sbs(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

Load edge list and create a graph

In [4]:
fh = open("canvas/hamster.edgelist", 'rb')
G = nx.read_edgelist(fh, create_using=nx.DiGraph())
fh.close()

Next we run the pagerank algorithm with a dampening parameter of 0.85. The dampening parameter represents the likelyhood of clicking a link on the webpage. With a dampening parameter of 0.85 we indicate that there is a 85% of clicking a link on the webpage and 15% of going to a random other node in the graph. We calculate the page rank using the power iteration method.

In [5]:
pr = nx.pagerank(G, alpha=0.85)

In [6]:
df_edge_in = pd.DataFrame(list(G.in_degree), columns=['node', 'in edges'])
df_edge_out = pd.DataFrame(list(G.out_degree), columns=['node', 'out edges'])
df_rank = pd.DataFrame(list(pr.items()), columns=['node', 'score']).sort_values(by=['score'], ascending=False)
df_temp = pd.merge(df_rank, df_edge_in, on='node')
df_total = pd.merge(df_temp, df_edge_out, on='node')
df_total.index = df_total.index + 1
df_total.columns.name = 'rank'

display_df_sbs(df_total.head(10), df_total.iloc[500 : 510], df_total.iloc[1000 : 1010], 
                        df_total.iloc[1500 : 1510], df_total.iloc[2000 : 2010], df_total.tail(10))

rank,node,score,in edges,out edges
1,404,0.042793,10,0
2,195,0.019961,80,1
3,77,0.018628,121,2
4,728,0.01553,10,0
5,36,0.011117,168,5
6,135,0.009544,49,8
7,192,0.009365,57,3
8,281,0.009304,32,0
9,136,0.008853,85,6
10,184,0.008296,80,3

rank,node,score,in edges,out edges
501,469,0.000328,26,48
502,1406,0.000328,13,26
503,610,0.000326,18,16
504,972,0.000326,7,4
505,129,0.000324,5,6
506,834,0.000322,11,2
507,1266,0.000322,3,4
508,27,0.000322,4,6
509,480,0.000321,4,6
510,1335,0.00032,5,1

rank,node,score,in edges,out edges
1001,708,0.000203,4,5
1002,896,0.000203,8,8
1003,270,0.000203,14,20
1004,799,0.000203,9,7
1005,811,0.000202,3,7
1006,56,0.000202,7,6
1007,290,0.000202,1,0
1008,142,0.000202,3,3
1009,944,0.000201,1,0
1010,695,0.000201,3,3

rank,node,score,in edges,out edges
1501,1597,0.000137,1,3
1502,1908,0.000137,1,3
1503,1442,0.000137,1,3
1504,1702,0.000137,1,3
1505,1048,0.000137,1,3
1506,1259,0.000137,1,7
1507,2172,0.000137,1,3
1508,1771,0.000137,1,3
1509,2035,0.000137,1,3
1510,2131,0.000137,1,3

rank,node,score,in edges,out edges
2001,2327,0.000113,0,4
2002,2091,0.000113,0,5
2003,2329,0.000113,0,3
2004,2330,0.000113,0,4
2005,2335,0.000113,0,4
2006,2309,0.000113,0,4
2007,1359,0.000113,0,1
2008,2006,0.000113,0,5
2009,1501,0.000113,0,4
2010,1148,0.000113,0,1

rank,node,score,in edges,out edges
2417,1739,0.000113,0,4
2418,918,0.000113,0,4
2419,1743,0.000113,0,1
2420,1744,0.000113,0,1
2421,1745,0.000113,0,1
2422,1746,0.000113,0,3
2423,1748,0.000113,0,4
2424,1749,0.000113,0,3
2425,1751,0.000113,0,2
2426,2426,0.000113,0,8


As expected, the higher ranked pages have more incoming edges than the lower ranked pages on average. It is important to note that a page being linked by a lot of other pages doesn't imply that it will rank high on the pagerank. The rank of a page is mainly influenced by the quality links directed to the page. A page which is linked on many other pages however is still far more likely to end up higher in the pagerank than a page which is linked less frequently. This is also shown in the data from the pagerank calculation above. The lower the pagerank the fewer incoming edges those pages have. There are however some exceptions in the data. One of them is the number 1 ranked page. The rank of that page far exceeds the other pages having a score of 0.042793 = 4.3% while the second best ranked page only has a score of 0.019961 = 2.0%. We will analyze this page by looking at the quality of the pages that link to it.

In [7]:
def gen_df(node):
    df_pred = pd.DataFrame(list(G.predecessors(node)), columns=['node'])
    scores = {}
    out_edges = {}
    for n in G.predecessors(node):
        out_edges[n] = len(G.out_edges(n))
        scores[n] = pr.get(n)
    df_out_edges = pd.DataFrame(list(out_edges.items()), columns=['node', 'out edges'])     
    df_score = pd.DataFrame(list(scores.items()), columns=['node', 'score']).sort_values(by=['score'], ascending=False)
    df_temp = pd.merge(df_score, df_pred, on='node')
    df_total = pd.merge(df_temp, df_out_edges, on='node')
    df_total.index = df_total.index + 1
    df_total.columns.name = node
    return df_total

def gen_sum_inc(node):
    summation = 0
    for n in G.predecessors(node):
        summation += pr.get(n)
    return summation

In [8]:
display_df_sbs(gen_df('404'), pd.DataFrame(), gen_df('195').head(10), gen_df('195').tail(10))
print("Summation incoming node score for node 404: {sum}".format(sum=gen_sum_inc('404')))
print("Summation incoming node score for node 195: {sum}".format(sum=gen_sum_inc('195')))

404,node,score,out edges
1,195,0.019961,1
2,77,0.018628,2
3,192,0.009365,3
4,126,0.008144,1
5,346,0.005487,3
6,403,0.004875,2
7,24,0.003894,3
8,246,0.002964,2
9,882,0.002327,1
10,775,0.000385,1

195,node,score,out edges
1,77,0.018628,2
2,36,0.011117,5
3,192,0.009365,3
4,181,0.005597,12
5,346,0.005487,3
6,182,0.004905,13
7,116,0.004715,54
8,125,0.003839,4
9,115,0.003329,45
10,101,0.003121,5

195,node,score,out edges
71,2019,0.000119,13
72,618,0.000119,16
73,2195,0.000113,6
74,684,0.000113,54
75,2135,0.000113,7
76,2097,0.000113,1
77,2018,0.000113,15
78,855,0.000113,3
79,911,0.000113,15
80,2352,0.000113,3


Summation incoming node score for node 404: 0.07603158841472238
Summation incoming node score for node 195: 0.1244965540211601


The data clearly shows that there are a lot more links to 195 than 404. Also does the data show that the sum of the score of all the pages that link to 195 is almost twice as high as the sum of the score of all pages that link to 404. Even though this is the case the score of 404 is way higher than the score of 195. The reason for this is that the pages that link to 195 also link to a lot of other pages while this is not the case for 404. The amount of outgoing edges for the pages that link to 404 is lower than for 195. Also do links from low scoring pages not affect the score of a page by a lot. Most of the score that both pages 404 and 195 get is from a few pages with high scores and low amount of links.

We now compare node 404 and 728. They look very similar in terms of both links to and from the page. Both have 10 links going to that particular page and both pages contain no links. Except for them looking the same in terms of connected edges, the score of node 404 is a lot higher than the score of 728. The only explanation for this is that the quality of the incoming edges of 404 must be better than the quality of the incoming edges of 728. We confirm this by looking at the nodes with edges directed to both pages.

In [9]:
display_df_sbs(gen_df('404'), pd.DataFrame(), gen_df('728'))
print("Summation incoming node score for node 404: {sum}".format(sum=gen_sum_inc('404')))
print("Summation incoming node score for node 195: {sum}".format(sum=gen_sum_inc('728')))

404,node,score,out edges
1,195,0.019961,1
2,77,0.018628,2
3,192,0.009365,3
4,126,0.008144,1
5,346,0.005487,3
6,403,0.004875,2
7,24,0.003894,3
8,246,0.002964,2
9,882,0.002327,1
10,775,0.000385,1

728,node,score,out edges
1,697,0.007044,1
2,724,0.004799,4
3,222,0.003467,4
4,727,0.003348,1
5,726,0.002353,2
6,170,0.002058,12
7,221,0.002017,5
8,725,0.001993,1
9,723,0.001598,1
10,220,0.001575,4


Summation incoming node score for node 404: 0.07603158841472238
Summation incoming node score for node 195: 0.030253764353088613


From the summation of the incoming node score we see that 404 scores better. However in the comparison between 404 and 195 it was already shown that this doesn't necessarily imply that 404 will score better than 195. If we take a look at the number of outgoing edges of the incoming nodes we see that 728 has slightly more in total. This also doesn't necessarily mean that the score of 728 should be lower than 404. The impact of an high amount of outgoing edges a node that has an high score is way more influential than when a node with a low score has an high amount of outgoing edges. The total score and amount of summations are a good indicator but not always right. In this case however it is.

## Graph evolution and PageRank values comparison
### Joris & Abdel

Load edge list and create a graph

In [10]:
fh = open("canvas/hamster.edgelist", 'rb')
G = nx.read_edgelist(fh, create_using=nx.DiGraph())
fh.close()

In [11]:
def calc_pagerank(G_in, alpha = 0.85):
    return nx.pagerank(G_in, alpha=0.85)

In [12]:
pr_origin = calc_pagerank(G)

In [13]:
def create_dataframe(pr, G_in):
    df_edge_in = pd.DataFrame(list(G_in.in_degree()), columns=['node', 'in edges'])
    df_edge_out = pd.DataFrame(list(G_in.out_degree()), columns=['node', 'out edges'])
    df_rank = pd.DataFrame(list(pr.items()), columns=['node', 'score']).sort_values(by=['score'], ascending=False)
    df_temp = pd.merge(df_rank, df_edge_in, on='node')
    df_total = pd.merge(df_temp, df_edge_out, on='node')
    df_total.index = df_total.index + 1
    df_total.columns.name = 'rank'
    return df_total

In [14]:
df_origin = create_dataframe(pr_origin, G)
df_origin.head(10)

rank,node,score,in edges,out edges
1,404,0.042793,10,0
2,195,0.019961,80,1
3,77,0.018628,121,2
4,728,0.01553,10,0
5,36,0.011117,168,5
6,135,0.009544,49,8
7,192,0.009365,57,3
8,281,0.009304,32,0
9,136,0.008853,85,6
10,184,0.008296,80,3


## 1b. Graph Evolution and Pagerank values comparison

In this section the effects of graph evolutions are going to be studied in relation to an evaluation of the stability of PageRank. In particular, various methodologies are going to be devised and exploited in which graphical represesentations of a social network are going to be altered by the removal and/or addition of nodes and edges in these graphs. The original graph $G$, represents a social network of friendships and familylinks between users of the website <a>hamsterster.com</a>. Various functions which make it possible to change this graph are going to be given and explained. Some of these functions focus on the addition or removal of edges, while other focus on nodes. Some of these functions are going to do make choices at random, while others are also going to exploit randomness, but proporotional to the node degree and other statistics. The choice is made to analyze the effects of the functions which evolve the graphs on the original graph $G$. So the evaluation of the various functions which add/remove graphs is going to be done starting from the full and original graph $G$ for each of the given functions. 

<i>Note: A social network would naturally be described with an undirected graph. The social network data is, however, treated as a combination of target and source id's which faciliate the usage of this data as a directed graph for the sake of implementing and testing graph evolutions methods to evaluate the stability of PageRank. No implications or conclusions should be directly related to the actual structure of the social networks of the website</i>

### A. Removing and adding edges uniformly at random

For $n$ number of nodes do the following:
* select 1 node uniformly at random
* add or remove an incoming/outgoing at random 

In [15]:
#add/remove edges for all the nodes uniformly at random
def random_edges_uniform_random(G_in, number_of_nodes = 1, choice_given = False, choice = False):
    nr_of_edges_added = 0
    nr_of_edges_removed = 0
    
    list_of_nodes = list(G_in) # all the nodes
    # select uniformly at random nodes of which we are going to add/remove edges
    selected_nodes = list(np.random.choice(list_of_nodes, size = number_of_nodes, replace = False)) # default probability p is an uniform distribution
    
    for node in selected_nodes: 
        successors = list(G_in.successors(str(node))) # find the successors of this nodes
        predecessors = list(G_in.predecessors(str(node))) # find the predecessors of this node
        #find candidates for new edges
        unconnected_to = [n for n in list(G_in.nodes()) if not n in successors] # no outgoing edge to these nodes
        unconnected_from = [n for n in list(G_in.nodes()) if not n in predecessors] # no incoming edge from these nodes
        
        if (choice_given):
            add = choice
        else:
            add = bool(random.getrandbits(1)) # randomly add or remove an edge of this node
        
        incoming =  bool(random.getrandbits(1)) # randomly add an outgoing/incoming edge
        if(add): # add an incoming/outgoing edge to node
            if(incoming): # add incoming edge
                if len(unconnected_from): #only add when unconnected_from is not empty
                    new = random.choice(unconnected_from)
                    G_in.add_edge(new, node)
                    print("\tnew edge:\t {} --> {}".format(new, node))
                    unconnected_from.remove(new)
                    predecessors.append(new)
            else: # add outgoing edge:
                if len(unconnected_to): #only add when unconnected_to is not empty
                    new = random.choice(unconnected_to)
                    G_in.add_edge(node, new)
                    print("\tnew edge:\t {} --> {}".format(node, new))
                    unconnected_to.remove(new)    
                    successors.append(new)
            nr_of_edges_added += 1
        else: # remove
            if(incoming): # remove incoming edge
                if len(predecessors): #only remove when predecessors is not empty
                    new = random.choice(predecessors)
                    G_in.remove_edge(new, node)
                    print("\tremove edge:\t {} --> {}".format(new, node))
                    predecessors.remove(new)
                    unconnected_from.append(new)
            else: # remove outgoing edge:
                if len(successors): #only remove when successors is not empty
                    new = random.choice(successors)
                    G_in.remove_edge(node, new)
                    print("\tremove edge:\t {} --> {}".format(node, new))
                    successors.remove(new)    
                    unconnected_to.append(new)
            nr_of_edges_removed += 1
            
    print("number of edges added: " + str(nr_of_edges_added))
    print("number of edges removed " + str(nr_of_edges_removed))
    
    return G_in

`random_edges_uniform_random` selects one node uniformly at random, and then for that specific node it uniformly at random adds or removes one outgoing or ingoing edge, which is also determined uniformly at random. Because the choice is made to analyze the effects of the functions which evolve the graphs on the original graph $G$, we call the function `random_edges_uniform_random` parameterized with a copy of $G$ and `number_of_nodes`$ = 100$. In other words, `random_edges_uniform_random` will either add or remove either an incoming or outgoing edge for each node of `number_of_nodes`. The result graph is stored in `G_random_edges_uniform_random`.

In [16]:
G_random_edges_uniform_random = random_edges_uniform_random(G.copy(), 100)

	remove edge:	 1612 --> 817
	remove edge:	 301 --> 303
	new edge:	 709 --> 1036
	new edge:	 1653 --> 2179
	new edge:	 2207 --> 1376
	remove edge:	 2412 --> 2413
	remove edge:	 648 --> 845
	new edge:	 889 --> 909
	new edge:	 1532 --> 2107
	new edge:	 541 --> 877
	remove edge:	 2384 --> 2354
	remove edge:	 2248 --> 305
	remove edge:	 522 --> 523
	remove edge:	 1887 --> 1889
	remove edge:	 697 --> 728
	remove edge:	 1801 --> 308
	new edge:	 2192 --> 1863
	remove edge:	 1631 --> 1638
	new edge:	 2328 --> 1903
	new edge:	 1292 --> 1467
	remove edge:	 1300 --> 249
	new edge:	 1810 --> 717
	new edge:	 728 --> 1889
	new edge:	 721 --> 1112
	remove edge:	 1123 --> 421
	remove edge:	 37 --> 60
	new edge:	 1234 --> 333
	new edge:	 189 --> 359
	remove edge:	 1724 --> 1725
	remove edge:	 964 --> 967
	new edge:	 1934 --> 825
	new edge:	 227 --> 460
	new edge:	 2049 --> 1817
	remove edge:	 470 --> 958
	new edge:	 522 --> 736
	remove edge:	 2094 --> 303
	remove edge:	 54 --> 121
	remove edge:	 830 -->

Next, we run `calc_pagerank` to calculate the new pagerank scores of `G_random_edges_uniform_random`. Thereafter a dataframe is created of the pagerank scores, together with for each node the number of incoming edges and outcoming edges. The nodes are sorted on the pagerank score, in descending order.

In [17]:
pr_random_edges_uniform_random = calc_pagerank(G_random_edges_uniform_random)
df_random_edges_uniform_random = create_dataframe(pr_random_edges_uniform_random, G_random_edges_uniform_random)
df_random_edges_uniform_random.head(10)

rank,node,score,in edges,out edges
1,404,0.042988,10,0
2,195,0.020129,80,1
3,77,0.018712,122,2
4,36,0.011565,168,5
5,192,0.009662,57,3
6,135,0.009649,49,8
7,728,0.009423,9,1
8,281,0.009182,32,0
9,136,0.008851,85,6
10,184,0.008165,80,3


Before the results are analyzed, lets first talk about some intuition what could happen when edges are added or removed uniformly at random. The original graph $G$ is a scale-free network, i.e. its degree distribution follows a power law, at least asymptiotically. This means that in this type of network structure, there will be many nodes with very low level of connectivity. And very few or except one node with exceptionally high degree of connectivity. So the nodes are very unequal in terms of how connected and influential the different nodes in the network are. Scale free networks describes a power or exponential relationship between the degree of connectivity a node has and the frequency of its occurence. This results in a highly centralized network. In the social network that is loaded from hamsterster.com, we have some people who have very many links into them, but there are also many people that have very few links into them. The power law distribution is often explained with reference to preferential attachment. Preferential attachment describes how a new node is linked amongst a number of nodes according to how much they already have. So those who already have a lot of links will receive more than those who have litte: the so called "rich get richer model". In the paragraphs D-E-F-G, preferential attachment proporional to some statistical measures is elaborated and analyzed. Paragraphs A-B-C will analyze the effects of adding/removing nodes/adges uniformly at random. In paragraph A, we add or remove only edges uniformly at random. In paragraph B nodes and corresponding edges are added uniformly at random. In paragraph C nodes and corresponding edges are removed uniformly at random. All modifications that will take place in paragraph A-B-C are determined uniformly at random, i.e. all nodes have equal probability to be chosen or to be removed. Scale free networks can be very robust or very fragile, depending on how we remove nodes (randomly or strategically). If we remove nodes uniformly at random, the network will be very robust to failure. This is because the vast majority of nodes have a very low degree of connectivity. Therefore, it is very likely that we will modify one of these insignificant nodes with little effect on the overall network.
So real word networks, like the network from hamsterster.com, are resilient to random attacks.  

When we add or remove only edges, we expect a constant average degree, i.e. the number of edges grows linearly with the number of nodes. Also, we expect that as the network grows, the distances between nodes grow. From the output of the function above, we see that $54$ edges are added, and $46$ edges are removed. When we look at the top ten nodes, we see that the top ten of `G_random_edges_uniform_random` exactly matches the top ten of the original graph $G$. Let's dive deeper into both graphs to look if there changed else.

First, we compare the $density$ between the original graph G and the graph `G_random_edges_uniform_random`. The density for directed graphs is: $d = \frac{E}{V(V-1)}$, where $E$ denotes the total number of edges and $V$ denotes the total number of nodes in the particular graph.

In [18]:
nx.info(G)

'Name: \nType: DiGraph\nNumber of nodes: 2426\nNumber of edges: 16631\nAverage in degree:   6.8553\nAverage out degree:   6.8553'

In [19]:
nx.info(G_random_edges_uniform_random)

'Name: \nType: DiGraph\nNumber of nodes: 2426\nNumber of edges: 16643\nAverage in degree:   6.8603\nAverage out degree:   6.8603'

In [20]:
nx.density(G), nx.density(G_random_edges_uniform_random)

(0.002826935008201528, 0.002828974766490171)

When we determine inside the function whether we add or remove an edge, the top ten of nodes are equal to the top ten nodes. This makes totally sense. The purpose of the function is adding or removing edges uniformaly at random to `number_of_nodes` nodes. These selected nodes all have the same probability to be selected. As explained earlier, there are a lot of nodes in the network that have a low connectivity and only a few nodes with a very high connectivity. It is very likely that we will modify one of these insignificant nodes with little effect on the overall network and therefore we can explain that the top ten ranked nodes are not changed after adding or removing only nodes. 

### only edges added

In [21]:
#number_of_nodes_random = random.randint(1,  int(0.1 * nx.number_of_edges(G.copy()))) #max 10% of edges to add
#print("number of nodes: " + str(number_of_nodes_random))
G_random_add_edges_uniform_random = random_edges_uniform_random(G.copy(), 100, True, True)

	new edge:	 411 --> 2046
	new edge:	 1620 --> 714
	new edge:	 1273 --> 708
	new edge:	 996 --> 1888
	new edge:	 2208 --> 531
	new edge:	 536 --> 2138
	new edge:	 1830 --> 1848
	new edge:	 1594 --> 1670
	new edge:	 1140 --> 2302
	new edge:	 733 --> 829
	new edge:	 1821 --> 1030
	new edge:	 2310 --> 1126
	new edge:	 492 --> 2347
	new edge:	 1568 --> 1933
	new edge:	 755 --> 1978
	new edge:	 1956 --> 1742
	new edge:	 705 --> 374
	new edge:	 311 --> 762
	new edge:	 982 --> 1329
	new edge:	 1885 --> 1525
	new edge:	 67 --> 658
	new edge:	 1553 --> 1353
	new edge:	 2024 --> 2260
	new edge:	 115 --> 1421
	new edge:	 17 --> 951
	new edge:	 2307 --> 118
	new edge:	 1900 --> 50
	new edge:	 1307 --> 2096
	new edge:	 226 --> 1078
	new edge:	 681 --> 568
	new edge:	 195 --> 2231
	new edge:	 67 --> 160
	new edge:	 393 --> 1424
	new edge:	 942 --> 1138
	new edge:	 1217 --> 16
	new edge:	 2423 --> 1800
	new edge:	 729 --> 1713
	new edge:	 373 --> 776
	new edge:	 1574 --> 1999
	new edge:	 1674 --> 1122

In [22]:
pr_random_add_edges_uniform_random = calc_pagerank(G_random_add_edges_uniform_random)
df_random_add_edges_uniform_random = create_dataframe(pr_random_add_edges_uniform_random, G_random_add_edges_uniform_random)
df_random_add_edges_uniform_random.head(10)

rank,node,score,in edges,out edges
1,404,0.035454,10,0
2,195,0.020621,80,2
3,77,0.019237,121,2
4,728,0.015491,10,0
5,36,0.011451,168,5
6,192,0.010375,57,3
7,135,0.009399,49,8
8,281,0.00918,32,0
9,2231,0.008848,1,9
10,136,0.008765,86,6


Above, we see the resulting top ten dataframe after running the function `random_edges_uniform_random`, parameterized with a copy of the original graph  GG , number_of_nodes = 248 and choice = True. In other words, the function adds 248 new edges (incoming or outgoing) to the network. The output is exaclty what we expect: the top ten is not changed that much. The only new node is node 281, which replaced node 184 of the original graph. Indeed, the new nodes that are added uniformly at random don't have a preferential attachment to the highly connected, because the number of incoming edges and outcoming edges for the remaining top nine nodes are exactly the same. 

In [23]:
nx.info(G)

'Name: \nType: DiGraph\nNumber of nodes: 2426\nNumber of edges: 16631\nAverage in degree:   6.8553\nAverage out degree:   6.8553'

In [24]:
nx.info(G_random_add_edges_uniform_random)

'Name: \nType: DiGraph\nNumber of nodes: 2426\nNumber of edges: 16731\nAverage in degree:   6.8965\nAverage out degree:   6.8965'

In [25]:
nx.density(G), nx.density(G_random_add_edges_uniform_random)

(0.002826935008201528, 0.0028439329939402183)

### only edges removed

In [26]:
#number_of_nodes_random = random.randint(1,  int(0.1 * nx.number_of_edges(G.copy()))) #max 10% of edges to remove to select
#print("number of nodes: " + str(number_of_nodes_random))
G_random_remove_edges_uniform_random = random_edges_uniform_random(G.copy(), 100, True, False)

	remove edge:	 2391 --> 328
	remove edge:	 1073 --> 1074
	remove edge:	 12 --> 653
	remove edge:	 1126 --> 1127
	remove edge:	 2300 --> 864
	remove edge:	 2072 --> 908
	remove edge:	 1428 --> 1427
	remove edge:	 394 --> 411
	remove edge:	 1779 --> 371
	remove edge:	 206 --> 207
	remove edge:	 2018 --> 2019
	remove edge:	 2138 --> 2139
	remove edge:	 1861 --> 127
	remove edge:	 1410 --> 1411
	remove edge:	 407 --> 109
	remove edge:	 1866 --> 544
	remove edge:	 689 --> 100
	remove edge:	 1144 --> 1146
	remove edge:	 683 --> 690
	remove edge:	 489 --> 486
	remove edge:	 2242 --> 2243
	remove edge:	 1579 --> 603
	remove edge:	 891 --> 894
	remove edge:	 1666 --> 1672
	remove edge:	 1956 --> 436
	remove edge:	 1874 --> 410
	remove edge:	 1221 --> 1223
	remove edge:	 1869 --> 1870
	remove edge:	 1414 --> 1416
	remove edge:	 2245 --> 412
	remove edge:	 889 --> 890
	remove edge:	 2039 --> 2044
	remove edge:	 83 --> 131
	remove edge:	 1038 --> 1039
	remove edge:	 637 --> 447
	remove edge:	 2188

In [27]:
pr_random_remove_edges_uniform_random = calc_pagerank(G_random_remove_edges_uniform_random)
df_random_remove_edges_uniform_random = create_dataframe(pr_random_remove_edges_uniform_random, G_random_remove_edges_uniform_random)
df_random_remove_edges_uniform_random.head(10)

rank,node,score,in edges,out edges
1,404,0.042517,9,0
2,195,0.028656,80,1
3,77,0.019047,121,1
4,728,0.015237,10,0
5,36,0.012024,167,5
6,192,0.010468,57,3
7,135,0.009435,49,8
8,281,0.009245,31,0
9,136,0.008725,85,6
10,184,0.008287,80,3


Above, we see the resulting top ten dataframe after running the function `random_edges_uniform_random`, parameterized with a copy of the original graph $G$, `number_of_nodes` = 100 and choice = False. In other words, the function only removes edges for 100 nodes in the graph uniformly at random. Also in this case we see that the top ten remains the same. This means that indeed the edges are removed uniformly at random, i.e. most of the edges that are removed are of the insignificant edges. Also we see that the score of the first node, node 404, is slightly decreased, while the second and third nodes (77 and 195) are increased. This means that the function did remove some high ranked links from node 404. Removing nodes will decrease the average degree of the nodes, which is comfirmed by the two info cells below.

In [28]:
nx.info(G)

'Name: \nType: DiGraph\nNumber of nodes: 2426\nNumber of edges: 16631\nAverage in degree:   6.8553\nAverage out degree:   6.8553'

In [29]:
nx.info(G_random_remove_edges_uniform_random)

'Name: \nType: DiGraph\nNumber of nodes: 2426\nNumber of edges: 16545\nAverage in degree:   6.8199\nAverage out degree:   6.8199'

In [30]:
nx.density(G), nx.density(G_random_remove_edges_uniform_random)

(0.002826935008201528, 0.0028123167404662548)

When edges are removed, the density of the network shrinks which is what we expected. 

### B. Adding nodes uniformly at random (copying model)

For $n$ iterations do the following:
* Make a new node instance $n$
* with a uniform random distribution pick $k$ nodes in the original graph
* copy the incoming/outgoing edges of the $k$ nodes for $n$
* choose with an unifrom distribution another node $l$ and add its edges also to $n$

<i>The last step might seem redundant at the moment, but later when the the $k$ nodes are going to be chosen with at random but proportional to a certain statistic, it makes sense to have a step in which you pick another node $l$ that is chosen with the opposite property so that the generation/stability of communities is ensured (i.e. power-law degree). In this implementation. This step is omitted, but it will be thus added in the functions that take statistical measures into consideration when choosing nodes at random</i>

In [31]:
#randomly add and remove nodes
#Edge Copying Model (slide 53 of Week6-SNA-Props)
def random_add_nodes_uniform(G_in, number_of_nodes = 1, k = 5):
    print("number of edges before :"+ str(len(G_in.edges())))
    for _ in range(number_of_nodes):
        #k is number of edges to be added, random integer 1 between 5
        k = random.randint(1, k) #select k random vertices
        #print("k = " + str(k))
        new_node = nx.number_of_nodes(G_in) + 1 #add node to graph
        # print("new node = " + str(new_node))
        
        list_of_nodes = list(G_in)  #create list of nodes
        
        G_in.add_node(str(new_node))   
        k_random_selected_nodes = np.random.choice(list_of_nodes, size = k, replace = False) # k nodes with a uniform distribution
        
        for node in k_random_selected_nodes:
            #print("node in k_random_selected_nodes = " + str(node))
            successors = list(G_in.successors(str(node)))
            #print("succesors are " + str(successors))
            for node_to in successors:
                G_in.add_edge(new_node, node_to) # add outgoing edges
            predecessors = list(G_in.predecessors(str(node)))
            for node_from in predecessors:
                G_in.add_edge(node_from, new_node) # add incoming edges
    print("number of edges after :"+str(len(G_in.edges())))
    return G_in

In [32]:
#number_of_nodes_to_add_random = random.randint(1,  int(0.1 * len(list(G.copy()))))
#print("number of nodes added: " + str(number_of_nodes_to_add_random))
G_random_add_nodes_uniform = random_add_nodes_uniform(G.copy(), 100)

number of edges before :16631
number of edges after :17756


In [33]:
pr_random_add_nodes_uniform = calc_pagerank(G_random_add_nodes_uniform)
df_random_add_nodes_uniform = create_dataframe(pr_random_add_nodes_uniform, G_random_add_nodes_uniform)
df_random_add_nodes_uniform.head(10)

rank,node,score,in edges,out edges
1,404,0.041764,10,0
2,195,0.01959,83,1
3,77,0.018303,128,2
4,728,0.015904,12,0
5,36,0.010739,172,5
6,135,0.009313,52,8
7,281,0.009128,34,0
8,192,0.008882,58,3
9,136,0.008499,87,6
10,184,0.00819,85,3


`random_add_nodes_uniform` add nodes uniformly at random, while the degree distribution of the network still satisfies the power law distribution. We call the function, parameterized with `number_of_nodes` =  maximal 10% of the total number of nodes that are in $G$. What we see is that rank of some nodes in the top ten is changed. We even see a new node in top ten, node 126. This means that randomness can create new strong nodes, by adding new nodes and create new strong communities in the graph. However, since this all happens unformly at random, we can't guarantee that this happens every single run (which we can guarantee when we add proportional to some statistical measure, see D-E-F-G). The added nodes are linked, one by one, to $k$ random selected nodes. Note that every node in $G$ does have the same probability, so again, it is most likely that we select the insignificant nodes as nodes where we link the new nodes to. We see that node 126 is in the top then , but this is because node 184 has a lower pagerank score compared to the original graph $G$. This could be that the some of the new added nodes are linked to node 184, and because this nodes have a low pagerank score, it will have a negative effect on node 184. We can also look at the density of graph  `G_random_add_nodes_uniform`. Because we add nodes, the density should be lower compared to the original graph $G$. 

In [34]:
nx.info(G)

'Name: \nType: DiGraph\nNumber of nodes: 2426\nNumber of edges: 16631\nAverage in degree:   6.8553\nAverage out degree:   6.8553'

In [35]:
nx.info(G_random_add_nodes_uniform)

'Name: \nType: DiGraph\nNumber of nodes: 2622\nNumber of edges: 17756\nAverage in degree:   6.7719\nAverage out degree:   6.7719'

In [36]:
nx.density(G), nx.density(G_random_add_nodes_uniform)

(0.002826935008201528, 0.0025837198872801998)

### C. Removal of nodes uniformly at random

Let $n$ represent the number of nodes that should be removed. If the $number\_of\_nodes$ parameter is givem then $n = number\_of\_nodes$, if this parameter is not specfied by the caller we have $n = \lfloor(0.1 * total\_number\_of\_nodes(G\_in))\rfloor)$

For $n$ iterations do the following:
* Select a node $m$ uniformly at random (iterations are abstracted by $np.random.choice$)
* Remove this node and its respective edges from the graph

In [37]:
def random_removal_nodes_uniform(G_in, number_given = False, number_of_nodes = None):
    if (number_given & number_of_nodes < len(list(G_in))): # check if we do not remove too much nodes
        n = number_of_nodes
    else:
        n = int(0.1 * len(list(G_in)))  # max 10% of nodes
    #remove nodes and corresponding edges
    print("number of nodes before :"+ str(len(list(G_in))))
    list_of_nodes = list(G_in)
    selected_nodes = np.random.choice(list_of_nodes, size = n, replace = False)
    for m_remove in selected_nodes:
        G_in.remove_node(m_remove)
    print("number of nodes after :"+ str(len(list(G_in))))
    return G_in

In [38]:
G_random_removal_nodes_uniform = random_removal_nodes_uniform(G.copy(), True, 100)

number of nodes before :2426
number of nodes after :2326


In [39]:
pr_random_removal_nodes_uniform = calc_pagerank(G_random_removal_nodes_uniform)
df_random_removal_nodes_uniform = create_dataframe(pr_random_removal_nodes_uniform, G_random_removal_nodes_uniform)
df_random_removal_nodes_uniform.head(10)

rank,node,score,in edges,out edges
1,404,0.043358,10,0
2,195,0.020496,79,1
3,77,0.018887,119,2
4,728,0.015455,10,0
5,36,0.011418,167,5
6,135,0.009693,47,8
7,192,0.009582,56,3
8,281,0.009136,30,0
9,136,0.009073,83,6
10,184,0.008431,76,3


In [40]:
nx.info(G)

'Name: \nType: DiGraph\nNumber of nodes: 2426\nNumber of edges: 16631\nAverage in degree:   6.8553\nAverage out degree:   6.8553'

In [41]:
nx.info(G_random_removal_nodes_uniform)

'Name: \nType: DiGraph\nNumber of nodes: 2326\nNumber of edges: 15504\nAverage in degree:   6.6655\nAverage out degree:   6.6655'

In [42]:
nx.density(G), nx.density(G_random_removal_nodes_uniform)

(0.002826935008201528, 0.0028668904113388623)

In [43]:
avg_node_degree_full_graph = df_origin["in edges"].mean()
avg_node_degree_graph_removed_nodes = df_random_removal_nodes_uniform["in edges"].mean()
avg_node_degree_full_graph, avg_node_degree_graph_removed_nodes

(6.855317394888706, 6.665520206362855)

`random_removal_nodes_uniform` removes nodes from the graph $G$ uniformly at random, parameterized with `number_of_nodes` = 100. It's is important to check whether the nodes that are removed are removed uniformly at random, i.e. the nodes that are removed are mostly the insignificant nodes. When we look at the output of the cell above, we see that the average degree of the nodes that are removed is slightly lower than the average degree of the full orignal graph $G$. This is good, because it means that in most of the cases only the insignificant nodes are removed. When we compare the top ten of `G_random_removal_nodes_uniform` to the top ten of $G$, we see that they contain the same items, only the order of some nodes has changed. For example, node 36 has climbed one spot, to rank 4. What happened is that `random_removal_nodes_uniform` removed some nodes that were linked to node 36 with a low pagerank score, and therefore the overall pagerank score of 36 has increased slightly. The most important conclusion that we can draw after this function call is that scale free networks are very robust against removing of nodes uniformly at random. As explained earlier, all nodes that are possible candidates to be removed have an equally probability to be chosen. Because we have only a few nodes with a very large connectivity, it is most likely that we select the nodes with a very low connectivity. Therefore the overall end result does not differ much compared to the original graph $G$.

## Graph evolution methodologies using statistical measures

The following section is going to focus on graph evolution methodologies (e.g. node/edge removal/addition) that use statistical measures obtained from statisical measures obtained the original network and/or nodes. Like in the previous section, the effect is studied in relation to the original graph. In other words, we do not continue with the evolved graph after the evaluation of each function. In stead, we start with a new copy of the original graph for each function which enables us to compare the different type of graph evolution methodologies objectively and fair, and study their effects on the PageRank stability of the (original) network.

### D. Removal of nodes at random but proportional to the degree of the nodes

Let $n$ represent the number of nodes that should be removed. If the $number\_of\_nodes$ parameter is givem then $n = number\_of\_nodes$, if this parameter is not specfied by the caller we have $n = \lfloor(0.1 * total\_number\_of\_nodes(G\_in))\rfloor)$

For $n$ iterations do the following:
* Select a node $m$ at random, but proportional to the in degree's of the nodes (iterations are abstracted by $np.random.choice$)
* Remove this node and its respective edges from the graph

In [44]:
def random_node_removals_proportional_degree(G_in, number_given = False, number_of_nodes = None, in_degree = True):
    if (number_given & number_of_nodes < len(list(G_in))): # check if we do not remove too much nodes
        n = number_of_nodes
    else:
        n = int(0.1 * len(list(G_in)))  # max 10% of nodes
    #remove nodes and corresponding edges
    print("number of nodes before :"+ str(len(list(G_in))))
    list_of_nodes = list(G_in)
    if (in_degree):
        degrees = dict(G_in.in_degree()).values() # in_degrees of all the nodes
    else: # out_degree
        degrees = dict(G_in.out_degree()).values() # in_degrees of all the nodes
    prob_degree = [float(i)/sum(degrees) for i in degrees] # probabilities proportional to degree
    
    selected_nodes = np.random.choice(list_of_nodes, size = n, replace = False, p = prob_degree)
    for m_remove in selected_nodes:
        G_in.remove_node(m_remove)
    print("number of nodes after :"+ str(len(list(G_in))))
    return G_in

In [45]:
G_random_node_removals_proportional_indegree = random_node_removals_proportional_degree(G.copy(), True, 100, True)
pr_random_node_removals_proportional_indegree = calc_pagerank(G_random_node_removals_proportional_indegree)
df_random_node_removals_proportional_indegree = create_dataframe(pr_random_node_removals_proportional_indegree,
                                                                 G_random_node_removals_proportional_indegree)
df_random_node_removals_proportional_indegree.head()

number of nodes before :2426
number of nodes after :2326


rank,node,score,in edges,out edges
1,404,0.027927,8,0
2,728,0.02471,10,0
3,77,0.019085,104,1
4,2,0.012414,33,0
5,697,0.011594,15,1


Let's have a look at the hundred nodes that are actually removed by the algorithm. The assumption is that the average in-degree of these 100 nodes is higher than the average node degree of the original graph $G$ since the nodes are removed at random, but proportional to their in-degree.

In [46]:
avg_node_degree_full_graph = df_origin["in edges"].mean()
avg_node_degree_graph_removed_nodes = df_random_node_removals_proportional_indegree["in edges"].mean()
round(avg_node_degree_full_graph, 6), round(avg_node_degree_graph_removed_nodes, 6)

(6.855317, 5.33448)

Since the average node in-degree dropped significantly, we can indeed conclude that the function removed nodes at random, but proportional to their node degrees since th drop in the average node degree of all the nodes.

In [47]:
nodes_original_graph = set(df_origin.node.values)
nodes_evolved_graph = set(df_random_node_removals_proportional_indegree.node.values)
removed_nodes = pd.DataFrame(list(nodes_original_graph.difference(nodes_evolved_graph)))
removed_nodes = pd.merge(df_origin, removed_nodes, left_on = 'node', right_on = 0)
avg_node_degree_removed_nodes = removed_nodes["in edges"].mean()
avg_node_degree_removed_nodes

33.31

And as already confirmed in the previous cell we can indeed conclude that the nodes that are removed have a significant higher node in-degree (on-average) than the average of the nodes in the original graph. We can thus indeed conclude that the function removed nodes at random, but proportional to their node degrees since th drop in the average node degree of all the nodes.

Let's use our knowledge of the inner-workings of PageRank to state hyptothesis about the changes of PageRank of the (nodes in the) evolved graph, compared to the original graph. We know that the PageRank score of a node is influenced by the number (the set) of incoming edges, the number of outbound links of the source nodes of these edges and the score of the particular source nodes. In addition, the damping factor is used in the calculations, but this is kept the same in this PageRank stability analysis so will be left out of consideration.

Let's have a look at what has happened to the PageRank scores after the removal of nodes at random, but proportional to the in-degree. As we have seen earlier, the nodes that were removed had on average, a much higher in-degree. Considering the PageRank formula, there is a high chance that nodes with a high number of incoming edges have a relatively high score. But this is affected by the quality of the nodes sourcing the incoming edge. In other words, even if a certain node $n_1$ has a high number of incoming edges than a node $n_2$, but the nodes connected to $n_2$ (have an outgoing edge to $n_2$ have relatively less outgoing links and a relatively higher score, it could be that the score of $n_2$ is higher than $n_1$.

Let's assess whether the nodes that were removed with a relaltively high in-degree compared to the rest of the nodes, have had a negative effect on the average score of the nodes in the graph. Considering the elabaroation that is given in the previous paragraph, this is affected by the number of outgoing links of the nodes that had an edge to the removed nodes and the score of these nodes.

In [48]:
avg_score_removed_nodes = removed_nodes.score.mean()
avg_score_full_graph = df_origin.score.mean()
round(avg_score_removed_nodes, 6), round(avg_score_full_graph, 6), round(avg_score_removed_nodes / avg_score_full_graph, 2)

(0.001654, 0.000412, 4.01)

We see that the the nodes that were removed indeed have a higher score (see factor) than the average score in the original graph of all the nodes.The nodes that had an outgoing edge to the nodes that were removed thus did not have such high number of total outgoing links that it affected the score of the removed nodes in such a way that the average score is not higher than that of the average node in the original graph.

Let's have a look at what happens with the PageRank scores when we remove nodes at random, but proportional to their out-degrees. When you look at the formula of PageRank, you see that an edge from a node with a lot of outoging links will make no significant contribution to the score of the particular node (when compared to a node of the same score with less outgoing links). The assumption is, therefore, that unlike in the previous case, the average PageRank score in the graph in which nodes are removed, will hardly differ from the average PageRank score in the original graph.

In [49]:
G_random_node_removals_proportional_outdegree = random_node_removals_proportional_degree(G.copy(), True, 100, False)
pr_random_node_removals_proportional_outdegree = calc_pagerank(G_random_node_removals_proportional_outdegree)
df_random_node_removals_proportional_outdegree = create_dataframe(pr_random_node_removals_proportional_outdegree,
                                                                 G_random_node_removals_proportional_outdegree)
df_random_node_removals_proportional_outdegree.head()

number of nodes before :2426
number of nodes after :2326


rank,node,score,in edges,out edges
1,404,0.042863,10,0
2,195,0.020472,73,1
3,77,0.01907,108,2
4,728,0.015702,10,0
5,36,0.011485,152,5


In [50]:
avg_score_graph_removed_nodes = df_random_node_removals_proportional_outdegree.score.mean()
avg_score_full_graph = df_origin.score.mean()
avg_score_graph_removed_nodes, avg_score_full_graph, round(avg_score_graph_removed_nodes / avg_score_full_graph, 2)

(0.0004299226139294739, 0.00041220115416324565, 1.04)

And indeed, there is no significant differences in the average PageRank scores of the full network when removing nodes at random, but proportional to the out degree of the nodes (i.e. nodes with a higher out-degree have a higher probability to be removed at random)

### E. Removal of nodes at random but proportional to the hubs/authorithy measures (HITS) of nodes

Let $n$ represent the number of nodes that should be removed. If the $number\_of\_nodes$ parameter is givem then $n = number\_of\_nodes$, if this parameter is not specfied by the caller we have $n = \lfloor(0.1 * total\_number\_of\_nodes(G\_in))\rfloor)$

For $n$ iterations do the following:
* Select a node $m$ at random, but proportional to HITS measures (i.e. hub or authority) of the nodes (iterations are abstracted by $np.random.choice$)
* Remove this node and its respective edges from the graph

In [51]:
def random_node_removals_proportional_HITS(G_in, authorithy = False, number_given = False, number_of_nodes = None):
    print("number of nodes before: " + str(len(list(G_in))))
    if (number_given & number_of_nodes < len(list(G_in))): # check if we do not remove too much nodes
        n = number_of_nodes
    else:
        n = int(0.1 * len(list(G_in)))  # max 10% of nodes
    #remove nodes and corresponding edges
    for i in range(0, n):
        list_of_nodes = list(G_in)
        #print(int((i / n) * 100), "%") # progress (nx.hits(Graph) takes some time depending on the graph size)
        if (authorithy):
            p = list(nx.hits(G_in)[1].values()) # probabilities proportional to authority of nodes
        else: # hub
            p = list(nx.hits(G_in)[0].values()) # probabilities proportional to hub of nodes
        node_remove = np.random.choice(list_of_nodes, p = p)
        G_in.remove_node(node_remove)
    print("number of nodes after: " + str(len(list(G_in))))
    return G_in

In [52]:
G_random_node_removals_proportional_authority = random_node_removals_proportional_HITS(G.copy(), True, True, 100)

number of nodes before: 2426
number of nodes after: 2326


In [53]:
pr_random_node_removals_proportional_authority = calc_pagerank(G_random_node_removals_proportional_authority)
df_random_node_removals_proportional_authority = create_dataframe(pr_random_node_removals_proportional_authority,
                                                             G_random_node_removals_proportional_authority)
df_random_node_removals_proportional_authority.head()

rank,node,score,in edges,out edges
1,728,0.021528,10,0
2,404,0.017614,7,0
3,899,0.01564,7,0
4,136,0.014771,62,4
5,183,0.013743,79,3


This will be a rather interesting analysis of the PageRank algorithm and its stability. Note that the previous function removes nodes at random, but proportional to algorithms acquired from the HITS algorithm. In particular, the funtion removes nodes at random proportional to their $hub$ or $authorithy$ values. Why is this interesting one may ask. A good hub reprresents a page (node) that points to many other pages, while a good authority represents a page that was linked by many different hubs. So there is definitely a relation between the HITS measures and the PageRank scores. While we could explain this relation in more detail, it is more fun to show this relation with evaluations of the node removals according to the HITS measurs, considering the avaialability of the data.

The node removal of 100 nodes that was just executed was at random, but proportional to the authorithy values of the nodes. This means that nodes with a higher authority value will have a higher chance of being removed. In other words, nodes that were referred to by many other hubs (i.e. incoming edges) will have had a higher chance of being removed. We have seen earlier on that the in general (dependending on the explained pecularities), the average PageRank scores of nodes (when $n$ is large enough) will be higher than the average PageRank score in the original graph, when the nodes are removed at random but proportional to this measure. Let's see of the function does what it should do and whether this is indeed the case.

In [54]:
avg_node_degree_graph_removed_nodes = df_random_node_removals_proportional_authority["in edges"].mean()
round(avg_node_degree_full_graph, 6), round(avg_node_degree_graph_removed_nodes, 6)

(6.855317, 4.765262)

First, note that it seems that that function indeed removed nodes at random but proportional to the authority values since the average in-degree value has dropped significantly after removal of the nodes. Which makes sense since the authority is an indication of how many pages (i.e. nodes in our context) linked to that particular node.

In [55]:
nodes_evolved_graph = set(df_random_node_removals_proportional_authority.node.values)
removed_nodes = pd.DataFrame(list(nodes_original_graph.difference(nodes_evolved_graph)))
removed_nodes = pd.merge(df_origin, removed_nodes, left_on = 'node', right_on = 0)
avg_node_degree_removed_nodes = removed_nodes["in edges"].mean()
avg_node_degree_removed_nodes

44.96

And note once again that the average in degree is much higher than the average in degree of the original graph. The function does what it should do. There is a strong relation with the previous analyis due to the strong relation between the HITS measures and the in/out-degrees of the nodes. We have shown that some statistics lead to the same conclusion. Let's devise an other method to evaluate the stability of the PageRank values after removal of the nodes according to their hub values. Remember, a good hub reprresents a page (node) that points to many other pages.

In [56]:
G_random_node_removals_proportional_hub = random_node_removals_proportional_HITS(G.copy(), False, True, 100)

number of nodes before: 2426
number of nodes after: 2326


In [57]:
pr_random_node_removals_proportional_hub = calc_pagerank(G_random_node_removals_proportional_hub)
df_random_node_removals_proportional_hub = create_dataframe(pr_random_node_removals_proportional_hub,
                                                             G_random_node_removals_proportional_hub)
df_random_node_removals_proportional_hub.head()

rank,node,score,in edges,out edges
1,404,0.042037,10,0
2,195,0.0194,65,1
3,77,0.018145,95,2
4,728,0.015282,10,0
5,36,0.01093,144,5


Like with the removal of nodes proportional to the out-degree (at random), the assumption is that removal of nodes proportional to their hub values, will have no significant effect on the PageRank values of the nodes in the the graph of the nodes that remain in the graph. We are, however, going to show this with a different method.

Let's evaluate the absolute change of the PageRank scores of the nodes that remain in the graph, once nodes are removed at random but proportional to their hub values. The hypothesis is already stated, let's see whether it holds up. We are going to form  a new DataFrame in which we have the new and old PageRank scores (i.e. before and after removal). And then going to add a new column with the %-change of the PageRank scores.

In [58]:
df_random_node_removals_proportional_hub.rename(columns = {'score': 'new_score', 'in edges': 'new_in_edges',
                                                           'out edges': 'new_out_edges'}, inplace=True)

In [59]:
df_random_node_removals_proportional_hub.head()

rank,node,new_score,new_in_edges,new_out_edges
1,404,0.042037,10,0
2,195,0.0194,65,1
3,77,0.018145,95,2
4,728,0.015282,10,0
5,36,0.01093,144,5


In [60]:
df_compare_scores = pd.merge(df_random_node_removals_proportional_hub, df_origin, on = 'node')
df_compare_scores["percentage_change"] = df_compare_scores.apply(lambda row: ((row.new_score - row.score) / row.score) * 100, axis = 1)
df_compare_scores["abs_percentage_change"] = df_compare_scores.apply(lambda row: abs(row.percentage_change), axis = 1)
print("top 10: ", df_compare_scores.head(10).abs_percentage_change.mean(),
      "\ntop 50: ", df_compare_scores.head(50).abs_percentage_change.mean(), 
      "\ntop 100: ", df_compare_scores.head(100).abs_percentage_change.mean(), 
      "\nwhole graph: ", df_compare_scores.abs_percentage_change.mean())

top 10:  3.2060444853041545 
top 50:  7.4826313130799305 
top 100:  7.844863838326026 
whole graph:  8.378814306453327


First note that the average change of the PageRank scores in the graph is not significant (single digit point percentage change), but what is even more interesting and scientifically interesting and explainable, is the fact that the change of the scores is on average less for nodes that had already a relatively high score to begin with. This can be seen in the previous print statement, we see that the average change of the scores (in point percentage) seems to be correlated with the original score of the node. Note that in addition, the number of total pages is reduced by 100 ($N$ in the PageRank) formula, which also has an effect on the new scores.

To give a simplified explanation of the PageRank explanation (left out some assumptions such as no outbound edges = edge to all other nodes (pages) in the graph) let's have a look what is happening more or less under the hood. We have $PRs(n) = \dfrac{1 - d}{N} + d \cdot \sum_{m \, \in \, predecessors(n)} \dfrac{PRs(m)}{\mid\,successors(m)\,\mid}$, <br>
where $N =$ the number of pages (reduced by 100), $d$ is the damping factor (remained constant) and the notation $\mid\,successors(m)\,\mid$ is used to denote the <i>number</i> of successors of node $m$ (i.e. the number of outgoing edges of node $m$). So even if a node is not directly affected by removal of nodes (i.e. it keeps its edges as the nodes that were removed had no direct connection to it), its score will still change due to the fact that $N$ has changed and the fact that the score, or the number of outgoing edges of its predecessors could have changed. These are the things to take into consideration when looking at the changes and the statement that a single percentage change is not large for removal of 100 nodes.

And now back to the interesting correlation. It can be explained quite simply. Let us have a node $x$ with a relatively high PageRank score. If we would remove one its predecessors $y$ with a high number of outgoing edges (as is the case with the random removal of nodes proportional to the value of the hub measure), its effect on the score of node $x$ would not be significant or as high as the removal of a node $z$ with a relatively low amount of outgoing edges compared to $y$ but with the same score, its effect will be less significant as the number of outgoing edges is the denominator in the summation. And the contribtion of this node to the total sum of all the PageRank scores divided by the number of successors, of the predecessors would be small to begin with. This is a possible explanation (there can be other causes) of the correlation that has been shown. So in removal of nodes proportional to the hub values we have seen some interesting evaulations of the stability of the PageRank scores of the nodes in the graph. 

### F. Addition of nodes at random but proportional to the degree of the nodes

Let $n$ represent the number of nodes that should be added to the network. If the $number\_of\_nodes$ parameter is given then $n = number\_of\_nodes$, if this parameter is not specfied by the caller we have $n = \lfloor(0.1 * total\_number\_of\_nodes(G\_in))\rfloor)$

For $n$ iterations do the following:
* Make a new node instance $n$
* with a probability distribution proportional to the in-degree of nodes pick $k$ nodes in the original graph
* copy the incoming/outgoing edges of the $k$ nodes for $n$
* Pick a node $l$ (at random) in the graph in the graph which had a relatively low probability in the second step and repeat step 3 for this node

In [61]:
type(np.random.choice(list(G), size = 5, replace = False)[0])

numpy.str_

In [62]:
# Inspired by the Edge Copying Model (slide 53 of Week6-SNA-Props)
def random_node_additions_proportional_in_degree(G_in, number_given = False, in_degree = True, number_of_nodes = 1, k = 5):
    if (number_given & number_of_nodes < len(list(G_in))): # check if we do not remove too much nodes
        n = number_of_nodes
    else:
        n = int(0.1 * len(list(G_in)))  # max 10% of nodes
    print("number of edges before :"+ str(len(G_in.edges())))
    for i in range(n):
        #k is number of edges to be added, random integer 1 between 5
        k = random.randint(1, k) #select k random vertices
        list_of_nodes = list(G_in)  #create list of nodes
        if (in_degree):
            degrees = dict(G_in.in_degree()).values() # in_degrees of all the nodes
        else: # out-degree
            degrees = dict(G_in.out_degree()).values() # out_degrees of all the nodes
        prob_degree = [float(i)/sum(degrees) for i in degrees] # probabilities proportional to degree
        k_random_selected_nodes = np.random.choice(list_of_nodes, size = k, p = prob_degree, replace = False) # selecte k nodes proportional to chosen measure
        
        new_node = nx.number_of_nodes(G_in) + 1 #add node to graph
        G_in.add_node(str(new_node))
        
        for node in k_random_selected_nodes:
            successors = list(G_in.successors(str(node)))
            for node_to in successors:
                G_in.add_edge(str(new_node), node_to) # add outgoing edges
            predecessors = list(G_in.predecessors(str(node)))
            for node_from in predecessors:
                G_in.add_edge(node_from, str(new_node)) # add incoming edges
        
        # pick one node that has a low probability (relatively low number of incoming edges)
        non_zero_probs = [i for i in prob_degree if i != 0.0]
        highest_chance_nodes = np.random.choice(list_of_nodes, p = prob_degree, 
                                                size = (len(non_zero_probs) - 1), replace = False)
        
        node_to_add = str(random.sample(set(list_of_nodes).difference(set(highest_chance_nodes)), 1)[0]) # low prob node
        successors = list(G_in.successors(node_to_add)) # successors of the node
        predecessors = list(G_in.predecessors(node_to_add)) # predecessors of the node 
        
        succ_current_node = list(G_in.successors(str(new_node))) # find the successors of the new node 
        pred_current_node = list(G_in.predecessors(str(new_node))) # find the predecessors of the new node
                                 
        # remove nodes to which the new node is already connected from the successors/predecessors list
        successors = [n for n in successors if not n in succ_current_node]
        predecessors = [n for n in predecessors if not n in pred_current_node]
                                 
        for node_to in successors:
            G_in.add_edge(str(new_node), node_to) # add outgoing edges
        for node_from in predecessors:
            G_in.add_edge(node_from, str(new_node)) # add incoming edges
            
    print("number of edges after :"+str(len(G_in.edges())))
    return G_in

In [63]:
G_random_node_additions_proportional_in_degree = random_node_additions_proportional_in_degree(G.copy(), True, True, 100, 5)

number of edges before :16631
number of edges after :23790


In [64]:
pr_random_node_additions_proportional_in_degree = calc_pagerank(G_random_node_additions_proportional_in_degree)
df_random_node_additions_proportional_in_degree = create_dataframe(pr_random_node_additions_proportional_in_degree,
                                                             G_random_node_additions_proportional_in_degree)
df_random_node_additions_proportional_in_degree.head(5)

rank,node,score,in edges,out edges
1,404,0.014269,11,0
2,2475,0.014269,11,8
3,2503,0.014269,11,11
4,77,0.013138,146,4
5,195,0.01169,106,3


This is going to be an interesting analysis where a lot of our knowledge about the workings of PageRank is going to be combined with correlations found in the analysis of previous functions. Let us first evaluate what is happening in the function. An $n$ number of nodes are added by first selecting $k$ nodes at random, but proportional to their in degrees and we copy all its edges. And then we do the opposite and select at random one node that has a relatively low probability (acquired from its low relatively in-degree) and copy its node (inspired by the Edge Copying Model, Kleinberg et al.).

We have seen in the analysis of function $D$ that in general, if we select nodes at random, but proportional to their in-degree, the average score and in-degree of this group is relatively higher than the the rest of the network when $n$ is large enough (guarrantees the insigniciance of cases where the score is affected by the number of outgoing links / score of the nodes).

So if we select $k$ nodes at random, but proportional to their in-degree, there is a high chance that they have a relatively high score. The PageRank score of the new node $n$ is calculated using, among other things, the score of nodes of its incoming edges divided by the number of outgoing links of those nodes. So there is a high chance that the $n$ new nodes have also a relatively high score compared to the average in the original graph (before the addition). Note that in this case, $N$ has dropped, which also has an effect on the PageRank scores since the first term becomes smaller. Let's see whether this hypothesis holds up in the experimental evaluation.

In [65]:
nodes_evolved_graph = set(df_random_node_additions_proportional_in_degree.node.values)
added_nodes = pd.DataFrame(list(nodes_evolved_graph.difference(nodes_original_graph)))
added_nodes = pd.merge(df_random_node_additions_proportional_in_degree, added_nodes, left_on = 'node', right_on = 0)
avg_node_score_added_nodes = added_nodes["score"].mean()
round(avg_node_score_added_nodes, 6)

0.001516

In [66]:
print(round(avg_score_full_graph, 6))
print("ratio: ", round(avg_node_score_added_nodes / avg_score_full_graph, 6))

0.000412
ratio:  3.678615


And we can indeed see that the scores of the average score is on average much higher than the average score in the original graph. Which is a correlation which seems to be in line with our hypothesis. Now another assumption is that the average PageRank has dropped, since we have added nodes that got incoming edges from nodes chosen at rando, but proportional to their in-degrees. This means that the nodes that were pointing to randomly selected nodes get more outgoing links, which affects the score of the selected nodes. Let's see if this is indeed the case.

In [67]:
check_new_score_original_nodes = pd.DataFrame(list(nodes_original_graph)) # omit new nodes, as this effects the results
check_new_score_original_nodes = pd.merge(df_random_node_additions_proportional_in_degree, check_new_score_original_nodes,
                                          left_on = 'node', right_on = 0)
avg_score_orginal_nodes = check_new_score_original_nodes.score.mean()
print("new average score original network: ", avg_score_orginal_nodes)
print("ratio drop: ", avg_score_orginal_nodes / avg_score_full_graph)

new average score original network:  0.0003496978870114116
ratio drop:  0.8483670738896557


While there are also othe factors which affect this drop (e.g. larger $N$ and backwards-propogation of changed scores), the PageRank scores of the nodes in the original network indeed seem to have taken a hit (no pun inteded/reference to next section).

### G. Addition of nodes at random but proportional to the hubs/authorithy measures (HITS) of nodes

Let $n$ represent the number of nodes that should be added to the network. If the $number\_of\_nodes$ parameter is given then $n = number\_of\_nodes$, if this parameter is not specfied by the caller we have $n = \lfloor(0.1 * total\_number\_of\_nodes(G\_in))\rfloor)$

For $n$ iterations do the following:
* Make a new node instance $n$
* with a probability distribution proportional to one of the HITS measures (hub- or authority value) of nodes pick $k$ nodes in the original graph
* copy the incoming/outgoing edges of the $k$ nodes for $n$
* Pick a node $l$ (at random) in the graph in the graph which had a relatively low probability (due to a low value) in the second step and repeat step 3 for this node

In [68]:
#Edge Copying Model (slide 53 of Week6-SNA-Props)
def random_node_additions_proportional_HITS(G_in, authority = False, number_given = False, number_of_nodes = 1, k = 5):
    if (number_given & number_of_nodes < len(list(G_in))): # check if we do not remove too much nodes
        n = number_of_nodes
    else:
        n = int(0.1 * len(list(G_in)))  # max 10% of nodes
    print("number of edges before :"+ str(len(G_in.edges())))
    for _ in range(n):
        #k is number of edges to be added, random integer 1 between 5
        k = random.randint(1, k) #select k random vertices
        list_of_nodes = list(G_in)  #create list of nodes
        if (authority):
            p = list(nx.hits(G_in)[0].values())
        else: # hub
            p = list(nx.hits(G_in)[1].values())
        
        k_random_selected_nodes = np.random.choice(list_of_nodes, size = k, p = p, replace = False) # selecte k nodes proportional to chosen measure
        
        new_node = nx.number_of_nodes(G_in) + 1 #add node to graph
        G_in.add_node(str(new_node))
        
        for node in k_random_selected_nodes:
            successors = list(G_in.successors(str(node)))
            for node_to in successors:
                G_in.add_edge(str(new_node), node_to) # add outgoing edges
            predecessors = list(G_in.predecessors(str(node)))
            for node_from in predecessors:
                G_in.add_edge(node_from, str(new_node)) # add incoming edges
        
        # pick one node that has a low probability (relatively low number of incoming edges)
        non_zero_probs = [i for i in p if i != 0.0]
        highest_chance_nodes = np.random.choice(list_of_nodes, p = p, 
                                                size = (len(non_zero_probs) - 1), replace = False)
        
        node_to_add = random.sample(set(list_of_nodes).difference(set(highest_chance_nodes)), 1)[0] # low prob node
        successors = list(G_in.successors(node_to_add)) # successors of the node
        predecessors = list(G_in.predecessors(node_to_add)) # predecessors of the node 
        
        succ_current_node = list(G_in.successors(str(new_node))) # find the successors of the new node 
        pred_current_node = list(G_in.predecessors(str(new_node))) # find the predecessors of the new node
                                 
        # remove nodes to which the new node is already connected from the successors/predecessors list
        successors = [n for n in successors if not n in succ_current_node]
        predecessors = [n for n in predecessors if not n in pred_current_node]
                                 
        for node_to in successors:
            G_in.add_edge(str(new_node), node_to) # add outgoing edges
        for node_from in predecessors:
            G_in.add_edge(node_from, str(new_node)) # add incoming edges
            
    print("number of edges after :"+str(len(G_in.edges())))
    return G_in

In [69]:
G_random_node_additions_proportional_authority = random_node_additions_proportional_HITS(G.copy(), True, True, 100, 5)

number of edges before :16631
number of edges after :24094


In [70]:
pr_random_node_additions_proportional_authority = calc_pagerank(G_random_node_additions_proportional_authority)
df_random_node_additions_proportional_authority = create_dataframe(pr_random_node_additions_proportional_authority, G_random_node_additions_proportional_authority)
df_random_node_additions_proportional_authority.head()

rank,node,score,in edges,out edges
1,404,0.042265,10,0
2,195,0.019706,98,1
3,77,0.018803,146,2
4,728,0.014965,10,0
5,36,0.011123,191,5


We follow the same scheme as in the previous graph changin method.  An $n$ number of nodes are added by first selecting $k$ nodes at random, but proportional to their authority values (could also pick hub) and we copy all its edges. And then we do the opposite and select at random one node that has a relatively low probability (acquired from its low authority value) and copy its node (inspired by the Edge Copying Model, Kleinberg et al.).

Now if we copy the incoming edges of $n$ at random chosen nodes that were chosen proportionally to their authority values, we know that if we pick $n$ large enough, the average number of incoming edges of these $n$ nodes will be relatively high compared to the rest of the network, since a good authority (i.e. a relatively high authority value) represents a node that was linked by many different hubs. We, however, have to see whether the quality/scores of the nodes linking to the $m$ randomly chosen nodes are relatively high. If this is indeed the case we'll see a relatively high score of the added nodes.

In [71]:
nodes_evolved_graph = set(df_random_node_additions_proportional_authority.node.values)
added_nodes = pd.DataFrame(list(nodes_evolved_graph.difference(nodes_original_graph)))
added_nodes = pd.merge(df_random_node_additions_proportional_in_degree, added_nodes, left_on = 'node', right_on = 0)
avg_node_score_added_nodes = added_nodes["score"].mean()
round(avg_node_score_added_nodes, 6)

0.001516

In [72]:
print(round(avg_score_full_graph, 6))
print("ratio: ", round(avg_node_score_added_nodes / avg_score_full_graph, 6))

0.000412
ratio:  3.678615


And like with the nodes that were added that copied edges from nodes that were chosen at random but proportional to their in-degrees, we see that the usage of authority values lead to the same result.

We have seen various functions that changed the original graph $G$ by the addition and/or removal of nodes or edges. In some of these functions the eelction of the nodes or edges was done uniformly at random, while in other function it was done proportional to the node degree and other statistics. We have stated hypotheses about the nodes that were added/removed and have evaluated whether this actually happened in the experiments done with the functions using $G$. In addition, hypothesis were stated about the stability of the PageRank scores of the original nodes in $G$ when $G$ was changed according to stastical measures. We know have a rough idea of when PageRank scores will change siginificantly, and when not. This knowledge can be tested and exploited in the analysis of data of actual webpages on the Internet.

## More advanced evaluation methods of PageRank stability

We have already seen several evaluation and assesments of the stability of the PageRank scores of the (orginal) graph. This next section is going to cover yet another evaluation method of the PageRank stability

#### rank-based error
In the analyses above the pagerank values of the different graph evolvement are compared, where only the absolute percentual change are chosen as comparison measure. For this paragraph, a more advanced method is implemented to evaluate the different evolvement on the original graph. The first measure to compare PageRank is rank-based error. For rank-based error, the error can be defined as: $$Error_{rank} = \sum_{i=1}^{n} \frac{|rank - rank_{baseline}|}{rank_{baseline}} $$

where $rank$ is generated by the used method. Before this measure can be applied, for each function we create a new dataframe where we merge the rank score calculated on the original graph $G$ and the rank score of respectively function $A,B,C,D,E,F,G$.

Before the results are analyzed, we state some hypothesis about the possible outcomes for different functions. When the difference between the new calculated rank and the original rank is high, the numerator $|rank - rank_{baseline}|$ will be high, and thus also the result of the fraction will be much higher. Therefore, the higher the $Error_{rank}$ is, the more the two graphs differ from eachother. When the highly connected nodes in the graph are have a large rank difference compared to the original graph $G$, this has a stronger effect on the outcome compared to nodes who already have a low score. 

In [73]:
df_random_edges_uniform_random.rename(columns={'score': 'score_random_edges_uniform_random'}, inplace=True)
df_random_edges_uniform_random.insert(1, 'rank_random_edges_uniform_random', range(1, 1+len(df_random_edges_uniform_random)))
df_random_edges_uniform_random.drop(['in edges', 'out edges'], axis=1, inplace = True)

In [74]:
df_origin.rename(columns={'score': 'score_original'}, inplace=True)
df_origin.insert(1, 'rank_original', range(1, 1+len(df_origin))) #add rank score to dataframe, because first column 'rank'can't be accessed (actually is row number)
df_origin.drop(['in edges', 'out edges'], axis=1, inplace = True)

In [75]:
df_comparison_1 = pd.merge(df_random_edges_uniform_random, df_origin, on='node')

In [76]:
def compute_error_based_1(row):
    return abs(row['rank_random_edges_uniform_random'] - row['rank_original']) / row['rank_original'] 

df_comparison_1.apply(compute_error_based_1, axis = 1).sum()

90.9046458019944

$Error_{edges-uniform-random} = 91$

Function $A$ adds or removes edges from the graph uniformly at random. The function selects $k$ nodes uniformly at random, so each node in the graph has an equal probability to be chosen. For these $k$ nodes, edges are added or removed. Because it is most likely that nodes with a very low connectivity are chosen, these ranks will not much differ from the ranks in the original graph. The ranks of these nodes will not change much, because they already have a very low connectivity and adding new or removing edges from that node will not make a very large difference. Therefore the sum of all the errors is relatively low. 

In [77]:
df_random_add_edges_uniform_random.rename(columns={'score': 'score_random_add_edges_uniform_random'}, inplace=True)
df_random_add_edges_uniform_random.insert(1, 'rank_random_add_edges_uniform_random', range(1, 1+len(df_random_add_edges_uniform_random)))
df_random_add_edges_uniform_random.drop(['in edges', 'out edges'], axis=1, inplace = True)

In [78]:
df_comparison_2 = pd.merge(df_origin, df_random_add_edges_uniform_random, on='node')
def compute_error_based_2(row):
    return abs(row['rank_random_add_edges_uniform_random'] - row['rank_original']) / row['rank_original'] 

df_comparison_2.apply(compute_error_based_2, axis = 1).sum()

126.56903995403896

$Error_{edges-add-uniform-random} = 126$

This error represents the error of a call to function $A$, where only edges are added. This score is higher than the score where edges are added and removed. This makes completely sense, because when edges are added (even when they are added uniformly at random), this will for most nodes increase the pagerank score. Therefore, for nodes that where previously had a very low connectivity, the pagerank increases because edges are edded to this node. 

In [79]:
df_random_remove_edges_uniform_random.rename(columns={'score': 'score_random_remove_edges_uniform_random'}, inplace=True)
df_random_remove_edges_uniform_random.insert(1, 'rank_random_remove_edges_uniform_random', range(1, 1+len(df_random_remove_edges_uniform_random)))
df_random_remove_edges_uniform_random.drop(['in edges', 'out edges'], axis=1, inplace = True)

In [80]:
df_comparison_3 = pd.merge(df_origin, df_random_remove_edges_uniform_random, on='node')
def compute_error_based_3(row):
    return abs(row['rank_random_remove_edges_uniform_random'] - row['rank_original']) / row['rank_original'] 

df_comparison_3.apply(compute_error_based_3, axis = 1).sum()

62.02870932081097

$Error_{edges-remove-uniform-random} = 62$

The error of the graph where only edges are removed, uniformly at random, the rank of the nodes will most likely decrease. The result value of 62 completely makes sense, because it is lower than the error value when edges are added or removed and much lower than the error when only edges are added. The $k$ random edges that are selected uniformly at random, are most likely the nodes with a low connectivity. Therefore, the chance that the probability of these nodes will most likely decrease. The error value confirms this. 

In [81]:
df_random_add_nodes_uniform.rename(columns={'score': 'score_random_add_nodes_uniform'}, inplace=True)
df_random_add_nodes_uniform.insert(1, 'rank_random_add_nodes_uniform', range(1, 1+len(df_random_add_nodes_uniform)))
df_random_add_nodes_uniform.drop(['in edges', 'out edges'], axis=1, inplace = True)

In [82]:
df_comparison_4 = pd.merge(df_origin, df_random_add_nodes_uniform, on='node')
def compute_error_based_4(row):
    return abs(row['rank_random_add_nodes_uniform'] - row['rank_original']) / row['rank_original'] 

df_comparison_4.apply(compute_error_based_4, axis = 1).sum()

161.7265425973618

$Error_{nodes-add-uniform-random} = 162$

This function only adds nodes, uniformly at random. This value is somewhat higher than the previous error values. The function copies the incoming/outgoing edges of the  $k$  nodes for each $n$ node that we want to add. The error value is higher compared to the error values above, most likely because we add only nodes.

In [83]:
df_random_removal_nodes_uniform.rename(columns={'score': 'score_random_removal_nodes_uniform'}, inplace=True)
df_random_removal_nodes_uniform.insert(1, 'rank_random_removal_nodes_uniform', range(1, 1+len(df_random_removal_nodes_uniform)))
df_random_removal_nodes_uniform.drop(['in edges', 'out edges'], axis=1, inplace = True)

In [84]:
df_comparison_5 = pd.merge(df_origin, df_random_removal_nodes_uniform, on='node')
def compute_error_based_5(row):
    return abs(row['rank_random_removal_nodes_uniform'] - row['rank_original']) / row['rank_original'] 

df_comparison_5.apply(compute_error_based_5, axis = 1).sum()

162.86777979600367

$Error_{nodes-removal-uniform-random} = 163$

This function removes $k$ selected nodes uniformly at random. Again, each node in the graph has an equal probability to be chosen, so it's very likely that nodes are removed which have a very low connectivity and thus ranking. This value equals the error value when only nodes are added. We expect that this error value is lower than the error values that we will see for the next functions ($D,E,F,G$). 

for function D-E-F-G:

In [85]:
df_random_node_removals_proportional_indegree.rename(columns={'score': 'score_random_node_removals_proportional_indegree'}, inplace=True)
df_random_node_removals_proportional_indegree.insert(1, 'rank_random_node_removals_proportional_indegree', range(1, 1+len(df_random_node_removals_proportional_indegree)))
df_random_node_removals_proportional_indegree.drop(['in edges', 'out edges'], axis=1, inplace = True)
df_comparison_6 = pd.merge(df_origin, df_random_node_removals_proportional_indegree, on='node')

In [86]:
def compute_error_based_6(row):
    return abs(row['rank_random_node_removals_proportional_indegree'] - row['rank_original']) / row['rank_original'] 
df_comparison_6.apply(compute_error_based_6, axis = 1).sum()

271.4626450197574

$Error_{nodes-removal-proportional-indegree} = 271$

Function $D$ removes nodes randomly, but proportional to the indegree. As the analysis at function $D$ already explained, when we remove nodes proportional to the indegree, the pagerank scores of the highly connected nodes are affected. This is because the algorithm will most likely select the nodes with a very high connectivity, because these nodes have a larger probability to be chosen. The $Error_{nodes-removal-proportional-indegree}$ indeed confirms this. Because lots of nodes are removed with a very high indegree, the pagerank of these nodes are also negatively affected. Therefore this error value is much higher than the ones we saw in previous cases. 

In [87]:
df_random_node_removals_proportional_authority.rename(columns={'score': 'score_random_node_removals_proportional_authority'}, inplace=True)
df_random_node_removals_proportional_authority.insert(1, 'rank_random_node_removals_proportional_authority', range(1, 1+len(df_random_node_removals_proportional_authority)))
df_random_node_removals_proportional_authority.drop(['in edges', 'out edges'], axis=1, inplace = True)
df_comparison_7 = pd.merge(df_origin, df_random_node_removals_proportional_authority, on='node')

In [88]:
def compute_error_based_7(row):
    return abs(row['rank_random_node_removals_proportional_authority'] - row['rank_original']) / row['rank_original'] 
df_comparison_7.apply(compute_error_based_7, axis = 1).sum()

272.91082920081425

$Error_{nodes-removal-proportional-authority} = 273$

This function removes nodes at random, but proportional to the authorithy values of the nodes. This means that nodes with a higher authority value will have a higher chance of being removed. In other words, nodes that were referred to by many other hubs (i.e. incoming edges) will have had a higher chance of being removed. We have seen earlier on that the in general (dependending on the explained pecularities), the average PageRank scores of nodes (when  nn  is large enough) will be higher than the average PageRank score in the original graph, when the nodes are removed at random but proportional to this measure. This error value confirms this. The error based value is very high, so this means that a lot of high connected "important" nodes are removed from the original graph.  

In [89]:
df_random_node_additions_proportional_in_degree.rename(columns={'score': 'score_random_node_additions_proportional_in_degree'}, inplace=True)
df_random_node_additions_proportional_in_degree.insert(1, 'rank_random_node_additions_proportional_in_degree', range(1, 1+len(df_random_node_additions_proportional_in_degree)))
df_random_node_additions_proportional_in_degree.drop(['in edges', 'out edges'], axis=1, inplace = True)
df_comparison_8 = pd.merge(df_origin, df_random_node_additions_proportional_in_degree, on='node')

In [90]:
def compute_error_based_8(row):
    return abs(row['rank_random_node_additions_proportional_in_degree'] - row['rank_original']) / row['rank_original'] 
df_comparison_8.apply(compute_error_based_8, axis = 1).sum()

463.95732855882574

$Error_{nodes-addition-proportional-indegree} = 464$

Function $F$ selects $k$ nodes at random, but proportional to their in-degree. Therefore, there is a high chance that they have a relatively high score. The PageRank score of the new node  $n$  is calculated using, the score of nodes of its incoming edges divided by the number of outgoing links of those nodes. Therefore it is very likely that  new nodes have also a relatively high score compared to the average in the original graph (before the addition). We see that this error is very large, compared to the ones that we saw earlier. This is because we add nodes, that were not in the original graph. Also these nodes have a very large value, and therefore sum of the error values will increase enormously.

In [91]:
df_random_node_additions_proportional_authority.rename(columns={'score': 'score_random_node_additions_proportional_authority'}, inplace=True)
df_random_node_additions_proportional_authority.insert(1, 'rank_random_node_additions_proportional_authority', range(1, 1+len(df_random_node_additions_proportional_authority)))
df_random_node_additions_proportional_authority.drop(['in edges', 'out edges'], axis=1, inplace = True)
df_comparison_9 = pd.merge(df_origin, df_random_node_additions_proportional_authority, on='node')

In [92]:
def compute_error_based_9(row):
    return abs(row['rank_random_node_additions_proportional_authority'] - row['rank_original']) / row['rank_original'] 
df_comparison_9.apply(compute_error_based_9, axis = 1).sum()

299.33725054497114

$Error_{nodes-addition-proportional-authority} = 299$

Function $G$ adds nodes at random, but proportional to their authority value. As concluded in the analysis of the results of function $G$, the nodes we add have a relatively high score. This will increase the value based error, because also the score of some other nodes increase. The value of $Error_{nodes-addition-proportional-authority}$ indeed confirms this. 