# PageRank

In [112]:
import networkx as nx
import pandas as pd

Load edge list and create a graph

In [113]:
fh = open("canvas/hamster.edgelist", 'rb')
G = nx.read_edgelist(fh, create_using=nx.DiGraph())
fh.close()

Next we run the pagerank algorithm with a dampening parameter of 0.85. The dampening parameter represents the likelyhood of clicking a link on the webpage. With a dampening parameter of 0.85 we indicate that there is a 85% of clicking a link on the webpage and 15% of going to a random other node in the graph. We calculate the page rank using the power iteration method.

In [114]:
pr = nx.pagerank(G, alpha=0.85)

In [115]:
df_edge_in = pd.DataFrame(list(G.in_degree), columns=['node', 'in edges'])
df_edge_out = pd.DataFrame(list(G.out_degree), columns=['node', 'out edges'])
df_rank = pd.DataFrame(list(pr.items()), columns=['node', 'score']).sort_values(by=['score'], ascending=False)
df_temp = pd.merge(df_rank, df_edge_in, on='node')
df_total = pd.merge(df_temp, df_edge_out, on='node')
df_total.index = df_total.index + 1
df_total.columns.name = 'rank'
df_total.head(10)

rank,node,score,in edges,out edges
1,404,0.042793,10,0
2,195,0.019961,80,1
3,77,0.018628,121,2
4,728,0.01553,10,0
5,36,0.011117,168,5
6,135,0.009544,49,8
7,192,0.009365,57,3
8,281,0.009304,32,0
9,136,0.008853,85,6
10,184,0.008296,80,3


In [116]:
df_total.iloc[500 : 505]

rank,node,score,in edges,out edges
501,469,0.000328,26,48
502,1406,0.000328,13,26
503,610,0.000326,18,16
504,972,0.000326,7,4
505,129,0.000324,5,6


In [117]:
df_total.iloc[1000 : 1005]

rank,node,score,in edges,out edges
1001,708,0.000203,4,5
1002,896,0.000203,8,8
1003,270,0.000203,14,20
1004,799,0.000203,9,7
1005,811,0.000202,3,7


In [118]:
df_total.iloc[1500 : 1505]

rank,node,score,in edges,out edges
1501,1597,0.000137,1,3
1502,1908,0.000137,1,3
1503,1442,0.000137,1,3
1504,1702,0.000137,1,3
1505,1048,0.000137,1,3


In [119]:
df_total.iloc[2000 : 2005]

rank,node,score,in edges,out edges
2001,2327,0.000113,0,4
2002,2091,0.000113,0,5
2003,2329,0.000113,0,3
2004,2330,0.000113,0,4
2005,2335,0.000113,0,4


In [120]:
df_total.tail()

rank,node,score,in edges,out edges
2422,1746,0.000113,0,3
2423,1748,0.000113,0,4
2424,1749,0.000113,0,3
2425,1751,0.000113,0,2
2426,2426,0.000113,0,8


As expected, the higher ranked pages have more incoming edges than the lower ranked pages on average. It is important to note that a page being linked by a lot of other pages doesn't imply that it will rank high on the pagerank. The rank of a page is mainly influenced by the quality of the pages that link to it. A page which is linked on many other pages however is still far more likely to end up higher in the pagerank than a page which is linked less frequently. This is also shown in the data from the pagerank calculation above. The lower the pagerank the fewer incoming edges those pages have. There are however some exceptions in the data. One of them is the number 1 ranked page. The rank of that page far exceeds the other pages with having a score of 0.042793	= 4.3% while the second best ranked page only had a score of 0.019961 = 2.0%. We will analyze this page by looking at the quality of the pages that link to it.

In [121]:
def gen_df(node):
    df_pred = pd.DataFrame(list(G.predecessors(node)), columns=['node'])
    scores = {}
    out_edges = {}
    for n in G.predecessors(node):
        out_edges[n] = len(G.out_edges(n))
        scores[n] = pr.get(n)
    df_out_edges = pd.DataFrame(list(out_edges.items()), columns=['node', 'out edges'])     
    df_score = pd.DataFrame(list(scores.items()), columns=['node', 'score']).sort_values(by=['score'], ascending=False)
    df_temp = pd.merge(df_score, df_pred, on='node')
    df_total = pd.merge(df_temp, df_out_edges, on='node')
    df_total.index = df_total.index + 1
    return df_total

def gen_sum_inc(node):
    summation = 0
    for n in G.predecessors(node):
        summation += pr.get(n)
    return summation    

In [122]:
display(gen_df('404'))

Unnamed: 0,node,score,out edges
1,195,0.019961,1
2,77,0.018628,2
3,192,0.009365,3
4,126,0.008144,1
5,346,0.005487,3
6,403,0.004875,2
7,24,0.003894,3
8,246,0.002964,2
9,882,0.002327,1
10,775,0.000385,1


In [123]:
display(gen_df('195'))

Unnamed: 0,node,score,out edges
1,77,0.018628,2
2,36,0.011117,5
3,192,0.009365,3
4,181,0.005597,12
5,346,0.005487,3
6,182,0.004905,13
7,116,0.004715,54
8,125,0.003839,4
9,115,0.003329,45
10,101,0.003121,5


In [124]:
display(gen_sum_inc('404'))

0.07603158841472238

In [125]:
display(gen_sum_inc('195'))

0.1244965540211601

The data clearly shows that there are a lot more links to 195 than 404. Also does the data show that the sum of the score of all the pages that link to 195 is almost twice as high as the sum of the score of all pages that link to 404. Even though this is the case the score of 404 is way higher than the score of 195. The reason for this is that the pages that link to 195 also link to a lot of other pages while this is not the case for 404. The amount of outgoing edges for the pages that link to 404 is lower than for 195. Also do links from low scoring pages not affect the score of a page by a lot. Most of the score that both pages 404 and 195 get is from a few pages with high scores and low amount of links.