# Homework 3, Basic: Part 3, PageRank

### Part 3: PageRank (Worth 50 points)

Recall that PageRank can be modeled using matrix operations as follows.  Let $M$ be a _weight transfer matrix_ in which:

$M[i,j] = \frac{1}{n_j}$, if $n_j > 0$ and 

$M[i,j] = 0$ otherwise

where page $i$ is pointed to by page $j$ and page $j$ has $n_j$ outgoing links. And define a _dampening factor_ $\alpha = 0.85$ and a corresponding $\beta = 1 - \alpha$.  Initialize the PageRank vector

$PR^{(0)}=[1,1,1,\ldots]^T$

(i.e., a matrix with m rows by 1 column, filled with ones).  Then we can compute the PageRank $PR$ for each iteration as:

$PR^{(i)}= \alpha \cdot M \cdot PR^{(i-1)} + \beta \cdot [1,1,1,\ldots]^T$

### Step 3.1 Download a Web Graph
The following code retrieves a web graph from https://snap.stanford.edu/data/web-NotreDame.txt.gz, which is a reasonably sized Web crawl done by Notre Dame University, and extracts it into `web-NotreDame.txt`.  Run the program to acquire your Web graph.


In [1]:
# Download and decompress data into your Jupyter environment

import urllib.request
import io
import gzip


for file in ['web-NotreDame.txt']:
    print ('Downloading compressed image of', file)
    source = urllib.request.urlopen("https://snap.stanford.edu/data/" + file + ".gz")
    compressedFile = io.BytesIO(source.read())
    decompressedFile = gzip.GzipFile(fileobj=compressedFile)

    with open(file, 'wb') as outfile:
        outfile.write(decompressedFile.read())
        outfile.close()
        print ('Saved', file)


Downloading compressed image of web-NotreDame.txt
Saved web-NotreDame.txt


### Step 3.2 Load the Notre Dame Web Graph into a Matrix

Next, write Python code to take the data from `web-NotreDame.txt`, read and parse the rows in a Pandas DataFrame (not a Spark DataFrame!).  Restrict the node IDs to values less than 10,000.

_Hints: If you use_ `read_csv`_, you may need to look at the_ `sep` _and_ `skiprows` _options.  Also take a look at the raw data and make sure you know how many rows don't contain data, and how the items are separated.  

In [2]:
# TODO: In this cell, store the data from web-NotreDame.txt in graph_df
# Worth 10 points

# YOUR CODE HERE
import pandas as pd

graph_df = pd.read_csv('web-NotreDame.txt', sep='\t', skiprows=4, names=('FromNodeId','ToNodeId'))
graph_df = graph_df[graph_df.FromNodeId < 10000]
graph_df = graph_df[graph_df.ToNodeId < 10000]

In [3]:
if graph_df.shape[1] != 2:
    raise ValueError('Incorrect number of columns')

In [4]:
graph_df.shape

(37841, 2)

Create a weight transfer matrix M corresponding to the Web graph, with edges whose weights are scaled as per the PageRank definition of a weight transfer matrix.  This will form an input into your PageRank algorithm.  Note that the dataset already includes node IDs that go from $0,\ldots,m$, so you can directly use the node IDs as indices in your matrix.  You should not use for loops, and instead, use the DataFrame and array functions that Pandas and NumPy provide as they are much more efficient.

When building $M$, you may need to build some "auxiliary" data structures to speed up performance, e.g., to quickly look up weights associated with node edges.  Note that lookup in an array is typically faster than lookup in a DataFrame.  Finally, you might want to use the_ `apply` _function for Pandas DataFrames or Numpy Matrices as they are orders of magnitude faster than trying to iterate through every row.  However, this is not a requirement -- just make sure you aren’t using for loops!_

In [5]:
# TODO: Create the M matrix in this cell
# Worth 20 points

# YOUR CODE HERE
import numpy as np

# calculate weight associated with each node edge
weights = graph_df.groupby('FromNodeId')['ToNodeId'].nunique().apply(lambda x: 1/x)
graph_df_new = graph_df.join(weights,on='FromNodeId',how='left',lsuffix='',rsuffix='_weights')

# build weight transfer matrix
M = np.zeros((10000,10000))
rows = np.array([graph_df_new['ToNodeId']], dtype=np.intp) # row indices
cols = np.array([graph_df_new['FromNodeId']], dtype=np.intp) # column indices
weights = np.array([graph_df_new['ToNodeId_weights']]) # weight values for each node ID
M[rows,cols] = weights

In [6]:
M[10:30,10:30]

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  

In [7]:
if M.shape != (10000,10000):
    raise ValueError("Incorrect Matrix dimensions")

### Step 3.3 Compute Matrix-Based PageRank
Implement a function `pagerank(M, alpha, num_iter)` that, when given a square $m \times m$ transition matrix $M$ from Step 3.2, initializes the PageRank vector to $m$ 1’s, sets $\alpha$ = `alpha`, sets $\beta$ appropriately given $\alpha$, and iterates `num_iter` times.  Return an $m$-element vector that consists of the final PageRank scores.

In [8]:
# TODO:
# Write your pagerank function into this cell
# Worth 15 points

# YOUR CODE HERE
def pagerank(M, alpha, num_iter):
    beta = 1-alpha
    pr = np.ones((len(M)))
    
    for i in range(0,num_iter):
        pr_temp = pr
        pr = alpha * np.dot(M,pr_temp) + beta
    
    return pr

In [9]:
pr = pagerank(M, 0.85, 15)

pr

array([  2.24702638e+02,   2.79759111e+01,   1.14034108e+01, ...,
         1.57231541e-01,   1.57231541e-01,   1.57231541e-01])

Output a DataFrame called `best_pages_df` with the schema `(id, pagerank)` containing the original IDs and PageRanks of the 10 nodes with highest PageRank, in descending order.

In [10]:
# TODO:
# Output 10 tuples using this cell.
# Worth 5 points plus validates your pagerank
    
# YOUR CODE HERE
pr = pagerank(M, 0.85, 15)
ids = np.argsort(pr)[::-1]
pr_sorted = np.sort(pr)[::-1]
best_pages_df = pd.DataFrame({'id':ids[0:10],'pagerank':pr_sorted[0:10]})
best_pages_df

Unnamed: 0,id,pagerank
0,0,224.702638
1,1973,189.250314
2,1790,53.438593
3,1828,50.954873
4,1,27.975911
5,238,26.779136
6,140,23.520898
7,14,22.232264
8,16,21.591054
9,162,18.283386


In [11]:
if best_pages_df.columns[0] != 'id' or best_pages_df.columns[1] != 'pagerank':
    raise ValueError('Incorrect column names')

In [12]:
if best_pages_df.shape[0] != 10:
    raise ValueError('There should be 10 rows in best_pages_df')

In [13]:
if len(np.where(best_pages_df['id'] == 0)[0]) != 1:
    raise ValueError('')