# Notebook 11: PageRank Computation
***

In this notebook we'll have a closer look at some of the issues that can confound a naive calculation of PageRank (dead-ends and spider traps). But, we'll also compute PageRank in a way to solve these issues!

We'll need numpy for this notebook, so let's load it.

In [2]:
import numpy as np

<br>

### Exercise 0: Following along in class

The transition matrix $M$ for the spider trap example from class is:

In [3]:
M = np.array([[1/2, 1/2, 0],
              [1/2, 0  , 0],
              [0  , 1/2, 1]])

And we saw that the modified transition matrix, to account for the possible teleports out of dead ends and spider traps, is:
$$A = \beta M + (1-\beta) \left[\dfrac{1}{N}\right]_{N\times N}$$

where $N$ is the number of pages and $\beta$ is the probability of following an actual link. We can construct $A$ as follows:

In [4]:
beta = 0.8
A = beta*M + (1-beta)*np.ones((3,3))/3
# Check after multiplying by 15 since that's the common denominator in 
# the slides (otherwise they'd be decimals and hard to compare directly)
print(A*15)

[[ 7.  7.  1.]
 [ 7.  1.  1.]
 [ 1.  7. 13.]]


In [5]:
# initialize
r_old = np.repeat(1/3, 3)

# powerfully iterate
for _ in range(10):
    r_new = np.matmul(A, r_old)
    r_old = r_new

# see if we agree with what we see on the slides...
print(np.round(r_new,3))

[0.214 0.153 0.633]


Sick!

<br>

### Exercise 1: Dead-ends and spider traps!

Seen here are two graphs: one has a spider trap, and one has a dead end. As we saw in the lecture, both present issues if one attempts to use the "vanilla" PageRank calculation from last time.

<img width=600px src="http://www.cs.colorado.edu/~anwo7157/home/resources/pagerank2.png">

**[1]** Identify which of these graphs has the spider trap and which has the dead end. Which page(s) are associated with each of those problematic structures?

**[2]** Then, form a hypothesis about how the computed PageRanks will reflect the dead end and spider trap problems in these graphs.

In [6]:
# Define the transition matrices for each graph
M1 = np.array([[0, 0, 1/2, 1/3, 1/2],
               [1, 0, 0  , 0  , 1/2],
               [0, 0, 0  , 1/3, 0  ],
               [0, 0, 1/2, 0  , 0  ],
               [0, 1, 0  , 1/3, 0  ]])

M2 = np.array([[0, 0, 1/2, 0, 1/3],
               [1, 0, 0  , 0, 1/3],
               [0, 0, 0  , 0, 1/3],
               [0, 0, 1/2, 0, 0  ],
               [0, 1, 0  , 0, 0  ]])

Use 20 iterations of power iteration to obtain estimates for the PageRanks for each graph using the methods from last time. That is, use power iteration on the un-altered $M$ matrices. Were your hypotheses regarding the PageRanks correct?

In [7]:
# initial rank vector guess
n = M1.shape[0]
r_old = np.repeat(1/n, n)

iters = 1
while dist_L1(r_old, r_new) >= tol: # TODO -- your code goes here!
    # TODO -- and here!
    r_old = r_new.copy()
    r_new = np.matmul(M, r_old)
    print("{:2.0f}  {:0.3f}  {:0.3f}  {:0.3f}  {:0.3f}".format(iters, r_new[0],r_new[1],r_new[2],r_new[3],r_new[4]))
    # TODO -- and maybe here too!
    iters += 1            



NameError: name 'dist_L1' is not defined

**Consider:** What do you think would happen to the PageRanks in the spider trap graph if there was no connection from Page 5 to Page 2? What about if there were no connection from Page 2 to Page 5?

<br>

### Exercise 2: Fixing the problems

Recall the random walker interpretation of PageRank from last time: A page's rank is equal to the long-run proportion of time that the walker spends on that Page, if she moves from page to page following any available out-link uniformly at random. Note that in the case of spider traps and dead ends, she gets stuck, which is of course and issue when we want the walker to roam free and explore the web graph.

We fix the issues of dead ends and spider traps by providing some probability $\beta < 1$ that the walker follows an actual link (chosen uniformly at random from those available), and with probability $1-\beta$, she teleports to a page chosen uniformly at random from some *teleport set* of pages. In general, the teleport set is taken to be the set of all pages, so the walker could pop up anywhere.

This leads to the updated transition matrix (from the slides):
$$A = \beta M + (1-\beta) \left[\dfrac{1}{N}\right]_{N\times N}$$

where $N$ is the number of pages, and the term in brackets is an $N \times N$ matrix full of $1/N$s.

For the case of the graph with the spider trap, construct the modified transition matrix $A$, using $\beta=0.85$ (which is a typical choice in real-life applications). Then do 20 steps of power iteration to check on our new and improved PageRank estimates.

In [8]:
# SOLUTION:


Did the addition of teleports fix the issue of accumulation of rank by the pages in the spider trap?

<br>

### Exercise 3: Sparse matrix encoding

Turns out, there are LOTS of web pages. Crazy, right? That means the matrix $M$ is huge, but sparse. So, representing that in memory can be challenging, but not impossible. The transition matrix updated to include the teleports, $A$, on the other hand, is fully dense because any page is reachable from any other page. Thus, $A$ may well be impossible to store in memory.

We decomposed the update equation for power iteration so that it would sequentially read in a single page's degree and out-link information, and update each of the out-linked nodes' ranks:
* First, we initialize all entries in $\vec{r}^{new}$ to equal $(1-\beta)/n$
* Then we loop over each page $i$ with out-degree $d_i$:
  * For each destination page that page links to, we distribute (add) $\beta r^{old}_i / d_i$ of rank

Store the degree and destination page information in a list of lists, where the primary index for the list corresponds to the source node, the first element of each constituent list is a single integer for that node's (out-)degree, and the second element of each constituent list is a list of that source node's destination nodes. 

For example, the row corresponding to Page 4 would be the fourth element of our list:
$$[3, [1, 3, 5]]$$

In [9]:
N = 5
M = np.zeros((N,N))
M[:,0] = [0, 1, 0, 0, 0]
M[:,1] = [0, 0, 0, 0, 1]
M[:,2] = [1/2, 0, 0, 1/2, 0]
M[:,3] = [1/3, 0, 1/3, 0, 1/3]
M[:,4] = [0, 1/2, 1/2, 0, 0]
print(M)

[[0.         0.         0.5        0.33333333 0.        ]
 [1.         0.         0.         0.         0.5       ]
 [0.         0.         0.         0.33333333 0.5       ]
 [0.         0.         0.5        0.         0.        ]
 [0.         1.         0.         0.33333333 0.        ]]


In [12]:
# TODO -- replace the words below with the appropriate numbers
N = 5
M = np.zeros((N,N))
M[:,0] = [0, 1, 0, 0, 0]
M[:,1] = [0, 0, 0, 0, 1]
M[:,2] = [1/2, 0, 0, 1/2, 0]
M[:,3] = [1/3, 0, 1/3, 0, 1/3]
M[:,4] = [0, 1/2, 1/2, 0, 0]
print(M)

M_compact = {}
for i in range(N):
    M_compact[i+1] = [idx+1 for idx, x in enumerate(M[:,i]) if x > 0]
print(M_compact)

{1: [2], 2: [5], 3: [1, 4], 4: [1, 3, 5], 5: [2, 3]}


Now we can initialize our old rank vector as all $1/n$, and the new rank vector that we will be computing to all $(1-\beta)/n$.

In [86]:
beta = 0.85
r_old = np.repeat(1/n, n)         # initializing the entire power iteration
r_new = np.repeat((1-beta)/n, n)  # initializing the output from a single step

Now we loop over the rows of `M_compact` and distribute the importance. The update from the first row is done for you below. Your job is to turn it into a `for` loop over all of the pages.

In [87]:
for dest in M_compact[0][1]:
    idx = dest-1 # accounting for Python's 0-based indexing
    r_new[idx] += beta*r_old[idx]/M_compact[0][0]

In [2]:
# SOLUTION:

Compare this to a single step of the regular power iteration (with the random teleporting). Do they agree?

In [3]:
# SOLUTION:

**Solution:** They totally do agree!