# Notebook 10: Power Iteration and Random Walks
***

In this notebook we'll run through an example using power iteration and have a closer look at the random walk interpretation of PageRank.

We'll need numpy for this notebook, so let's load it.

In [1]:
import numpy as np

<br>

### Exercise 1: Power Iteration

Consider the small web of 5 pages depicted below.

<img width=250px src="http://www.cs.colorado.edu/~anwo7157/home/resources/pagerank1.png">

First, let's form a hypothesis about which page(s) will have the highest page rank, and which will have the lowest. What evidence in the graph suggests those pages will have high/low rank?

**Solution:**

Page 1 has the most in-links, so we would suppose it should have a high rank. Page 2 has an in-link from Page 1, so it might also have a high rank. Page 4 only has 1 in-link, so it probably has low rank.

Now set up the stochastic adjacency matrix $M$ for this graph. Use the natural order 1-5 for the rows/columns. Recall that $M_{ji} = 1/\text{degree}(i)$ if there is a link from $i$ to $j$, otherwise $M_{ji} = 0$. The first row is done for you, depicting the in-links to Page 1.

In [5]:
N = 5
M = np.zeros((N,N))
print(M)
M[0,:] = [0, 0, 1/2, 1/3, 1/3]
print(M)
# TODO -- fill in the rest of M
M[1,:] = [1, 0, 0, 0, 1/3]
M[2,:] = [0, 0, 0, 1/3, 1/3]
M[3,:] = [0, 0, 1/2, 0, 0]
M[4,:] = [0, 1, 0, 1/3, 0]
print(M)

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
[[0.         0.         0.5        0.33333333 0.33333333]
 [0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.        ]]
[[0.         0.         0.5        0.33333333 0.33333333]
 [1.         0.         0.         0.         0.33333333]
 [0.         0.         0.         0.33333333 0.33333333]
 [0.         0.         0.5        0.         0.        ]
 [0.         1.         0.         0.33333333 0.        ]]


Compute something(s) to make sure that $M$ is indeed a ***column-stochastic*** matrix.

In [7]:
# SOLUTION:
np.sum(M, axis=0)

array([1., 1., 1., 1., 1.])

To perform power iteration, we need to initialize our PageRank vector, $r$. Our solution shouldn't rely critically on how we do this, but an easy first option is to evenly distribute the rank among all $N$ pages:

In [8]:
r = np.repeat(1/N, N)

A single iteration of power iteration involves multiplying $M$ by the PageRank vector $r$, which effective "moves" rank around. Perform a single iteration, but save both the old and the new rank vectors so we can compare them.

In [9]:
r_old = r.copy()
r_new = np.matmul(M, r_old)

As we iterate many more times trying to converge our estimate for the rank vector, we will need a ***distance measure*** between old and new rank vectors, to detect when our estimates are converged. We can use the $L_1$ norm:
$$d(\vec{x}, \vec{y}) = \sum_{i=1}^N |x_i - y_i|$$

In [10]:
def dist_L1(x,y):
    return np.sum(np.abs(np.array(x)-np.array(y)))

What is the distance between the old and new rank vectors?

In [11]:
print("d(r_old, r_new) = {:0.4f}".format(dist_L1(r_old, r_new)))

d(r_old, r_new) = 0.3333


We need to pick a tolerance for when to decide our power iteration has converged. We'll use a tolerance of 0.001 here, but note that in reality, we may want to use stricter tolerances, depending on how much is at stake (like, whether a website appears on the first page of our internet search or not).

Use a `while` loop to perform power iteration until it converges. Keep track of how many iterations are performed (including the one above) and print out the estimate of the rank vector each iteration.

In [15]:
tol = 0.001
iters = 1
r_new = r.copy()
r_old = np.repeat(0, N)
print("{:2.0f}  {:0.3f}  {:0.3f}  {:0.3f}  {:0.3f}  {:0.3f}".format(0, r[0],r[1],r[2],r[3],r[4]))
while dist_L1(r_old, r_new) > tol: # TODO -- your code goes here!
    # TODO -- and here!
    r_old = r_new.copy()
    r_new = np.matmul(M, r_old)
    print("{:2.0f}  {:0.3f}  {:0.3f}  {:0.3f}  {:0.3f} {:0.3f}".format(iters, r_new[0],r_new[1],r_new[2],r_new[3],r_new[4]))
    # TODO -- and maybe here too!
    iters+=1

 0  0.200  0.200  0.200  0.200  0.200
 1  0.233  0.267  0.133  0.100 0.267
 2  0.189  0.322  0.122  0.067 0.300
 3  0.183  0.289  0.122  0.061 0.344
 4  0.196  0.298  0.135  0.061 0.309
 5  0.191  0.299  0.123  0.068 0.319
 6  0.190  0.297  0.129  0.062 0.322
 7  0.192  0.298  0.128  0.064 0.318
 8  0.191  0.298  0.127  0.064 0.319
 9  0.191  0.298  0.128  0.064 0.319
10  0.192  0.298  0.128  0.064 0.319
11  0.191  0.298  0.128  0.064 0.319


How do the computed PageRanks compare to your hypothesis? In particular, can you explain:
* Why does Page 2 have higher rank than Page 1? This seems crazy because Page 1 has so many in-links! What's up with that?
* Why does Page 5 have the highest rank?
* Any other funky structure you notice?

<br>

### Exercise 2:  Random Walking

PageRank can equivalently be thought of in terms of an imaginary Tron-style person walking around on the internet. As the walker moves, she randomly follows one of the out-links from the current page to another one (possibly back to the current page, if there are self-links). Each step can be considered a point in time, during which the walker is at exactly one of the pages. A page's PageRank is the long-run proportion of time that the walker spends on that page. Here, "long-run" means we need the walker to move around for a very long time. If you have taken prob/stats classes covering *Markov chains* before, and/or were paying close attention during lecture, you are probably hissing *stationary distribution!* under your breath. That's perfectly normal.

So, we could also estimate PageRank by simulating a random walk on the graph defined above. Suppose the walker starts at Page 1. Then the first column of $M$ defines the probabilities of the walker landing in any of the other pages. In this case, she goes to Page 2 with probability 1 since that is the only out-link from Page 1.

In [85]:
print(M)

[[0.         0.         0.5        0.33333333 0.33333333]
 [1.         0.         0.         0.         0.33333333]
 [0.         0.         0.         0.33333333 0.33333333]
 [0.         0.         0.5        0.         0.        ]
 [0.         1.         0.         0.33333333 0.        ]]


Let's go for a walk, shall we?

We need a list to track all of the Pages the walker has visited, and we can start her off at Page 1.

In [88]:
M[:,0]

array([0., 1., 0., 0., 0.])

In [94]:
np.random.seed(4022)

# Keep track of all the pages visited 
visited_pages = []
all_pages = list(range(1, M.shape[0]+1))

# Start the walker on Page 1
current_page = 1
visited_pages.append(current_page)

# Pick a random new page
new_page = np.random.choice(all_pages, p=M[:,current_page-1])
print(new_page)

2


Now we reset the current page to the new one, save it to `visited_pages`, and continue to step forward in this manner a great many times.

In [96]:
# Step forward
current_page = new_page
visited_pages.append(current_page)

We can do this using a fixed number of iterations, or until we reach some convergence criterion as a stopping condition.

In [108]:
# Initialize
np.random.seed(4022)
visited_pages = []
all_pages = list(range(1, M.shape[0]+1))
current_page = 1  # Start the walker on Page 1
visited_pages.append(current_page)
niter = 100

# Iterate
for _ in range(niter):
    new_page = np.random.choice(all_pages, p=M[:,current_page-1])
    visited_pages.append(new_page)
    current_page = new_page

We can check the proportion of iterations that the walker spent on each page:

In [109]:
r_walk = [np.sum([visited_pages[k]==p for k in range(niter)])/niter for p in all_pages]
print(r_walk)

[0.19, 0.28, 0.15, 0.08, 0.3]


How does this compare to our final PageRank estimate from power iteration in Exercise 1? What does this say about whether or not we have run our random walk for a sufficient length?

In [110]:
print(r_new)

[0.19149377 0.2978014  0.1277292  0.06376457 0.31921106]


Run the random walk for a total of 50,000 iterations, and check again how similar the estimated PageRanks are to the ranks from power iteration.

In [2]:
# SOLUTION:


Consider what it means for $\vec{r}$ to be a stationary distribution of this random walk:  $\vec{r}$ must satisfy the equation $\vec{r} = M \vec{r}$.

Based on this, and what you know from earlier in the semester about ways to measure the distance between two vectors, can you come up with another way to evaluate how well the random walk has converged to the stationary distribution (and therefore to the actual PageRanks)?

In [3]:
# SOLUTION: