# Module 6: Markov Chains - Practice

In this session, we will gently practice appling this graphical model of **markov chain** to a real-world dataset.

The dataset is a graph of connections between web pages.

Dataset: [Stanford web graph](https://snap.stanford.edu/data/web-Stanford.html)

The task is to predict popularities of each web page solely based on its degree of connectivity
relative to other pages.  
This is part of Google's famous algorithm [PageRank](https://en.wikipedia.org/wiki/PageRank).

In [1]:
import os, sys
import itertools
import pickle
import numpy as np
import tensorflow as tf

## Load dataset

This time it's a **graph**, which is a type of data structure in computer science, widely applicable
to representing objects we are studying with vertices and relations between them with edges.

Formally defined as $G=(V, E)$, where V is set of vertices and E is set of edges.

* [Wikipedia: Graph](https://goo.gl/Da4yAS)
* [What are graphs in laymen's terms](https://softwareengineering.stackexchange.com/questions/168058/what-are-graphs-in-laymens-terms)

We still represent the graph with fundamental linear data structures such as list() and set() in Python.

Here's how:

![Graph example](../resources/graph_example.png)

Take this graph for example. Each circle (**vertex**) can represent web pages there are on the Internet.  
Each arrow (**directed edge**) can be representing hyperlinks that sends user from one page to another.

Here's an [adjacency list](https://en.wikipedia.org/wiki/Adjacency_list) representation of the graph.

~~~python
G = [
    {3, 4}, # descendants of vertex 0
    {5},    # descendants of vertex 1
    {},     # descendants of vertex 2
    {1},    # descendants of vertex 3
    {2, 5}, # descendants of vertex 4
    {}      # descendants of vertex 5
]
~~~

Now we will be able to look up, for instance, all vertices connected to vertex 4 with **G[4]**.  
And we can check whether 3 is connected to 4 using **3 in G[4]**. This check is $O(1)$ time due  
to the fact that we put descendants in a set whose underlying data structure is hash set.  
If you have no prior experience to dealing with this type of data structure, all you have to know  
is that this representation is relatively **convenient** and it's **fast**.

The following cell will load in a graph in an **adjacency list** representation.

In [2]:
# Size of graph
num_vertices = 0
num_edges = 0
with open('/dsa/data/all_datasets/AppliedML_M6/web-Stanford.5600.pkl', 'rb') as f_adj:
    G = pickle.load(f_adj)
    for k, v in G.items():
        num_edges += len(v)
        num_vertices = max(num_vertices, max(k, max(v)))

print('Graph size', 'V={} E={}'.format(num_vertices, num_edges))

Graph size V=5600 E=1209


Now we derive transition matrix for markov chain from a connectivity matrix, defined as

$$ C_{ij} = [\text{whether there's an edge from i to j}] $$

Following previous example graph illustrated in the figure, $C$ would be:

$$\begin{pmatrix} 0 & 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 \end{pmatrix}$$

In [5]:
# Initialize as all zeros
matG = np.zeros((num_vertices, num_vertices), dtype=bool) # Connectivity matrix

# Complete code below this comment  (Question #P6101)
# ----------------------------------
# Set 1 if there's an directed edge
for u, V in G.items(): # traverse vertices
    for v in V: # traverse directed edges outbound from u
        matG[u-1, v-1] = 1 # we are using [u-1, v-1] as index because data we loaded uses 1-based index,
                           #    while numpy array uses 0-based index.

Compute transition matrix assuming transition probability for each vertex is a discrete uniform distribution.  
That is to say, at any given webpage, user is equally likely to click on any link on that web page the next moment.

Therefore the transition matrix should be:

$$T_{ij} = \frac{1}{deg^{+}(i)} $$

where $ deg^{+}(i) $ denotes the outdegree of a vertex.

We can find out outdegree by summing up **connectivity matrix** along **axis=1**.

Only be careful with the edge case where $deg^{+}(i)=0$, which by definition means there's no
outbound connection for vertex i. So the transition probability should be 0 because if a webpage
contains no link, user couldn't possibly go anywhere from there.

In [6]:
# Compute transition matrix

# Complete code below this comment  (Question #P6102)
# ----------------------------------
out_degree = np.sum(matG,axis=1)
# ----------------------------------

# only deal with vertices which have any outbound connection at all
nonzero = np.flatnonzero(out_degree)
P = matG.copy().astype(float)
P[nonzero, :] /= out_degree[nonzero][..., np.newaxis]

print('Transition matrix shape', P.shape)

Transition matrix shape (5600, 5600)


Find out stationary distribution.

In [8]:
%%time

# Compute stationary state - power method
def mat_power(M, n):
    """ Construct a graph that raises square matrix M to n-th power where n>=1
    This generates a computational graph with space complexity O(log(n)).
    """
    assert n>=1
    # trivial cases
    if n==2:
        return tf.matmul(M, M)
    elif n==1:
        return M
    
    # divide & conquer
    A = mat_power(M, n//2)
    A2 = tf.matmul(A, A)
    if n&1: # odd power
        return tf.matmul(A2, M)
    else: # even power
        return A2

    
# Complete code below this comment  (Question #P6103)
# ----------------------------------
def get_stationary_state(P):
    pi0 = tf.constant(np.ones((1, len(P)))/len(P))
    transition_matrix = tf.constant(P)
    stationary_state = tf.squeeze(tf.matmul(pi0,mat_power(transition_matrix,50)))
    with tf.Session() as sess:
        return sess.run(stationary_state)

a = get_stationary_state(P)

CPU times: user 3min 55s, sys: 1.32 s, total: 3min 56s
Wall time: 21.3 s


Print out **less than 10** indices of top ranked pages.

**Hint**: $a$ is a probability distribution. An uninformative uniform distribution would have 1/5600
in all of it's entries where 5600 is number of vertices in this graph.
So this code currently thresholds out web pages that are more popular than average.

In [18]:
# Tweak code below this comment so that it only prints less than 10 entries (Question #P6104)
# ----------------------------------
print('Most popular pages', (np.flatnonzero(a>1/5600)+1)[:9])

Most popular pages [  80  165  784  891  925 1148 1305 1342 1887]


# Save your notebook!