# Case 1: PageRank

First, we need to import numpy to get access to all the functions we need:

In [1]:
import numpy as np

We now import the adjacency matrix from the example. It is stored in `adjacency_example.txt`.

In [2]:
A= np.loadtxt('adjacency_example.txt', dtype=int)
print(A)

[[0 1 1 1 0]
 [0 0 0 0 1]
 [0 1 0 0 0]
 [0 0 1 0 0]
 [1 0 0 1 0]]


## Task 1: Compute the transition matrix

In [3]:
def transition_matrix(A):
    '''
    transition_matrix(A):
        Input: 
            A: Adjacency matrix 
        Output:
            P: Transition matrix
    '''
    m = np.sum(A, axis=1)
    n = m.size
    m = m.reshape((n,1)) #Make m a nx1 array
    C = A/m #C[i,j]=A[i,j]/m[i,1]
    return C.T

Check if the transition_matrix function returns a stochastic matrix:

In [4]:
P=transition_matrix(A)
np.allclose(np.sum(P, axis=0), 1) # Will return True if P is a stochastic matrix

True

## Task 2: The power method
Implement the power method. Your function should take as input the transition matrix, an initial vector of probabilities and the number of iterations of the power method should be used.

In [5]:
def pow_method(P, v0, N):
    '''
    pow_method(P, v0, N):
        Input: 
            P:   transition matrix
            v0:  initial vector of probabilities
            N:   number of iterations
        Output:
            v:   final vector of probabilities v=P^n * v0
    '''
    v=v0
    for i in range(N):
        v=P@v
    return v

### Testing the method

In [6]:
n = P.shape[0]    # The number of nodes
x0 = np.ones(n)/n # initial probabilities: 1/n for every node
N=100             # number of iterations

q=pow_method(P,x0,N)
print(q)

[0.12500122 0.24999806 0.20833086 0.16666739 0.25000247]


If you have done everything correctly, the vector you get should be close to the exact steady-state vector
$$ \begin{aligned}
\mathbf{q}_{\text{exact}}=&\left[\frac{1}{8}, \frac{1}{4}, \frac{5}{24}, \frac{1}{6}, \frac{1}{4}\right]^T\\
\approx&[0.125, 0.25, 0.20833333, 0.16666667, 0.25 ]^T.
\end{aligned}$$

Increasing `N` should get you even closer to the exact vector.

## Task 3: NMBU Realtek
In this task, you are going to test your implementation on a set of webpages and links. This set consists of webpages with an url starting with www.nmbu.no/fakultet/realtek and internal links between them (In January 2022). Dangling nodes have been removed.

The adjacency matrix is stored in `adjacency_realtek.txt`, and the urls represented by each index in `keyvals.txt`.

What are the top 5 webpages by PageRank on the RealTek webpages?

Hint: `np.argsort` can be useful for answering this question.

In [9]:
A=np.loadtxt('adjacency_realtek.txt', dtype=int)
P=transition_matrix(A)
n=A.shape[0]
x0=np.ones(n)/n
q=pow_method(P,x0,100)

low_args=np.argsort(q) # smallest first
high_args=np.flip(low_args) # highest first
bestid=high_args[:10]

keyvals=np.loadtxt('keyvalues.txt', dtype=str)
print("Indices of highest ranked pages:")
print(bestid)

print("\nLinks to highest ranked pages:")
for u in bestid:
    print(keyvals[u,1])

Indices of highest ranked pages:
[ 0  2 54 59 65 62 64 66 63 61]

Links to highest ranked pages:
https://www.nmbu.no/fakultet/realtek
https://www.nmbu.no/fakultet/realtek/studier
https://www.nmbu.no/fakultet/realtek/studier/student
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/ppu
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/imrt100
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/frie-realfag
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/data-science
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/h-yere-rstrinn
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/lektor


## Task 4: Additional challenges for the interested.

a) In the power method implemented above, we specify the number of iterations. It would be better to iterate until 
$$\|\mathbf{x}_{k}-\mathbf{x}_{k-1}\|\le Tol$$
for some specified tolerance $Tol$. Implement a new function that does this.

*Hint*: `np.linalg.norm` will be useful.
For more efficient code, you should not compute the matrix-vector product $P\mathbf{x}_{k-1}$ more than once per iteration.
It is good practice to implement a maximum number of iterations, so your code doesn't get stuck in an infinite loop.

b) The actual PageRank algorithm is more robust than what we have done here. By making some adjustments, it can handle dangling nodes (webpages without links) and other complications that may arise. Implement Adjustment 1 and 2 described in *Lay* Chapter 10.2.

## Some additional remarks
Our implementation only works for graphs with less than a few thousand nodes. For larger graphs, better implementations are needed. Adjacency matrices are usually sparse, that is, most of the entries are zero. Efficient matrix algorithms take advantage of this. In Python, the most used implementation of sparse matrices is scipy.sparse. See https://docs.scipy.org/doc/scipy/reference/sparse.html.

Really efficient implementations of the PageRank algorithm never explicitly store the adjacency matrix or the transition matrix. It is possible to compute a matrix-vector product $P\mathbf{v}$ using only the adjacency list (https://en.wikipedia.org/wiki/Adjacency_list) of the graph.

In [8]:
def pow_method_tolerance(P, v0, tol):
    '''
    pow_method(P, v0, tol):
        Input: 
            P:   transition matrix
            v0:  initial vector of probabilities
            tol: tolerance. Iteration stops when ||v-Pv||<tol
        Output:
            v:   final vector of probabilities v=P^n * v0
    '''
    v=v0
    maxit=50000
    for i in range(maxit):
        vold=v
        v=P@v
        if np.linalg.norm(v-vold)<tol:
            print('Convergence after '+str(i)+ ' iterations')
            break
    else: # Python for-else. else runs if the for-loop never breaks
        print('No convergence')
    
    return v

In [9]:
q=pow_method_tolerance(P, x0, 1e-12)

low_args=np.argsort(q) # smallest first
high_args=np.flip(low_args) # highest first
bestid=high_args[:10]

keyvals=np.loadtxt('keyvalues.txt', dtype=str)
print("Indices of highest ranked pages:")
print(bestid)

print("\nLinks to highest ranked pages:")
for u in bestid:
    print(keyvals[u,1])

Convergence after 151 iterations
Indices of highest ranked pages:
[ 0  2 54 59 65 62 64 66 63 61]

Links to highest ranked pages:
https://www.nmbu.no/fakultet/realtek
https://www.nmbu.no/fakultet/realtek/studier
https://www.nmbu.no/fakultet/realtek/studier/student
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/ppu
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/imrt100
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/frie-realfag
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/data-science
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/h-yere-rstrinn
https://www.nmbu.no/fakultet/realtek/studier/student/semesterstart/lektor
