# PageRank

In this notebook, we will see a basic application of PageRank using pyspark. We will start by reviewing the theory. Suppose we have a graph, with nodes and edges, pretty much like the web. We want to determine the importance of each node. With the web analogy, this means ranking web pages according to their relevance. Of course, we cannot rely ask people directly to rank pages. We have to use the information already available. Let's see an example of a small web. 

<img src="https://miro.medium.com/max/892/1*E1QqUL6eQpJsuxI9V5FE7g.png" alt="alt text" width="400"/>

Here you have $5$ web pages $\{0,1,2,3,4\}$. Each edge $i \rightarrow j$ means that the web page $i$ is referring the page $j$. One first idea to determine the relevance of a web page can be the number of other web pages referring to it. For example, page $1$ is referred the most, so one could assume that it is the most relevant page. However, we do not account for the relevance of other pages. For example, if a very relevant page, $x$,  refers to another page $y$.  We could also safely say that $y$ is probably relevant although it may be referred to just a few times. Thus, to get the score of a page $i$, $r_i$, we can consider a score like this one:

$$ r_i = \sum_{j \rightarrow i} \frac{r_j}{d_j}$$

This means that the relevance score of the page $i$, named $r_i$, is given as a weighted sum of the relevance scores of all the pages referring to $i$. Each page referring to $i$ is weighted according to the out-degree $d_j$ (the number of pages referred from $j$). Of course, without knowing the relevance of all pages referring to $i$, we cannot determine the relevance of $i$. Let's try to write what we know formally. 

$$
\begin{alignat*}{4}
    r_0 =& \frac{r_4}{3}\\
    r_1 =& \frac{r_2}{2} + \frac{r_4}{3} + r_3\\
    r_2 =& \frac{r_0}{3} + \frac{r_4}{3}\\
    r_3 =& \frac{r_2}{2} + \frac{r_0}{3}\\
    r_4 =& \frac{r_0}{3} + r_1
\end{alignat*}
$$

In matrix form:

$$
P =\begin{bmatrix}
0 & 0 & 0 & 0 & \frac{1}{3}\\
0 & 0 & \frac{1}{2} & 1 & \frac{1}{3}\\
\frac{1}{3} & 0 & 0 & 0 & \frac{1}{3}\\
\frac{1}{3} & 0 & \frac{1}{2} & 0 & 0\\
\frac{1}{3} & 1 & 0 & 0 & 0 \\
\end{bmatrix}
$$

We have good news, this is a sparse matrix (it has a lot of $0$s). This means that we can leverage the sparsity of the matrix to store only what is really necessary. Notice that $P^T$ is row stochastic, meaning that the rows sum to $1$. This is because each column of $P$ represents a probability distribution. Let's have deeper look at the first column of the matrix:

$$
\begin{bmatrix}
0 & 0 & \frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\
\end{bmatrix}
$$

This row tells us the probability of going on any web page of a person randomly browsing (called a random walker) the web page $2$. In this example, the random walker has probability $\frac{1}{3}$ of ending on page $2$, probability $\frac{1}{3}$ of ending on page $3$, and probability $\frac{1}{3}$ of ending on page $4$. Remember, in practice, the element $P_{ij}$ represents the probability of going $j \rightarrow i$. Let's suppose our random walker can start from any node, so its initial probability distribution is: 

$$
\pi = \begin{bmatrix}
\frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5} & \frac{1}{5}\\
\end{bmatrix}
$$

Let's ask ourselves, what is the probability of a random walker with initial distribution $\pi$ of ending in any other page after one step? Actually, let's begin with a simple question what is the probability of ending the state $5$. 

- We have a probability of $\frac{1}{5}$ of starting on $0$ and probability $\frac{1}{3}$ of moving to $4$ ($\frac{1}{5}\frac{1}{3}=\frac{1}{15}$).
- We have a probability of $\frac{1}{5}$ of starting on $1$ and probability $1$ of moving to $4$ ($\frac{1}{5}1=\frac{1}{5}$).
- We have a probability of $\frac{1}{5}$ of starting on $2$ and probability $0$ of moving to $4$ ($\frac{1}{5}0=0$).
- We have a probability of $\frac{1}{5}$ of starting on $3$ and probability $0$ of moving to $4$ ($\frac{1}{5}0=0$).
- We have a probability of $\frac{1}{5}$ of starting on $4$ and probability $0$ of moving to $4$ ($\frac{1}{5}0=0$).

So, we have a probability of ending in $4$ of $\frac{1}{15} + \frac{1}{5} + 0 + 0 + 0 = \frac{4}{15}$. This amount to multiply the last row of $P$ with $\pi$:

$$
P_4\pi = \begin{bmatrix}
\frac{1}{3} & 1 & 0 & 0 & 0\\
\end{bmatrix}
\begin{bmatrix}
\frac{1}{5} \\ \frac{1}{5} \\ \frac{1}{5} \\ \frac{1}{5} \\ \frac{1}{5}\\
\end{bmatrix} = \frac{4}{15}
$$

In practice, we can obtain the probability of ending in any state after one step, $\pi^{(1)}$, by multiplying $P$ and $\pi$.

$$
P\pi = \pi^{(1)} =\begin{bmatrix}
0 & 0 & 0 & 0 & \frac{1}{3}\\
0 & 0 & \frac{1}{2} & 1 & \frac{1}{3}\\
\frac{1}{3} & 0 & 0 & 0 & \frac{1}{3}\\
\frac{1}{3} & 0 & \frac{1}{2} & 0 & 0\\
\frac{1}{3} & 1 & 0 & 0 & 0 \\
\end{bmatrix}
\begin{bmatrix}
\frac{1}{5} \\ \frac{1}{5} \\ \frac{1}{5} \\ \frac{1}{5} \\ \frac{1}{5}\\
\end{bmatrix}= 
\begin{bmatrix}
\frac{1}{15} & \frac{11}{30} & \frac{2}{15} & \frac{1}{6} & \frac{4}{15}\\
\end{bmatrix}
$$

We can obtain the probability of a random walker finding himself on any webpage after two steps, $\pi^{(2)}$, by multiplying $P$ and $\pi^{(1)}$. Through an iterative process, we can obtain the probability of ending on any page after any finite number of steps. 

## PageRank on spark


In [1]:
# load spark
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
                    .appName("Python Spark SQL basic example") \
                    .getOrCreate()
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [2]:
# First let's try an example 
dataset = spark.sparkContext.parallelize([(0,3), (0,2), (0,4), 
                                          (1,4),
                                          (2,1), (2,3),
                                          (3,1),
                                          (4,0), (4,1), (4,2)])

In [3]:
# let's peek the first entries
dataset.collect()

[(0, 3),
 (0, 2),
 (0, 4),
 (1, 4),
 (2, 1),
 (2, 3),
 (3, 1),
 (4, 0),
 (4, 1),
 (4, 2)]

In [4]:
total_pages = max(dataset.max(lambda x:x[0])[0],dataset.max(lambda x:x[1])[1])
print(total_pages)

                                                                                

4


In [5]:
# compute the out-degree for each node
id2degree = dataset.countByKey()
id2degree[0],id2degree[1],id2degree[2]

(3, 1, 2)

In [6]:
P = dataset.map(lambda x:(x[0],x[1],1/id2degree[x[0]]))
P.take(20)

[(0, 3, 0.3333333333333333),
 (0, 2, 0.3333333333333333),
 (0, 4, 0.3333333333333333),
 (1, 4, 1.0),
 (2, 1, 0.5),
 (2, 3, 0.5),
 (3, 1, 1.0),
 (4, 0, 0.3333333333333333),
 (4, 1, 0.3333333333333333),
 (4, 2, 0.3333333333333333)]

In [7]:
import numpy as np
p = np.full((total_pages+1,), 1/(total_pages+1))
p[:10]

array([0.2, 0.2, 0.2, 0.2, 0.2])

Now we need to implement the distributed version of the matrix multiplication. This part will be a little tricky. We will assume that the vector $p$ can fit in memory. 
- The matrix $P$ is represented as $(i,j,m_{ij})$.
- The vector $p$ is represented as $(j, v_j)$

The algorithm proceed as follows:
- Firstly, we map each $(i,j,m_{ij}) \rightarrow (i, m_{ij}v_j)$
- Next, we reduce by key $(i, [m_{ij}v_j, \dots, m_{it}v_t]) \rightarrow (i, m_{ij}v_j + \dots + m_{it}v_t)$

$$Px = y$$
$$y_i = \sum_k m_{ik}x_k$$


In [8]:
# columns do sum to 1.
PT = P.map(lambda x: (x[1],x[0],x[2]))
PT  .map(lambda x: (x[1],x[2]))\
    .reduceByKey(lambda x,y: x+y)\
    .take(10)
    


                                                                                

[(0, 1.0), (1, 1.0), (2, 1.0), (3, 1.0), (4, 1.0)]

In [9]:

for i in range(10):
    new_p = PT.map(lambda x:(x[0],(x[2]*p[x[1]])))\
              .reduceByKey(lambda x,y: x+y)\
              .collect()
    for idx,prb in new_p:
        p[idx] = prb
    
    print(f"iteration {i}")


iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9


In [10]:
list(zip(p.argsort()[::-1], p[p.argsort()[::-1]]))

[(4, 0.3357537807583531),
 (1, 0.2961593337736455),
 (2, 0.14710325323036794),
 (3, 0.11074425477146095),
 (0, 0.11023937746617213)]

### On real data

We will study a real dataset from https://www.cs.cornell.edu/courses/cs685/2002fa/. You can download the raw file at https://www.cs.cornell.edu/courses/cs685/2002fa/data/gr0.California. These web subgraphs were constructed by expanding a 200-page response set to a search engine query, as in the hub/authority algorithm. This data was collected some time back, so a number of the links will be broken. 

Before going any further, try to make the page rank algorithm yourself.


In [11]:
## let's download the dataset
import wget, os
if not os.path.isfile("dataset.txt"):
    wget.download(url = "https://www.cs.cornell.edu/courses/cs685/2002fa/data/gr0.California", out = "dataset.txt")

In [12]:
# load the dataset
from pprint import pprint
dataset = spark.sparkContext.textFile(name = "dataset.txt", minPartitions = 2)

# let's have a peek a our dataset
print("dataset --->")
pprint(dataset.take(10))

dataset --->
['n 0 http://www.berkeley.edu/',
 'n 1 http://www.caltech.edu/',
 'n 2 http://www.realestatenet.com/',
 'n 3 http://www.ucsb.edu/',
 'n 4 http://www.washingtonpost.com/wp-srv/national/longterm/50states/ca.htm',
 'n 5 http://www-ucpress.berkeley.edu/',
 'n 6 http://www.ucr.edu/',
 'n 7 http://www.tegnetcorporation.com/',
 'n 8 http://www.research.digital.com/SRC/virtual-tourist/California.html',
 'n 9 http://www.leginfo.ca.gov/calaw.html']


In [13]:
# get nodes
id2ref = dataset.filter(lambda x:x.startswith("n"))\
                .map(lambda x:tuple(x.split(" ")))\
                .map(lambda x:(int(x[1]),x[2]))
print("firsts 10 nodes entries: ", id2ref.take(10),end="\n\n")

# get edges
id2id = dataset.filter(lambda x:x.startswith("e"))\
               .map(lambda x:tuple(x.split(" ")))\
               .map(lambda x:(int(x[1]),int(x[2])))
print("firsts 10 edges entries: ", id2id.take(10),end="\n\n")


firsts 10 nodes entries:  [(0, 'http://www.berkeley.edu/'), (1, 'http://www.caltech.edu/'), (2, 'http://www.realestatenet.com/'), (3, 'http://www.ucsb.edu/'), (4, 'http://www.washingtonpost.com/wp-srv/national/longterm/50states/ca.htm'), (5, 'http://www-ucpress.berkeley.edu/'), (6, 'http://www.ucr.edu/'), (7, 'http://www.tegnetcorporation.com/'), (8, 'http://www.research.digital.com/SRC/virtual-tourist/California.html'), (9, 'http://www.leginfo.ca.gov/calaw.html')]

firsts 10 edges entries:  [(0, 449), (0, 450), (0, 451), (0, 452), (0, 453), (0, 454), (0, 455), (0, 456), (0, 432), (0, 457)]



Next we need to fix an important problem. <br />
There are nodes that do not appear among the edges. <br />
This means that the stochastic matrix would have some full zero rows. <br />
This is a big problems as by multiplying P by p, we cannot expect a distribution. <br />
There are many ways to solve this problem, but one of the simplest is adding edges to an artificial nodes.<br />
This artificial node---called hidden. <br />
it will reach all other node and it will be reachable by all other nodes. <br />

In [14]:
hidden = max(id2id.keys().max(), id2id.values().max()) + 1
id2hidden = spark.sparkContext.parallelize([(i,hidden) for i in range(hidden)])
hidden2id = spark.sparkContext.parallelize([(hidden,i) for i in range(hidden)])
id2id_filled  = id2id.union(id2hidden).union(hidden2id)
id2ref_filled = id2ref.union(spark.sparkContext.parallelize([(hidden, "None")]))

In [15]:
# compute the out-degree for each node
id2degree = id2id_filled.countByKey()
print(f"degree of node 0:{id2degree[0]}, degree of node 1:{id2degree[1]}, degree of node 2:{id2degree[2]}.\n")



degree of node 0:18, degree of node 1:2, degree of node 2:4.



                                                                                

In [None]:
# compute sparse transition matrix
P = id2id_filled.map(lambda x:(x[0],x[1],1/id2degree[x[0]]))
PT = P.map(lambda x: (x[1],x[0],x[2]))
print("firsts 10 matrix entries:", P.take(10), end="\n\n")
    
# compute total number of nodes
connected_nodes = id2id.keys().union(id2id.values()).distinct().count()
total_nodes = id2ref_filled.map(lambda x:x[0]).count()
print(f"total nodes: {total_nodes}\n")

# compute probabilities vector
import numpy as np
p = np.full((total_nodes,), 1/(total_nodes))

# P*p for some iteration
for i in range(10):
    new_p = PT.map(lambda x:(x[0],(x[2]*p[x[1]])))\
              .reduceByKey(lambda x,y: x+y)\
              .collect()
    for idx,prb in new_p:
        p[idx] = prb

print(p.sum())

# print top pages
for page in list(zip(p.argsort()[::-1], p[p.argsort()[::-1]]))[:10]:
    print(f"prb:{page[1]}, page:{id2ref_filled.lookup(page[0])}")



firsts 10 matrix entries: [(0, 449, 0.05555555555555555), (0, 450, 0.05555555555555555), (0, 451, 0.05555555555555555), (0, 452, 0.05555555555555555), (0, 453, 0.05555555555555555), (0, 454, 0.05555555555555555), (0, 455, 0.05555555555555555), (0, 456, 0.05555555555555555), (0, 432, 0.05555555555555555), (0, 457, 0.05555555555555555)]

total nodes: 9665



