# Google Page Rank
__MATH 420__  <br>
_Spring 2021_ <br>


We would like a way to assign a numerical value, let's call it the _rank,_ to the popularity of a web page. We'll describe a method that is the  basis of the Google Page Rank.

To start our thinking about this, let's imagine that a popular page, say the _Wall Street Journal_ (WSJ) has a link to the _Kearney Hub._  The editor of the _Hub_ will be thrilled with the traffic that might result from being linked from such a popular web page. But if the _Kearney Hub_ links to the _Wall Street Journal,_  I'd guess that the editors of the WSJ would barely notice. So our first insight is that the rank of a page depends on the ranks of the pages that link to it. If a highly ranked page links to the _Kearney Hub,_ for example, it raises the rank of the _Hub._  But if a lowly ranked page links to the _Hub,_ it doesn't affect the rank of the _Hub_ all that much. 

Our second insight is that if a page links to many pages, that diminishes the influence of a link. A visitor to a page that links to a million other pages, might click on any one of a million links, but a visitor to a page that only links to just ten pages has a good chance of visiting one of these ten pages. So the more links a page has, the less influence it has on the ranks of the pages it links to. We can think of each link from a web page as a vote, with the weight of each vote as $1/n$, where $n$ is the number of links from a web page. Thus the sum of all the votes from each web page is one. 

Given these insights, let's define the _rank_ of a page to be proportional  to the weighted sum of the ranks that link to it. Again, the weight of each link is the reciprocal of the number of pages it links to. So if a page links to $107$ other pages, the weight of each link is $1/107$.

Let's take an example. Let's suppose we have four web pages labeled $A,B,C$, and $D$. 

Suppose pages $B$ and $C$ link to page $A$, and suppose page $B$ has a weight of $1/2$ (that is, it links to a total of two pages), and page $C$ has a weight of $1/3$ (thus page $C$ links to three pages). Calling the proportionality constant $\lambda$, the rank of $A$  satisfies
$$
  \lambda  \, \, \mbox{rank}(A) = \frac{1}{2}  \mbox{rank} (B) + \frac{1}{3}  \mbox{rank}(C).
$$

For the rank of $B$, suppose that pages $A$ and $C$ link to page $B$. And suppose the weight of page $A$ is $1/2$. Then
$$
  \lambda  \, \, \mbox{rank}(B) = \frac{1}{2}  \mbox{rank} (A) + \frac{1}{3} \mbox{rank} (C).
$$
For the other two pages, let's suppose that 
$$
  \lambda  \, \,  \mbox{rank}(C) = \frac{1}{2}  \mbox{rank} (B) + \mbox{rank} (D) ,
$$
$$
  \lambda   \, \, \mbox{rank}(D) = \frac{1}{2}  \mbox{rank} (A) +  \frac{1}{3} \mbox{rank} (C). 
$$

In matrix notation, our equations are
$$
  \lambda \begin{bmatrix} A \\ B \\ C \\ D \end{bmatrix} = \begin{bmatrix} 0 & 1/2 & 1/3 & 0 \\ 1/2 & 0 & 1/3 & 0 \\ 0 & 1/2 & 0 & 1 \\ 1/2 & 0 & 1/3 & 0 \end{bmatrix} \begin{bmatrix} A \\ B \\ C \\ D \end{bmatrix}. 
$$
Here I tired of writing $\mbox{rank}(A)$, so I wrote $A$ instead; and similarly for the other variables. Since $\lambda$ is unknown and the equations for $A, B, C$, and $D$ are linear, this is an _eigenvalue problem_. 

The matrix has several nice properties: (a) every column sum is one (b) all entries are in the interval $[0,1]$. Such a matrix is called a _Markov_ matrix. (See, for example, https://en.wikipedia.org/wiki/Stochastic_matrix). Ou r matrix has an additional property that in each column, every nonzero value is the same. For our matrix, for example, every nonzero member of the first column is $1/2$.

The equations for the unknowns $A, B, C$, and $D$ are homogeneous and linear. Accordingly, any multiple of a solution for provides another solution. This freedom allows us to require that the sum of the ranks have a specific value (for example 10).

Let's have Julia solve the eigenvalue problem for us. We'll need the package `LinearAlgebra`. 

In [1]:
using LinearAlgebra

We define the matrix `M` by hand. After that, we can find the eigenvalues and eigenvectors with one command:

In [2]:
M = [0 1/2 1/3 0; 1/2 0 1/3 0; 0 1/2 0 1; 1/2 0 1/3 0]

4×4 Array{Float64,2}:
 0.0  0.5  0.333333  0.0
 0.5  0.0  0.333333  0.0
 0.0  0.5  0.0       1.0
 0.5  0.0  0.333333  0.0

In [3]:
F  = eigen(M);

The eigenvalues (the value of $\lambda$) gives the proportionality constant. The eigenvalues are

In [4]:
F.values

4-element Array{Float64,1}:
 -0.5000000062651664
 -0.4999999937348334
  0.0
  0.9999999999999998

In [5]:
F.vectors

4×4 Array{Float64,2}:
 -0.288675   0.288675  -0.471405  0.436436
 -0.288675   0.288675  -0.471405  0.436436
  0.866025  -0.866025   0.707107  0.654654
 -0.288675   0.288675   0.235702  0.436436

Each column is an eigenvalue. Column 4, the eigenvector corresponding to the eigenvalue one, consists of entirely positive numbers. It's natural, I think, to require that a rank be nonnegative. Thus, we'll choose the eigenvector corresponding to the eigenvalue one for the ranks.

So the eigenvalue corresponding to the eigenvalue 0.9999999999999998 is (the command [:, 4] returns column 4 of a matrix).

In [6]:
x = F.vectors[:, 4]

4-element Array{Float64,1}:
 0.4364357804719847
 0.4364357804719847
 0.6546536707079773
 0.4364357804719847

This says Page $C$ has the highest rank (it's rank is about $0.65$ and each of the other ranks is about $0.43$. There is a three way tie for second place.  We might like to normalize the ranks to sum to $10 \,\, (= 1 + 2 + 3 + 4)$. A quick way to do this is to use

In [7]:
10 * x / sum(x)

4-element Array{Float64,1}:
 2.222222222222222
 2.222222222222222
 3.333333333333334
 2.222222222222222

In [8]:
x / sum(x)

4-element Array{Float64,1}:
 0.2222222222222222
 0.2222222222222222
 0.3333333333333334
 0.2222222222222222

Requiring that the sum of the ranks be 10, the rank of Page $C$ is about $3.3$ and the other ranks are $2.2$.

Was it a coincidence that exactly one eigenvalue was positive and one eigenvector had only nonnegative terms?  No, there is a theory. For the theory, see https://en.wikipedia.org/wiki/Perron%E2%80%93Frobenius_theorem .

Eigenvalue problems look somewhat like a fixed point problem, but the eigenvalue alters that somewhat. But for an eigenvalue of one, the eigenvalue problem
$$
   x = M x
$$
is _exactly_ the form of a fixed point problem.  Let's try solving our problem using fixed point iteration. Here is a quick to write recursive method:

In [9]:
function fixed_point(M, x0, tol)
    x1 = M * x0
    @show(x1)
    if norm(x1-x0, 2) < tol x1 else fixed_point(M, x1, tol) end
end

fixed_point (generic function with 1 method)

In [10]:
fixed_point(M, [1; 0 ; 0; 0], 1.0e-6)

x1 = [0.0, 0.5, 0.0, 0.5]
x1 = [0.25, 0.0, 0.75, 0.0]
x1 = [0.25, 0.375, 0.0, 0.375]
x1 = [0.1875, 0.125, 0.5625, 0.125]
x1 = [0.25, 0.28125, 0.1875, 0.28125]
x1 = [0.203125, 0.1875, 0.421875, 0.1875]
x1 = [0.234375, 0.2421875, 0.28125, 0.2421875]
x1 = [0.21484375, 0.2109375, 0.36328125, 0.2109375]
x1 = [0.2265625, 0.228515625, 0.31640625, 0.228515625]
x1 = [0.2197265625, 0.21875, 0.3427734375, 0.21875]
x1 = [0.2236328125, 0.22412109375, 0.328125, 0.22412109375]
x1 = [0.221435546875, 0.22119140625, 0.336181640625, 0.22119140625]
x1 = [0.22265625, 0.2227783203125, 0.331787109375, 0.2227783203125]
x1 = [0.22198486328125, 0.221923828125, 0.33416748046875, 0.221923828125]
x1 = [0.22235107421875, 0.222381591796875, 0.3328857421875, 0.222381591796875]
x1 = [0.2221527099609375, 0.222137451171875, 0.3335723876953125, 0.222137451171875]
x1 = [0.222259521484375, 0.22226715087890625, 0.3332061767578125, 0.22226715087890625]
x1 = [0.22220230102539062, 0.222198486328125, 0.3334007263183594, 0.22219

4-element Array{Float64,1}:
 0.22222228348255157
 0.22222229093313217
 0.3333331346511841
 0.22222229093313217

These numbers are familiar! This is exactly the result we got when we normalized by dividing by the sum of the ranks. For our matrix, we can show that if the sum of the members of xx is one, the sum of the members of $Mx$ is also one.

> Since started with a vector whose sum of components was one, the method returns a vector that also has a sum of components of one.



> We've described what Wikipedia (https://en.wikipedia.org/wiki/PageRank#Simplified_algorithm) refers to the _simplified version_.  

The Patent for the Google Page Rank (https://patentimages.storage.googleapis.com/db/8f/cb/dad63e985797ec/US7058628.pdf) replaces the Markov matrix $M$ for the simplified version by 
$$
  \frac{\alpha}{N} I  + (1 - \alpha) M,
$$
where $N$ is the number of nodes, $I$ is an identity matrix, and $\alpha \in [0,1]$. In general, this is _not_ a Markov matrix, and its largest eigenvalue (called the _dominant eigenvalue_) is strictly less than one. Actually, all eigenvalues are inside the unit circle; consequently, it can be shown that _every_ fixed point sequence converges to the zero vector. And that would make the page rank of every page equal zero. Since every fixed point sequence converges to the zero vector when $\alpha < 1$, generally $\alpha$ is called a _damping factor._ 

But buried in the Patent application is 

"_Note that in order to ensure convergence, the norm of p, must be made equal to 1  after each iteration_"

And this means that the original method modifies the fixed point sequence by dividing each term fixed point sequence by a norm (which norm, the one, two, or infinity,  doesn't matter). This is known as the power method for finding the dominant eigenvalue (see https://en.wikipedia.org/wiki/Power_iteration).

With or without a damping factor, the matrix used can have two or more linearly independent eigenvectors corresponding to the eigenvalue with the greatest magnitude. This happens, for example, when there are two or more nonempty disjoint sets of web pages (call them clusters) that are linked to other members of the subset, but not other clusters. Here is an example


We can see the need for a modification by considering the Markov matrix

In [11]:
M = [0 1 0 0; 1 0 0 0; 0 0 0 1; 0 0 1 0]

4×4 Array{Int64,2}:
 0  1  0  0
 1  0  0  0
 0  0  0  1
 0  0  1  0

Calling these pages $A$ though $D$, we see that $A$ and $B$ are linked and $C$ and $D$ are linked, but these two sets of nodes aren't linked together. What about the eigenvalues?

In [12]:
x = eigen(M);

Ha! There are two eigenvectors with eigenvalue 1. Using one eigenvector, the rank of $A$ and $B$ tie, but the ranks of $C$ and $D$ are zero.  And the other eigenvector swaps this. 

In [13]:
x.values

4-element Array{Float64,1}:
 -0.9999999999999989
 -0.9999999999999989
  1.0
  1.0

In [14]:
x.vectors

4×4 Array{Float64,2}:
  0.707107   0.0       0.707107  0.0
 -0.707107   0.0       0.707107  0.0
  0.0        0.707107  0.0       0.707107
  0.0       -0.707107  0.0       0.707107

Including a damping factor does still gives two linearly independent eigenvectors corresponding to the dominant eigenvalue.  

One way to fix this is to have a fictitious ''super node'' that is linked to every page and every page is linked to the super node. Effectively, the super node idea then includes the possibility that a user will visit a page by entering a url instead of randomly clicking.

In [15]:
alpha = 0.15

0.15

In [16]:
N = 4

4

In [17]:
xx = alpha/ N * I + (1-alpha) * M

4×4 Array{Float64,2}:
 0.0375  0.85    0.0     0.0
 0.85    0.0375  0.0     0.0
 0.0     0.0     0.0375  0.85
 0.0     0.0     0.85    0.0375

In [18]:
eigen(xx)

Eigen{Float64,Float64,Array{Float64,2},Array{Float64,1}}
values:
4-element Array{Float64,1}:
 -0.8124999999999989
 -0.8124999999999989
  0.8875
  0.8875
vectors:
4×4 Array{Float64,2}:
  0.707107   0.0       0.707107  0.0
 -0.707107   0.0       0.707107  0.0
  0.0        0.707107  0.0       0.707107
  0.0       -0.707107  0.0       0.707107