# Lecture 11 PageRank
__Math 3280: Data Mining__

__Outline__
1. Archaic search engines and Term Spam
2. Basic PageRank
3. Dead Ends and Spider Traps
4. Taxation

__Reading__ 
* Leskovec, Chapter 5
* [PageRank: Link Analysis Explanation and Python Implementation from Scratch, *Towards Data Science*](https://towardsdatascience.com/pagerank-3c568a7d2332)  
-----

Early search engines
* Programs would crawl through websites, listing terms (words or other strings of characters other than white space) used in that page.
* The list of terms would be stored in an inverted index (a data structure that makes it easy, given a term, to find all places where that term occurs)
* A search query would move through the inverted index and look for pages with the searched terms
  * The result would be a ranked list of pages that use those terms

Problems
* __Term Spam__ refers to techniques for fooling search engines into believe your page is about something it is not
  * To attract users to their website, some developers would take a list of key terms and add it to their webpage multiple times in a font the same color as the background so they'd get a high ranking in these early search engines
  * Example, you have a website to sell GPS units, but want to attract anyone who is searching for local hikes
    * You encode the webpage with the words "local" and "hike" repeated multiple times
    * If your page is grey, then you change the font color of the repeated words to be the same shade of grey so they aren't immediately apparent to the user
    * The more times the words "local" and "hike" appear in your page, the more likely your website will get a high rank in a search query
    
Two men worked on a new piece of software named __PageRank__ which combats term spam:
* A search query creates __web surfers__, or random paths following pages link to link
  * Website A has a link to website B and follows it, and finds on site B a link to site C, and so on, crawling from link to link through the web
* Other components are also considered, such as how many sites link to that page
* Calculating the probability that a site is at the end of the crawl gives a way to rank the sites

We'll look at
1. Strongly connected websites (no dead ends)
2. Other components on the web that introduce dead ends
3. Spider Traps
4. Using taxation to resolve dead ends and spider traps

## Strongly connected websites


Figure 5.1
* A -> B, C, D
* B -> A, D
* C -> A
* D -> B, C

Create a transition matrix ($M$)
* Each column is a starting page (A, B, C, or D)
* Each row is a linked page that is linked to from a starting page
* The values are the probability that a linked page is selected from the given starting page
  * $1/k$ where there are $k$ links on a page
  * This process is __stochastic__
    * There is an equal probability (random chances) for each page to be selected
    * The total probability of each column is 1

Starting vector $\mathbf{v_0}$
* A vector of size $n$ (the number of sites in the graph) whose elements are $1/n$, indicating the probability of that being a starting page

A simulated web surfer's first step:
$$v_1 = Mv_0$$

The result of this multiplication is the probability of landing on each site after the first step. Find the probability of each following step:
$$v_{i+1} = Mv_i$$

If we complete this a large number of times, the vector will approach the probability of ending up on that particular site. The sites with the highest probabilities are listed first in the search query. 

In [1]:
# Transition Matrix 
import numpy as np
#              Starting Page     A,   B,   C,   D,   E
transition_matrix = np.array([[  0,   0,   0,   0, 1/3],  # Linked page A
                              [1/2,   0, 1/3,   0, 1/3],  #             B
                              [1/2,   0,   0, 1/2,   0],  #             C
                              [  0,   0, 1/3, 1/2, 1/3],  #             D
                              [  0,   1, 1/3,   0,   0]]) #             E

# Starting vector
v = np.array([1/5, 1/5, 1/5, 1/5, 1/5])

# Web crawler steps
for i in range(30):
    v = np.matmul(transition_matrix, v)
    print(v)

[0.06666667 0.23333333 0.2        0.23333333 0.26666667]
[0.08888889 0.18888889 0.15       0.27222222 0.3       ]
[0.1        0.19444444 0.18055556 0.28611111 0.23888889]
[0.07962963 0.18981481 0.19305556 0.28287037 0.25462963]
[0.08487654 0.18904321 0.18125    0.29066358 0.25416667]
[0.08472222 0.18757716 0.18777006 0.29047068 0.24945988]
[0.08315329 0.18810442 0.18759645 0.29097865 0.25016718]
[0.08338906 0.18749786 0.18706597 0.29141054 0.25063657]
[0.08354552 0.18759538 0.1873998  0.29160612 0.24985318]
[0.08328439 0.18752376 0.18757582 0.29155405 0.25006198]
[0.08335399 0.18752146 0.18741922 0.29165629 0.25004903]
[0.08334968 0.18749975 0.18750514 0.2916509  0.24999454]
[0.08333151 0.18750806 0.18750029 0.29165868 0.25000146]
[0.08333382 0.18749967 0.18749509 0.29166325 0.25000816]
[0.08333605 0.18750133 0.18749854 0.29166604 0.24999804]
[0.08333268 0.18750022 0.18750105 0.29166521 0.25000084]
[0.08333361 0.1875003  0.18749895 0.29166657 0.25000057]
[0.08333352 0.18749998 0.187500

(Solution: [3/9, 2/9, 2/9, 2/9])

## Dead ends




Figure 5.3 - Same as figure 5.1, but remove the link from C to A
* A -> B, C, D
* B -> A, D
* C -> __Dead End__
* D -> B, C

This process is no longer stochastic, since the total of column C is 0, not 1. Since some columns are stochastic, but not all, we call this __substochastic__.

In [2]:
#              Starting Page     A,   B, C,   D
transition_matrix = np.array([[  0,   0,   0,   0, 1/3],  # Linked page A
                              [1/2,   0, 1/3,   0, 1/3],  #             B
                              [1/2,   0,   0, 1/2,   0],  #             C
                              [  0,   0, 1/3, 1/2, 1/3],  #             D
                              [  0,   0, 1/3,   0,   0]]) #             E

# Starting vector
v = np.array([1/5, 1/5, 1/5, 1/5, 1/5])

# Web surfer steps
for i in range(60):
    v = np.matmul(transition_matrix, v)
    print(v)

[0.06666667 0.23333333 0.2        0.23333333 0.06666667]
[0.02222222 0.12222222 0.15       0.20555556 0.06666667]
[0.02222222 0.08333333 0.11388889 0.175      0.05      ]
[0.01666667 0.06574074 0.09861111 0.14212963 0.03796296]
[0.01265432 0.05385802 0.07939815 0.11658951 0.03287037]
[0.01095679 0.04375    0.06462191 0.09571759 0.02646605]
[0.00882202 0.03584105 0.05333719 0.07822145 0.02154064]
[0.00718021 0.02937028 0.04352173 0.06407    0.01777906]
[0.00592635 0.02402371 0.03562511 0.0524686  0.01450724]
[0.00483575 0.01967396 0.02919748 0.04294508 0.01187504]
[0.00395835 0.01610871 0.02389042 0.03516338 0.00973249]
[0.00324416 0.01318681 0.01956086 0.02878933 0.00796347]
[0.00265449 0.01079686 0.01601675 0.02356944 0.00652029]
[0.00217343 0.00883959 0.01311197 0.01929706 0.00533892]
[0.00177964 0.00723701 0.01073525 0.01579883 0.00437066]
[0.00145689 0.00592512 0.00878923 0.01293471 0.00357842]
[0.00119281 0.00485099 0.0071958  0.01058991 0.00292974]
[0.00097658 0.00397158 0.005891

(Solution after a large number of iterations: [0,0,0,0])

Two approaches to dealing with dead ends
1. Drop all dead ends
   * Will also need to drop sites that only lead to dead ends
   * If site J leads only to K, and K is a dead end, then K is dropped. J then become a dead end, so will also need to be dropped
2. Modify the process by which random surfers are assumed to move about the web (the modification we'll look at is "taxation")

In [3]:
#              Starting Page     A,   C,   D,   E
transition_matrix = np.array([[  0,   0,   0, 1/2],  # Linked page A
                              [  1,   0, 1/2,   0],  #             C
                              [  0, 1/2, 1/2, 1/2],  #             D
                              [  0, 1/2,   0,   0]]) #             E

# Starting vector
v = np.array([1/4, 1/4, 1/4, 1/4])

# Web surfer steps
for i in range(30):
    v = np.matmul(transition_matrix, v)
    print(v)

[0.125 0.375 0.375 0.125]
[0.0625 0.3125 0.4375 0.1875]
[0.09375 0.28125 0.46875 0.15625]
[0.078125 0.328125 0.453125 0.140625]
[0.0703125 0.3046875 0.4609375 0.1640625]
[0.08203125 0.30078125 0.46484375 0.15234375]
[0.07617188 0.31445312 0.45898438 0.15039062]
[0.07519531 0.30566406 0.46191406 0.15722656]
[0.07861328 0.30615234 0.46240234 0.15283203]
[0.07641602 0.30981445 0.46069336 0.15307617]
[0.07653809 0.3067627  0.46179199 0.15490723]
[0.07745361 0.30743408 0.46173096 0.15338135]
[0.07669067 0.30831909 0.46127319 0.15371704]
[0.07685852 0.30732727 0.46165466 0.15415955]
[0.07707977 0.30768585 0.46157074 0.15366364]
[0.07683182 0.30786514 0.46146011 0.15384293]
[0.07692146 0.30756187 0.46158409 0.15393257]
[0.07696629 0.30771351 0.46153927 0.15378094]
[0.07689047 0.30773592 0.46151686 0.15385675]
[0.07692838 0.3076489  0.46155477 0.15386796]
[0.07693398 0.30770576 0.46153581 0.15382445]
[0.07691222 0.30770189 0.46153301 0.15385288]
[0.07692644 0.30767873 0.46154389 0.15385094]
[0

(Solution: [2/9, 4/9, 3/9])

## Spider traps
Occasionally, a website will link to itself (e.g. a website links to a section halfway down the page, another link returns the user to the top of the page). If the only links on a website are to itself, then it is called a __spider trap__. As there are links, it is not a dead end. Although this may make sense as a web developer, it does mess up the PageRank algorithm.



Figure 5.6 - Same as figure 5.3, but C has a spider trap (link to itself)
* A -> B, C, D
* B -> A, D
* C -> C
* D -> B, C

In [5]:
#              Starting Page     A,   B,   C,   D,   E
transition_matrix = np.array([[  0,   0,   0,   0, 1/3],  # Linked page A
                              [1/2,   0, 1/3,   0, 1/3],  #             B
                              [1/2,   0,   0,   0,   0],  #             C
                              [  0,   0, 1/3,   1, 1/3],  #             D
                              [  0,   1, 1/3,   0,   0]]) #             E

# Starting vector
v = np.array([1/5, 1/5, 1/5, 1/5, 1/5])

# Web surfer steps
for i in range(60):
    v = np.matmul(transition_matrix, v)
    print(v)

[0.06666667 0.23333333 0.1        0.33333333 0.26666667]
[0.08888889 0.15555556 0.03333333 0.45555556 0.26666667]
[0.08888889 0.14444444 0.04444444 0.55555556 0.16666667]
[0.05555556 0.11481481 0.04444444 0.62592593 0.15925926]
[0.05308642 0.09567901 0.02777778 0.69382716 0.12962963]
[0.04320988 0.07901235 0.02654321 0.7462963  0.10493827]
[0.03497942 0.0654321  0.02160494 0.79012346 0.08786008]
[0.02928669 0.05397805 0.01748971 0.8266118  0.07263374]
[0.02421125 0.0446845  0.01464335 0.85665295 0.05980796]
[0.01993599 0.03692273 0.01210562 0.88147005 0.04956561]
[0.01652187 0.03052507 0.00996799 0.90202713 0.04095793]
[0.01365264 0.02523624 0.00826094 0.91900244 0.03384774]
[0.01128258 0.02086255 0.00682632 0.93303866 0.02798989]
[0.00932996 0.01724669 0.00564129 0.94464407 0.02313799]
[0.00771266 0.01425807 0.00466498 0.95423716 0.01912712]
[0.00637571 0.01178703 0.00385633 0.96216786 0.01581307]
[0.00527102 0.00974432 0.00318785 0.96872433 0.01307248]
[0.00435749 0.00805562 0.002635

(Solution: [0,0,1,0])

The result is a 100% chance of ending up in the spider trap.

## Taxation
To resolve this, we use a process called __taxation__. We modify the PageRank algorithm by allowing each random surfer a small probability of *teleporting* to a random page. If there is a probability $\beta$ (taxation parameter) that a surfer is not teleported, then our random surf equation becomes,
$$\mathbf{v}' = \beta M\mathbf{v} + \frac{1-\beta}{n}\mathbf{e}$$

where $\mathbf{e}$ is a vector of 1's the same size as $\mathbf{v}$ and where $n$ is the size of the vector $\mathbf{v}$ (or the number of webpages in question). The term on the right increases the probabilty of randomly ending up on another page (e.g. the user stops a current line of search and goes somewhere else). Generally, the probability of not being transported is set between $\beta=70$ and $\beta=90$.

In [8]:
#              Starting Page     A,   B,   C,   D,   E
transition_matrix = np.array([[  0,   0,   0,   0, 1/3],  # Linked page A
                              [1/2,   0, 1/3,   0, 1/3],  #             B
                              [1/2,   0,   0,   0,   0],  #             C
                              [  0,   0, 1/3,   1, 1/3],  #             D
                              [  0,   1, 1/3,   0,   0]]) #             E

# Starting vector
v = np.array([1/5, 1/5, 1/5, 1/5, 1/5])

# Teleporting vector
e = np.array([1, 1, 1, 1, 1])
beta = 0.8

# Web surfer steps
for i in range(20):
    v = beta*np.matmul(transition_matrix, v) + e*(1-beta)/len(e)
    print(v)

[0.09333333 0.22666667 0.12       0.30666667 0.25333333]
[0.10755556 0.17688889 0.07733333 0.38488889 0.25333333]
[0.10755556 0.1712     0.08302222 0.43608889 0.20213333]
[0.09390222 0.1590637  0.08302222 0.46491259 0.19909926]
[0.09309314 0.15279328 0.07756089 0.48716247 0.18939022]
[0.09050406 0.14842422 0.07723725 0.50091694 0.18291753]
[0.08877801 0.14557623 0.07620162 0.51010816 0.17933597]
[0.08782293 0.14365456 0.0755112  0.51622989 0.17678142]
[0.08714171 0.1424072  0.07512917 0.52026194 0.17505997]
[0.08668266 0.14157379 0.07485668 0.52292666 0.17396021]
[0.08638939 0.14102424 0.07467306 0.5246925  0.17322081]
[0.08619222 0.14066079 0.07455576 0.52585903 0.17273221]
[0.08606192 0.14042034 0.07447689 0.52663068 0.17241017]
[0.08597604 0.14026132 0.07442477 0.52714109 0.17219678]
[0.08591914 0.14015616 0.07439042 0.52747862 0.17205566]
[0.08588151 0.14008661 0.07436766 0.52770185 0.17196238]
[0.08585663 0.14004061 0.0743526  0.52784949 0.17190066]
[0.08584018 0.14001019 0.074342

(These solutions are [15/148, 19/148, 95/148, 19/148])

We have looked at how PageRank is used based on links to other pages. Every search engine has a more complicated algorithm based on more components:
* Number of pages with links to that page
* Frequency of words used
* Relevance of word usage based on words around it (Natural Language Processing)
* ...etc...

Google is said to have over 250 different components to their algorithm. Normally, the weighting of properties is such that unless all the search terms are present, a page has very little chance of being high on the list.

## 

## 5.2 Efficient Computation of PageRank
The real web is far more complicated than the example we just did. There are, however, two major problems,
1. With billions of websites ($n$), the matrix will be $n\times n$, with over $10^{18}$ elements. It would be a very sparse matrix since no website links to more than a handful of websites.
2. We could use MapReduce to do the calculation, but the basic approaches are not sufficient to avoid heavy use.

### Representing the matrix
A matrix representing $n$ websites will take $n^2$ bytes. However, if we use locality-sensitive hashing, then we don't need to store the entire (sparse) matrix. In fact, the space needed for the relevant information in the matrix will no longer be quadratic, but a linear relationship to the number of links involved.

In addition, every value for a link within a page will have the same value: 1 divided by the out-degree of the page (the number of links on the page). Using the links from Figure 5.1,

$$M=\begin{bmatrix}
  0 & 1/2 & 1 & 0 \\
  1/3 & 0 & 0 & 1/2 \\
  1/3 & 0 & 0 & 1/2 \\
  1/3 & 1/2 & 0 & 0 \end{bmatrix}$$

The matrix $M$ has 16 elements, so it would take 16 bytes of memory. But if we use LSH, 

| Source | Degree | Value | Destinations |
| ------ | ------ | ----- | ------------ |
| A      | 3      | 1/3   | B,C,D        |
| B      | 2      | 1/2   | A,D          |
| C      | 1      | 1     | A            |
| D      | 2      | 1/2   | B,C          |

The LSH list would be:
* (B, 1/3)
* (C, 1/3)
* (D, 1/3)
* (A, 1/2)
* (B, 1/2)
* (C, 1)
* (B, 1/2)
* (C, 1/2)

That is 8 entries. If each takes 4 bytes, then that is 32 bytes of memory. For this example, the LSH takes up more space. However, if we add two more sites, $M$ increases to 36 bytes, and LSH only increases by the number of links on the two pages (if there are 2 links on each page, it goes up to 36 bytes).

## 5.3 Topic-Sensitive PageRank
Currently, using the PageRank with teleporting, there is an equal probability that the web surfer ends up at any page. But could we make this even better? For example, if we take the term *Diamondback*, I could be referring to one of the following:
* The Diamondback rattlesnake (animal)
* A Diamondback Mountain Bike (recreation)
* The Arizona Diamondbacks MLB team (sports)

If the system knows a little more about the user, perhaps the PageRank could give better results.

All topics for websites can be divided into specific categories. One such useful set of 16 top-level categories is known as the DMOZ. Once we know the topics of each website, then we can weight the teleport for each site. Let's set $S$ to be the weighted teleport set, and $\mathbf{e}_S$ the vector that has 1 for each element in $S$ and 0 for all other elements. We calculate the __Topic-Sensitive PageRank__ as:
$$\mathbf{v}' = \beta M\mathbf{v} + \frac{1-\beta}{|S|}\mathbf{e_S}$$

Let us say that $S=\{B,D\}$, meaning the only sites teleported to are sites $B$ and $D$. Let's also leave the taxation parameter as $\beta=0.8$.

In [9]:
#              Starting Page     A,   B,   C,   D,   E
transition_matrix = np.array([[  0,   0,   0,   0, 1/3],  # Linked page A
                              [1/2,   0, 1/3,   0, 1/3],  #             B
                              [1/2,   0,   0, 1/2,   0],  #             C
                              [  0,   0, 1/3, 1/2, 1/3],  #             D
                              [  0,   1, 1/3,   0,   0]]) #             E

# Starting vector
v = np.array([0, 0, 1/2, 1/2, 0])

# Teleporting vector
es = np.array([0, 0, 1, 1, 0])
beta = 0.8

# Web surfer steps
for i in range(20):
    v = beta*np.matmul(transition_matrix, v) + es*(1-beta)/sum(es)
    print(v)

[0.         0.13333333 0.3        0.43333333 0.13333333]
[0.03555556 0.11555556 0.27333333 0.38888889 0.18666667]
[0.04977778 0.13688889 0.26977778 0.37822222 0.16533333]
[0.04408889 0.13594074 0.2712     0.36731852 0.18145185]
[0.04838716 0.13834272 0.26456296 0.36763457 0.18107259]
[0.04828602 0.13819101 0.26640869 0.36588998 0.1812243 ]
[0.04832648 0.13868321 0.2656704  0.36572479 0.18159513]
[0.04842537 0.1386014  0.26562051 0.36556072 0.18179201]
[0.04847787 0.13868015 0.26559444 0.36553429 0.18171325]
[0.04845687 0.1386732  0.26560486 0.36549577 0.1817693 ]
[0.04847181 0.13868252 0.26558105 0.36549808 0.18176652]
[0.04847107 0.13868141 0.26558796 0.36549192 0.18176763]
[0.04847137 0.13868325 0.2655852  0.36549159 0.18176859]
[0.04847162 0.13868289 0.26558519 0.36549098 0.18176932]
[0.04847182 0.13868318 0.26558504 0.36549093 0.18176903]
[0.04847174 0.13868315 0.2655851  0.36549079 0.18176923]
[0.04847179 0.13868318 0.26558501 0.3654908  0.18176921]
[0.04847179 0.13868318 0.265585