# Lecture 12 TrustRank
__Math 3280: Data Mining__

__Outline__
1. TrustRank
2. Spam Farms
3. SpamMass

__Reading__ 
* Leskovec, Chapter 5
* [PageRank: Link Analysis Explanation and Python Implementation from Scratch, *Towards Data Science*](https://towardsdatascience.com/pagerank-3c568a7d2332)  
-----

### 5.4 Link Spam
We talked earlier about term spam, where spammers add specific words to their webpage to increase the search results. We effectively combated that by looking at links to each page using PageRank. However, there are groups that have found a way around PageRank. This is called __link spam__ which increases the PageRank of a spam page by use of a series of sites called a __spam farm__.

Spammers are dealing with three different types of pages:
1. Inaccessible pages
    * Spammers can do nothing about these pages
    * Generally don't link to spam pages
2. Accessible pages
    * Spammers can't change the page, but can manipulate them
    * For example, on a blog, the spammer can't change what's on the page, but they can add a comment to the effect of "Good points! I have some additional comments at [link to spam page]."
3. Own pages
    * A series of pages that the spammer owns and controls, adding links to their target page(s)
  
To analyze a spam farm, let's define the following:
* $t$: The spammer's target page
* $x$: The PageRank of $t$ from Accessible Pages
* $m$: The number of pages in the spam farm
* $y$: Total (unknown) PageRank for the target page $t$

We can calculate the PageRank for $t$ from a single page in the Spam Farm as the probability of being directed to $t$ from any page $z_i$ within the spam farm as ($\beta y/m$) plus the probability of being teleported to that page ($(1-\beta)/n$).
$$z_i = \frac{\beta y}{m} + \frac{1-\beta}{n}$$

The contribution from the entire spam farm is then $z=mz_i$

We can now calculate the total PageRank for $t$:
1. There is no contribution from Inaccessible Pages
2. The contribution from Accessible Pages is simply $x$
3. The contribution from the Spam Farm is $\beta z + (1-\beta)/n$
    * The last term $(1-\beta)/n$ is so small that it is relatively insignificant
$$y = x + \beta z + \cancel{\frac{1-\beta}{n}}= x + \beta m\left(\frac{\beta y}{m} + \frac{1-\beta}{n}\right) + \cancel{\frac{1-\beta}{n}}$$
$$y = x + \beta^2 y + \beta(1-\beta)\frac{m}{n}$$

Solving for $y$,
$$y - \beta^2 y = x + \beta(1-\beta)\frac{m}{n}$$
$$y = \frac{x}{1-\beta^2} + \frac{\beta(1-\beta)}{1-\beta^2}\frac{m}{n}$$
$$y = \frac{x}{1-\beta^2} + \frac{\beta(1-\beta)}{(1-\beta)(1+\beta)}\frac{m}{n}$$
$$y = \frac{x}{1-\beta^2} + \frac{\beta}{1+\beta}\frac{m}{n}$$

Thus we see that the PageRank contribution from Accessible Pages is $x/(1-\beta^2)$ and the contribution from the spam farm is proportional to the ratio of the farm's size to the entire internet $(m/n)$.

__Example__: If we are calculating the PageRank of a $link spam$ page using a taxation parameter of $\beta = 0.85$, then the contribution from accessible pages $x$ increases by a factor of,
$$\frac{1}{1-\beta^2} = 3.60 = 360\%$$

while the contribution from the spam farm itself is the ratio $m/n$, increased by the factor,
$$\frac{\beta}{1+\beta} = 0.46 = 46\%$$

### Combating Link Spam
We have found that using PageRank discourages the use of term spam. How can we combat link spam? We do this simply with two tools:
1. TrustRank
2. Spam Mass

__TrustRank__ is simply a topic-sensitive PageRank, but the set of pages is set to a set of *trusted* pages.
* The likelihood of a trusted page linking to a spam page is very small
* Two common approaches for determining trusted pages:
   1. Humans examine pages and determine if they are trustworthy or not
       * Requires a lot of hands-on work, which means pages are sent in small batches to people
   2. Picking a domain whose membership is controlled (.edu, .mil, .gov, .ac.il, .edu.sg)
 
Two major issues with TrustRank are (1) building the trusted set by human inspection requires a lot of work, so can only be done in small batches, and (2) all good pages need to somehow be reachable from the trusted set, which isn't always the case. The bottom line is that TrustRank is very effective at filtering out link spam, but also filters out valid but less common webpages in the meantime.

The idea of __Spam Mass__ is to calculate the "percentage" of the PageRank that is from spam. If we assume the PageRank is a combination of TrustRank ($t$) and spam, then the PageRank ($r$) is the sum of the two. Thus, the contribution from spam is $r-t$. The percentage of the PageRank from spam is then,
$$Spam~Mass = \frac{r-t}{r} = 1 - \frac{t}{r}$$

Now, a percentage will be a number between 0 and 1. However, the TrustRank can be larger than the PageRank, which will give a negative number. So,
* If the Spam Mass is negative, it is a trusted site
* If the Spam Mass is close to 0, it has a low chance of being a trusted site
* If the Spam Mass is close to 1, it has a very low TrustRank score, so is likely spam

The following three cells use Figure 5.1 to calculate the PageRank, the TrustRank (using $B$ and $D$ as trusted pages), and the Spam Mass.

In [20]:
### PageRank ###
# Transition Matrix
import numpy as np
#              Starting Page     A,   B,   C,   D
transition_matrix = np.array([[  0, 1/2,   1,   0],  # Linked page A
                              [1/3,   0,   0, 1/2],  #             B
                              [1/3,   0,   0, 1/2],  #             C
                              [1/3, 1/2,   0,   0]]) #             D

# Starting vector
pagerank = np.array([1/4, 1/4, 1/4, 1/4])

# Web surfer steps
for i in range(30):
    pagerank = np.matmul(transition_matrix, pagerank)

print(f"PageRank = {pagerank}")

PageRank = [0.33333333 0.22222222 0.22222222 0.22222222]


In [21]:
### TrustRank ###
# Starting vector
trustrank = np.array([0, 1/2, 0, 1/2])

# Teleporting vector
es = np.array([0, 1, 0, 1])
beta = 0.80

# Web surfer steps
for i in range(20):
    trustrank = beta*np.matmul(transition_matrix, trustrank) + es*(1-beta)/sum(es)

print(f"TrustRank = {trustrank}")

TrustRank = [0.25714286 0.28095238 0.18095238 0.28095238]


In [24]:
### Spam Mass ###
print(f"Spam Mass = {1 - (trustrank/pagerank)}")

Spam Mass = [-30.48211026 -56.124792   -31.399158   -56.124792     0.73347195]


* Since $B$ and $D$ were trusted pages, their scores are negative
* $A$ and $C$ are linked to $B$ and $D$, so their Spam Masses are small
* If a website $E$ were to be introduced that is not connected to $B$ and $D$, it would likely have a spam mass closer to 1

In [25]:
# Transition Matrix
#              Starting Page     A,   B,   C,   D,   E
transition_matrix = np.array([[  0, 1/2, 1/2,   0,   0],  # Linked page A
                              [1/3,   0,   0, 1/2,   0],  #             B
                              [1/3,   0,   0, 1/2,   0],  #             C
                              [1/3, 1/2,   0,   0,   0],  #             D
                              [  0,   0, 1/2,   0,   1]]) #             E <-- SPAM

# Starting vector
pagerank = np.array([1/5, 1/5, 1/5, 1/5, 1/5])
trustrank = np.array([1/5, 1/5, 1/5, 1/5, 1/5])
es = np.array([0, 1, 0, 1, 0])

# Web surfer steps
for i in range(30):
    pagerank = np.matmul(transition_matrix, pagerank)
    trustrank = beta*np.matmul(transition_matrix, trustrank) + es*(1-beta)/sum(es)

print(f"Spam Mass = {1 - (trustrank/pagerank)}")

Spam Mass = [-30.48211026 -56.124792   -31.399158   -56.124792     0.73347195]


-----
## Homework
* Exercise 5.1.1
* Exercise 5.1.2
* Exercise 5.1.7
* Exercise 5.2.1
* Exercise 5.3.1 - Use $\beta=0.82$
* Exercise 5.4.2