<div style="text-align: right">© Moon</div>

# Project Report: Evaluation of the Centrality Algorithm, PageRank Part 2
***

## Introduction

In the previous notebook, we briefly demonstrated about PageRank algorithm and simplified PageRank algorithm theoretical and empirical complexity. PageRank algorithms are important members of centrality algorithms which rank vertices of a graph by measuring the direct influence of nodes based on proportional rank.

We empirically showed that the algorithm can run in O(N^2 * I) time where N represents total number of nodes given graph and I represents iterations.


In this notebook, we will dive deeper. We will demonstrate how to use PR algorithm in real industry.

This notebook demonstates:
    - PR aglrotihm and adjusting limitations
    - PR Implementation on Social media
    - Topic-Specific (Personalized) PageRank
    - Web Spam Detection Algorithms

***

## The PageRank Algorithm

The PageRank algorithm gives each page a rating of its
importance, which is a recursively defined measure whereby a
page becomes important if important pages link to it. 

The page rank of any page is the probability that the random surfer
will land on a particular page that the surfer is more likely to end up in important pages.

The behavior of the random surfer is an example of a Markov
process, which depends only of the current state of a system.  
The algorithm moves moves from state to state, based on probability distribution of the likelihood of moving from each state to every other possible state. 


### Algorithm Concepts
1. Start with a set of pages. 
2. Crawl the web to determine the link structure. 
3. Assign each page an initial rank of 1 / N. 
4. update the rank of each page by adding up the
weight of every page that links to it divided by the number
of links emanating from the referring page.
5. If a page has no outwardlinks, redistribute its rank equally among the other pages in
the graph. 
6. Apply this redistribution to every page in the graph. 
7. Repeat this process until the page ranks stabilize. 
8. In practice, the Page Rank algorithm adds a damping factor
at each stage to model the fact that users stop searching. 

### Algorithm Steps
The implementation of the classic PageRank algorithm uses an iterative method. At each iteration step, the PageRank value of all nodes in the graph are computed.

1. Initialize the PageRank of every node with a value of 1/n
2. Iterate through the graph. For each iteration, update the PageRank of every node in the graph.
   1. For the first page, it only processes through random walk. 
   2. For other pages, they can process through random walk or inter-page links. 
   3. Sum up the proportional rank from all of its in-neighbors
   4. Update the PageRank with the weighted sum of proportional rank and random walk
3. Normalize the PageRank when there is terminal point. PageRank value will converge after enough iterations
5. Return PR scores

***

## Limitations of the early PageRank 

In the early PageRank, there limitations:
- [5]Rank Sinks: A rank sink occurs when a page does not link out. Rank sinks occurs when by refusing to share. 

- [5]Hoarding: a group of pages that only link between each other will also monopolize PageRank, creating error. 

- [5]Circular references: A couple of pages that only link between themselves and do not link to any other page. The iterative process will never converge, creating infinity loop.


## Adjustment for PR

### First adjustment: Stochasticity Adjustment
The PageRank equation computation requires summations which takes more computation time. To save the time, we can uses matrices to convert summations n to simpler vector-matrix multiplication, which saves computation time. 

Matrices also take advantage of matrix algebra and Markov Chains theory. 

In a matrix, the rows and columns are pages and the value (0 or 1) at the intersections indicates whether or not there is a link between the pages. Instead of using 1 to indicate a link, we use 1/x, where x is the number of non-zero elements in each row. This strategy turns the non-zero values into probabilities, and creates a row substochastic matrix. Basically, this means that when you add the values of each row, some of the totals will equal 1 and the rest will equal zero. The zero totals happen because of the dangling nodes or rank sinks. For a row stochastic matrix all the rows must add up to 1.

In addition to the problems mentioned above, leaving the matrix unmodified does not guarantee that the values will ever converge, no matter how many iterations are performed. In order to fix these problems, the first adjustment was introduced. It replaces all zero rows (dangling nodes/rank sinks) with 1/n eT (eT is a row vector of all 1s), making the matrix stochastic. Let's call this modified matrix S.


## Second adjustment: Primitivity Adjustment

In addition to solving the problems caused by rank sinks, it is desirable that the PageRank value of all pages is found quickly (in as few iterations as possible). Fortunately, applying the Power Method to a Markov matrix converges to a unique and positive vector called the stationary vector—in our case, the PageRank vector—as long as the matrix is stochastic, irreducible, and aperiodic. (Aperiodicity and irreducibility imply primitivity.)

Intuitively, the primitive adjustment can be thought of as a random surfer that gets bored sometimes while following the hyperlink structure of the Web, and, instead of following links at random, enters a new URL in the browser navigation bar and continues from there. A proportion of the time he will be following links at random and a proportion of the time he will be 'teleporting' to a new URL.

In order to model this mathematically, a new factor is introduced: α, a scalar between 0 and 1. Page and Brin originally defined α as 0.85. For this suggested α, it means that 85% of the time the surfer is following links at random, and 15% of the time he is entering new URLs in the browser bar.

A new matrix is born from this adjustment. Let's call it G, the Google matrix.

G = α S + (1 - α) 1/n eeT or G = α S + (1 - α) E, where E is the teleportation matrix. E = 1/n eeT (remember that eT is a row vector of all 1s)

The teleporting is random because the teleportation matrix E = 1/n eeT is uniform, which means that the random surfer is equally likely to jump to any page when he teleports.

One of the challenges for the designers of any search engine is
ensuring that a commercial interest can’t artificially increase its
ranking by creating many others pages whose only purpose is to
link to that company’s home page.
• Adopting the PageRank algorithm makes it harder for authors to
manipulate the system because the ranking of a page depends
on the prestige of important pages that are typically outside the
control of those who are seeking to game the system.
• Preventing users from manipulating their own web rankings is
an ongoing problem for all search engine companies. To help
ensure that the rankings remain fair, Google must keep the
details of the ranking algorithms secret and change them often
enough to outwit the would-be saboteurs.

When you enter a set of search terms, Google allows you to
search for a sequence of consecutive words by enclosing those
words in quotation marks. In this example, searching for roast
or mules is useless; searching for the quoted string "roast mules"
brings the answer up immediately.

***

## Page Rank Algorithm

***

## Assumption

For each node take the difference in PR score between the current iteration and the last iteration, if this error falls below a certain point the graph has converged.

Starting from arbitrary values assigned to each node in the graph, the computation iterates until convergence below a given threshold is achieved.

[6]Convergence is achieved when the error rate for any vertex in the graph falls below a given threshold value. The error rate of a vertex comuted by difference between the “real” score of the vertex PR(Vi) and the score computed at iteration I, PR^I(Vi) . The error rate is approximated at PR^(I+1)(Vi)+ PR^(I)(Vi).


The computation of PR has no issue, if disregard scales. As damping factor increases, the rate of convergence also increases.

The PageRank algorithm was designed for directed graphs. For this study, we will be using only directed graphs generated from NetworkX library. We will use damping factor as 0.85 and number of iterations as 100.


The PageRank algorithm was designed for directed graphs. There are several factors


The output (Numpy matrix) represents the transition matrix that describes the Markov chain used in PageRank. For PageRank to converge to a unique solution that there must be exists a path between every pair of nodes in the graph. Otherwise, there is a risk of being invalidated PR rank.

    """Returns the PageRank of the nodes in the graph.

    PageRank computes a ranking of the nodes in the graph G based on
    the structure of the incoming links. It was originally designed as
    an algorithm to rank web pages.

    Parameters
    ----------
    G : graph
      A NetworkX graph.  Undirected graphs will be converted to a directed
      graph with two directed edges for each undirected edge.

    d : float, optional
      Damping factor for PageRank, default=0.85.

    personalization: dict, optional
      a nodes personalization value will be zero.
      By default, a uniform distribution is used.

    max_iter : integer, optional
      Maximum number of iterations in power method eigenvalue solver.

    tol : float, optional
      Error tolerance used to check convergence in power method solver.

    weight : weights are set to 1.

    dangling: dict, optional
      The outedges to be assigned to any "dangling" nodes, i.e., nodes without
      any outedges. 
      The dict key is the node the outedge points to and the dict
      value is the weight of that outedge. By default, dangling nodes are given
      outedges according to the personalization vector (uniform if not
      specified). This must be selected to result in an irreducible transition
      matrix. It may be common to have the
      dangling dict to be the same as the personalization dict.


    Returns
    -------
    pagerank : dictionary
       Dictionary of nodes with PageRank as value


# PageRank Implementation on other graphs

In [15]:
import networkx as nx


In [20]:
def pageRank_graph(G, d=0.85, I=100, tol=1.0e-6):
    if len(G) == 0:
            return {}

    D = G.to_directed()

    # Create a copy in (right) stochastic form
    W = nx.stochastic_graph(D)
    # get total number nodes of graph
    N = W.number_of_nodes()
    
    # Initialize the PageRank of every node with a value of 1/n | O(n) 
    '''
    x => PR
    '''
    PR = dict.fromkeys(W, 1.0 / N)
    
    # Assign uniform personalization vector
    p = dict.fromkeys(W, 1.0 / N)
    
    # Set dangling_weights to persolization vector
    dangling_weights = p
    dangling_nodes = [n for n in W if W.out_degree(n, weight=weight) == 0.0]
    
    # power iteration: make up to I iterations
    for _ in range(I):
        PRlast = PR
        PR = dict.fromkeys(PRlast.keys(), 0)
        danglesum = d * sum(PRlast[n] for n in dangling_nodes)
        for n in PR:
            # this matriPR multiply looks odd because it is
            # doing a left multiply PR^T=PRlast^T*W
            for _, nbr, wt in W.edges(n):
                PR[nbr] += d * PRlast[n] * wt
            PR[n] += danglesum * \
                dangling_weights.get(n, 0) + (1.0 - d) * p.get(n, 0)
        # check convergence, l1 norm
        err = sum(abs(PR[n] - PRlast[n]) for n in PR)
        if err < N * tol:
            return PR
    raise nx.PowerIterationFailedConvergence(I)

In [22]:
G = nx.DiGraph(nx.path_graph(4))
pr = _pagerank_python(G, alpha=0.9)

{0: 0.17241401247723942,
 1: 0.32758598752276064,
 2: 0.32758598752276064,
 3: 0.17241401247723942}

## References
[1] A. Langville and C. Meyer,
    "A survey of eigenvector methods of web information retrieval."
    http://citeseer.ist.psu.edu/713792.html
[2] Page, Lawrence; Brin, Sergey; Motwani, Rajeev and Winograd, Terry,
    The PageRank citation ranking: Bringing order to the Web. 1999
    http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1999-66&format=pdf
