- Algorithm description
- Algorithm code
- Empirical and canonical test dataset
- Performance measures (analytical and empirical)
- Evaluation
- Conclusion


                                                                                                                        © Moon

# Project Report: Evaluation of the Centrality Algorithm, PageRank

## Introduction

Centrality algorithms are one of the categories of graph algorithms. They identify the important nodes in a given graph and those nodes are defined as vertices with many direct or indirect connections.  One of the centrality algorithms is called $PageRank(PR)$ which identify most important vertices of a graph by measuring the direct influence of nodes based on proportional rank. Now days, developers utilize on analytics, web, social networks, and etc. PR can be used to identify influencers in social media for example or identify potential attack targets in a network. Google rank websites in their Search engine results. 

This notebook demonstrates:
- Measuring time complexity theoretically
- Measuring time complexity empirically
- Uses networkx library to draw the graphs

## The Algorithm

The PR ranks depend on the number of pages, damping factor, and a number of iterations. The PageRank relies on an arbitrary probability distribution in which a person randomly clicks on links will arrive at any particular page. The probability which a person independently will continue is a damping fator d. PR computations require iterations through a number of pages to adjust approximate PR values to the theoretical value.

The iteration equation of the page rank value of  𝑖  is given by

**PR(n) = (1-d)/N + d*(PR(n1)/num_neighbors(n1) + ... + PR(n_last)/num_neighbors(n_last))**

where the damping factor $d=D$, $\frac{d}{n}$ denotes random walk score, $OutputDegree(P_j)$ denote how many pages are linked as children pages for the page $j$.

### Algorithm Steps:
1. Initialize the PageRank of every node with a value of 1/n
2. Iterate through the graph. For each iteration, update the PageRank of every node in the graph.
   1. For the first page, it only processes through random walk. 
   2. For other pages, they can process through random walk or inter-page links. 
   3. Sum up the proportional rank from all of its in-neighbors
   4. Update the PageRank with the weighted sum of proportional rank and random walk
3. Normalize the PageRank when there is terminal point. PageRank value will converge after enough iterations
5. Return Rank

### 1. Direct Iteration Method

In [1]:
# import packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
from time import perf_counter
np.set_printoptions(precision=3)  # the precision to print


In [8]:
# n = 6  # the number of pages
# d = 0.15  # damping factor
# T = 1  # the number of iteration

def pageRank(n, d, T):
    # Step 1: Initialize the PageRank of every node with a value of 1/n | O(n)
    PR = np.ones(n)/n

    # Step 2: For each iteration, update the PageRank of every node in the graph.
    for i_t in range(T):  # O(k) where k is number of iteration
        
        # 2-1: For the first page, it only processes through random walk. 
        rand = 1 / n  # assign value: O(1)
        PR[0] = d * rand  # assign value & computation: O(1)

        #  2-2: For other pages, they can process through random walk or inter-page links.
        for i in range(1, n):  # O(n) where n is number of pages
            
            # 2 - 3: Sum up the proportional rank from all of its in-neighbors 
            i_prop = PR[i-1] / 1 # assign value & computation: O(1)
            # 2 - 4: Update the PageRank with the weighted sum of proportional rank and random walk
            PR[i] = d * rand + (1-d) * i_prop # assign value & computation: O(1)

# Step 3: normalize PR when there is terminal point
    PR /= PR.sum()  # assign value & computation: O(1)
    return PR # returning value: O(1)


print(pageRank(6, 0.15, 1))


[0.061 0.112 0.156 0.193 0.225 0.252]
