- Algorithm description
- Algorithm code
- Empirical and canonical test dataset
- Performance measures (analytical and empirical)
- Evaluation
- Conclusion


                                                                                                                        © Moon

# Project Report: Evaluation of the Centrality Algorithm, PageRank

## Introduction

Centrality algorithms are one of the categories of graph algorithms. They identify the important nodes in a given graph and those nodes are defined as vertices with many direct or indirect connections.  One of the centrality algorithms is called $PageRank(PR)$ which identify most important vertices of a graph by measuring the direct influence of nodes based on proportional rank.  [2] PR is invented by Larray Page and used by Google Search to rank web pages in their search engine results.

[2]PageRank is a link analysis algorithm and it assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. 


This notebook demonstrates:
- Implementation the PageRank algorithm and explore it on graphs of social networks available in Networks.
- Measuring time complexity theoretically
- Measuring time complexity empirically

## The Algorithm

The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes, called "iterations", through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.

A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is commonly expressed as a "50% chance" of something happening. Hence, a document with a PageRank of 0.5 means there is a 50% chance that a person clicking on a random link will be directed to said document.

The PR ranks depend on the number of pages, damping factor, and a number of iterations. The PageRank relies on an arbitrary probability distribution in which a person randomly clicks on links will arrive at any particular page. The probability which a person independently will continue is a damping fator d. PR computations require iterations through a number of pages to adjust approximate PR values to the theoretical value.

The iteration equation of the page rank value of  𝑖  is given by

**PR(n) = (1-d)/N + d*(PR(n1)/num_neighbors(n1) + ... + PR(n_last)/num_neighbors(n_last))**

where the damping factor $d=D$, $\frac{d}{n}$ denotes random walk score, $OutputDegree(P_j)$ denote how many pages are linked as children pages for the page $j$.

# 1. PageRank: Link Anaylsis

### Algorithm Steps:
1. Initialize the PageRank of every node with a value of 1/n
2. Iterate through the graph. For each iteration, update the PageRank of every node in the graph.
   1. For the first page, it only processes through random walk. 
   2. For other pages, they can process through random walk or inter-page links. 
   3. Sum up the proportional rank from all of its in-neighbors
   4. Update the PageRank with the weighted sum of proportional rank and random walk
3. Normalize the PageRank when there is terminal point. PageRank value will converge after enough iterations
5. Return Rank

# The Algorithm Script

### 1. Direct Iteration Method

In [1]:
# import packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
from time import perf_counter
np.set_printoptions(precision=3)  # the precision to print


In [8]:
# n = 6  # the number of pages
# d = 0.15  # damping factor
# T = 1  # the number of iteration

def pageRank(n, d, T):
    # Step 1: Initialize the PageRank of every node with a value of 1/n | O(n)
    PR = np.ones(n)/n

    # Step 2: For each iteration, update the PageRank of every node in the graph.
    for i_t in range(T):  # O(k) where k is number of iteration
        
        # 2-1: For the first page, it only processes through random walk. 
        rand = 1 / n  # assign value: O(1)
        PR[0] = d * rand  # assign value & computation: O(1)

        #  2-2: For other pages, they can process through random walk or inter-page links.
        for i in range(1, n):  # O(n) where n is number of pages
            
            # 2 - 3: Sum up the proportional rank from all of its in-neighbors 
            i_prop = PR[i-1] / 1 # assign value & computation: O(1)
            # 2 - 4: Update the PageRank with the weighted sum of proportional rank and random walk
            PR[i] = d * rand + (1-d) * i_prop # assign value & computation: O(1)

# Step 3: normalize PR when there is terminal point
    PR /= PR.sum()  # assign value & computation: O(1)
    return PR # returning value: O(1)


print(pageRank(6, 0.15, 1))


[0.061 0.112 0.156 0.193 0.225 0.252]


# PageRank Implementation on other graphs

## Instruction


1. Runs the PageRank algorithm on a directed or undirected graph of the data.
2. Iterate through 

## Theoretical Analysis

```
Time and space complexity of the clustering algorithm.

Refer to the algorithm and the script in the previous cells.

The main loop runs N times.
The inner loop runs K times, which is the number of clusters generated
Worst case: N clusters are generated, 
O
(
n
2
)
 complexity

Average case: K clusters are generated, 
K
<<
N
, and then 
O
(
n
)
 complexity

Space complexity is 
O
(
K
)
 + 
O
(
n
)
 since we keep only the cluster information, i.e. centroids and cluster assignment for each data point 
x
```

# Empirical Time Complexity

Counting the number of operations by a counter is the most accurate way of empirical analysis of a function.



# Conclusion

```
In this study, we analyzed a clustering algorithm for its theoretical and empirical complexity.

Clustering algorithms are important members of unsupervised learning algorithms. In this study, we examined the single-pass clustering algorithm as a streaming, big data analytic. We empirically showed that the algorithm can run in 
O
(
n
)
 time and therefore it can be used in streaming big data applications.


```


Note: Your evaluations and conclusions must be much more detailed.

## References

rajat95gupta. “Implementing PageRank on Famous Social Networks.” Kaggle.com, Kaggle, 29 Dec. 2021, www.kaggle.com/code/rajat95gupta/implementing-pagerank-on-famous-social-networks. <br>
Wikipedia Contributors. “PageRank.” Wikipedia, Wikimedia Foundation, 23 July 2022, en.wikipedia.org/wiki/PageRank. Accessed 24 July 2022.

‌


‌