- Algorithm description
- Algorithm code
- Empirical and canonical test dataset
- Performance measures (analytical and empirical)
- Evaluation
- Conclusion


                                                                                                                        © Moon

# Project Report: Evaluation of the Centrality Algorithm, PageRank

## Introduction

Centrality algorithms are one of the categories of graph algorithms. They identify the important nodes in a given graph and those nodes are defined as vertices with many direct or indirect connections.  One of the centrality algorithms is called $PageRank(PR)$ which identify most important vertices of a graph by measuring the direct influence of nodes based on proportional rank.  [2]PR is invented by Larray Page and used by Google Search to rank web pages in their search engine results.

This notebook demonstrates:
- The PageRank algorithm
- Implementation the Classic PageRank algorithm and explore it on graphs generated from Networks, python library.
- Measuring time complexity theoretically
- Measuring time complexity empirically

## Classic PageRank

[2] ```PageRank is a link analysis algorithm and it assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set.```
.The PageRank algorithm returns the probability that a person randomly surfing will arrive arrive at any particular page. It is assumed that the input distribution is evenly divided at the beginning of PR process. 

PR takes three inputs; number of pages, damping factor, and a number of iterations. The PageRank relies on an arbitrary probability distribution in which a person randomly clicks on links will arrive at any particular page. The probability which a person independently will continue is a damping fator d. PR computations require iterations through a number of pages to adjust approximate PR values to the theoretical value.

The result of node with a PR of 0.4 for instance, means there is 40% chance that a person randomly surf will be directed to the node.

The implementation of the classic PageRank algorithm uses an iterative method. At each iteration step, the PageRank value of all nodes in the graph are computed.

### PageRank Formula





The iteration equation of the page rank value of  𝑖  is given by

**PR(n) = (1-d)/N + d*(PR(n1)/num_neighbors(n1) + ... + PR(n_last)/num_neighbors(n_last))**

where the damping factor $d=D$, $\frac{d}{n}$ denotes random walk score, $OutputDegree(P_j)$ denote how many pages are linked as children pages for the page $j$.

### Algorithm Steps
The implementation of the classic PageRank algorithm uses an iterative method. At each iteration step, the PageRank value of all nodes in the graph are computed.

1. Initialize the PageRank of every node with a value of 1/n
2. Iterate through the graph. For each iteration, update the PageRank of every node in the graph.
   1. For the first page, it only processes through random walk. 
   2. For other pages, they can process through random walk or inter-page links. 
   3. Sum up the proportional rank from all of its in-neighbors
   4. Update the PageRank with the weighted sum of proportional rank and random walk
3. Normalize the PageRank when there is terminal point. PageRank value will converge after enough iterations
5. Return Rank

# The Algorithm Script

| **Input Argument** | **Type** | **Comment**                                       | 
|--------------------|----------|---------------------------------------------------|
| G                  | graph    | input graph; will be converted to number of nodes |
| n                  | int      | total number of nodes of given graph (G)          |     
| d                  | float    | damping factor                                    |
| I                  | int    | the number of iteration                       |

| **Output Argument** | **Type** | **Comment**                                                           |
|---------------------|----------|-----------------------------------------------------------------------|
| PR                  | array    | node property where the PageRank value for each node will be written. |


In [5]:
# import packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import itertools
import random
from time import perf_counter
np.set_printoptions(precision=3)  # the precision to print
import networkx as nx


In [6]:
def pageRank(n, d, I):
    # Step 1: Initialize the PageRank of every node with a value of 1/n | O(n)
    PR = np.ones(n)/n

    # Step 2: For each iteration, update the PageRank of every node in the graph.
    for i_t in range(I):  # O(k) where k is number of iteration
        
        # 2-1: For the first page, it only processes through random walk. 
        rand = 1 / n  # assign value: O(1)
        PR[0] = d * rand  # assign value & computation: O(1)

        #  2-2: For other pages, they can process through random walk or inter-page links.
        for i in range(1, n):  # O(n) where n is number of pages
            
            # 2 - 3: Sum up the proportional rank from all of its in-neighbors 
            i_prop = PR[i-1] / 1 # assign value & computation: O(1)
            # 2 - 4: Update the PageRank with the weighted sum of proportional rank and random walk
            PR[i] = d * rand + (1-d) * i_prop # assign value & computation: O(1)

# Step 3: normalize PR when there is terminal point
    PR /= PR.sum()  # assign value & computation: O(1)
    return PR # returning value: O(1)


# print(pageRank(10, 0.15, 50010))
'''
References
----------
[1]“Networkx.algorithms.link_analysis.pagerank_alg — NetworkX 2.8.5 Documentation.” 
Networkx.org, 2022, networkx.org/documentation/stable/_modules/networkx/algorithms/link_analysis/pagerank_alg.html#pagerank. 
Accessed 24 July 2022.
'''

[0.028 0.051 0.071 0.088 0.102 0.114 0.125 0.134 0.141 0.147]


'\nReferences\n----------\n[1]“Networkx.algorithms.link_analysis.pagerank_alg — NetworkX 2.8.5 Documentation.” \nNetworkx.org, 2022, networkx.org/documentation/stable/_modules/networkx/algorithms/link_analysis/pagerank_alg.html#pagerank. \nAccessed 24 July 2022.\n'

In [50]:
# global parameters
n = 6

# initialize adjacency matrix
M = np.zeros([n, n])
# add links of page 1 -> 2
M[0, 1] = 1
# add links of page 2 -> 3
M[1, 2] = 1
# add links of page 3 -> 4
M[2, 3] = 1
# add links of page 4 -> 5
M[3, 4] = 1
# add links of page 5 -> 6
M[4, 5] = 1
# add links of page 6 -> all
M[5, :] = 1
# show the matrix
print(M)


'''
Then, we can use adjacency matrix to calculate the transition probability matrix  𝑃 , which is the row-wise normalized adjacency matrix.  𝑃(𝑖,𝑗)  denotes the probability of the transition from  𝑖  to  𝑗 .
'''




[[0. 1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 1.]
 [1. 1. 1. 1. 1. 1.]]


'\nThen, we can use adjacency matrix to calculate the transition probability matrix  𝑃 , which is the row-wise normalized adjacency matrix.  𝑃(𝑖,𝑗)  denotes the probability of the transition from  𝑖  to  𝑗 .\n'

In [51]:
# by row sum of M (aggregate the second dimension: column)
M_rowsum = np.sum(M, 1, keepdims=True)
print(f'row sum of M:\n {M_rowsum}')

# transition matrix
P = M / M_rowsum
print(f'transition matrix of M:\n {P}')

row sum of M:
 [[1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [6.]]
transition matrix of M:
 [[0.    1.    0.    0.    0.    0.   ]
 [0.    0.    1.    0.    0.    0.   ]
 [0.    0.    0.    1.    0.    0.   ]
 [0.    0.    0.    0.    1.    0.   ]
 [0.    0.    0.    0.    0.    1.   ]
 [0.167 0.167 0.167 0.167 0.167 0.167]]


However, there is a probability  𝑑  of random walk. Thus, the expected transition matrix  𝑃̂   is the weighted average of  𝑃  and random walk probability:

In [52]:
# the probability of random walk (damping factor)
d = 0.15

# the probability of being reached if random walk
rand = 1 / n

# the expected transition matrix
P_hat = P * (1 - d) + rand * d

print(f'the expected transition matrix:\n {P_hat}')

the expected transition matrix:
 [[0.025 0.875 0.025 0.025 0.025 0.025]
 [0.025 0.025 0.875 0.025 0.025 0.025]
 [0.025 0.025 0.025 0.875 0.025 0.025]
 [0.025 0.025 0.025 0.025 0.875 0.025]
 [0.025 0.025 0.025 0.025 0.025 0.875]
 [0.167 0.167 0.167 0.167 0.167 0.167]]


In [44]:
#[3]
# Create a scale-free graph on one hundred nodes:
G = nx.scale_free_graph(10)
n = nx.number_of_nodes(G)
d = 0.15
T = 100

print(pageRank(n, d, T))


[0.028 0.051 0.071 0.088 0.102 0.114 0.125 0.134 0.141 0.147]


In [35]:
print(G)

MultiDiGraph with 100 nodes and 205 edges


1. Runs the PageRank algorithm on a directed or undirected graph of the data.
2. Iterate through 

## Theoretical Analysis


Time and space complexity of the classic PageRank algorithm. The implementation of this algorithm uses an iterative method. At each iteration step, the PageRank value of all nodes in the graph are computed. <br>


Refer to the algorithm and the script in the previous cells. <br>

The main loop runs I times, which is total number of iteration. <br>
The inner loop runs n times, which is the number of nodes generated <br>
<br>

The time complexity’s value is O(I * n) where I represents the specific number of iterations that needs to be run on node n. 

The big O value for PageRank’s space complexity is O(N) where n is total number of nodes. since we keep only the given nodes information. 






# Empirical Time Complexity

Counting the number of operations by a counter is the most accurate way of empirical analysis of a function.



# Conclusion

```
In this study, we analyzed a PageRank algorithm for its theoretical and empirical complexity.

PageRank algorithms are important members of unsupervised learning algorithms. In this study, we examined the single-pass clustering algorithm as a streaming, big data analytic. We empirically showed that the algorithm can run in 
O(N * K) time and therefore it can be used in streaming big data applications.
```


Note: Your evaluations and conclusions must be much more detailed.

## References

rajat95gupta. “Implementing PageRank on Famous Social Networks.” Kaggle.com, Kaggle, 29 Dec. 2021, www.kaggle.com/code/rajat95gupta/implementing-pagerank-on-famous-social-networks. <br>
Wikipedia Contributors. “PageRank.” Wikipedia, Wikimedia Foundation, 23 July 2022, en.wikipedia.org/wiki/PageRank. Accessed 24 July 2022.
“Scale_free_graph — NetworkX 2.8.5 Documentation.” Networkx.org, 2022, networkx.org/documentation/stable/reference/generated/networkx.generators.directed.scale_free_graph.html#networkx.generators.directed.scale_free_graph. Accessed 24 July 2022.
“Networkx.algorithms.link_analysis.pagerank_alg — NetworkX 2.8.5 Documentation.” Networkx.org, 2022, networkx.org/documentation/stable/_modules/networkx/algorithms/link_analysis/pagerank_alg.html#pagerank. Accessed 24 July 2022.

‌
‌

‌


‌