# Computational Linear Algebra: PageRank Algorithm Homework

**Academic Year:** 2025/2026

### Team Members:
1. Indiano, Giovanni (357942);
2. Stradiotti, Fabio (359415).

## Introduction

At the beginning of **World Wide Web**'s development during the 1990s, complex search engines capable of filtering vast amounts of public data to deliver relevant information were needed. Early proposed solutions struggled with accuracy, frequently returning useless or irrelevant links that didn't match the user's request. The first algorithm that actually succeeded in this goal was **PageRank**, which is the primary reason behind **Google**'s enormous success. The algorithm quantitatively rates the importance of each webpage, in order to return the most helpful results first.

PageRank is a delightful application of **linear algebra**. It represents the web as a graph, in which the webpages are the vertices and the links are the edges. The intuition of the algorithm consisted in giving a certain *importance* to a page not only basing on the number of incoming links (**backlinks**), but also on the importance of the pages where these links come from. This relationship is enclosed in a matrix $A$, called the **link matrix**, where each page's importance score ($x_k$) depends on the scores of its backlinks, weighted by the number of outgoing links on those pages.

From the mathematics point of view, the ranking problem consists in finding an **eigenvector** associated with a unitary eigenvalue for a **column-stochastic matrix**. While using the algorithm, it could be possible to deal with the following problems:

- **Non-Unique Rankings**: if there are more disconnected subwebs, it's impossible to find a single unique ranking;
- **Dangling nodes**: pages with no outgoing links make the matrix substochastic, which may lack of an eigenvalue equal to 1.

PageRank solves this problem by using the **Google Matrix** $M$, defined as:

$$
M = (1 - m)A + mS
$$

where $S$ is an *egalitarian* matrix, with entries equal to $\frac{1}{n}$, and $m$ is a real number such that $0 \le m \le 1$. As the **Perron-Frobenius** theorem explicits, this modification ensures the resulting matrix is positive and column-stochastic and guarantees a unique, one-dimensional eigenspace with a positive eigenvector, providing a stable and unambiguous ranking for the entire web.

This report presents our proposed solutions for the exercises included in the assignment basing on linear algebra principles, such as the development of the link matrix, the convergence of the power method, and the impact of web structure on page importance.


# Implementation of the PageRank Algorithm

In this section, we present the implementation of the PageRank algorithm. The code is structured to handle large-scale graphs efficiently using sparse matrices.

We use the **Power Iteration Method** to compute the dominant eigenvector of the Google Matrix $M$.

# Importing modules

In [11]:
import numpy as np
from scipy import sparse
from scipy.sparse import linalg as splinalg

# Damping factor used by Google
m = 0.15

# 1. Graph Reading and Matrix Construction

The function `read_dat` reads the graph structure from a file. To handle large graphs efficiently (like the web), we use **sparse matrices** (specifically the Compressed Sparse Column format, CSC).

The Link Matrix $A$ is constructed as follows:
1.  For each link $j \to i$, we place a $1$ in $A_{ij}$.
2.  We normalize the columns so that $A$ becomes **column-stochastic** (the sum of each column is 1). This is done by multiplying $A$ by the inverse of the diagonal matrix of column sums: $A = A D^{-1}$.

If a node has no outgoing links (dangling node), its column sum is 0, which we handle gracefully to avoid division by zero.

In [12]:
def read_dat(file_name):
    """
    Reads a graph from a .dat file and constructs the column-stochastic link matrix A.
    Returns:
        A (scipy.sparse.csc_matrix): The link matrix.
        labels (dict): A mapping from node ID to node name.
    """
    labels = {}
    row_indices = [] # Lists to store sparse matrix coordinates
    col_indices = []
    
    try:
        with open(file_name, 'r') as file:
            first_line = file.readline().strip()
            if not first_line:
                 return None, None
            parts = first_line.split()
            num_nodes = int(parts[0])
            num_edges = int(parts[1])
            
            # Use COO format for construction (efficient for appending)
            # Later convert to CSC (Compressed Sparse Column) for calculation
            
            for _ in range(num_nodes):
                line = file.readline().strip()
                if line:
                    parts = line.split(maxsplit=1) 
                    node_id = int(parts[0])
                    node_name = parts[1]
                    labels[node_id] = node_name

            for _ in range(num_edges):
                line = file.readline().strip()
                if line:
                    parts = line.split()
                    source = int(parts[0])
                    target = int(parts[1])
                    # Store coordinates instead of filling dense matrix directly
                    # A[target-1][source-1]=1
                    row_indices.append(target - 1)
                    col_indices.append(source - 1)
            
            # Create sparse matrix with 1s at specific coordinates
            data = np.ones(len(row_indices))
            A = sparse.coo_matrix((data, (row_indices, col_indices)), shape=(num_nodes, num_nodes)).tocsc()
            
            # Efficient column normalization for sparse matrix
            # Calculate sum of each column
            col_sums = np.array(A.sum(axis=0)).flatten()
            
            # Avoid division by zero. If sum is 0, scaling factor is 0.
            with np.errstate(divide='ignore', invalid='ignore'):
                scale_factors = np.where(col_sums != 0, 1.0 / col_sums, 0)
            
            # Multiply A by diagonal matrix of inverse sums to normalize
            D_inv = sparse.diags(scale_factors)
            A = A @ D_inv
                    
    except FileNotFoundError:
        print(f"Error: File '{file_name}' not found.")
        return None, None
    except Exception as e:
        print(f"Error during the analysis of the file: {e}")
        return None, None
    
    return A, labels

## 2. The Power Iteration Method

To find the PageRank vector $x$, we need to find the stationary distribution of the Markov chain defined by the Google Matrix $M$:
$$M = (1-m)A + mS$$
where $S$ is the rank-one matrix $\mathbf{e}\mathbf{v}^T$ (with $\mathbf{v}$ usually being the uniform vector $1/n$).

Instead of explicitly constructing the dense matrix $M$ (which would destroy sparsity), we compute the matrix-vector multiplication exploitng the structure:
$$x^{(k+1)} = M x^{(k)} = (1-m) A x^{(k)} + m S x^{(k)}$$

Since $S x^{(k)} = \mathbf{e}(\mathbf{v}^T x^{(k)}) = \mathbf{e} (1) = \mathbf{e}$ (assuming $x$ is normalized), the update rule becomes:
$$x^{(k+1)} = (1-m) A x^{(k)} + m \mathbf{v}$$
where $\mathbf{v}$ is the vector of uniform probabilities ($1/n$).

We iterate until the norm of the difference between consecutive vectors is below a tolerance $\epsilon$.

In [13]:
def power_iteration_with_vector(A, s, m, tolerance=1e-6, max_iterations=1000):
    """
    Computes the PageRank vector using the Power Iteration method.
    Args:
        A: The column-stochastic link matrix (sparse).
        s: The personalization vector (usually uniform 1/n).
        m: The damping factor (probability of random jump).
    """
    n = A.shape[0]
    x = np.ones(n) / n # initial vector (normalized)
    
    for iteration in range(max_iterations):
        # Sparse matrix multiplication (@) is efficient here
        # x_new = (1 - m) * A * x + m * s
        x_new = (1 - m) * (A @ x) + m * s
        
        # Re-normalize to ensure numerical stability (sum should theoretically remain 1)
        x_new = x_new / np.sum(x_new) 
        
        # Check convergence (L1 norm)
        if np.linalg.norm(x_new - x, 1) < tolerance:
            print(f"  Converged in {iteration + 1} iterations")
            break
        x = x_new
    else:
        print(f"  Warning: Maximum iterations ({max_iterations}) reached")
    
    return x

## 3. Helper Functions

We define utility functions to:
1.  **Check dangling nodes**: Identify pages with no outgoing links (columns of 0).
2.  **Check stochasticity**: Verify if a matrix is column-stochastic.
3.  **Analyze Graph**: A wrapper to run the full analysis pipeline on a given file.

In [14]:
def check_dangling_nodes(A):
    col_sums = np.array(A.sum(axis=0)).flatten()
    dangling = []
    for i in range(len(col_sums)):
        if col_sums[i] == 0:
            dangling.append(i)
    return dangling

def analyze_graph(filename, m=0.15, print_top_k=None):
    """
    Main function to analyze a graph file.
    Args:
        filename: Path to the .dat file.
        m: Damping factor.
        print_top_k: If set, prints only the top k results (useful for large datasets).
    """
    print(f"Analyzing {filename} ...")
    A, labels = read_dat(filename)
    if A is None: return None, None
    
    n = A.shape[0]
    s = np.ones(n) / n # Uniform personalization vector
    
    # 1. Check Dangling Nodes
    dangling = check_dangling_nodes(A)
    if dangling:
        print(f"  - Warning: Found {len(dangling)} dangling node(s).")
    else:
        print(f"  - No dangling nodes detected.")
    
    # 2. Compute PageRank
    x = power_iteration_with_vector(A, s, m)
    
    # 3. Display Results
    sorted_indices = np.argsort(x)[::-1]
    
    print(f"  PageRank scores (Top results):")
    print(f"  {'-'*40}")
    
    limit = print_top_k if print_top_k else n
    for rank, idx in enumerate(sorted_indices[:limit], 1):
        node_label = labels[idx + 1]
        score = x[idx]
        print(f"  {rank}. {node_label:20s}: {score:.6f}")
        
    if print_top_k and n > print_top_k:
        print(f"  ... (and {n - print_top_k} more)")
    
    print("\n" + "="*60 + "\n")
    return

## 4. Testing on Standard Graphs

As requested, we test the implementation on the two reference graphs provided in the paper (Figures 2.1 and 2.2) and the `hollins.dat` dataset.

In [15]:
# Analysis of Graph 1 (Paper Fig 2.1)
analyze_graph("Graphs/graph1.dat", m)

# Analysis of Graph 2 (Paper Fig 2.2)
analyze_graph("Graphs/graph2.dat", m)

# Analysis of Hollins Dataset
# We limit output to top 15 results to keep the notebook clean
analyze_graph("Graphs/hollins.dat", m, print_top_k=15)

Analyzing Graphs/graph1.dat ...
  - No dangling nodes detected.
  Converged in 19 iterations
  PageRank scores (Top results):
  ----------------------------------------
  1. Node1               : 0.368151
  2. Node3               : 0.287962
  3. Node4               : 0.202078
  4. Node2               : 0.141809


Analyzing Graphs/graph2.dat ...
  - No dangling nodes detected.
  Converged in 2 iterations
  PageRank scores (Top results):
  ----------------------------------------
  1. Node4               : 0.285000
  2. Node3               : 0.285000
  3. Node2               : 0.200000
  4. Node1               : 0.200000
  5. Node5               : 0.030000


Analyzing Graphs/hollins.dat ...
  Converged in 206 iterations
  PageRank scores (Top results):
  ----------------------------------------
  1. http://www.hollins.edu/: 0.017669
  2. http://www.hollins.edu/admissions/visit/visit.htm: 0.010238
  3. http://www.hollins.edu/about/about_tour.htm: 0.009457
  4. http://www.hollins.edu/htdig

## 5. Discussion of the Results

The PageRank algorithm was applied to the reference graphs and the Hollins University dataset with a damping factor $m=0.15$. The results are interpreted below through the lens of spectral graph theory and Markov chains.

### Analysis of Graph 1 (Figure 2.1)
**Convergence:** The power iteration method converged in **19 iterations** to a residual norm $< 10^{-6}$.

**Spectral Interpretation:**
The ranking vector $x$ represents the **stationary distribution** of the random walk defined by the Google Matrix $M$.
* **Dominant Eigenvector:** Node 1 achieves the highest score ($x_1 \approx 0.368$). This high **eigenvector centrality** is not merely a function of in-degree (number of links) but of the quality of those links. Specifically, Node 1 receives a directed edge from Node 3 ($x_3 \approx 0.288$), effectively absorbing a significant portion of the probability mass circulating in the network.
* **Flow of Authority:** The hierarchy $x_1 > x_3 > x_4 > x_2$ illustrates the flow of importance: Node 3 acts as a key "hub" that transfers authority to Node 1. Node 2, despite being part of the strongly connected component, acts largely as a tributary, passing its mass forward without receiving high-value backlinks in return.

### Analysis of Graph 2 (Figure 2.2)
**Convergence:** The algorithm converged rapidly in **2 iterations**. This implies that the second largest eigenvalue modulus $|\lambda_2|$ of the matrix $M$ is very small, leading to a fast decay of the transient error component.

**Topological Analysis:**
The ranking highlights the graph's automorphisms and connectivity issues:

1.  **Symmetries and Automorphisms:**
    The pairs $\{3, 4\}$ and $\{1, 2\}$ exhibit identical scores ($x_3=x_4 \approx 0.285$, $x_1=x_2 \approx 0.200$). This confirms that the graph possesses structural symmetries where these nodes are topologically indistinguishable (i.e., swapping Node 3 with Node 4 leaves the adjacency matrix invariant).

2.  **The Isolated Node (Node 5):**
    Node 5 is structurally disconnected from the main component (it has indegree 0 from the rest of the graph). Its score of **0.030** is derived exclusively from the **teleportation term**:
    $$x_5 = \frac{m}{N} \cdot \sum_{j} x_j = \frac{0.15}{5} \cdot 1 = 0.03$$
    This result validates the robustness of the PageRank formulation: without the damping factor $m$ (i.e., if $m=0$), the matrix $A$ would be reducible, and Node 5 might have a score of 0, potentially causing convergence issues depending on the initial vector. The term $mS$ ensures $M$ is strictly positive (primitive), guaranteeing a unique positive eigenvector (Perron-Frobenius).

### Analysis of Hollins University Dataset
**Convergence:** Convergence required **206 iterations**, significantly more than the toy examples. This indicates a smaller **spectral gap** $(1 - |\lambda_2|)$, typical of large, sparse matrices representing real-world web structures.

**Structural Insights:**

1.  **Dangling Nodes Handling:**
    The dataset contains **3189 dangling nodes** ($\approx 53\%$ of the graph). These are pages with out-degree 0 (sinks). The algorithm implicitly handles these by replacing their zero columns in $A$ with the uniform vector $\mathbf{v} = \frac{1}{N}\mathbf{1}$, essentially treating them as linking to everyone. This prevents the "rank sink" effect where probability mass would otherwise be trapped and lost.

2.  **Cluster Analysis:**
    * **Global Maxima:** The homepage (`www.hollins.edu`) and top-level directories (Admissions, About) dominate the ranking. This is consistent with a "nested" website topology where most leaf nodes point back to the root, concentrating the steady-state probability at the top of the hierarchy.
    * **Local Traps (Slide Galleries):** We observe a cluster of high-ranking pages related to specific academic slides (e.g., "Sculpture/sld001.htm"). These likely form a **tightly knit strongly connected component** (a clique of pages linking to Next/Previous). In a random walk, the surfer gets "trapped" in this subgraph for a long time before teleporting out, artificially inflating the local PageRank scores.