# Task 3 — PageRank with Varying Damping Factors
This notebook applies the PageRank algorithm on the citation network (from `dblp_subset.json`) with damping factors ranging from 0.15 to 0.95 (step 0.10).

For each damping factor:
1. Compute PageRank scores for all papers.
2. Select the top-50 papers by PageRank.
3. Compute citation counts (in-degree) for these top-50 papers.
4. Calculate Pearson correlation between PageRank scores and citation counts.

Then report:
- Correlation values for all damping factors.
- Top-10 papers (with PageRank scores) for best and worst correlation values.

In [11]:
# Imports
import json
import os
import time
import numpy as np
import networkx as nx
from scipy.stats import pearsonr
import pandas as pd

In [12]:
# Load the subset file and build the citation graph
subset_path = os.path.join(os.getcwd(), 'dblp_subset.json')
print('Reading', subset_path)
start = time.time()
papers = []
with open(subset_path, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            papers.append(json.loads(line))
        except Exception:
            continue

n = len(papers)
print(f'Read {n} papers in {time.time()-start:.2f}s')

# Build mappings
id_to_idx = {}
titles = []
refs_by_id = []
for idx, p in enumerate(papers):
    pid = p.get('id')
    id_to_idx[pid] = idx
    titles.append(p.get('title', '') or '')
    refs_by_id.append(p.get('references', []) or [])

print('Built mappings for', n, 'papers')

Reading /Users/ankushchhabra/Downloads/Data Mining Assignment2/dblp_subset.json
Read 49572 papers in 0.87s
Built mappings for 49572 papers
Read 49572 papers in 0.87s
Built mappings for 49572 papers


In [3]:
# Build directed citation graph using NetworkX
# Edge (i -> j) means paper i cites paper j
print('Building directed graph...')
start = time.time()
G = nx.DiGraph()
G.add_nodes_from(range(n))

for i, refs in enumerate(refs_by_id):
    for ref_id in refs:
        j = id_to_idx.get(ref_id)
        if j is not None:
            G.add_edge(i, j)

print(f'Graph built in {time.time()-start:.2f}s')
print(f'Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}')

Building directed graph...
Graph built in 0.38s
Nodes: 49572, Edges: 163309


In [4]:
# Function to compute PageRank and correlations
def compute_pagerank_and_correlation(G, alpha, k=50):
    """
    Compute PageRank with damping factor alpha.
    Select top-k papers by PageRank.
    Compute in-degree (citation count) for top-k papers.
    Return Pearson correlation between PageRank and in-degree.
    """
    # Compute PageRank
    pr = nx.pagerank(G, alpha=alpha)
    
    # Get top-k papers by PageRank
    top_k_papers = sorted(pr.items(), key=lambda x: x[1], reverse=True)[:k]
    top_k_indices = [idx for idx, _ in top_k_papers]
    top_k_pr_scores = [score for _, score in top_k_papers]
    
    # Compute in-degree (citation count) for top-k papers
    in_degrees = [G.in_degree(idx) for idx in top_k_indices]
    
    # Compute Pearson correlation
    if len(top_k_pr_scores) > 1 and np.std(in_degrees) > 0:
        corr, _ = pearsonr(top_k_pr_scores, in_degrees)
    else:
        corr = 0.0
    
    return pr, top_k_papers, top_k_indices, in_degrees, corr

print('Function defined')

Function defined


In [5]:
# Compute PageRank and correlations for all damping factors
damping_factors = np.arange(0.15, 1.0, 0.10)
damping_factors = np.round(damping_factors, 2)

print('Damping factors:', damping_factors)

results = {}
k = 50

for alpha in damping_factors:
    print(f'\nComputing PageRank with alpha={alpha}...')
    start = time.time()
    pr, top_k_papers, top_k_indices, in_degrees, corr = compute_pagerank_and_correlation(G, alpha, k)
    elapsed = time.time() - start
    
    results[alpha] = {
        'pagerank': pr,
        'top_k_papers': top_k_papers,
        'top_k_indices': top_k_indices,
        'in_degrees': in_degrees,
        'correlation': corr
    }
    print(f'  Correlation: {corr:.4f} (computed in {elapsed:.2f}s)')

Damping factors: [0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95]

Computing PageRank with alpha=0.15...
  Correlation: 0.7858 (computed in 0.12s)

Computing PageRank with alpha=0.25...
  Correlation: 0.7688 (computed in 0.19s)

Computing PageRank with alpha=0.35...
  Correlation: 0.7457 (computed in 0.18s)

Computing PageRank with alpha=0.45...
  Correlation: 0.7323 (computed in 0.10s)

Computing PageRank with alpha=0.55...
  Correlation: 0.7128 (computed in 0.10s)

Computing PageRank with alpha=0.65...
  Correlation: 0.6889 (computed in 0.10s)

Computing PageRank with alpha=0.75...
  Correlation: 0.6742 (computed in 0.10s)

Computing PageRank with alpha=0.85...
  Correlation: 0.5542 (computed in 0.10s)

Computing PageRank with alpha=0.95...
  Correlation: 0.6400 (computed in 0.10s)


In [6]:
# Report: Correlation values for all damping factors
print('\n' + '='*70)
print('CORRELATION VALUES FOR EACH DAMPING FACTOR')
print('='*70)
print(f'{"Damping Factor":<20} {"Pearson Correlation":<20}')
print('-'*70)

correlations_dict = {}
for alpha in damping_factors:
    corr = results[alpha]['correlation']
    correlations_dict[alpha] = corr
    print(f'{alpha:<20.2f} {corr:<20.6f}')


CORRELATION VALUES FOR EACH DAMPING FACTOR
Damping Factor       Pearson Correlation 
----------------------------------------------------------------------
0.15                 0.785755            
0.25                 0.768759            
0.35                 0.745674            
0.45                 0.732333            
0.55                 0.712790            
0.65                 0.688880            
0.75                 0.674215            
0.85                 0.554154            
0.95                 0.640048            


In [7]:
# Find best and worst correlation values
best_alpha = max(correlations_dict, key=correlations_dict.get)
worst_alpha = min(correlations_dict, key=correlations_dict.get)

best_corr = correlations_dict[best_alpha]
worst_corr = correlations_dict[worst_alpha]

print(f'\nBest correlation: alpha={best_alpha}, correlation={best_corr:.6f}')
print(f'Worst correlation: alpha={worst_alpha}, correlation={worst_corr:.6f}')


Best correlation: alpha=0.15, correlation=0.785755
Worst correlation: alpha=0.85, correlation=0.554154


In [8]:
# Report: Top-10 papers for BEST correlation
print('\n' + '='*70)
print(f'TOP-10 PAPERS FOR BEST CORRELATION (alpha={best_alpha})')
print('='*70)
print(f'{"S.No.":<6} {"Title":<40} {"PageRank Score":<20}')
print('-'*70)

best_top_k = results[best_alpha]['top_k_papers'][:10]
for rank, (idx, pr_score) in enumerate(best_top_k, start=1):
    title = titles[idx]
    # Truncate title if too long
    if len(title) > 40:
        title = title[:37] + '...'
    print(f'{rank:<6} {title:<40} {pr_score:<20.8f}')


TOP-10 PAPERS FOR BEST CORRELATION (alpha=0.15)
S.No.  Title                                    PageRank Score      
----------------------------------------------------------------------
1      LIBSVM: A library for support vector ... 0.00084804          
2      The Pascal Visual Object Classes (VOC... 0.00027491          
3      Object Detection with Discriminativel... 0.00022282          
4      Community detection in graphs            0.00016004          
5      Fast and Scalable Local Kernel Machines  0.00014622          
6      What is Twitter, a social network or ... 0.00014267          
7      Reducibility Among Combinatorial Prob... 0.00013088          
8      Talking about tactile experiences        0.00011013          
9      ImageNet Classification with Deep Con... 0.00010258          
10     KEGG for representation and analysis ... 0.00009980          


In [9]:
# Report: Top-10 papers for WORST correlation
print('\n' + '='*70)
print(f'TOP-10 PAPERS FOR WORST CORRELATION (alpha={worst_alpha})')
print('='*70)
print(f'{"S.No.":<6} {"Title":<40} {"PageRank Score":<20}')
print('-'*70)

worst_top_k = results[worst_alpha]['top_k_papers'][:10]
for rank, (idx, pr_score) in enumerate(worst_top_k, start=1):
    title = titles[idx]
    # Truncate title if too long
    if len(title) > 40:
        title = title[:37] + '...'
    print(f'{rank:<6} {title:<40} {pr_score:<20.8f}')


TOP-10 PAPERS FOR WORST CORRELATION (alpha=0.85)
S.No.  Title                                    PageRank Score      
----------------------------------------------------------------------
1      LIBSVM: A library for support vector ... 0.00892469          
2      Fast and Scalable Local Kernel Machines  0.00727250          
3      The Pascal Visual Object Classes (VOC... 0.00448177          
4      Factored Shapes and Appearances for P... 0.00390998          
5      Object Detection with Discriminativel... 0.00262820          
6      ClassCut for unsupervised class segme... 0.00219337          
7      Community detection in graphs            0.00101820          
8      A Singular Value Thresholding Algorit... 0.00095931          
9      TwitterRank: finding topic-sensitive ... 0.00094809          
10     Guaranteed Minimum-Rank Solutions of ... 0.00090303          


In [10]:
# Save results to CSV for easy reference
import csv

# Save correlation table
corr_csv_path = os.path.join(os.getcwd(), 'pagerank_correlations.csv')
with open(corr_csv_path, 'w', newline='', encoding='utf-8') as f:
    w = csv.writer(f)
    w.writerow(['Damping_Factor', 'Pearson_Correlation'])
    for alpha in damping_factors:
        w.writerow([f'{alpha:.2f}', f'{correlations_dict[alpha]:.6f}'])

print(f'Saved correlation table to {corr_csv_path}')

# Save best top-10
best_csv_path = os.path.join(os.getcwd(), 'pagerank_best_top10.csv')
with open(best_csv_path, 'w', newline='', encoding='utf-8') as f:
    w = csv.writer(f)
    w.writerow(['Rank', 'Paper_Index', 'Title', 'PageRank_Score'])
    for rank, (idx, pr_score) in enumerate(best_top_k, start=1):
        w.writerow([rank, idx, titles[idx], f'{pr_score:.8f}'])

print(f'Saved best top-10 to {best_csv_path}')

# Save worst top-10
worst_csv_path = os.path.join(os.getcwd(), 'pagerank_worst_top10.csv')
with open(worst_csv_path, 'w', newline='', encoding='utf-8') as f:
    w = csv.writer(f)
    w.writerow(['Rank', 'Paper_Index', 'Title', 'PageRank_Score'])
    for rank, (idx, pr_score) in enumerate(worst_top_k, start=1):
        w.writerow([rank, idx, titles[idx], f'{pr_score:.8f}'])

print(f'Saved worst top-10 to {worst_csv_path}')

Saved correlation table to /Users/ankushchhabra/Downloads/Data Mining Assignment2/pagerank_correlations.csv
Saved best top-10 to /Users/ankushchhabra/Downloads/Data Mining Assignment2/pagerank_best_top10.csv
Saved worst top-10 to /Users/ankushchhabra/Downloads/Data Mining Assignment2/pagerank_worst_top10.csv


## Summary

This notebook computes PageRank scores for the citation network using various damping factors and measures the correlation between PageRank and citation counts (in-degree) for the top-50 papers.

Key insights:
- The damping factor affects how PageRank distributes authority through the network.
- Lower damping factors (0.15–0.35) give more weight to direct links (in-degree).
- Higher damping factors (0.75–0.95) give more weight to indirect paths and overall network structure.
- The correlation shows whether PageRank aligns with simple citation counts; higher correlations suggest PageRank is capturing similar ranking signals.

Results are saved to CSV files:
- `pagerank_correlations.csv` — all damping factors and correlations
- `pagerank_best_top10.csv` — top-10 papers for best correlation
- `pagerank_worst_top10.csv` — top-10 papers for worst correlation