<a href="https://colab.research.google.com/github/chihoang811/chihoang811/blob/main/PageRank_based_Link_analysis_on_Books/Book_rating_Similarity_checked.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***PageRank based Link Analysis on Books*** - Linh Chi HOANG

*(Considering the similarity in book titles)*


---



##### **1. Download Datasets**


---




*   Please access your Kaggle account and download your API tokens for uploading the kaggle.json file and downloading the datasets
*   For this project, only rating dataset was used



In [None]:
from google.colab import files
print("Please upload your kaggle.json file.")
files.upload()

In [None]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!pip install kaggle



In [None]:
!kaggle datasets download mohamedbakhet/amazon-books-reviews

Dataset URL: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews
License(s): CC0-1.0
Downloading amazon-books-reviews.zip to /content
 97% 1.03G/1.06G [00:18<00:00, 176MB/s]
100% 1.06G/1.06G [00:18<00:00, 62.4MB/s]


In [None]:
!unzip amazon-books-reviews.zip

Archive:  amazon-books-reviews.zip
  inflating: Books_rating.csv        
  inflating: books_data.csv          


In [None]:
import pandas as pd
book_rating = pd.read_csv("Books_rating.csv")

##### **2. Parameters settings**


---


*   Note: Because of the limiation in RAM, this project used only popular books, which have at least 81 unique users




In [None]:
USE_SUBSAMPLE = True
MIN_RATING = 4 # keep onlybooks with score >= 4
MIN_USERS_PER_BOOK = 81 # choose only books rated by at least 81 users
MIN_SHARED_USERS_FOR_EDGES = 2 # edges are created only if for each pair of book there are at least 2 mutual users rating them

##### **3. Data Preprocessing with dataset *book_ratings***


---


*   Choose only books with **review/score ≥ 4.0**
*   Check NAN values
*   Normalize book titles
*   Check similar book titles (using Jaccard similarity)
*   Create Weight matrix W
*   Create Transition matrix M

In [None]:
# Check for NaN values
nan_check = book_rating.isnull()
print(nan_check.sum())

# Remove rows with NaN values
book_rating = book_rating.dropna()

Id                          0
Title                     208
Price                 2518829
User_id                561787
profileName            561905
review/helpfulness          0
review/score                0
review/time                 0
review/summary            407
review/text                 8
dtype: int64


In [None]:
# Normalize the titles (Because there are some books whose titles are the same but written in different forms)
import re

def normalized_title(t):
    t = t.lower().strip()
    t = re.sub(r'[^a-z0-9 ]+', '', t)   # remove punctuation
    t = re.sub(r'\s+', ' ', t)          # collapse spaces
    return t

book_rating['Title_cleaned'] = book_rating['Title'].apply(normalized_title)

# Filter ratings (>= 4) and select Title
filtered_books = book_rating[book_rating['review/score'] >= MIN_RATING][['Title_cleaned', 'User_id', 'review/score']]

In [None]:
# Count how many unique users rated each book
book_user_counts = filtered_books.groupby("Title_cleaned")["User_id"].nunique()

if USE_SUBSAMPLE:
  # Keep only books with at least 81 unique users
  popular_books = book_user_counts[book_user_counts >= MIN_USERS_PER_BOOK].index
  filtered_books = filtered_books[filtered_books["Title_cleaned"].isin(popular_books)]
else:
 popular_books = book_user_counts.index

In [None]:
filtered_books

Unnamed: 0,Title_cleaned,User_id,review/score
38660,gods and kings chronicles of the kings 1,A7IMBNFYANPNV,5.0
38661,gods and kings chronicles of the kings 1,A1WV4Q44JE40UF,5.0
38662,gods and kings chronicles of the kings 1,A32MYDPSMCHT1L,5.0
38663,gods and kings chronicles of the kings 1,AAOJ5T6VS5Z6O,5.0
38664,gods and kings chronicles of the kings 1,A2X1OYBRURS04R,5.0
...,...,...,...
2994351,first 100 words bright baby,AIFSOGUNXFADJ,5.0
2994352,first 100 words bright baby,A2B9B5F9FQ4JJ8,4.0
2994354,first 100 words bright baby,A2G9EZ3R716RH1,4.0
2994355,first 100 words bright baby,A37P45ZC5DSAJJ,5.0


###### *Find similarity applying Jaccard function*

In [None]:
# Number of unique titles
nunique_titles = filtered_books["Title_cleaned"].nunique()
print(nunique_titles)

378


In [None]:
# Convert each normalized title into a set of words
def title_to_set(title):
    return set(title.split())

In [None]:
# Jaccard similarity function
def jaccard_similarity(set1, set2):
    if not set1 and not set2:
        return 0.0
    return len(set1 & set2) / len(set1 | set2)

In [None]:
# Apply Jaccard function on all unique titles
import itertools

unique_titles = filtered_books["Title_cleaned"].unique()

# Convert each title to a set of words
title_sets = {t: title_to_set(t) for t in unique_titles}

similarities = []

for t1, t2 in itertools.combinations(unique_titles, 2):
    s1 = title_sets[t1]
    s2 = title_sets[t2]
    sim = jaccard_similarity(s1, s2)
    similarities.append((t1, t2, sim))

similar_df = pd.DataFrame(similarities, columns=['Title1', 'Title2', 'Sim'])
similar_df

Unnamed: 0,Title1,Title2,Sim
0,gods and kings chronicles of the kings 1,the mayor of casterbridge,0.222222
1,gods and kings chronicles of the kings 1,hyperspace a scientific odyssey through parall...,0.111111
2,gods and kings chronicles of the kings 1,solitary witch the ultimate book of shadows fo...,0.133333
3,gods and kings chronicles of the kings 1,stitch n bitch crochet the happy hooker,0.076923
4,gods and kings chronicles of the kings 1,why men love bitches from doormat to dreamgirl...,0.000000
...,...,...,...
71248,13 little blue envelopes,six days of war june 1967 and the making of th...,0.000000
71249,13 little blue envelopes,first 100 words bright baby,0.000000
71250,1491 new revelations of the americas before co...,six days of war june 1967 and the making of th...,0.111111
71251,1491 new revelations of the americas before co...,first 100 words bright baby,0.000000


###### *Merge titles whose Jaccard similarity ≥ 0.4*

In [None]:
# Build clusters of similar titles
# Keep only pairs with Jaccard >= 0.4
jaccard_filtered = similar_df[similar_df['Sim'] >= 0.4].copy()

clusters = []
visited = set()

# We will treat each connected group of titles as a cluster
for t in pd.unique(jaccard_filtered[['Title1', 'Title2']].values.ravel()):
    if t in visited:
        continue

    # start a new cluster with this title
    group = {t}
    changed = True

    # keep adding titles that are linked to any title in the group
    while changed:
        changed = False
        for _, row in jaccard_filtered.iterrows():
            a, b = row['Title1'], row['Title2']
            if a in group and b not in group:
                group.add(b)
                changed = True
            elif b in group and a not in group:
                group.add(a)
                changed = True

    visited |= group
    clusters.append(group)

print("Number of clusters:", len(clusters))


Number of clusters: 12


In [None]:
# Choose one canonical title per cluster
def choose_canonical(group):
    return min(group, key=len)

mapping = {}

for group in clusters:
    canon = choose_canonical(group)
    for t in group:
        mapping[t] = canon

filtered_books['Title_final'] = filtered_books['Title_cleaned'].map(mapping).fillna(filtered_books['Title_cleaned'])


In [None]:
filtered_books = filtered_books[['Title_final', 'User_id', 'review/score']]
filtered_books

Unnamed: 0,Title_final,User_id,review/score
38660,gods and kings chronicles of the kings 1,A7IMBNFYANPNV,5.0
38661,gods and kings chronicles of the kings 1,A1WV4Q44JE40UF,5.0
38662,gods and kings chronicles of the kings 1,A32MYDPSMCHT1L,5.0
38663,gods and kings chronicles of the kings 1,AAOJ5T6VS5Z6O,5.0
38664,gods and kings chronicles of the kings 1,A2X1OYBRURS04R,5.0
...,...,...,...
2994351,first 100 words bright baby,AIFSOGUNXFADJ,5.0
2994352,first 100 words bright baby,A2B9B5F9FQ4JJ8,4.0
2994354,first 100 words bright baby,A2G9EZ3R716RH1,4.0
2994355,first 100 words bright baby,A37P45ZC5DSAJJ,5.0


###### *Create the Transition matrix M*

In [None]:
# For each pair of books, count how many users liked both
import numpy as np

# Build a user-book matrix (rows are users, columns are books, value =1 if user rated book >= 4)
book_user = filtered_books.drop_duplicates().pivot_table(
    index='User_id',
    columns='Title_final',
    values='Title_final',
    aggfunc='count',
    fill_value=0
)
book_user = (book_user > 0).astype(int)

# Build Weighted adjacency matrix W where W[i,j] is the number of shared users between book i and book j
W = book_user.T.dot(book_user)

print(W.shape)
W.head()

(342, 342)


Title_final,13 little blue envelopes,1491 new revelations of the americas before columbus,1632 the assiti shards,1906,23 minutes in hell one mans story about what he saw heard and felt in that place of torment,500 lowcarb recipes 500 recipes from snacks to dessert that the whole family will love,a breath of snow and ashes outlander,a bride most begrudging,a caress of twilight meredith gentry book 2,a certain slant of light,...,where the sidewalk ends poems and drawings,who wrote the bible,why men love bitches from doormat to dreamgirl a womans guide to holding her own in a relationship,wicked the grimmerie a behindthescenes look at the hit broadway musical,wizards first rule sword of truth book 1,words that work its not what you say its what people hear,wuthering heights,yours until dawn,zen in the martial arts,zen shorts caldecott honor book
Title_final,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
13 little blue envelopes,109,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,1
1491 new revelations of the americas before columbus,0,345,1,0,0,0,0,0,0,0,...,1,1,0,0,1,1,3,0,0,0
1632 the assiti shards,0,1,141,0,0,0,1,1,2,0,...,0,0,0,0,3,0,0,1,0,0
1906,0,0,0,88,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23 minutes in hell one mans story about what he saw heard and felt in that place of torment,0,0,0,0,346,0,0,0,0,1,...,1,0,1,0,0,0,0,0,0,0


In [None]:
# Remove self-counts on the diagonal
np.fill_diagonal(W.values, 0)

# Keep only strong edges
W = W.where(W >= MIN_SHARED_USERS_FOR_EDGES, 0)

W.head()

Title_final,13 little blue envelopes,1491 new revelations of the americas before columbus,1632 the assiti shards,1906,23 minutes in hell one mans story about what he saw heard and felt in that place of torment,500 lowcarb recipes 500 recipes from snacks to dessert that the whole family will love,a breath of snow and ashes outlander,a bride most begrudging,a caress of twilight meredith gentry book 2,a certain slant of light,...,where the sidewalk ends poems and drawings,who wrote the bible,why men love bitches from doormat to dreamgirl a womans guide to holding her own in a relationship,wicked the grimmerie a behindthescenes look at the hit broadway musical,wizards first rule sword of truth book 1,words that work its not what you say its what people hear,wuthering heights,yours until dawn,zen in the martial arts,zen shorts caldecott honor book
Title_final,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
13 little blue envelopes,0,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,0
1491 new revelations of the americas before columbus,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,3,0,0,0
1632 the assiti shards,0,0,0,0,0,0,0,0,2,0,...,0,0,0,0,3,0,0,0,0,0
1906,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
23 minutes in hell one mans story about what he saw heard and felt in that place of torment,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Transition pronability matrix M
A = W.to_numpy().astype(float)   # A[i, j] = weight from book j to book i

# Column sums: total outgoing weight from each book j
col_sums = A.sum(axis=0)

# Handle dangling nodes (books with no outgoing edges)
dangling = (col_sums == 0)
if np.any(dangling):
    A[:, dangling] = 1.0  # For dangling columns, pretend they link equally to all books
    col_sums = A.sum(axis=0)

# Normalize columns so that each column sums to 1
M_np = A / col_sums

M = pd.DataFrame(M_np, index=W.index, columns=W.columns)

In [None]:
print(M.shape)
print("Column sums (=1):")
print(M.sum(axis=0))

(342, 342)
Column sums (should be ~1):
Title_final
13 little blue envelopes                                                                       1.0
1491 new revelations of the americas before columbus                                           1.0
1632 the assiti shards                                                                         1.0
1906                                                                                           1.0
23 minutes in hell one mans story about what he saw heard and felt in that place of torment    1.0
                                                                                              ... 
words that work its not what you say its what people hear                                      1.0
wuthering heights                                                                              1.0
yours until dawn                                                                               1.0
zen in the martial arts                                   

##### **4. Apply PageRank-based Link Analysis**


---




*   Mathematical term: `v'= βMv + (1-β)e/n`



In [None]:
def pagerank(M, alpha=0.85, tol=1e-8, max_iter=100):
    M = M.to_numpy() if isinstance(M, pd.DataFrame) else M
    N = M.shape[0]
    p = np.ones(N) / N
    teleport = (1 - alpha) / N

    for _ in range(max_iter):
        p_new = alpha * (M @ p) + teleport
        if np.linalg.norm(p_new - p, 1) < tol:
            break
        p = p_new
    return p

In [None]:
scores = pagerank(M)

# Attach book titles
pagerank_scores = pd.Series(scores, index=M.index, name='PageRank')

# Sort from most important book to least
pagerank_scores = pagerank_scores.sort_values(ascending=False)

pagerank_scores = pd.DataFrame(pagerank_scores)
pagerank_scores

Unnamed: 0_level_0,PageRank
Title_final,Unnamed: 1_level_1
the tao of pooh,0.046578
jane eyre new windmill,0.037223
jane eyre large print,0.037223
a christmas carol in prose being a ghost story of christmas collected works of charles dickens,0.025420
a christmas carol classic fiction,0.025420
...,...
cliffstestprep math review for standardized tests cliffs test prep math review standardized,0.000523
the wealthy spirit daily affirmations for financial stress reduction,0.000523
algebra survival guide a conversational guide for the thoroughly befuddled,0.000523
adobe photoshop cs2 oneonone,0.000523
