<a href="https://colab.research.google.com/github/chihoang811/chihoang811/blob/main/PageRank_based_Link_analysis_on_Books/Book_ratings_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***PageRank based Link Analysis on Books*** - Linh Chi HOANG

*(Without considering the similarity)*


---



##### **1. Download Datasets**



---


*   Please access your Kaggle account and download your API tokens for uploading the kaggle.json file and downloading the datasets
*   For this project, only rating dataset was used





In [None]:
from google.colab import files
print("Please upload your kaggle.json file.")
files.upload()

In [2]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [3]:
!pip install kaggle



In [4]:
!kaggle datasets download mohamedbakhet/amazon-books-reviews

Dataset URL: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews
License(s): CC0-1.0
Downloading amazon-books-reviews.zip to /content
 99% 1.05G/1.06G [00:03<00:00, 235MB/s]
100% 1.06G/1.06G [00:03<00:00, 300MB/s]


In [5]:
!unzip amazon-books-reviews.zip

Archive:  amazon-books-reviews.zip
  inflating: Books_rating.csv        
  inflating: books_data.csv          


In [6]:
import pandas as pd
book_rating = pd.read_csv("Books_rating.csv")

##### **2. Parameters settings**


---



*   Note: Because of the limiation in RAM, this project used only popular books, which have at least 81 unique users





In [7]:
USE_SUBSAMPLE = True
MIN_RATING = 4 # keep onlybooks with score >= 4
MIN_USERS_PER_BOOK = 81 # choose only books rated by at least 81 users
MIN_SHARED_USERS_FOR_EDGES = 2 # edges are created only if for each pair of book there are at least 2 mutual users rating them

##### **3. Data Preprocessing with dataset *book_ratings***



---


*   Choose only books with **review/score ≥ 4.0**
*   Check NAN values
*   Create Weight matrix W
*   Create Transition matrix M



In [8]:
# Check for NaN values
nan_check = book_rating.isnull()
print("Number of NAN values in each column:")
print(nan_check.sum())

# Remove rows with NaN values
book_rating = book_rating.dropna()

Number of NAN values in each column:
Id                          0
Title                     208
Price                 2518829
User_id                561787
profileName            561905
review/helpfulness          0
review/score                0
review/time                 0
review/summary            407
review/text                 8
dtype: int64


In [9]:
# Filter ratings (>= 4)
filtered_books = book_rating[book_rating['review/score'] >= MIN_RATING][['Title', 'User_id', 'review/score']]

In [10]:
# Count how many unique users rated each book
book_user_counts = filtered_books.groupby("Title")["User_id"].nunique()

if USE_SUBSAMPLE:
  # Keep only books with at least 81 unique users
  popular_books = book_user_counts[book_user_counts >= MIN_USERS_PER_BOOK].index
  filtered_books = filtered_books[filtered_books["Title"].isin(popular_books)]
else:
 popular_books = book_user_counts.index

In [11]:
print("Column types:")
print(filtered_books.dtypes)
filtered_books

Column types:
Title            object
User_id          object
review/score    float64
dtype: object


Unnamed: 0,Title,User_id,review/score
38660,Gods and Kings (Chronicles of the Kings #1),A7IMBNFYANPNV,5.0
38661,Gods and Kings (Chronicles of the Kings #1),A1WV4Q44JE40UF,5.0
38662,Gods and Kings (Chronicles of the Kings #1),A32MYDPSMCHT1L,5.0
38663,Gods and Kings (Chronicles of the Kings #1),AAOJ5T6VS5Z6O,5.0
38664,Gods and Kings (Chronicles of the Kings #1),A2X1OYBRURS04R,5.0
...,...,...,...
2994351,First 100 Words (Bright Baby),AIFSOGUNXFADJ,5.0
2994352,First 100 Words (Bright Baby),A2B9B5F9FQ4JJ8,4.0
2994354,First 100 Words (Bright Baby),A2G9EZ3R716RH1,4.0
2994355,First 100 Words (Bright Baby),A37P45ZC5DSAJJ,5.0


In [12]:
user_num = filtered_books["User_id"].nunique()
print(user_num)

title_num = filtered_books["Title"].nunique()
print(title_num)

60081
380


###### *Create the Transition matrix M*

In [13]:
# For each pair of books, count how many users liked both
import numpy as np
from scipy.sparse import coo_matrix, csr_matrix, csc_matrix

# Remove duplicated (User_id, Title) pairs
df = filtered_books[["User_id", "Title"]].drop_duplicates()

# Convert User_id and Title into integer IDs
user_codes, user_index = pd.factorize(df["User_id"], sort=True)
title_codes, title_index = pd.factorize(df["Title"], sort=True)

n_users = len(user_index)
n_titles = len(title_index)

# Build matrix B (users x titles): B[u, t] = 1 if user u liked title t
data = np.ones(len(df), dtype=np.int8)
B = coo_matrix((data, (user_codes, title_codes)), shape=(n_users, n_titles)).tocsr()

# Shared-users matrix W (titles x titles): W[i, j] = #users who liked both titles
W = (B.T @ B).tocsr()

# Remove self-counts on the diagonal
W.setdiag(0)
W.eliminate_zeros()

print("W shape:", W.shape)
print("Nonzero edges:", W.nnz)


W shape: (380, 380)
Nonzero edges: 19724


In [14]:
# Keep only strong edges
W = W.multiply(W >= MIN_SHARED_USERS_FOR_EDGES)

W_df = pd.DataFrame(W.toarray(), index=title_index, columns=title_index)
W_df.head()

Unnamed: 0,13 Little Blue Envelopes,1491: New Revelations of the Americas Before Columbus,1632 (The Assiti Shards),1906,"23 Minutes In Hell: One Man's Story About What He Saw, Heard, and Felt in that Place of Torment","500 Low-Carb Recipes: 500 Recipes from Snacks to Dessert, That the Whole Family Will Love",A Breath of Snow and Ashes (Outlander),A Bride Most Begrudging,"A Caress of Twilight (Meredith Gentry, Book 2)",A Certain Slant of Light,...,"Wizard's First Rule (Sword of Truth, Book 1)","Words That Work: It's Not What You Say, It's What People Hear",Wuthering Heights,Year of Wonders (Turtleback School & Library Binding Edition),Yours Until Dawn,Zane's Afterburn: A Novel,Zen Shorts (Caldecott Honor Book),Zen in the Martial Arts,comeback - a mother and daughter's journey through hell and back,the Picture of Dorian Gray
13 Little Blue Envelopes,0,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,0
1491: New Revelations of the Americas Before Columbus,0,0,0,0,0,0,0,0,0,0,...,0,0,3,5,0,0,0,0,0,2
1632 (The Assiti Shards),0,0,0,0,0,0,0,0,2,0,...,3,0,0,0,0,0,0,0,0,0
1906,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"23 Minutes In Hell: One Man's Story About What He Saw, Heard, and Felt in that Place of Torment",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
# W: sparse (n x n), W[i, j] = weight from book j to i
W = W.tocsc().astype(float)
n = W.shape[0]

col_sums = np.asarray(W.sum(axis=0)).ravel()
dangling = (col_sums == 0)

M = W.copy()

# Normalize non-dangling columns
for j in np.where(~dangling)[0]:
    start, end = M.indptr[j], M.indptr[j + 1]
    if start < end:
        M.data[start:end] /= col_sums[j]

# Fill dangling columns efficiently
if np.any(dangling):
    M = M.tolil()
    for j in np.where(dangling)[0]:
        M[:, j] = 1.0 / n
    M = M.tocsr()
else:
    M = M.tocsr()

col_check = np.asarray(M.sum(axis=0)).ravel()
print("M shape:", M.shape)
print("Column sums in [min, max]:", col_check.min(), col_check.max())

M shape: (380, 380)
Column sums in [min, max]: 0.9999999999999987 1.0000000000000024


##### **4. Apply PageRank-based Link Analysis**


---




*   Mathematical term: `v'= βMv + (1-β)ı\n`





In [16]:
def pagerank(M, alpha=0.85, tol=1e-8, max_iter=100):
    if isinstance(M, pd.DataFrame):
        M = M.values

    N = M.shape[0]
    p = np.ones(N) / N
    teleport = np.ones(N) / N

    for _ in range(max_iter):
        p_new = alpha * (M @ p) + (1 - alpha) * teleport
        if np.linalg.norm(p_new - p, 1) < tol:
            break
        p = p_new

    return p / p.sum()

In [17]:
scores = pagerank(M)

# Attach book titles
pagerank_scores = pd.Series(scores, index=title_index, name='PageRank')

# Sort from most important book to least
pagerank_scores = pagerank_scores.sort_values(ascending=False)

pagerank_scores = pd.DataFrame(pagerank_scores)
pagerank_scores

Unnamed: 0,PageRank
Jane Eyre (Large Print),0.018190
Jane Eyre (New Windmill),0.018190
Wuthering Heights,0.016202
A Tale of Two Cities - Literary Touchstone Edition,0.015097
Great Expectations,0.014423
...,...
APA: The Easy Way! (for APA 5th edition),0.000468
Ultra Black Hair Growth II 2000 Edition,0.000468
Twenty Things Adopted Kids Wish Their Adoptive Parents Knew,0.000468
Total Workday Control Using Microsoft Outlook: The Eight Best Practices of Task and E-Mail Management,0.000468
