<a href="https://colab.research.google.com/github/chihoang811/chihoang811/blob/main/PageRank_based_Link_analysis_on_Books/Book_rating_no_Similarity_checked.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***PageRank based Link Analysis on Books*** - Linh Chi HOANG

*(Without considering the similarity)*


---



##### **1. Download Datasets**



---


*   Please access your Kaggle account and download your API tokens for uploading the kaggle.json file and downloading the datasets
*   For this project, only rating dataset was used





In [None]:
from google.colab import files
print("Please upload your kaggle.json file.")
files.upload()

In [None]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!pip install kaggle



In [None]:
!kaggle datasets download mohamedbakhet/amazon-books-reviews

Dataset URL: https://www.kaggle.com/datasets/mohamedbakhet/amazon-books-reviews
License(s): CC0-1.0
Downloading amazon-books-reviews.zip to /content
 96% 1.02G/1.06G [00:06<00:00, 101MB/s]
100% 1.06G/1.06G [00:07<00:00, 162MB/s]


In [None]:
!unzip amazon-books-reviews.zip

Archive:  amazon-books-reviews.zip
  inflating: Books_rating.csv        
  inflating: books_data.csv          


In [None]:
import pandas as pd
book_rating = pd.read_csv("Books_rating.csv")

##### **2. Parameters settings**


---



*   Note: Because of the limiation in RAM, this project used only popular books, which have at least 81 unique users





In [None]:
USE_SUBSAMPLE = True
MIN_RATING = 4 # keep onlybooks with score >= 4
MIN_USERS_PER_BOOK = 81 # choose only books rated by at least 81 users
MIN_SHARED_USERS_FOR_EDGES = 2 # edges are created only if for each pair of book there are at least 2 mutual users rating them

##### **3. Data Preprocessing with dataset *book_ratings***



---


*   Choose only books with **review/score ≥ 4.0**
*   Check NAN values
*   Create Weight matrix W
*   Create Transition matrix M



In [None]:
# Check for NaN values
nan_check = book_rating.isnull()
print(nan_check.sum())

# Remove rows with NaN values
book_rating = book_rating.dropna()

Id                          0
Title                     208
Price                 2518829
User_id                561787
profileName            561905
review/helpfulness          0
review/score                0
review/time                 0
review/summary            407
review/text                 8
dtype: int64


In [None]:
# Filter ratings (>= 4)
filtered_books = book_rating[book_rating['review/score'] >= MIN_RATING][['Title', 'User_id', 'review/score']]

In [None]:
# Count how many unique users rated each book
book_user_counts = filtered_books.groupby("Title")["User_id"].nunique()

if USE_SUBSAMPLE:
  # Keep only books with at least 81 unique users
  popular_books = book_user_counts[book_user_counts >= MIN_USERS_PER_BOOK].index
  filtered_books = filtered_books[filtered_books["Title"].isin(popular_books)]
else:
 popular_books = book_user_counts.index

In [None]:
filtered_books

Unnamed: 0,Title,User_id,review/score
38660,Gods and Kings (Chronicles of the Kings #1),A7IMBNFYANPNV,5.0
38661,Gods and Kings (Chronicles of the Kings #1),A1WV4Q44JE40UF,5.0
38662,Gods and Kings (Chronicles of the Kings #1),A32MYDPSMCHT1L,5.0
38663,Gods and Kings (Chronicles of the Kings #1),AAOJ5T6VS5Z6O,5.0
38664,Gods and Kings (Chronicles of the Kings #1),A2X1OYBRURS04R,5.0
...,...,...,...
2994351,First 100 Words (Bright Baby),AIFSOGUNXFADJ,5.0
2994352,First 100 Words (Bright Baby),A2B9B5F9FQ4JJ8,4.0
2994354,First 100 Words (Bright Baby),A2G9EZ3R716RH1,4.0
2994355,First 100 Words (Bright Baby),A37P45ZC5DSAJJ,5.0


In [None]:
user_num = filtered_books["User_id"].nunique()
print(user_num)

title_num = filtered_books["Title"].nunique()
print(title_num)

60081
380


###### *Create the Transition matrix M*

In [None]:
# For each pair of books, count how many users liked both
import numpy as np

# Build a user-book matrix (rows are users, columns are books, value =1 if user rated book >= 4)
book_user = filtered_books.drop_duplicates().pivot_table(
    index='User_id',
    columns='Title',
    values='Title',
    aggfunc='count',
    fill_value=0
)
book_user = (book_user > 0).astype(int)

# Build Weighted adjacency matrix W where W[i,j] is the number of shared users between book i and book j
W = book_user.T.dot(book_user)

print(W.shape)
W.head()

(380, 380)


Title,13 Little Blue Envelopes,1491: New Revelations of the Americas Before Columbus,1632 (The Assiti Shards),1906,"23 Minutes In Hell: One Man's Story About What He Saw, Heard, and Felt in that Place of Torment","500 Low-Carb Recipes: 500 Recipes from Snacks to Dessert, That the Whole Family Will Love",A Breath of Snow and Ashes (Outlander),A Bride Most Begrudging,"A Caress of Twilight (Meredith Gentry, Book 2)",A Certain Slant of Light,...,"Wizard's First Rule (Sword of Truth, Book 1)","Words That Work: It's Not What You Say, It's What People Hear",Wuthering Heights,Year of Wonders (Turtleback School & Library Binding Edition),Yours Until Dawn,Zane's Afterburn: A Novel,Zen Shorts (Caldecott Honor Book),Zen in the Martial Arts,comeback - a mother and daughter's journey through hell and back,the Picture of Dorian Gray
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
13 Little Blue Envelopes,109,0,0,0,0,0,0,0,0,3,...,0,0,0,1,0,0,1,0,0,1
1491: New Revelations of the Americas Before Columbus,0,345,1,0,0,0,0,0,0,0,...,1,1,3,5,0,0,0,0,1,2
1632 (The Assiti Shards),0,1,141,0,0,0,1,1,2,0,...,3,0,0,1,1,0,0,0,0,0
1906,0,0,0,88,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"23 Minutes In Hell: One Man's Story About What He Saw, Heard, and Felt in that Place of Torment",0,0,0,0,346,0,0,0,0,1,...,0,0,0,1,0,0,0,0,0,0


In [None]:
# Remove self-counts on the diagonal
np.fill_diagonal(W.values, 0)

# Keep only strong edges
W = W.where(W >= MIN_SHARED_USERS_FOR_EDGES, 0)

W.head()

Title,13 Little Blue Envelopes,1491: New Revelations of the Americas Before Columbus,1632 (The Assiti Shards),1906,"23 Minutes In Hell: One Man's Story About What He Saw, Heard, and Felt in that Place of Torment","500 Low-Carb Recipes: 500 Recipes from Snacks to Dessert, That the Whole Family Will Love",A Breath of Snow and Ashes (Outlander),A Bride Most Begrudging,"A Caress of Twilight (Meredith Gentry, Book 2)",A Certain Slant of Light,...,"Wizard's First Rule (Sword of Truth, Book 1)","Words That Work: It's Not What You Say, It's What People Hear",Wuthering Heights,Year of Wonders (Turtleback School & Library Binding Edition),Yours Until Dawn,Zane's Afterburn: A Novel,Zen Shorts (Caldecott Honor Book),Zen in the Martial Arts,comeback - a mother and daughter's journey through hell and back,the Picture of Dorian Gray
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
13 Little Blue Envelopes,0,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,0
1491: New Revelations of the Americas Before Columbus,0,0,0,0,0,0,0,0,0,0,...,0,0,3,5,0,0,0,0,0,2
1632 (The Assiti Shards),0,0,0,0,0,0,0,0,2,0,...,3,0,0,0,0,0,0,0,0,0
1906,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"23 Minutes In Hell: One Man's Story About What He Saw, Heard, and Felt in that Place of Torment",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Transition pronability matrix M
A = W.to_numpy().astype(float)   # A[i, j] = weight from book j to book i

# Column sums: total outgoing weight from each book j
col_sums = A.sum(axis=0)

# Handle dangling nodes (books with no outgoing edges)
dangling = (col_sums == 0)
if np.any(dangling):
    A[:, dangling] = 1.0  # For dangling columns, pretend they link equally to all books
    col_sums = A.sum(axis=0)

# Normalize columns so that each column sums to 1
M_np = A / col_sums

M = pd.DataFrame(M_np, index=W.index, columns=W.columns)

In [None]:
print(M.shape)
print("Column sums (=1):")
print(M.sum(axis=0))

(380, 380)
Column sums (should be ~1):
Title
13 Little Blue Envelopes                                                                           1.0
1491: New Revelations of the Americas Before Columbus                                              1.0
1632 (The Assiti Shards)                                                                           1.0
1906                                                                                               1.0
23 Minutes In Hell: One Man's Story About What He Saw, Heard, and Felt in that Place of Torment    1.0
                                                                                                  ... 
Zane's Afterburn: A Novel                                                                          1.0
Zen Shorts (Caldecott Honor Book)                                                                  1.0
Zen in the Martial Arts                                                                            1.0
comeback - a mother and daug

##### **4. Apply PageRank-based Link Analysis**


---




*   Mathematical term: `v'= βMv + (1-β)ı\n`





In [None]:
def pagerank(M, alpha=0.85, tol=1e-8, max_iter=100):
    M = M.to_numpy() if isinstance(M, pd.DataFrame) else M
    N = M.shape[0]
    p = np.ones(N) / N
    teleport = (1 - alpha) / N

    for _ in range(max_iter):
        p_new = alpha * (M @ p) + teleport
        if np.linalg.norm(p_new - p, 1) < tol:
            break
        p = p_new
    return p

In [None]:
scores = pagerank(M)

# Attach book titles
pagerank_scores = pd.Series(scores, index=M.index, name='PageRank')

# Sort from most important book to least
pagerank_scores = pagerank_scores.sort_values(ascending=False)

pagerank_scores = pd.DataFrame(pagerank_scores)
pagerank_scores

Unnamed: 0_level_0,PageRank
Title,Unnamed: 1_level_1
Jane Eyre (Large Print),0.026710
Jane Eyre (New Windmill),0.026710
The Picture of Dorian Gray,0.022949
the Picture of Dorian Gray,0.022949
The Picture of Dorian Gray (The Classic Collection),0.022949
...,...
APA: The Easy Way! (for APA 5th edition),0.000468
Ultra Black Hair Growth II 2000 Edition,0.000468
Twenty Things Adopted Kids Wish Their Adoptive Parents Knew,0.000468
Total Workday Control Using Microsoft Outlook: The Eight Best Practices of Task and E-Mail Management,0.000468
