<div class="alert alert-info">

## Introduction


</div>

In [3]:
#import
import os

import numpy as np
import pandas as pd
from hashlib import sha1

from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate
from surprise import accuracy

<div class="alert alert-info">

## Data Description<a name="2"></a>
Given the large size of the dataset, only 10000 rows of the dataset is used for the models.
This project utilizes a comprehensive dataset sourced from Kaggle, which can be accessed via the following link: (https://www.kaggle.com/datasets/beaglelee/amazon-reviews-us-books-v1-02-tsv-zip). The dataset consists of 15 columns and encompasses a substantial total of 3,105,370 rows, providing rich insights into customer feedback and product ratings specifically within the book category.

Due to the extensive size of the dataset, a subset of 10,000 rows has been selected for analysis and modeling. This reduction allows for efficient processing while still capturing the diverse range of reviews and ratings present in the original dataset.
</div>

In [4]:
# Data
data = pd.read_csv("data/amazon_reviews_us_Books_v1_02.tsv", sep='\t', on_bad_lines='skip')


In [5]:
data.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,12076615,RQ58W7SMO911M,385730586,122662979,Sisterhood of the Traveling Pants (Book 1),Books,4.0,2.0,3.0,N,N,this book was a great learning novel!,this boook was a great one that you could lear...,2005-10-14
1,US,12703090,RF6IUKMGL8SF,811828964,56191234,The Bad Girl's Guide to Getting What You Want,Books,3.0,5.0,5.0,N,N,Fun Fluff,If you are looking for something to stimulate ...,2005-10-14
2,US,12257412,R1DOSHH6AI622S,1844161560,253182049,"Eisenhorn (A Warhammer 40,000 Omnibus)",Books,4.0,1.0,22.0,N,N,this isn't a review,never read it-a young relative idicated he lik...,2005-10-14
3,US,50732546,RATOTLA3OF70O,373836635,348672532,Colby Conspiracy (Colby Agency),Books,5.0,2.0,2.0,N,N,fine author on her A-game,Though she is honored to be Chicago Woman of t...,2005-10-14
4,US,51964897,R1TNWRKIVHVYOV,262181533,598678717,The Psychology of Proof: Deductive Reasoning i...,Books,4.0,0.0,2.0,N,N,Execellent cursor examination,Review based on a cursory examination by Unive...,2005-10-14


<div class="alert alert-info">

## Exploratory Data Analysis(EDA) <a name="3"></a>

This section outlines the exploratory data analysis techniques employed to extract meaningful insights from the dataset, guiding the subsequent steps in model development.

To create a focused subset for analysis, we identified the 1,000 unique product IDs with the highest number of reviews. From this selection, we sampled a total of 10,000 rows corresponding to their reviews. As a result, our final subset comprises 10,000 rows, encompassing 993 distinct products and 9,993 associated reviews. This approach ensures a diverse representation of products while allowing us to conduct a thorough examination of customer feedback.

</div>

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3105370 entries, 0 to 3105369
Data columns (total 15 columns):
 #   Column             Dtype  
---  ------             -----  
 0   marketplace        object 
 1   customer_id        int64  
 2   review_id          object 
 3   product_id         object 
 4   product_parent     int64  
 5   product_title      object 
 6   product_category   object 
 7   star_rating        float64
 8   helpful_votes      float64
 9   total_votes        float64
 10  vine               object 
 11  verified_purchase  object 
 12  review_headline    object 
 13  review_body        object 
 14  review_date        object 
dtypes: float64(3), int64(2), object(10)
memory usage: 355.4+ MB


In [7]:
data.isnull().sum()

marketplace            0
customer_id            0
review_id              0
product_id             0
product_parent         0
product_title          0
product_category       0
star_rating            4
helpful_votes          4
total_votes            4
vine                   4
verified_purchase      4
review_headline       57
review_body            4
review_date          133
dtype: int64

In [8]:
data = data.dropna()

In [9]:
data.replace(['null', 'N/A', '', ' '], np.nan, inplace=True)

In [10]:
data.isnull().sum()

marketplace          0
customer_id          0
review_id            0
product_id           0
product_parent       0
product_title        0
product_category     0
star_rating          0
helpful_votes        0
total_votes          0
vine                 0
verified_purchase    0
review_headline      0
review_body          0
review_date          0
dtype: int64

In [11]:
data.nunique()

marketplace                1
customer_id          1502265
review_id            3105184
product_id            779692
product_parent        666003
product_title         713665
product_category           1
star_rating                5
helpful_votes            942
total_votes             1024
vine                       2
verified_purchase          2
review_headline      2456998
review_body          3070458
review_date             3575
dtype: int64

In [20]:
# Step 1: Filter customers with at least 10 reviews
customers_with_at_least_10_reviews = customer_review_counts[customer_review_counts['review_count'] >= 10]

# Step 2: Filter products with at least 10 reviews
product_review_counts = data.groupby('product_id').size().reset_index(name='review_count')
products_with_at_least_10_reviews = product_review_counts[product_review_counts['review_count'] >= 10]

# Step 3: Filter the original dataset to only include customers and products with at least 10 reviews
filtered_data = data[
    (data['customer_id'].isin(customers_with_at_least_10_reviews['customer_id'])) &
    (data['product_id'].isin(products_with_at_least_10_reviews['product_id']))
]
filtered_data.shape

(389017, 15)

In [22]:

customer_review_counts_filtered = filtered_data.groupby('customer_id').size().reset_index(name='review_count')
product_review_counts_filtered = filtered_data.groupby('product_id').size().reset_index(name='review_count')

customers_with_at_least_10_reviews = customer_review_counts_filtered[customer_review_counts_filtered['review_count'] >= 10]
products_with_at_least_10_reviews = product_review_counts_filtered[product_review_counts_filtered['review_count'] >= 10]


num_customers_with_at_least_10_reviews = len(customers_with_at_least_10_reviews)
num_products_with_at_least_10_reviews = len(products_with_at_least_10_reviews)

print(f"Number of customers with at least 10 reviews: {num_customers_with_at_least_10_reviews}")
print(f"Number of products with at least 10 reviews: {num_products_with_at_least_10_reviews}")

Number of customers with at least 10 reviews: 14364
Number of products with at least 10 reviews: 10727


In [23]:
# Step 1: Filter customers with at least 10 reviews
customer_review_counts = filtered_data.groupby('customer_id').size().reset_index(name='review_count')
customers_with_at_least_10_reviews = customer_review_counts[customer_review_counts['review_count'] >= 10]['customer_id']

# Step 2: Filter products with at least 10 reviews
product_review_counts = filtered_data.groupby('product_id').size().reset_index(name='review_count')
products_with_at_least_10_reviews = product_review_counts[product_review_counts['review_count'] >= 10]['product_id']

# Step 3: Filter the dataset to include only reviews from these customers and products
filtered_customers_products_data = filtered_data[
    (filtered_data['customer_id'].isin(customers_with_at_least_10_reviews)) & 
    (filtered_data['product_id'].isin(products_with_at_least_10_reviews))
]

# Step 4: Select a subset of customers (e.g., 1500) and products (e.g., 2000)
subset_customers = customers_with_at_least_10_reviews.sample(n=1500, random_state=42)
subset_products = products_with_at_least_10_reviews.sample(n=2000, random_state=42)

# Step 5: Filter the dataset to include only reviews from the selected customers and products
subset_data = filtered_customers_products_data[
    (filtered_customers_products_data['customer_id'].isin(subset_customers)) & 
    (filtered_customers_products_data['product_id'].isin(subset_products))
]

# Step 6: Check the number of available reviews before sampling
available_reviews = len(subset_data)
print(f"Available reviews: {available_reviews}")

# If there are fewer than 10,000 reviews, sample only as many as are available
if available_reviews < 10000:
    final_sample = subset_data.sample(n=available_reviews, random_state=123)
else:
    final_sample = subset_data.sample(n=10000, random_state=123)

# Step 7: Display the results
final_num_reviews = len(final_sample)
final_num_customers = final_sample['customer_id'].nunique()
final_num_products = final_sample['product_id'].nunique()

print(f"Final number of reviews: {final_num_reviews}")
print(f"Final number of unique customers: {final_num_customers}")
print(f"Final number of unique products: {final_num_products}")


Available reviews: 4510
Final number of reviews: 4510
Final number of unique customers: 1210
Final number of unique products: 1615


In [24]:
final_sample.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
1484994,US,38484071,RWCTJ6DAIGZ7T,61087122,683956970,Master of Seduction (Sea Wolves Series),Books,4.0,7.0,10.0,N,N,Good book,I have only read two books by Kinley MacGregor...,2002-07-06
1271322,US,51165577,R1H1YNIAYYLSKF,345422805,234414967,Team Rodent : How Disney Devours the World,Books,4.0,7.0,9.0,N,N,the shady side of Disney,I read this book several years ago and since t...,2003-01-11
1784914,US,40669813,R1BQAW6T2KKOI4,679764410,864803147,American Sphinx: The Character of Thomas Jeffe...,Books,5.0,46.0,56.0,N,N,Sphinx?,Five star effort by Ellis for what he did. If ...,2001-11-04
2155611,US,50358298,RVKRFPUQGVE30,679879250,872660777,"The Subtle Knife (His Dark Materials, Book 2)",Books,5.0,2.0,2.0,N,N,"Where there's Will, there's a way...",This book was a worthy successor to The Golden...,2001-01-02
503798,US,20997233,R3RW172U8DSP4O,684840057,505830594,Radical Son: A Generational Odyssey,Books,5.0,19.0,23.0,N,N,Exceptionally Truthful Examination of a Tumult...,"\\""Radical Son\\"" is the well-written and brut...",2004-10-01


In [25]:
final_sample.nunique()

marketplace             1
customer_id          1210
review_id            4510
product_id           1615
product_parent       1571
product_title        1594
product_category        1
star_rating             5
helpful_votes         170
total_votes           187
vine                    1
verified_purchase       2
review_headline      4323
review_body          4465
review_date          1965
dtype: int64

In [26]:
final_sample.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4510 entries, 1484994 to 2185734
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   marketplace        4510 non-null   object 
 1   customer_id        4510 non-null   int64  
 2   review_id          4510 non-null   object 
 3   product_id         4510 non-null   object 
 4   product_parent     4510 non-null   int64  
 5   product_title      4510 non-null   object 
 6   product_category   4510 non-null   object 
 7   star_rating        4510 non-null   float64
 8   helpful_votes      4510 non-null   float64
 9   total_votes        4510 non-null   float64
 10  vine               4510 non-null   object 
 11  verified_purchase  4510 non-null   object 
 12  review_headline    4510 non-null   object 
 13  review_body        4510 non-null   object 
 14  review_date        4510 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 563.8+ KB


In [27]:
final_sample.to_csv('amazon_reviews_subset.csv', index=False)


<div class="alert alert-info">
    
## Collaborative Filtering
**Collaborative Filtering** is a widely-used technique for addressing the challenge of missing entries in a utility matrix, leveraging user behavior and interactions to make recommendations. This approach operates on the principle that users who have agreed in the past will continue to agree in the future, allowing the model to infer preferences based on the preferences of similar users.

This method can be likened to advanced dimensionality reduction techniques such as Latent Semantic Analysis (LSA) or Truncated Singular Value Decomposition (SVD). By capturing the underlying relationships between users and items, collaborative filtering helps to predict missing values, enhancing the accuracy and relevance of recommendations.

In this project, we will implement collaborative filtering as our baseline model to improve user experience by personalizing content based on historical data, thus enabling more informed decision-making.
</div>

In [28]:
coll_data = final_sample[['customer_id', 'product_id', 'star_rating']].reset_index(drop=True)
coll_data.head()

Unnamed: 0,customer_id,product_id,star_rating
0,38484071,61087122,4.0
1,51165577,345422805,4.0
2,40669813,679764410,5.0
3,50358298,679879250,5.0
4,20997233,684840057,5.0


In [33]:
avg_nratings_per_user = coll_data.groupby(user_key).size().mean()
avg_nratings_per_movie = coll_data.groupby(item_key).size().mean()
print(f"Average number of ratings per user : {avg_nratings_per_user}")
print(f"Average number of ratings per movie: {avg_nratings_per_movie}")

Average number of ratings per user : 3.727272727272727
Average number of ratings per movie: 2.7925696594427243


In [36]:
# Using surprise package
reader = Reader()
data = Dataset.load_from_df(coll_data, reader)  

k = 10
algo = SVD(n_factors=k, random_state=42)

In [37]:
pd.DataFrame(cross_validate(algo, data, measures=["RMSE"], cv=5, verbose=True))

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0332  1.0968  1.0999  1.0774  1.0690  1.0752  0.0240  
Fit time          0.01    0.01    0.01    0.01    0.01    0.01    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    


Unnamed: 0,test_rmse,fit_time,test_time
0,1.033194,0.013326,0.003266
1,1.096768,0.008017,0.003322
2,1.099856,0.007618,0.002244
3,1.077363,0.007056,0.001945
4,1.068997,0.006439,0.001804
