<div class="alert alert-info">

## Introduction


</div>

In [22]:
#import
import os

import numpy as np
import pandas as pd
from hashlib import sha1

from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate
from surprise import accuracy

<div class="alert alert-info">

## Data Description<a name="2"></a>
Given the large size of the dataset, only 10000 rows of the dataset is used for the models.
This project utilizes a comprehensive dataset sourced from Kaggle, which can be accessed via the following link: (https://www.kaggle.com/datasets/beaglelee/amazon-reviews-us-books-v1-02-tsv-zip). The dataset consists of 15 columns and encompasses a substantial total of 3,105,370 rows, providing rich insights into customer feedback and product ratings specifically within the book category.

Due to the extensive size of the dataset, a subset of 10,000 rows has been selected for analysis and modeling. This reduction allows for efficient processing while still capturing the diverse range of reviews and ratings present in the original dataset.
</div>

In [108]:
# Data
data = pd.read_csv("data/amazon_reviews_us_Books_v1_02.tsv", sep='\t', on_bad_lines='skip')


In [109]:
data.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,12076615,RQ58W7SMO911M,385730586,122662979,Sisterhood of the Traveling Pants (Book 1),Books,4.0,2.0,3.0,N,N,this book was a great learning novel!,this boook was a great one that you could lear...,2005-10-14
1,US,12703090,RF6IUKMGL8SF,811828964,56191234,The Bad Girl's Guide to Getting What You Want,Books,3.0,5.0,5.0,N,N,Fun Fluff,If you are looking for something to stimulate ...,2005-10-14
2,US,12257412,R1DOSHH6AI622S,1844161560,253182049,"Eisenhorn (A Warhammer 40,000 Omnibus)",Books,4.0,1.0,22.0,N,N,this isn't a review,never read it-a young relative idicated he lik...,2005-10-14
3,US,50732546,RATOTLA3OF70O,373836635,348672532,Colby Conspiracy (Colby Agency),Books,5.0,2.0,2.0,N,N,fine author on her A-game,Though she is honored to be Chicago Woman of t...,2005-10-14
4,US,51964897,R1TNWRKIVHVYOV,262181533,598678717,The Psychology of Proof: Deductive Reasoning i...,Books,4.0,0.0,2.0,N,N,Execellent cursor examination,Review based on a cursory examination by Unive...,2005-10-14


<div class="alert alert-info">

## Exploratory Data Analysis(EDA) <a name="3"></a>

This section describes the exploratory data analysis (EDA) techniques employed to derive valuable insights from the dataset, which will inform the subsequent stages of model development.

To create a targeted subset for analysis, we identified product IDs and customer IDs associated with at least 100 reviews. This filtering process resulted in a dataset containing 24,466 rows, representing customer reviews. Our final subset includes 1,672 distinct products and 1,230 distinct customers, ensuring a diverse representation of both products and customers. This comprehensive approach enables us to conduct a thorough examination of customer feedback, facilitating deeper insights into their preferences and behaviors.

</div>

In [111]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3105370 entries, 0 to 3105369
Data columns (total 15 columns):
 #   Column             Dtype  
---  ------             -----  
 0   marketplace        object 
 1   customer_id        int64  
 2   review_id          object 
 3   product_id         object 
 4   product_parent     int64  
 5   product_title      object 
 6   product_category   object 
 7   star_rating        float64
 8   helpful_votes      float64
 9   total_votes        float64
 10  vine               object 
 11  verified_purchase  object 
 12  review_headline    object 
 13  review_body        object 
 14  review_date        object 
dtypes: float64(3), int64(2), object(10)
memory usage: 355.4+ MB


In [112]:
data.isnull().sum()

marketplace            0
customer_id            0
review_id              0
product_id             0
product_parent         0
product_title          0
product_category       0
star_rating            4
helpful_votes          4
total_votes            4
vine                   4
verified_purchase      4
review_headline       57
review_body            4
review_date          133
dtype: int64

In [113]:
data = data.dropna()

In [114]:
data.replace(['null', 'N/A', '', ' '], np.nan, inplace=True)

In [115]:
data.isnull().sum()

marketplace          0
customer_id          0
review_id            0
product_id           0
product_parent       0
product_title        0
product_category     0
star_rating          0
helpful_votes        0
total_votes          0
vine                 0
verified_purchase    0
review_headline      0
review_body          0
review_date          0
dtype: int64

In [116]:
data.nunique()

marketplace                1
customer_id          1502265
review_id            3105184
product_id            779692
product_parent        666003
product_title         713665
product_category           1
star_rating                5
helpful_votes            942
total_votes             1024
vine                       2
verified_purchase          2
review_headline      2456998
review_body          3070458
review_date             3575
dtype: int64

In [110]:
# Selected a subset with customers and products with at least 100 reviews
# Step 1: Filter customers with at least 100 reviews
customer_review_counts = data.groupby('customer_id').size().reset_index(name='review_count')
customers_with_at_least_100_reviews = customer_review_counts[customer_review_counts['review_count'] >= 100]

# Step 2: Filter products with at least 100 reviews
product_review_counts = data.groupby('product_id').size().reset_index(name='review_count')
products_with_at_least_100_reviews = product_review_counts[product_review_counts['review_count'] >= 100]

# Step 3: Filter the original dataset to only include customers and products with at least 100 reviews
filtered_data = data[
    (data['customer_id'].isin(customers_with_at_least_100_reviews['customer_id'])) &
    (data['product_id'].isin(products_with_at_least_100_reviews['product_id']))
]
filtered_data.shape

(24466, 15)

In [149]:
filtered_data.to_csv('data/amazon_reviews_subset_100.csv', index=False)

In [120]:
filtered_data.nunique()

marketplace              1
customer_id           1230
review_id            24466
product_id            1672
product_parent        1485
product_title         1562
product_category         1
star_rating              5
helpful_votes          423
total_votes            460
vine                     1
verified_purchase        2
review_headline      23305
review_body          24314
review_date           2574
dtype: int64


<div class="alert alert-info">
    
## Collaborative Filtering
**Collaborative Filtering** is a widely-used technique for addressing the challenge of missing entries in a utility matrix, leveraging user behavior and interactions to make recommendations. This approach operates on the principle that users who have agreed in the past will continue to agree in the future, allowing the model to infer preferences based on the preferences of similar users.

This method can be likened to advanced dimensionality reduction techniques such as Latent Semantic Analysis (LSA) or Truncated Singular Value Decomposition (SVD). By capturing the underlying relationships between users and items, collaborative filtering helps to predict missing values, enhancing the accuracy and relevance of recommendations.

In this project, we will implement collaborative filtering as our baseline model to improve user experience by personalizing content based on historical data, thus enabling more informed decision-making.
</div>

In [122]:
# Reading the data
coll_data = filtered_data[['customer_id', 'product_id', 'star_rating']].reset_index(drop=True)
coll_data.head()

Unnamed: 0,customer_id,product_id,star_rating
0,50230169,0451526341,4.0
1,50776149,038551428X,5.0
2,12598621,059035342X,5.0
3,49770667,1594480001,5.0
4,49828549,0671027360,1.0


In [124]:
# Number of customers and products
user_key = "customer_id"
item_key = "product_id"
N = len(np.unique(coll_data[user_key])) 
M = len(np.unique(coll_data[item_key]))
print(f"Number of customers (N)  : {N}")
print(f"Number of products (M) : {M}")

Number of customers (N)  : 1230
Number of products (M) : 1672


In [125]:
non_nan_ratings_percentage = (len(coll_data) / (N * M) * 100) 
print(f"Non-nan ratings percentage: {np.round(non_nan_ratings_percentage,3)}")

Non-nan ratings percentage: 1.19


In [128]:
avg_nratings_per_user = coll_data.groupby(user_key).size().mean()
avg_nratings_per_movie = coll_data.groupby(item_key).size().mean()
print(f"Average number of ratings per customer : {avg_nratings_per_user:.2f}")
print(f"Average number of ratings per product: {avg_nratings_per_movie:.2f}")

Average number of ratings per customer : 19.89
Average number of ratings per product: 14.63


In [129]:
# Data Splitting
X = coll_data.copy()
y = coll_data['customer_id']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=123)

In [130]:
user_mapper = dict(zip(np.unique(coll_data[user_key]), list(range(N))))
item_mapper = dict(zip(np.unique(coll_data[item_key]), list(range(M))))
user_inverse_mapper = dict(zip(list(range(N)), np.unique(coll_data[user_key])))
item_inverse_mapper = dict(zip(list(range(M)), np.unique(coll_data[item_key])))

In [131]:
def create_Y_from_ratings(data, N, M):
    Y = np.zeros((N, M))
    Y.fill(np.nan)
    for index, val in data.iterrows():
        n = user_mapper[val[user_key]]
        m = item_mapper[val[item_key]]
        Y[n, m] = val["star_rating"]

    return Y

train_mat = create_Y_from_ratings(X_train, N, M)
valid_mat = create_Y_from_ratings(X_valid, N, M)


In [132]:
# What's the number of non-nan elements in train_mat (nnn_train_mat)?
nnn_train_mat = np.sum(~np.isnan(train_mat)) 

# What's the number of non-nan elements in valid_mat (nnn_valid_mat)?
nnn_valid_mat = np.sum(~np.isnan(valid_mat)) 
print(f"Number of non-nan elements in train_mat: {nnn_train_mat}")
print(f"Number of non-nan elements in valid_mat: {nnn_valid_mat}")

Number of non-nan elements in train_mat: 19085
Number of non-nan elements in valid_mat: 4855


In [133]:
# Evaluation
def error(Y1, Y2):
    """
    Given two matrices of the same shape, 
    returns the root mean squared error (RMSE).
    """
    return np.sqrt(np.nanmean((Y1 - Y2) ** 2))


def evaluate(pred_Y, train_mat, valid_mat, model_name="Global average"):
    """
    Given predicted utility matrix and train and validation utility matrices 
    print train and validation RMSEs.
    """
    print("%s train RMSE: %0.2f" % (model_name, error(pred_Y, train_mat)))
    print("%s valid RMSE: %0.2f" % (model_name, error(pred_Y, valid_mat)))

In [134]:
# global average rating baseline
avg = np.nanmean(train_mat)
pred_g = np.zeros(train_mat.shape) + avg
evaluate(pred_g, train_mat, valid_mat, model_name="Global average")

Global average train RMSE: 1.07
Global average valid RMSE: 1.06


In [135]:
# Per-user average baseline
avg_n = np.nanmean(train_mat, axis=1)
avg_n[
    np.isnan(avg_n)
] = avg  
pred_n = np.tile(avg_n[:, None], (1, M))
evaluate(pred_n, train_mat, valid_mat, model_name="Per-user average")

Per-user average train RMSE: 0.95
Per-user average valid RMSE: 1.01


  avg_n = np.nanmean(train_mat, axis=1)


In [136]:
# Per-product average baseline
avg_m = np.nanmean(train_mat, axis=0)
avg_m[np.isnan(avg_m)] = avg
pred_m = np.tile(avg_m[None, :], (N, 1))
evaluate(pred_m, train_mat, valid_mat, model_name="Per-product average")

Per-product average train RMSE: 0.94
Per-product average valid RMSE: 1.03


  avg_m = np.nanmean(train_mat, axis=0)


In [137]:
# Average of per-user and per-product average baselines
pred_n_m = (pred_n + pred_m) * 0.5
evaluate(pred_n_m, train_mat, valid_mat, model_name="Per-user and product average")

Per-user and product average train RMSE: 0.89
Per-user and product average valid RMSE: 0.96


In [138]:
# K-nearest neighbours imputation
from sklearn.impute import KNNImputer

num_neighs = [10, 15, 18, 20, 40]
for n_neighbors in num_neighs:
    print("\nNumber of neighbours: ", n_neighbors)
    imputer = KNNImputer(n_neighbors=n_neighbors, keep_empty_features=True)
    pred_knn = imputer.fit_transform(train_mat)
    evaluate(pred_knn, train_mat, valid_mat)


Number of neighbours:  10
Global average train RMSE: 0.00
Global average valid RMSE: 1.05

Number of neighbours:  15
Global average train RMSE: 0.00
Global average valid RMSE: 1.05

Number of neighbours:  18
Global average train RMSE: 0.00
Global average valid RMSE: 1.05

Number of neighbours:  20
Global average train RMSE: 0.00
Global average valid RMSE: 1.05

Number of neighbours:  40
Global average train RMSE: 0.00
Global average valid RMSE: 1.05


In [145]:
# collaborative filtering with TruncatedSVD()
def reconstruct_svd(Z, W, avg_n, avg_m):
    return Z @ W + 0.5 * avg_n[:, None] + 0.5 * avg_m[None]


train_mat_svd = train_mat - 0.5 * avg_n[:, None] - 0.5 * avg_m[None]
train_mat_svd = np.nan_to_num(train_mat_svd)

k_range = [10, 50, 100, 200, 500, 1000]
for k in k_range:
    print("\n")
    tsvd = TruncatedSVD(n_components=k)
    Z = tsvd.fit_transform(train_mat_svd)
    W = tsvd.components_
    X_hat = reconstruct_svd(Z, W, avg_n, avg_m)
    evaluate(X_hat, train_mat, valid_mat, model_name="TruncatedSVD (k = %d)" % k)



TruncatedSVD (k = 10) train RMSE: 0.83
TruncatedSVD (k = 10) valid RMSE: 0.96


TruncatedSVD (k = 50) train RMSE: 0.69
TruncatedSVD (k = 50) valid RMSE: 0.95


TruncatedSVD (k = 100) train RMSE: 0.57
TruncatedSVD (k = 100) valid RMSE: 0.95


TruncatedSVD (k = 200) train RMSE: 0.41
TruncatedSVD (k = 200) valid RMSE: 0.95


TruncatedSVD (k = 500) train RMSE: 0.15
TruncatedSVD (k = 500) valid RMSE: 0.95


TruncatedSVD (k = 1000) train RMSE: 0.01
TruncatedSVD (k = 1000) valid RMSE: 0.95


In [140]:
# Using surprise package
reader = Reader()
data = Dataset.load_from_df(coll_data, reader)  

k = 10
algo = SVD(n_factors=k, random_state=42)

In [144]:
pd.DataFrame(cross_validate(algo, data, measures=["RMSE"], cv=5, verbose=True))

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9232  0.9419  0.9577  0.9425  0.9740  0.9479  0.0170  
Fit time          0.04    0.02    0.02    0.02    0.02    0.03    0.01    
Test time         0.01    0.01    0.01    0.01    0.01    0.01    0.00    


Unnamed: 0,test_rmse,fit_time,test_time
0,0.923235,0.03956,0.010284
1,0.941899,0.023084,0.008132
2,0.957692,0.022823,0.008161
3,0.942477,0.022872,0.008036
4,0.974015,0.022733,0.007917
