<div class="alert alert-info">

## Introduction
This project aims to build a recommendation system using Amazon book review data, predicting how users would rate books they haven’t rated yet. The system enhances user experience by offering personalized book suggestions. Two key models are employed:

**Collaborative Filtering**: The collaborative filtering model suggests books based on user-item interactions by identifying patterns in user behavior. It predicts a user’s rating of a book based on the ratings from similar users.
The model achieved a Root Mean Squared Error (RMSE) of 0.94, indicating its performance in predicting ratings.
Content-Based Filtering:

After collaborative filtering, a **content-based model** is applied, incorporating vectors derived from user reviews as each product's features. These review vectors represent important aspects of each book based on the content of its reviews.
By using these vectors, the model computes similarities between books and makes recommendations based on their content. This approach enhances the ability to recommend books with similar content to those the user has already rated.

The content-based model showed a significant improvement in prediction accuracy, reducing the RMSE to 0.69—a notable improvement over the collaborative filtering model.
This two-stage approach, first using collaborative filtering and then leveraging review-derived features in the content-based model, allows the system to generate more accurate and meaningful recommendations for users.

</div>

In [392]:
#import
import os
import re
import numpy as np
import pandas as pd
from hashlib import sha1

from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate
from surprise import accuracy
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from transformers import BartForConditionalGeneration, BartTokenizer
from sentence_transformers import SentenceTransformer


<div class="alert alert-info">

## Data Description<a name="2"></a>
This project utilizes a comprehensive dataset sourced from Kaggle, which can be accessed via the following link: (https://www.kaggle.com/datasets/beaglelee/amazon-reviews-us-books-v1-02-tsv-zip). The dataset consists of 15 columns and encompasses a substantial total of 3,105,370 rows, providing rich insights into customer feedback and product ratings specifically within the book category.

</div>

In [393]:
# Data
data = pd.read_csv("data/amazon_reviews_us_Books_v1_02.tsv", sep='\t', on_bad_lines='skip')


In [394]:
data.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,12076615,RQ58W7SMO911M,385730586,122662979,Sisterhood of the Traveling Pants (Book 1),Books,4.0,2.0,3.0,N,N,this book was a great learning novel!,this boook was a great one that you could lear...,2005-10-14
1,US,12703090,RF6IUKMGL8SF,811828964,56191234,The Bad Girl's Guide to Getting What You Want,Books,3.0,5.0,5.0,N,N,Fun Fluff,If you are looking for something to stimulate ...,2005-10-14
2,US,12257412,R1DOSHH6AI622S,1844161560,253182049,"Eisenhorn (A Warhammer 40,000 Omnibus)",Books,4.0,1.0,22.0,N,N,this isn't a review,never read it-a young relative idicated he lik...,2005-10-14
3,US,50732546,RATOTLA3OF70O,373836635,348672532,Colby Conspiracy (Colby Agency),Books,5.0,2.0,2.0,N,N,fine author on her A-game,Though she is honored to be Chicago Woman of t...,2005-10-14
4,US,51964897,R1TNWRKIVHVYOV,262181533,598678717,The Psychology of Proof: Deductive Reasoning i...,Books,4.0,0.0,2.0,N,N,Execellent cursor examination,Review based on a cursory examination by Unive...,2005-10-14


<div class="alert alert-info">

## Exploratory Data Analysis(EDA) <a name="3"></a>

This section describes the exploratory data analysis (EDA) techniques employed to derive valuable insights from the dataset, which will inform the subsequent stages of model development.

To create a targeted subset for analysis, we identified product IDs and customer IDs associated with at least 100 reviews. This filtering process resulted in a dataset containing 24,466 rows, representing customer reviews. Our final subset includes 1,672 distinct products and 1,230 distinct customers, ensuring a diverse representation of both products and customers. This comprehensive approach enables us to conduct a thorough examination of customer feedback, facilitating deeper insights into their preferences and behaviors.

</div>

In [395]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3105370 entries, 0 to 3105369
Data columns (total 15 columns):
 #   Column             Dtype  
---  ------             -----  
 0   marketplace        object 
 1   customer_id        int64  
 2   review_id          object 
 3   product_id         object 
 4   product_parent     int64  
 5   product_title      object 
 6   product_category   object 
 7   star_rating        float64
 8   helpful_votes      float64
 9   total_votes        float64
 10  vine               object 
 11  verified_purchase  object 
 12  review_headline    object 
 13  review_body        object 
 14  review_date        object 
dtypes: float64(3), int64(2), object(10)
memory usage: 355.4+ MB


In [396]:
data.isnull().sum()

marketplace            0
customer_id            0
review_id              0
product_id             0
product_parent         0
product_title          0
product_category       0
star_rating            4
helpful_votes          4
total_votes            4
vine                   4
verified_purchase      4
review_headline       57
review_body            4
review_date          133
dtype: int64

In [397]:
data = data.dropna()

In [398]:
data.replace(['null', 'N/A', '', ' '], np.nan, inplace=True)

In [399]:
data.isnull().sum()

marketplace          0
customer_id          0
review_id            0
product_id           0
product_parent       0
product_title        0
product_category     0
star_rating          0
helpful_votes        0
total_votes          0
vine                 0
verified_purchase    0
review_headline      0
review_body          0
review_date          0
dtype: int64

In [400]:
data.nunique()

marketplace                1
customer_id          1502265
review_id            3105184
product_id            779692
product_parent        666003
product_title         713665
product_category           1
star_rating                5
helpful_votes            942
total_votes             1024
vine                       2
verified_purchase          2
review_headline      2456998
review_body          3070458
review_date             3575
dtype: int64

In [401]:
# Selected a subset with customers and products with at least 100 reviews
# Step 1: Filter customers with at least 100 reviews
customer_review_counts = data.groupby('customer_id').size().reset_index(name='review_count')
customers_with_at_least_100_reviews = customer_review_counts[customer_review_counts['review_count'] >= 100]

# Step 2: Filter products with at least 100 reviews
product_review_counts = data.groupby('product_id').size().reset_index(name='review_count')
products_with_at_least_100_reviews = product_review_counts[product_review_counts['review_count'] >= 100]

# Step 3: Filter the original dataset to only include customers and products with at least 100 reviews
filtered_data = data[
    (data['customer_id'].isin(customers_with_at_least_100_reviews['customer_id'])) &
    (data['product_id'].isin(products_with_at_least_100_reviews['product_id']))
]
filtered_data.shape

(24459, 15)

In [402]:
filtered_data.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
295,US,50230169,R23MCAR8GSV3T0,0451526341,380925201,Animal farm: A Fairy Story,Books,4.0,2.0,2.0,N,N,Simple Yet Profound,"A generation ago, the sight of the cover of Ge...",2005-10-14
307,US,50776149,RUCZYTA3MP0MR,038551428X,970964974,"The Traveler (Fourth Realm Trilogy, Book 1)",Books,5.0,2.0,5.0,N,N,Great Marketing for a Pretty Good Book,The most interesting thing about this book is ...,2005-10-14
314,US,12598621,RCL2ARHKWH6RL,059035342X,667539744,Harry Potter and the Sorcerer's Stone,Books,5.0,2.0,2.0,N,N,I Think Part Of The Charm Is You Feel Like You...,Even though this is the shortest book in the H...,2005-10-14
363,US,49770667,R2P4B3STC980QP,1594480001,659516630,The Kite Runner,Books,5.0,4.0,4.0,N,N,Praiseworthy first novel,Well I thoroughly enjoyed this book. Although ...,2005-10-14
406,US,49828549,RM0CSYVWKHW5W,0671027360,141370518,Angels & Demons,Books,1.0,31.0,39.0,N,N,Preposterous,"Early in this novel, our hero finds out that a...",2005-10-14


In [403]:
filtered_data.to_csv('data/amazon_reviews_subset_100.csv', index=False)

In [404]:
filtered_data.nunique()

marketplace              1
customer_id           1229
review_id            24459
product_id            1672
product_parent        1485
product_title         1562
product_category         1
star_rating              5
helpful_votes          423
total_votes            460
vine                     1
verified_purchase        2
review_headline      23298
review_body          24307
review_date           2574
dtype: int64


<div class="alert alert-info">
    
## Collaborative Filtering
**Collaborative Filtering** is a widely-used technique for addressing the challenge of missing entries in a utility matrix, leveraging user behavior and interactions to make recommendations. This approach operates on the principle that users who have agreed in the past will continue to agree in the future, allowing the model to infer preferences based on the preferences of similar users.

This method can be likened to advanced dimensionality reduction techniques such as Latent Semantic Analysis (LSA) or Truncated Singular Value Decomposition (SVD). By capturing the underlying relationships between users and items, collaborative filtering helps to predict missing values, enhancing the accuracy and relevance of recommendations.

In this project, we implemented collaborative filtering as our baseline model to improve user experience by personalizing content based on historical data, thus enabling more informed decision-making.
</div>

In [405]:
# Reading the data
coll_data = filtered_data[['customer_id', 'product_id', 'star_rating']].reset_index(drop=True)
coll_data.head()

Unnamed: 0,customer_id,product_id,star_rating
0,50230169,0451526341,4.0
1,50776149,038551428X,5.0
2,12598621,059035342X,5.0
3,49770667,1594480001,5.0
4,49828549,0671027360,1.0


In [406]:
coll_data.nunique()

customer_id    1229
product_id     1672
star_rating       5
dtype: int64

In [407]:
# Number of customers and products
user_key = "customer_id"
item_key = "product_id"
N = len(np.unique(coll_data[user_key])) 
M = len(np.unique(coll_data[item_key]))
print(f"Number of customers (N)  : {N}")
print(f"Number of products (M) : {M}")

Number of customers (N)  : 1229
Number of products (M) : 1672


In [408]:
non_nan_ratings_percentage = (len(coll_data) / (N * M) * 100) 
print(f"Non-nan ratings percentage: {np.round(non_nan_ratings_percentage,3)}")

Non-nan ratings percentage: 1.19


In [409]:
avg_nratings_per_user = coll_data.groupby(user_key).size().mean()
avg_nratings_per_movie = coll_data.groupby(item_key).size().mean()
print(f"Average number of ratings per customer : {avg_nratings_per_user:.2f}")
print(f"Average number of ratings per product: {avg_nratings_per_movie:.2f}")

Average number of ratings per customer : 19.90
Average number of ratings per product: 14.63


In [410]:
# Data Splitting
X = coll_data.copy()
y = coll_data['customer_id']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=123)

In [411]:
user_mapper = dict(zip(np.unique(coll_data[user_key]), list(range(N))))
item_mapper = dict(zip(np.unique(coll_data[item_key]), list(range(M))))
user_inverse_mapper = dict(zip(list(range(N)), np.unique(coll_data[user_key])))
item_inverse_mapper = dict(zip(list(range(M)), np.unique(coll_data[item_key])))

In [412]:
def create_Y_from_ratings(data, N, M):
    Y = np.zeros((N, M))
    Y.fill(np.nan)
    for index, val in data.iterrows():
        n = user_mapper[val[user_key]]
        m = item_mapper[val[item_key]]
        Y[n, m] = val["star_rating"]

    return Y

train_mat = create_Y_from_ratings(X_train, N, M)
valid_mat = create_Y_from_ratings(X_valid, N, M)


In [413]:
# What's the number of non-nan elements in train_mat (nnn_train_mat)?
nnn_train_mat = np.sum(~np.isnan(train_mat)) 

# What's the number of non-nan elements in valid_mat (nnn_valid_mat)?
nnn_valid_mat = np.sum(~np.isnan(valid_mat)) 
print(f"Number of non-nan elements in train_mat: {nnn_train_mat}")
print(f"Number of non-nan elements in valid_mat: {nnn_valid_mat}")

Number of non-nan elements in train_mat: 19059
Number of non-nan elements in valid_mat: 4858


In [414]:
# Evaluation
def error(Y1, Y2):
    """
    Given two matrices of the same shape, 
    returns the root mean squared error (RMSE).
    """
    return np.sqrt(np.nanmean((Y1 - Y2) ** 2))


def evaluate(pred_Y, train_mat, valid_mat, model_name="Global average"):
    """
    Given predicted utility matrix and train and validation utility matrices 
    print train and validation RMSEs.
    """
    print("%s train RMSE: %0.2f" % (model_name, error(pred_Y, train_mat)))
    print("%s valid RMSE: %0.2f" % (model_name, error(pred_Y, valid_mat)))

In [415]:
# global average rating baseline
avg = np.nanmean(train_mat)
pred_g = np.zeros(train_mat.shape) + avg
evaluate(pred_g, train_mat, valid_mat, model_name="Global average")

Global average train RMSE: 1.06
Global average valid RMSE: 1.09


In [416]:
# Per-user average baseline
avg_n = np.nanmean(train_mat, axis=1)
avg_n[
    np.isnan(avg_n)
] = avg  
pred_n = np.tile(avg_n[:, None], (1, M))
evaluate(pred_n, train_mat, valid_mat, model_name="Per-user average")

Per-user average train RMSE: 0.94
Per-user average valid RMSE: 1.04


  avg_n = np.nanmean(train_mat, axis=1)


In [417]:
# Per-product average baseline
avg_m = np.nanmean(train_mat, axis=0)
avg_m[np.isnan(avg_m)] = avg
pred_m = np.tile(avg_m[None, :], (N, 1))
evaluate(pred_m, train_mat, valid_mat, model_name="Per-product average")

Per-product average train RMSE: 0.93
Per-product average valid RMSE: 1.06


  avg_m = np.nanmean(train_mat, axis=0)


In [418]:
# Average of per-user and per-product average baselines
pred_n_m = (pred_n + pred_m) * 0.5
evaluate(pred_n_m, train_mat, valid_mat, model_name="Per-user and product average")

Per-user and product average train RMSE: 0.88
Per-user and product average valid RMSE: 0.99


In [419]:
# K-nearest neighbours imputation
from sklearn.impute import KNNImputer

num_neighs = [10, 15, 18, 20, 40]
for n_neighbors in num_neighs:
    print("\nNumber of neighbours: ", n_neighbors)
    imputer = KNNImputer(n_neighbors=n_neighbors, keep_empty_features=True)
    pred_knn = imputer.fit_transform(train_mat)
    evaluate(pred_knn, train_mat, valid_mat)


Number of neighbours:  10
Global average train RMSE: 0.00
Global average valid RMSE: 1.08

Number of neighbours:  15
Global average train RMSE: 0.00
Global average valid RMSE: 1.08

Number of neighbours:  18
Global average train RMSE: 0.00
Global average valid RMSE: 1.08

Number of neighbours:  20
Global average train RMSE: 0.00
Global average valid RMSE: 1.08

Number of neighbours:  40
Global average train RMSE: 0.00
Global average valid RMSE: 1.08


In [420]:
# collaborative filtering with TruncatedSVD()
def reconstruct_svd(Z, W, avg_n, avg_m):
    return Z @ W + 0.5 * avg_n[:, None] + 0.5 * avg_m[None]


train_mat_svd = train_mat - 0.5 * avg_n[:, None] - 0.5 * avg_m[None]
train_mat_svd = np.nan_to_num(train_mat_svd)

k_range = [10, 50, 100, 200, 500, 1000]
for k in k_range:
    print("\n")
    tsvd = TruncatedSVD(n_components=k)
    Z = tsvd.fit_transform(train_mat_svd)
    W = tsvd.components_
    X_hat = reconstruct_svd(Z, W, avg_n, avg_m)
    evaluate(X_hat, train_mat, valid_mat, model_name="TruncatedSVD (k = %d)" % k)



TruncatedSVD (k = 10) train RMSE: 0.82
TruncatedSVD (k = 10) valid RMSE: 0.98


TruncatedSVD (k = 50) train RMSE: 0.68
TruncatedSVD (k = 50) valid RMSE: 0.98


TruncatedSVD (k = 100) train RMSE: 0.56
TruncatedSVD (k = 100) valid RMSE: 0.97


TruncatedSVD (k = 200) train RMSE: 0.40
TruncatedSVD (k = 200) valid RMSE: 0.97


TruncatedSVD (k = 500) train RMSE: 0.15
TruncatedSVD (k = 500) valid RMSE: 0.97


TruncatedSVD (k = 1000) train RMSE: 0.01
TruncatedSVD (k = 1000) valid RMSE: 0.97


In [421]:
# Using surprise package
reader = Reader()
data = Dataset.load_from_df(coll_data, reader)  

k = 10
algo = SVD(n_factors=k, random_state=42)

In [422]:
pd.DataFrame(cross_validate(algo, data, measures=["RMSE"], cv=5, verbose=True))

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9471  0.9443  0.9562  0.9424  0.9480  0.9476  0.0048  
Fit time          0.02    0.02    0.02    0.02    0.02    0.02    0.00    
Test time         0.01    0.01    0.01    0.01    0.01    0.01    0.00    


Unnamed: 0,test_rmse,fit_time,test_time
0,0.947094,0.023036,0.00897
1,0.944289,0.022839,0.008253
2,0.956209,0.022565,0.008341
3,0.942358,0.022773,0.008378
4,0.948014,0.022638,0.008283


<div class="alert alert-info">
    
## Incorporating Reviews
The model used makes predictions using features derived from product reviews, capturing the essence of each book’s content.

The process involves several key steps:

**Review Cleaning and Summarization**: The first step is to clean the raw reviews, removing irrelevant content, and ensuring the text data is ready for further analysis.
The cleaned reviews are then summarized by product using the `facebook/bart-large-cnn` model. This summarization reduces the review data to its essential information for each product.

**Vectorization**: Once the reviews are summarized, they are transformed into vectors using the `SentenceTransformer('all-MiniLM-L6-v2')` model. This converts the textual summaries into numerical representations (embeddings) that capture the semantic meaning of the reviews.

**Dimensionality Reduction**: To handle the high dimensionality of the vectors, Principal Component Analysis (`PCA`) is applied. PCA reduces the complexity of the data while preserving the most important information, making the model more efficient without losing significant information.

**Cosine Similarity**: With the reduced-dimension vectors, cosine similarity is used to measure the similarity between different products. This metric identifies products that are most similar to each other based on their review vectors.

**Prediction and Evaluation**: The model predicts a user’s rating for an item by looking at the ratings of similar items they have already rated. A weighted average of these ratings is used to make predictions.
The content-based model achieved an `RMSE` of 0.69 on the test set and an `RMSE` of 0.505 on the train set, indicating improved accuracy over the collaborative filtering model.

This approach leverages both review content and sophisticated vectorization techniques to generate personalized and content-driven recommendations, significantly improving the model’s performance.


</div>

In [423]:
review_data = filtered_data[['customer_id', 'product_id', 'star_rating', 'review_body']].reset_index(drop=True)
review_data.head()

Unnamed: 0,customer_id,product_id,star_rating,review_body
0,50230169,0451526341,4.0,"A generation ago, the sight of the cover of Ge..."
1,50776149,038551428X,5.0,The most interesting thing about this book is ...
2,12598621,059035342X,5.0,Even though this is the shortest book in the H...
3,49770667,1594480001,5.0,Well I thoroughly enjoyed this book. Although ...
4,49828549,0671027360,1.0,"Early in this novel, our hero finds out that a..."


In [424]:
review_data.shape

(24459, 4)

In [425]:
review_data.nunique()

customer_id     1229
product_id      1672
star_rating        5
review_body    24307
dtype: int64

In [426]:
# clean the reviews
# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove stopwords and lemmatize
    text = ' '.join(lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words)
    return text
# Clean the 'review_body' column
review_data['cleaned_review_body'] = review_data['review_body'].apply(clean_text)

# Step 1: Group by 'customer_id' and 'product_id' and aggregate
aggregated_data = review_data.groupby(['product_id']).agg(
    average_rating=('star_rating', 'mean'),         # Mean of the star ratings
    aggregated_reviews=('cleaned_review_body', ' '.join)  # Concatenate all cleaned review bodies
).reset_index()

# Display the aggregated DataFrame
aggregated_data.head()



Unnamed: 0,product_id,average_rating,aggregated_reviews
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...
2,006001203X,4.1,health care proffesional tell way traumatising...
3,0060096195,4.428571,started reading one bathtub get id gotten fina...
4,006016848X,3.5625,really like book time everyone want equality s...


In [427]:
aggregated_data.nunique()

product_id            1672
average_rating         423
aggregated_reviews    1672
dtype: int64

In [428]:
aggregated_data.shape

(1672, 3)

In [429]:
# Load the BART model and tokenizer
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)


In [430]:
def summarize_reviews(df):
    summaries = []
    for review in df['aggregated_reviews']:
        inputs = tokenizer(review, return_tensors="pt", max_length=1024, truncation=True)
        summary_ids = model.generate(inputs["input_ids"], max_length=50, num_beams=4, early_stopping=True)
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        summaries.append(summary)
    return summaries

# Summarizing reviews 
aggregated_data['summarized_reviews'] = summarize_reviews(aggregated_data)





In [431]:
aggregated_data.head()

Unnamed: 0,product_id,average_rating,aggregated_reviews,summarized_reviews
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...,susan cooper dark rising sequence joined pryda...
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...,sheer diversity recipe japanese thai indian fr...
2,006001203X,4.1,health care proffesional tell way traumatising...,health care proffesional tell way traumatising...
3,0060096195,4.428571,started reading one bathtub get id gotten fina...,started reading one bathtub get id gotten fina...
4,006016848X,3.5625,really like book time everyone want equality s...,really like book time everyone want equality s...


In [432]:
aggregated_data[['product_id', 'average_rating', 'summarized_reviews']]

Unnamed: 0,product_id,average_rating,summarized_reviews
0,0020425651,5.000000,susan cooper dark rising sequence joined pryda...
1,0028610105,4.400000,sheer diversity recipe japanese thai indian fr...
2,006001203X,4.100000,health care proffesional tell way traumatising...
3,0060096195,4.428571,started reading one bathtub get id gotten fina...
4,006016848X,3.562500,really like book time everyone want equality s...
...,...,...,...
1667,1931412065,4.875000,confirmed low carber year year constant raveno...
1668,1931498717,4.727273,selection book group sunday september since or...
1669,1931561648,4.437500,said time traveler wife nonconventional love s...
1670,1931866007,5.000000,book consists transcript interview mike litman...


In [433]:
aggregated_data.to_csv('data/summarized_review.csv', index=False)

In [434]:
sample_data= aggregated_data

In [435]:
# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2') 

# Encode the summaries to get embeddings
embeddings = model.encode(sample_data['summarized_reviews'].tolist())

# Convert embeddings to a DataFrame
embeddings_df = pd.DataFrame(embeddings)

vectorized_data = pd.concat([sample_data, embeddings_df], axis=1)
vectorized_data.head()


Unnamed: 0,product_id,average_rating,aggregated_reviews,summarized_reviews,0,1,2,3,4,5,...,374,375,376,377,378,379,380,381,382,383
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...,susan cooper dark rising sequence joined pryda...,-0.070799,-0.061837,-0.003631,0.012133,-0.035551,0.093195,...,0.065155,0.053727,0.003597,0.088892,-0.042067,0.041044,0.070728,-0.043085,-0.064512,0.038242
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...,sheer diversity recipe japanese thai indian fr...,-0.073018,-0.023593,0.066772,0.036159,0.006666,-0.010815,...,0.042942,0.037886,-0.001067,-0.009074,0.065551,-0.054624,0.067726,0.079832,-0.015437,-0.041357
2,006001203X,4.1,health care proffesional tell way traumatising...,health care proffesional tell way traumatising...,0.004559,0.033951,0.020969,0.073957,-0.02227,0.056416,...,-0.005109,0.098458,0.004446,0.010148,-0.050168,0.036669,0.133147,0.01329,0.06339,0.043042
3,0060096195,4.428571,started reading one bathtub get id gotten fina...,started reading one bathtub get id gotten fina...,-0.066638,-0.083743,0.053587,0.066727,-0.009095,0.022768,...,0.046725,0.029233,0.014859,0.084216,-0.085885,0.048461,0.023749,0.003057,-0.0757,-0.034523
4,006016848X,3.5625,really like book time everyone want equality s...,really like book time everyone want equality s...,-0.061388,0.048214,0.015069,-0.003492,-0.092024,0.02274,...,0.041296,0.018716,0.018985,0.015178,-0.019459,0.000856,0.134094,-0.078086,0.004868,-0.001025


In [436]:
# Dimensionality Reduction
# Set the desired number of components
n_components = 20

# Initialize TruncatedSVD and fit-transform the embeddings
svd = TruncatedSVD(n_components=n_components)
reduced_embeddings = svd.fit_transform(embeddings_df)

# Convert the reduced embeddings to a DataFrame
reduced_embeddings_df = pd.DataFrame(reduced_embeddings)

# Concatenate the original data with the reduced embeddings
vectorized_data = pd.concat([sample_data, reduced_embeddings_df], axis=1)
vectorized_data.head()


Unnamed: 0,product_id,average_rating,aggregated_reviews,summarized_reviews,0,1,2,3,4,5,...,10,11,12,13,14,15,16,17,18,19
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...,susan cooper dark rising sequence joined pryda...,0.635252,-0.269045,0.012184,-0.010487,0.058941,0.251971,...,-0.184776,-0.0352,-0.05286,0.074156,-0.033203,0.029792,0.076104,0.132746,0.045042,0.031042
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...,sheer diversity recipe japanese thai indian fr...,0.261072,0.044054,0.13019,0.080427,-0.001261,0.046605,...,-0.07835,-0.052345,0.057356,0.110849,0.248279,-0.254497,0.029056,0.108632,0.106938,0.194251
2,006001203X,4.1,health care proffesional tell way traumatising...,health care proffesional tell way traumatising...,0.257239,0.295282,0.238382,0.178325,-0.07783,0.070369,...,-0.021567,0.065485,0.269843,0.109493,0.062807,-0.056614,0.152163,-0.151631,-0.050819,0.159799
3,0060096195,4.428571,started reading one bathtub get id gotten fina...,started reading one bathtub get id gotten fina...,0.683657,-0.247198,0.033077,0.060875,-0.078793,-0.056407,...,0.039104,-0.121871,0.087071,0.013348,0.125837,0.046312,-0.081648,0.077465,-0.08589,-0.125254
4,006016848X,3.5625,really like book time everyone want equality s...,really like book time everyone want equality s...,0.492497,0.233496,0.10542,0.168974,0.087518,-0.126515,...,-0.033602,-0.117381,-0.153551,-0.093654,0.01388,0.031424,0.073882,-0.089763,-0.05506,-0.027179


In [437]:
vectorized_data.to_csv('data/vectorized_data.csv', index=False)

In [441]:
# Finding Similarity and calculate RMSE 

# the relevant features start from the 4th column onwards (0-indexed)
features = vectorized_data.iloc[:, 4:]

# Step 1: Standardize the feature set
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# Step 2: Split the data into training and testing sets
train_data, test_data = train_test_split(vectorized_data, test_size=0.2, random_state=42)

# Reset indices for easier access
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)

# Select vector features from the training data
train_vector_features = scaled_features[train_data.index]

# Calculate cosine similarity matrix for training data
cosine_sim = cosine_similarity(train_vector_features)

# Initialize a list to store the predicted ratings
predicted_ratings = []

# Step 4: Loop over each item in the test set to predict its rating
for idx, row in test_data.iterrows():
    # Extract the index of the current test item
    test_index = row.name  # Get the original index in the dataset
    
    # Compute similarity between the test item and all training items
    similarity_scores = cosine_sim[test_index, :]

    # Get the indices of the top 5 most similar training items
    similar_indices = np.argsort(similarity_scores)[::-1][:5]  # Top 5
    
    # Retrieve the ratings of these similar training items
    similar_ratings = train_data.iloc[similar_indices]['average_rating']
    
    # Retrieve the corresponding similarity scores
    similar_sim_scores = similarity_scores[similar_indices]
    
    # Compute the weighted average to predict the rating
    if np.sum(similar_sim_scores) == 0:
        # If similarity scores sum to zero, default to the mean rating of the training set
        predicted_rating = train_data['average_rating'].mean()
    else:
        predicted_rating = np.dot(similar_sim_scores, similar_ratings) / np.sum(similar_sim_scores)
    
    # Append the predicted rating to the list
    predicted_ratings.append(predicted_rating)

# Add the predicted ratings to the test DataFrame
test_data['predicted_rating'] = predicted_ratings

# Step 5: Calculate RMSE on the test set
rmse = round(np.sqrt(mean_squared_error(test_data['average_rating'], test_data['predicted_rating'])),4)
print(f"Test RMSE: {rmse}")



Test RMSE: 0.6693


In [442]:
# calculate RMSE for the training set 
train_predictions = []
for idx, row in train_data.iterrows():
    # Extract the index of the current train item
    train_index = row.name  # Get the original index in the dataset
    
    # Compute similarity between the train item and all training items
    similarity_scores = cosine_sim[train_index, :]

    # Get the indices of the top 5 most similar training items
    similar_indices = np.argsort(similarity_scores)[::-1][:5]  # Top 5
    
    # Retrieve the ratings of these similar training items
    similar_ratings = train_data.iloc[similar_indices]['average_rating']
    
    # Retrieve the corresponding similarity scores
    similar_sim_scores = similarity_scores[similar_indices]
    
    # Compute the weighted average to predict the rating
    if np.sum(similar_sim_scores) == 0:
        predicted_rating = train_data['average_rating'].mean()
    else:
        predicted_rating = np.dot(similar_sim_scores, similar_ratings) / np.sum(similar_sim_scores)
    
    # Append the predicted rating to the list
    train_predictions.append(predicted_rating)

# Add the predicted ratings to the train DataFrame
train_data['predicted_rating'] = train_predictions

# Calculate RMSE on the training set
train_rmse = round(np.sqrt(mean_squared_error(train_data['average_rating'], train_data['predicted_rating'])),4)
print(f"Train RMSE: {train_rmse}")

Train RMSE: 0.4971


In [443]:
def get_similar_products(product_id, top_n=5):
    # Find the index of the given product ID
    product_idx = vectorized_data_reduced[vectorized_data_reduced['product_id'] == product_id].index[0]
    
    # Get the similarity scores for this product
    similarity_scores = cosine_sim[product_idx]
    
    # Get the indices of the most similar products, excluding the product itself
    similar_indices = np.argsort(similarity_scores)[::-1][1:top_n + 1]
    
    # Get the similar products' details
    similar_products = vectorized_data_reduced.iloc[similar_indices][['product_id', 'average_rating']]
    
    return similar_products

# Example usage: Predict similar products for a given product ID
similar_products = get_similar_products('0020425651', top_n=5)
print(similar_products)

     product_id  average_rating
352  0345413350        4.526316
870  0451166582        4.428571
927  0505523892        3.636364
104  0064406970        4.666667
763  0440995779        4.142857


In [444]:
vectorized_sentiment_data = vectorized_data

In [445]:
from transformers import pipeline

# Load a pre-trained sentiment analysis model
sentiment_pipeline = pipeline("sentiment-analysis")

# Function to get sentiment
def get_sentiment(review):
    result = sentiment_pipeline(review)[0]
    return result['label'], result['score']

# Apply the function to summarize the reviews
vectorized_sentiment_data[['sentiment', 'sentiment_score']] = vectorized_sentiment_data['summarized_reviews'].apply(get_sentiment).apply(pd.Series)
vectorized_sentiment_data.head()


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,product_id,average_rating,aggregated_reviews,summarized_reviews,0,1,2,3,4,5,...,12,13,14,15,16,17,18,19,sentiment,sentiment_score
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...,susan cooper dark rising sequence joined pryda...,0.635252,-0.269045,0.012184,-0.010487,0.058941,0.251971,...,-0.05286,0.074156,-0.033203,0.029792,0.076104,0.132746,0.045042,0.031042,POSITIVE,0.998871
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...,sheer diversity recipe japanese thai indian fr...,0.261072,0.044054,0.13019,0.080427,-0.001261,0.046605,...,0.057356,0.110849,0.248279,-0.254497,0.029056,0.108632,0.106938,0.194251,POSITIVE,0.975348
2,006001203X,4.1,health care proffesional tell way traumatising...,health care proffesional tell way traumatising...,0.257239,0.295282,0.238382,0.178325,-0.07783,0.070369,...,0.269843,0.109493,0.062807,-0.056614,0.152163,-0.151631,-0.050819,0.159799,NEGATIVE,0.997175
3,0060096195,4.428571,started reading one bathtub get id gotten fina...,started reading one bathtub get id gotten fina...,0.683657,-0.247198,0.033077,0.060875,-0.078793,-0.056407,...,0.087071,0.013348,0.125837,0.046312,-0.081648,0.077465,-0.08589,-0.125254,POSITIVE,0.993411
4,006016848X,3.5625,really like book time everyone want equality s...,really like book time everyone want equality s...,0.492497,0.233496,0.10542,0.168974,0.087518,-0.126515,...,-0.153551,-0.093654,0.01388,0.031424,0.073882,-0.089763,-0.05506,-0.027179,NEGATIVE,0.977043


In [446]:
sentiment_mapping = {
    'POSITIVE': 1,
    'NEGATIVE': -1,
    'NEUTRAL': 0  
}
vectorized_sentiment_data['sentiment'] = vectorized_sentiment_data['sentiment'].map(sentiment_mapping)
vectorized_sentiment_data.head()

Unnamed: 0,product_id,average_rating,aggregated_reviews,summarized_reviews,0,1,2,3,4,5,...,12,13,14,15,16,17,18,19,sentiment,sentiment_score
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...,susan cooper dark rising sequence joined pryda...,0.635252,-0.269045,0.012184,-0.010487,0.058941,0.251971,...,-0.05286,0.074156,-0.033203,0.029792,0.076104,0.132746,0.045042,0.031042,1,0.998871
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...,sheer diversity recipe japanese thai indian fr...,0.261072,0.044054,0.13019,0.080427,-0.001261,0.046605,...,0.057356,0.110849,0.248279,-0.254497,0.029056,0.108632,0.106938,0.194251,1,0.975348
2,006001203X,4.1,health care proffesional tell way traumatising...,health care proffesional tell way traumatising...,0.257239,0.295282,0.238382,0.178325,-0.07783,0.070369,...,0.269843,0.109493,0.062807,-0.056614,0.152163,-0.151631,-0.050819,0.159799,-1,0.997175
3,0060096195,4.428571,started reading one bathtub get id gotten fina...,started reading one bathtub get id gotten fina...,0.683657,-0.247198,0.033077,0.060875,-0.078793,-0.056407,...,0.087071,0.013348,0.125837,0.046312,-0.081648,0.077465,-0.08589,-0.125254,1,0.993411
4,006016848X,3.5625,really like book time everyone want equality s...,really like book time everyone want equality s...,0.492497,0.233496,0.10542,0.168974,0.087518,-0.126515,...,-0.153551,-0.093654,0.01388,0.031424,0.073882,-0.089763,-0.05506,-0.027179,-1,0.977043


In [453]:
# Ensure all column names are strings
vectorized_sentiment_data.columns = vectorized_sentiment_data.columns.astype(str)

# Step 1: Standardize the feature set
features_sentiment = vectorized_sentiment_data.iloc[:, 4:]  
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features_sentiment)


In [454]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import cosine_similarity

# Ensure all column names are strings
# vectorized_data_reduced.columns = vectorized_sentiment_data.columns.astype(str)

# # Step 1: Standardize the feature set
# features = vectorized_sentiment_data.iloc[:, 4:]  
# scaler = StandardScaler()
# scaled_features = scaler.fit_transform(features)

# Step 2: Split the data into training and testing sets
train_data, test_data = train_test_split(vectorized_sentiment_data, test_size=0.2, random_state=42)

# Reset indices for easier access
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)

# Select vector features from the training data
train_vector_features = scaled_features[train_data.index]
test_vector_features = scaled_features[test_data.index]

# Calculate cosine similarity matrix for training data
cosine_sim_train = cosine_similarity(train_vector_features)

# Step 3: Predict ratings for the train set
predicted_ratings_train = []
for idx, row in train_data.iterrows():
    # Compute similarity between the current training item and all other training items
    similarity_scores = cosine_sim_train[idx, :]
    
    # Get the indices of the top 5 most similar training items, excluding the item itself
    similar_indices = np.argsort(similarity_scores)[::-1][1:6]
    
    # Retrieve the ratings of these similar training items
    similar_ratings = train_data.iloc[similar_indices]['average_rating']
    similar_sim_scores = similarity_scores[similar_indices]
    
    # Compute the weighted average to predict the rating
    if np.sum(similar_sim_scores) == 0:
        predicted_rating = train_data['average_rating'].mean()
    else:
        predicted_rating = np.dot(similar_sim_scores, similar_ratings) / np.sum(similar_sim_scores)
    
    predicted_ratings_train.append(predicted_rating)

# Calculate RMSE for training set
train_rmse = round(np.sqrt(mean_squared_error(train_data['average_rating'], predicted_ratings_train)), 4)
print(f"Train RMSE: {train_rmse}")

# Step 4: Predict ratings for the test set using the similarity matrix of training data
cosine_sim_test = cosine_similarity(test_vector_features, train_vector_features)
predicted_ratings_test = []
for idx, row in test_data.iterrows():
    # Compute similarity between the current test item and all training items
    similarity_scores = cosine_sim_test[idx, :]
    
    # Get the indices of the top 5 most similar training items
    similar_indices = np.argsort(similarity_scores)[::-1][:5]
    
    # Retrieve the ratings of these similar training items
    similar_ratings = train_data.iloc[similar_indices]['average_rating']
    similar_sim_scores = similarity_scores[similar_indices]
    
    # Compute the weighted average to predict the rating
    if np.sum(similar_sim_scores) == 0:
        predicted_rating = train_data['average_rating'].mean()
    else:
        predicted_rating = np.dot(similar_sim_scores, similar_ratings) / np.sum(similar_sim_scores)
    
    predicted_ratings_test.append(predicted_rating)

# Calculate RMSE for test set
test_rmse = round(np.sqrt(mean_squared_error(test_data['average_rating'], predicted_ratings_test)), 4)
print(f"Test RMSE: {test_rmse}")



Train RMSE: 0.6721
Test RMSE: 0.687
