<div class="alert alert-info">

## Introduction


</div>

In [1]:
#import
import os

import numpy as np
import pandas as pd
from hashlib import sha1

from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate
from surprise import accuracy
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

<div class="alert alert-info">

## Data Description<a name="2"></a>
Given the large size of the dataset, only 10000 rows of the dataset is used for the models.
This project utilizes a comprehensive dataset sourced from Kaggle, which can be accessed via the following link: (https://www.kaggle.com/datasets/beaglelee/amazon-reviews-us-books-v1-02-tsv-zip). The dataset consists of 15 columns and encompasses a substantial total of 3,105,370 rows, providing rich insights into customer feedback and product ratings specifically within the book category.

Due to the extensive size of the dataset, a subset of 10,000 rows has been selected for analysis and modeling. This reduction allows for efficient processing while still capturing the diverse range of reviews and ratings present in the original dataset.
</div>

In [237]:
# Data
data = pd.read_csv("data/amazon_reviews_us_Books_v1_02.tsv", sep='\t', on_bad_lines='skip')


In [238]:
data.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,12076615,RQ58W7SMO911M,385730586,122662979,Sisterhood of the Traveling Pants (Book 1),Books,4.0,2.0,3.0,N,N,this book was a great learning novel!,this boook was a great one that you could lear...,2005-10-14
1,US,12703090,RF6IUKMGL8SF,811828964,56191234,The Bad Girl's Guide to Getting What You Want,Books,3.0,5.0,5.0,N,N,Fun Fluff,If you are looking for something to stimulate ...,2005-10-14
2,US,12257412,R1DOSHH6AI622S,1844161560,253182049,"Eisenhorn (A Warhammer 40,000 Omnibus)",Books,4.0,1.0,22.0,N,N,this isn't a review,never read it-a young relative idicated he lik...,2005-10-14
3,US,50732546,RATOTLA3OF70O,373836635,348672532,Colby Conspiracy (Colby Agency),Books,5.0,2.0,2.0,N,N,fine author on her A-game,Though she is honored to be Chicago Woman of t...,2005-10-14
4,US,51964897,R1TNWRKIVHVYOV,262181533,598678717,The Psychology of Proof: Deductive Reasoning i...,Books,4.0,0.0,2.0,N,N,Execellent cursor examination,Review based on a cursory examination by Unive...,2005-10-14


<div class="alert alert-info">

## Exploratory Data Analysis(EDA) <a name="3"></a>

This section describes the exploratory data analysis (EDA) techniques employed to derive valuable insights from the dataset, which will inform the subsequent stages of model development.

To create a targeted subset for analysis, we identified product IDs and customer IDs associated with at least 100 reviews. This filtering process resulted in a dataset containing 24,466 rows, representing customer reviews. Our final subset includes 1,672 distinct products and 1,230 distinct customers, ensuring a diverse representation of both products and customers. This comprehensive approach enables us to conduct a thorough examination of customer feedback, facilitating deeper insights into their preferences and behaviors.

</div>

In [239]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3105370 entries, 0 to 3105369
Data columns (total 15 columns):
 #   Column             Dtype  
---  ------             -----  
 0   marketplace        object 
 1   customer_id        int64  
 2   review_id          object 
 3   product_id         object 
 4   product_parent     int64  
 5   product_title      object 
 6   product_category   object 
 7   star_rating        float64
 8   helpful_votes      float64
 9   total_votes        float64
 10  vine               object 
 11  verified_purchase  object 
 12  review_headline    object 
 13  review_body        object 
 14  review_date        object 
dtypes: float64(3), int64(2), object(10)
memory usage: 355.4+ MB


In [240]:
data.isnull().sum()

marketplace            0
customer_id            0
review_id              0
product_id             0
product_parent         0
product_title          0
product_category       0
star_rating            4
helpful_votes          4
total_votes            4
vine                   4
verified_purchase      4
review_headline       57
review_body            4
review_date          133
dtype: int64

In [228]:
data = data.dropna()

In [229]:
data.replace(['null', 'N/A', '', ' '], np.nan, inplace=True)

In [230]:
data.isnull().sum()

marketplace          0
customer_id          0
review_id            0
product_id           0
product_parent       0
product_title        0
product_category     0
star_rating          0
helpful_votes        0
total_votes          0
vine                 0
verified_purchase    0
review_headline      0
review_body          0
review_date          0
dtype: int64

In [231]:
data.nunique()

marketplace                1
customer_id          1502265
review_id            3105184
product_id            779692
product_parent        666003
product_title         713665
product_category           1
star_rating                5
helpful_votes            942
total_votes             1024
vine                       2
verified_purchase          2
review_headline      2456998
review_body          3070458
review_date             3575
dtype: int64

In [241]:
# Selected a subset with customers and products with at least 100 reviews
# Step 1: Filter customers with at least 100 reviews
customer_review_counts = data.groupby('customer_id').size().reset_index(name='review_count')
customers_with_at_least_100_reviews = customer_review_counts[customer_review_counts['review_count'] >= 100]

# Step 2: Filter products with at least 100 reviews
product_review_counts = data.groupby('product_id').size().reset_index(name='review_count')
products_with_at_least_100_reviews = product_review_counts[product_review_counts['review_count'] >= 100]

# Step 3: Filter the original dataset to only include customers and products with at least 100 reviews
filtered_data = data[
    (data['customer_id'].isin(customers_with_at_least_100_reviews['customer_id'])) &
    (data['product_id'].isin(products_with_at_least_100_reviews['product_id']))
]
filtered_data.shape

(24466, 15)

In [242]:
filtered_data.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
295,US,50230169,R23MCAR8GSV3T0,0451526341,380925201,Animal farm: A Fairy Story,Books,4.0,2.0,2.0,N,N,Simple Yet Profound,"A generation ago, the sight of the cover of Ge...",2005-10-14
307,US,50776149,RUCZYTA3MP0MR,038551428X,970964974,"The Traveler (Fourth Realm Trilogy, Book 1)",Books,5.0,2.0,5.0,N,N,Great Marketing for a Pretty Good Book,The most interesting thing about this book is ...,2005-10-14
314,US,12598621,RCL2ARHKWH6RL,059035342X,667539744,Harry Potter and the Sorcerer's Stone,Books,5.0,2.0,2.0,N,N,I Think Part Of The Charm Is You Feel Like You...,Even though this is the shortest book in the H...,2005-10-14
363,US,49770667,R2P4B3STC980QP,1594480001,659516630,The Kite Runner,Books,5.0,4.0,4.0,N,N,Praiseworthy first novel,Well I thoroughly enjoyed this book. Although ...,2005-10-14
406,US,49828549,RM0CSYVWKHW5W,0671027360,141370518,Angels & Demons,Books,1.0,31.0,39.0,N,N,Preposterous,"Early in this novel, our hero finds out that a...",2005-10-14


In [149]:
filtered_data.to_csv('data/amazon_reviews_subset_100.csv', index=False)

In [243]:
filtered_data.nunique()

marketplace              1
customer_id           1230
review_id            24466
product_id            1672
product_parent        1485
product_title         1562
product_category         1
star_rating              5
helpful_votes          423
total_votes            460
vine                     1
verified_purchase        2
review_headline      23305
review_body          24314
review_date           2574
dtype: int64


<div class="alert alert-info">
    
## Collaborative Filtering
**Collaborative Filtering** is a widely-used technique for addressing the challenge of missing entries in a utility matrix, leveraging user behavior and interactions to make recommendations. This approach operates on the principle that users who have agreed in the past will continue to agree in the future, allowing the model to infer preferences based on the preferences of similar users.

This method can be likened to advanced dimensionality reduction techniques such as Latent Semantic Analysis (LSA) or Truncated Singular Value Decomposition (SVD). By capturing the underlying relationships between users and items, collaborative filtering helps to predict missing values, enhancing the accuracy and relevance of recommendations.

In this project, we will implement collaborative filtering as our baseline model to improve user experience by personalizing content based on historical data, thus enabling more informed decision-making.
</div>

In [244]:
# Reading the data
coll_data = filtered_data[['customer_id', 'product_id', 'star_rating']].reset_index(drop=True)
coll_data.head()

Unnamed: 0,customer_id,product_id,star_rating
0,50230169,0451526341,4.0
1,50776149,038551428X,5.0
2,12598621,059035342X,5.0
3,49770667,1594480001,5.0
4,49828549,0671027360,1.0


In [245]:
filtered_data.nunique()

marketplace              1
customer_id           1230
review_id            24466
product_id            1672
product_parent        1485
product_title         1562
product_category         1
star_rating              5
helpful_votes          423
total_votes            460
vine                     1
verified_purchase        2
review_headline      23305
review_body          24314
review_date           2574
dtype: int64

In [234]:
coll_data.nunique()

customer_id    1230
product_id     1672
star_rating       5
dtype: int64

In [246]:
# Number of customers and products
user_key = "customer_id"
item_key = "product_id"
N = len(np.unique(coll_data[user_key])) 
M = len(np.unique(coll_data[item_key]))
print(f"Number of customers (N)  : {N}")
print(f"Number of products (M) : {M}")

Number of customers (N)  : 1230
Number of products (M) : 1672


In [52]:
non_nan_ratings_percentage = (len(coll_data) / (N * M) * 100) 
print(f"Non-nan ratings percentage: {np.round(non_nan_ratings_percentage,3)}")

Non-nan ratings percentage: 1.19


In [53]:
avg_nratings_per_user = coll_data.groupby(user_key).size().mean()
avg_nratings_per_movie = coll_data.groupby(item_key).size().mean()
print(f"Average number of ratings per customer : {avg_nratings_per_user:.2f}")
print(f"Average number of ratings per product: {avg_nratings_per_movie:.2f}")

Average number of ratings per customer : 19.89
Average number of ratings per product: 14.63


In [247]:
# Data Splitting
X = coll_data.copy()
y = coll_data['customer_id']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=123)

In [248]:
user_mapper = dict(zip(np.unique(coll_data[user_key]), list(range(N))))
item_mapper = dict(zip(np.unique(coll_data[item_key]), list(range(M))))
user_inverse_mapper = dict(zip(list(range(N)), np.unique(coll_data[user_key])))
item_inverse_mapper = dict(zip(list(range(M)), np.unique(coll_data[item_key])))

In [249]:
def create_Y_from_ratings(data, N, M):
    Y = np.zeros((N, M))
    Y.fill(np.nan)
    for index, val in data.iterrows():
        n = user_mapper[val[user_key]]
        m = item_mapper[val[item_key]]
        Y[n, m] = val["star_rating"]

    return Y

train_mat = create_Y_from_ratings(X_train, N, M)
valid_mat = create_Y_from_ratings(X_valid, N, M)


In [250]:
# What's the number of non-nan elements in train_mat (nnn_train_mat)?
nnn_train_mat = np.sum(~np.isnan(train_mat)) 

# What's the number of non-nan elements in valid_mat (nnn_valid_mat)?
nnn_valid_mat = np.sum(~np.isnan(valid_mat)) 
print(f"Number of non-nan elements in train_mat: {nnn_train_mat}")
print(f"Number of non-nan elements in valid_mat: {nnn_valid_mat}")

Number of non-nan elements in train_mat: 19085
Number of non-nan elements in valid_mat: 4855


In [251]:
# Evaluation
def error(Y1, Y2):
    """
    Given two matrices of the same shape, 
    returns the root mean squared error (RMSE).
    """
    return np.sqrt(np.nanmean((Y1 - Y2) ** 2))


def evaluate(pred_Y, train_mat, valid_mat, model_name="Global average"):
    """
    Given predicted utility matrix and train and validation utility matrices 
    print train and validation RMSEs.
    """
    print("%s train RMSE: %0.2f" % (model_name, error(pred_Y, train_mat)))
    print("%s valid RMSE: %0.2f" % (model_name, error(pred_Y, valid_mat)))

In [243]:
# global average rating baseline
avg = np.nanmean(train_mat)
pred_g = np.zeros(train_mat.shape) + avg
evaluate(pred_g, train_mat, valid_mat, model_name="Global average")

Global average train RMSE: 1.06
Global average valid RMSE: 1.09


In [244]:
# Per-user average baseline
avg_n = np.nanmean(train_mat, axis=1)
avg_n[
    np.isnan(avg_n)
] = avg  
pred_n = np.tile(avg_n[:, None], (1, M))
evaluate(pred_n, train_mat, valid_mat, model_name="Per-user average")

Per-user average train RMSE: 0.94
Per-user average valid RMSE: 1.04


  avg_n = np.nanmean(train_mat, axis=1)


In [245]:
# Per-product average baseline
avg_m = np.nanmean(train_mat, axis=0)
avg_m[np.isnan(avg_m)] = avg
pred_m = np.tile(avg_m[None, :], (N, 1))
evaluate(pred_m, train_mat, valid_mat, model_name="Per-product average")

Per-product average train RMSE: 0.93
Per-product average valid RMSE: 1.06


  avg_m = np.nanmean(train_mat, axis=0)


In [246]:
# Average of per-user and per-product average baselines
pred_n_m = (pred_n + pred_m) * 0.5
evaluate(pred_n_m, train_mat, valid_mat, model_name="Per-user and product average")

Per-user and product average train RMSE: 0.88
Per-user and product average valid RMSE: 0.99


In [247]:
# K-nearest neighbours imputation
from sklearn.impute import KNNImputer

num_neighs = [10, 15, 18, 20, 40]
for n_neighbors in num_neighs:
    print("\nNumber of neighbours: ", n_neighbors)
    imputer = KNNImputer(n_neighbors=n_neighbors, keep_empty_features=True)
    pred_knn = imputer.fit_transform(train_mat)
    evaluate(pred_knn, train_mat, valid_mat)


Number of neighbours:  10
Global average train RMSE: 0.00
Global average valid RMSE: 1.08

Number of neighbours:  15
Global average train RMSE: 0.00
Global average valid RMSE: 1.08

Number of neighbours:  18
Global average train RMSE: 0.00
Global average valid RMSE: 1.08

Number of neighbours:  20
Global average train RMSE: 0.00
Global average valid RMSE: 1.08

Number of neighbours:  40
Global average train RMSE: 0.00
Global average valid RMSE: 1.08


In [248]:
# collaborative filtering with TruncatedSVD()
def reconstruct_svd(Z, W, avg_n, avg_m):
    return Z @ W + 0.5 * avg_n[:, None] + 0.5 * avg_m[None]


train_mat_svd = train_mat - 0.5 * avg_n[:, None] - 0.5 * avg_m[None]
train_mat_svd = np.nan_to_num(train_mat_svd)

k_range = [10, 50, 100, 200, 500, 1000]
for k in k_range:
    print("\n")
    tsvd = TruncatedSVD(n_components=k)
    Z = tsvd.fit_transform(train_mat_svd)
    W = tsvd.components_
    X_hat = reconstruct_svd(Z, W, avg_n, avg_m)
    evaluate(X_hat, train_mat, valid_mat, model_name="TruncatedSVD (k = %d)" % k)



TruncatedSVD (k = 10) train RMSE: 0.82
TruncatedSVD (k = 10) valid RMSE: 0.98


TruncatedSVD (k = 50) train RMSE: 0.68
TruncatedSVD (k = 50) valid RMSE: 0.98


TruncatedSVD (k = 100) train RMSE: 0.56
TruncatedSVD (k = 100) valid RMSE: 0.97


TruncatedSVD (k = 200) train RMSE: 0.40
TruncatedSVD (k = 200) valid RMSE: 0.97


TruncatedSVD (k = 500) train RMSE: 0.15
TruncatedSVD (k = 500) valid RMSE: 0.97


TruncatedSVD (k = 1000) train RMSE: 0.01
TruncatedSVD (k = 1000) valid RMSE: 0.97


In [249]:
# Using surprise package
reader = Reader()
data = Dataset.load_from_df(coll_data, reader)  

k = 10
algo = SVD(n_factors=k, random_state=42)

In [250]:
pd.DataFrame(cross_validate(algo, data, measures=["RMSE"], cv=5, verbose=True))

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9642  0.9487  0.9521  0.9302  0.9462  0.9483  0.0109  
Fit time          0.03    0.02    0.02    0.02    0.02    0.02    0.00    
Test time         0.01    0.01    0.01    0.01    0.01    0.01    0.00    


Unnamed: 0,test_rmse,fit_time,test_time
0,0.964154,0.025022,0.009257
1,0.948651,0.023375,0.008167
2,0.952118,0.02266,0.008023
3,0.930238,0.022565,0.007989
4,0.946221,0.022534,0.007865


<div class="alert alert-info">
    
## Incorporating Reviews

</div>

In [272]:
review_data = filtered_data[['customer_id', 'product_id', 'star_rating', 'review_body']].reset_index(drop=True)
review_data.head()

Unnamed: 0,customer_id,product_id,star_rating,review_body
0,50230169,0451526341,4.0,"A generation ago, the sight of the cover of Ge..."
1,50776149,038551428X,5.0,The most interesting thing about this book is ...
2,12598621,059035342X,5.0,Even though this is the shortest book in the H...
3,49770667,1594480001,5.0,Well I thoroughly enjoyed this book. Although ...
4,49828549,0671027360,1.0,"Early in this novel, our hero finds out that a..."


In [273]:
review_data.shape

(24466, 4)

In [274]:
review_data.nunique()

customer_id     1230
product_id      1672
star_rating        5
review_body    24314
dtype: int64

In [69]:
# clean the reviews
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


# Initialize stopwords and lemmatizer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove stopwords and lemmatize
    text = ' '.join(lemmatizer.lemmatize(word) for word in text.split() if word not in stop_words)
    return text
# Clean the 'review_body' column
review_data['cleaned_review_body'] = review_data['review_body'].apply(clean_text)

# Step 1: Group by 'customer_id' and 'product_id' and aggregate
aggregated_data = review_data.groupby(['product_id']).agg(
    average_rating=('star_rating', 'mean'),         # Mean of the star ratings
    aggregated_reviews=('cleaned_review_body', ' '.join)  # Concatenate all cleaned review bodies
).reset_index()

# Display the aggregated DataFrame
aggregated_data.head()



Unnamed: 0,product_id,average_rating,aggregated_reviews
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...
2,006001203X,4.1,health care proffesional tell way traumatising...
3,0060096195,4.428571,started reading one bathtub get id gotten fina...
4,006016848X,3.5625,really like book time everyone want equality s...


In [121]:
aggregated_data.nunique()

product_id            1672
average_rating         425
aggregated_reviews    1672
summarized_reviews    1669
dtype: int64

In [70]:
aggregated_data.shape

(1672, 3)

In [71]:
import pandas as pd
from transformers import BartForConditionalGeneration, BartTokenizer

# Load the BART model and tokenizer
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)


In [74]:

def summarize_reviews(df):
    summaries = []
    for review in df['aggregated_reviews']:
        inputs = tokenizer(review, return_tensors="pt", max_length=1024, truncation=True)
        summary_ids = model.generate(inputs["input_ids"], max_length=50, num_beams=4, early_stopping=True)
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        summaries.append(summary)
    return summaries

# Summarizing reviews 
aggregated_data['summarized_reviews'] = summarize_reviews(aggregated_data)



KeyError: "['customer_id'] not in index"

In [352]:
aggregated_data.head()

Unnamed: 0,product_id,average_rating,aggregated_reviews,summarized_reviews,sentiment,sentiment_score
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...,susan cooper dark rising sequence joined pryda...,1,0.998871
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...,sheer diversity recipe japanese thai indian fr...,1,0.975348
2,006001203X,4.1,health care proffesional tell way traumatising...,health care proffesional tell way traumatising...,-1,0.997175
3,0060096195,4.428571,started reading one bathtub get id gotten fina...,started reading one bathtub get id gotten fina...,1,0.993411
4,006016848X,3.5625,really like book time everyone want equality s...,really like book time everyone want equality s...,-1,0.977043


In [323]:
aggregated_data[['product_id', 'average_rating', 'summarized_reviews']]

Unnamed: 0,product_id,average_rating,summarized_reviews
0,0020425651,5.000000,susan cooper dark rising sequence joined pryda...
1,0028610105,4.400000,sheer diversity recipe japanese thai indian fr...
2,006001203X,4.100000,health care proffesional tell way traumatising...
3,0060096195,4.428571,started reading one bathtub get id gotten fina...
4,006016848X,3.562500,really like book time everyone want equality s...
...,...,...,...
1667,1931412065,4.875000,confirmed low carber year year constant raveno...
1668,1931498717,4.727273,selection book group sunday september since or...
1669,1931561648,4.437500,said time traveler wife nonconventional love s...
1670,1931866007,5.000000,book consists transcript interview mike litman...


In [286]:
aggregated_data.to_csv('data/summarized_review.csv', index=False)

In [324]:
sample_data= aggregated_data

In [325]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2') 

# Encode the summaries to get embeddings
embeddings = model.encode(sample_data['summarized_reviews'].tolist())

# Convert embeddings to a DataFrame
embeddings_df = pd.DataFrame(embeddings)

vectorized_data = pd.concat([sample_data, embeddings_df], axis=1)
vectorized_data.head()


Unnamed: 0,product_id,average_rating,aggregated_reviews,summarized_reviews,sentiment,sentiment_score,0,1,2,3,...,374,375,376,377,378,379,380,381,382,383
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...,susan cooper dark rising sequence joined pryda...,1,0.998871,-0.070799,-0.061837,-0.003631,0.012133,...,0.065155,0.053727,0.003597,0.088892,-0.042067,0.041044,0.070728,-0.043085,-0.064512,0.038242
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...,sheer diversity recipe japanese thai indian fr...,1,0.975348,-0.073018,-0.023593,0.066772,0.036159,...,0.042942,0.037886,-0.001067,-0.009074,0.065551,-0.054624,0.067726,0.079832,-0.015437,-0.041357
2,006001203X,4.1,health care proffesional tell way traumatising...,health care proffesional tell way traumatising...,-1,0.997175,0.004559,0.033951,0.020969,0.073957,...,-0.005109,0.098458,0.004446,0.010148,-0.050168,0.036669,0.133147,0.01329,0.06339,0.043042
3,0060096195,4.428571,started reading one bathtub get id gotten fina...,started reading one bathtub get id gotten fina...,1,0.993411,-0.066638,-0.083743,0.053587,0.066727,...,0.046725,0.029233,0.014859,0.084216,-0.085885,0.048461,0.023749,0.003057,-0.0757,-0.034523
4,006016848X,3.5625,really like book time everyone want equality s...,really like book time everyone want equality s...,-1,0.977043,-0.061388,0.048214,0.015069,-0.003492,...,0.041296,0.018716,0.018985,0.015178,-0.019459,0.000856,0.134094,-0.078086,0.004868,-0.001025


In [375]:
from sklearn.decomposition import TruncatedSVD

# Set the desired number of components
n_components = 20

# Initialize TruncatedSVD and fit-transform the embeddings
svd = TruncatedSVD(n_components=n_components)
reduced_embeddings = svd.fit_transform(embeddings_df)

# Convert the reduced embeddings to a DataFrame
reduced_embeddings_df = pd.DataFrame(reduced_embeddings)

# Concatenate the original data with the reduced embeddings
vectorized_data_reduced = pd.concat([sample_data, reduced_embeddings_df], axis=1)
vectorized_data_reduced.head()


Unnamed: 0,product_id,average_rating,aggregated_reviews,summarized_reviews,sentiment,sentiment_score,0,1,2,3,...,10,11,12,13,14,15,16,17,18,19
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...,susan cooper dark rising sequence joined pryda...,1,0.998871,0.635252,-0.269041,0.012194,-0.010467,...,-0.188243,-0.027312,-0.050258,0.066114,-0.039863,-0.032078,0.067193,-0.0369,0.138247,0.047895
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...,sheer diversity recipe japanese thai indian fr...,1,0.975348,0.261072,0.044044,0.130149,0.080442,...,-0.076873,-0.055025,0.061867,0.125944,0.289912,0.249448,0.005777,0.024197,0.114358,0.179739
2,006001203X,4.1,health care proffesional tell way traumatising...,health care proffesional tell way traumatising...,-1,0.997175,0.257239,0.295288,0.238396,0.178317,...,-0.015927,0.056299,0.268024,0.118565,0.062791,0.051532,0.155097,0.039039,-0.163372,0.166068
3,0060096195,4.428571,started reading one bathtub get id gotten fina...,started reading one bathtub get id gotten fina...,1,0.993411,0.683657,-0.247203,0.033083,0.060892,...,0.040029,-0.122108,0.089621,0.021345,0.117762,-0.058642,-0.08317,-0.118794,0.001701,-0.112346
4,006016848X,3.5625,really like book time everyone want equality s...,really like book time everyone want equality s...,-1,0.977043,0.492497,0.233488,0.105423,0.168909,...,-0.038277,-0.107366,-0.157789,-0.082718,0.001719,-0.044761,0.076308,0.034848,-0.099364,-0.002617


In [376]:


# Assuming your final dataset is stored in 'vectorized_data'
# and the relevant features start from the 6th column onwards (0-indexed)
features = vectorized_data_reduced.iloc[:, 6:]

# Step 1: Standardize the feature set
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# Step 2: Split the data into training and testing sets
train_data, test_data = train_test_split(vectorized_data_reduced, test_size=0.2, random_state=42)

# Reset indices for easier access
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)

# Select vector features from the training data
train_vector_features = scaled_features[train_data.index]

# Calculate cosine similarity matrix for training data
cosine_sim = cosine_similarity(train_vector_features)

# Initialize a list to store the predicted ratings
predicted_ratings = []

# Step 4: Loop over each item in the test set to predict its rating
for idx, row in test_data.iterrows():
    # Extract the index of the current test item
    test_index = row.name  # Get the original index in the dataset
    
    # Compute similarity between the test item and all training items
    similarity_scores = cosine_sim[test_index, :]

    # Get the indices of the top 5 most similar training items
    similar_indices = np.argsort(similarity_scores)[::-1][:5]  # Top 5
    
    # Retrieve the ratings of these similar training items
    similar_ratings = train_data.iloc[similar_indices]['average_rating']
    
    # Retrieve the corresponding similarity scores
    similar_sim_scores = similarity_scores[similar_indices]
    
    # Compute the weighted average to predict the rating
    if np.sum(similar_sim_scores) == 0:
        # If similarity scores sum to zero, default to the mean rating of the training set
        predicted_rating = train_data['average_rating'].mean()
    else:
        predicted_rating = np.dot(similar_sim_scores, similar_ratings) / np.sum(similar_sim_scores)
    
    # Append the predicted rating to the list
    predicted_ratings.append(predicted_rating)

# Add the predicted ratings to the test DataFrame
test_data['predicted_rating'] = predicted_ratings

# Step 5: Calculate RMSE on the test set
rmse = np.sqrt(mean_squared_error(test_data['average_rating'], test_data['predicted_rating']))
print(f"Test RMSE: {rmse}")

# calculate RMSE for the training set 
train_predictions = []
for idx, row in train_data.iterrows():
    # Extract the index of the current train item
    train_index = row.name  # Get the original index in the dataset
    
    # Compute similarity between the train item and all training items
    similarity_scores = cosine_sim[train_index, :]

    # Get the indices of the top 5 most similar training items
    similar_indices = np.argsort(similarity_scores)[::-1][:5]  # Top 5
    
    # Retrieve the ratings of these similar training items
    similar_ratings = train_data.iloc[similar_indices]['average_rating']
    
    # Retrieve the corresponding similarity scores
    similar_sim_scores = similarity_scores[similar_indices]
    
    # Compute the weighted average to predict the rating
    if np.sum(similar_sim_scores) == 0:
        predicted_rating = train_data['average_rating'].mean()
    else:
        predicted_rating = np.dot(similar_sim_scores, similar_ratings) / np.sum(similar_sim_scores)
    
    # Append the predicted rating to the list
    train_predictions.append(predicted_rating)

# Add the predicted ratings to the train DataFrame
train_data['predicted_rating'] = train_predictions

# Calculate RMSE on the training set
train_rmse = np.sqrt(mean_squared_error(train_data['average_rating'], train_data['predicted_rating']))
print(f"Train RMSE: {train_rmse}")


Test RMSE: 0.6891104190425241
Train RMSE: 0.5050478806823305


In [287]:
vectorized_data.to_csv('data/vectorized_data.csv', index=False)

In [385]:
# Running the model and calculate RMSE 

# the relevant features start from the 6th column onwards (0-indexed)
features = vectorized_data_reduced.iloc[:, 6:]

# Step 1: Standardize the feature set
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# Step 2: Split the data into training and testing sets
train_data, test_data = train_test_split(vectorized_data_reduced, test_size=0.2, random_state=42)

# Reset indices for easier access
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)

# Select vector features from the training data
train_vector_features = scaled_features[train_data.index]

# Calculate cosine similarity matrix for training data
cosine_sim = cosine_similarity(train_vector_features)

# Initialize a list to store the predicted ratings
predicted_ratings = []

# Step 4: Loop over each item in the test set to predict its rating
for idx, row in test_data.iterrows():
    # Extract the index of the current test item
    test_index = row.name  # Get the original index in the dataset
    
    # Compute similarity between the test item and all training items
    similarity_scores = cosine_sim[test_index, :]

    # Get the indices of the top 5 most similar training items
    similar_indices = np.argsort(similarity_scores)[::-1][:5]  # Top 5
    
    # Retrieve the ratings of these similar training items
    similar_ratings = train_data.iloc[similar_indices]['average_rating']
    
    # Retrieve the corresponding similarity scores
    similar_sim_scores = similarity_scores[similar_indices]
    
    # Compute the weighted average to predict the rating
    if np.sum(similar_sim_scores) == 0:
        # If similarity scores sum to zero, default to the mean rating of the training set
        predicted_rating = train_data['average_rating'].mean()
    else:
        predicted_rating = np.dot(similar_sim_scores, similar_ratings) / np.sum(similar_sim_scores)
    
    # Append the predicted rating to the list
    predicted_ratings.append(predicted_rating)

# Add the predicted ratings to the test DataFrame
test_data['predicted_rating'] = predicted_ratings

# Step 5: Calculate RMSE on the test set
rmse = round(np.sqrt(mean_squared_error(test_data['average_rating'], test_data['predicted_rating'])),4)
print(f"Test RMSE: {rmse}")



Test RMSE: 0.6891


In [386]:
# calculate RMSE for the training set 
train_predictions = []
for idx, row in train_data.iterrows():
    # Extract the index of the current train item
    train_index = row.name  # Get the original index in the dataset
    
    # Compute similarity between the train item and all training items
    similarity_scores = cosine_sim[train_index, :]

    # Get the indices of the top 5 most similar training items
    similar_indices = np.argsort(similarity_scores)[::-1][:5]  # Top 5
    
    # Retrieve the ratings of these similar training items
    similar_ratings = train_data.iloc[similar_indices]['average_rating']
    
    # Retrieve the corresponding similarity scores
    similar_sim_scores = similarity_scores[similar_indices]
    
    # Compute the weighted average to predict the rating
    if np.sum(similar_sim_scores) == 0:
        predicted_rating = train_data['average_rating'].mean()
    else:
        predicted_rating = np.dot(similar_sim_scores, similar_ratings) / np.sum(similar_sim_scores)
    
    # Append the predicted rating to the list
    train_predictions.append(predicted_rating)

# Add the predicted ratings to the train DataFrame
train_data['predicted_rating'] = train_predictions

# Calculate RMSE on the training set
train_rmse = round(np.sqrt(mean_squared_error(train_data['average_rating'], train_data['predicted_rating'])),4)
print(f"Train RMSE: {train_rmse}")

Train RMSE: 0.505


In [389]:
def get_similar_products(product_id, top_n=5):
    # Find the index of the given product ID
    product_idx = vectorized_data_reduced[vectorized_data_reduced['product_id'] == product_id].index[0]
    
    # Get the similarity scores for this product
    similarity_scores = cosine_sim[product_idx]
    
    # Get the indices of the most similar products, excluding the product itself
    similar_indices = np.argsort(similarity_scores)[::-1][1:top_n + 1]
    
    # Get the similar products' details
    similar_products = vectorized_data_reduced.iloc[similar_indices][['product_id', 'average_rating']]
    
    return similar_products

# Example usage: Predict similar products for a given product ID
similar_products = get_similar_products('0020425651', top_n=5)
print(similar_products)

      product_id  average_rating
352   0345413350        4.526316
763   0440995779        4.142857
870   0451166582        4.428571
104   0064406970        4.666667
1093  0671024248        4.111111


In [346]:
from transformers import pipeline

# Load a pre-trained sentiment analysis model
sentiment_pipeline = pipeline("sentiment-analysis")

# Function to get sentiment
def get_sentiment(review):
    result = sentiment_pipeline(review)[0]
    return result['label'], result['score']

# Apply the function to summarize the reviews
vectorized_data_reduced[['sentiment', 'sentiment_score']] = vectorized_data_reduced['summarized_reviews'].apply(get_sentiment).apply(pd.Series)
vectorized_data_reduced.head()


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Unnamed: 0,product_id,average_rating,aggregated_reviews,summarized_reviews,sentiment,sentiment_score,0,1,2,3,...,40,41,42,43,44,45,46,47,48,49
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...,susan cooper dark rising sequence joined pryda...,POSITIVE,0.998871,0.635252,-0.269045,0.012187,-0.010486,...,0.007408,-0.130559,-0.088907,-0.097262,-0.004827,-0.097124,0.106259,0.008301,-0.028821,-0.006177
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...,sheer diversity recipe japanese thai indian fr...,POSITIVE,0.975348,0.261072,0.044055,0.130192,0.080377,...,-0.115873,0.020826,-0.093551,-0.112952,0.057351,0.000587,0.135271,0.006109,-0.027882,0.017403
2,006001203X,4.1,health care proffesional tell way traumatising...,health care proffesional tell way traumatising...,NEGATIVE,0.997175,0.257239,0.295279,0.238385,0.178327,...,0.015217,-0.012752,0.036691,0.073162,-0.085413,0.005514,-0.008199,-0.041943,-0.069186,0.040457
3,0060096195,4.428571,started reading one bathtub get id gotten fina...,started reading one bathtub get id gotten fina...,POSITIVE,0.993411,0.683657,-0.247196,0.033078,0.060899,...,0.005285,0.064651,-0.008425,0.058347,-0.063713,-0.014532,-0.020493,-0.045625,0.054717,-0.064903
4,006016848X,3.5625,really like book time everyone want equality s...,really like book time everyone want equality s...,NEGATIVE,0.977043,0.492497,0.233495,0.105419,0.168958,...,-0.094077,0.050498,-0.065301,-0.012698,-0.071451,0.064249,-0.041519,-0.009113,0.018505,0.055179


In [347]:
sentiment_mapping = {
    'POSITIVE': 1,
    'NEGATIVE': -1,
    'NEUTRAL': 0  
}
vectorized_data_reduced['sentiment'] = vectorized_data_reduced['sentiment'].map(sentiment_mapping)
vectorized_data_reduced.head()

Unnamed: 0,product_id,average_rating,aggregated_reviews,summarized_reviews,sentiment,sentiment_score,0,1,2,3,...,40,41,42,43,44,45,46,47,48,49
0,0020425651,5.0,susan cooper dark rising sequence joined pryda...,susan cooper dark rising sequence joined pryda...,1,0.998871,0.635252,-0.269045,0.012187,-0.010486,...,0.007408,-0.130559,-0.088907,-0.097262,-0.004827,-0.097124,0.106259,0.008301,-0.028821,-0.006177
1,0028610105,4.4,sheer diversity recipe japanese thai indian fr...,sheer diversity recipe japanese thai indian fr...,1,0.975348,0.261072,0.044055,0.130192,0.080377,...,-0.115873,0.020826,-0.093551,-0.112952,0.057351,0.000587,0.135271,0.006109,-0.027882,0.017403
2,006001203X,4.1,health care proffesional tell way traumatising...,health care proffesional tell way traumatising...,-1,0.997175,0.257239,0.295279,0.238385,0.178327,...,0.015217,-0.012752,0.036691,0.073162,-0.085413,0.005514,-0.008199,-0.041943,-0.069186,0.040457
3,0060096195,4.428571,started reading one bathtub get id gotten fina...,started reading one bathtub get id gotten fina...,1,0.993411,0.683657,-0.247196,0.033078,0.060899,...,0.005285,0.064651,-0.008425,0.058347,-0.063713,-0.014532,-0.020493,-0.045625,0.054717,-0.064903
4,006016848X,3.5625,really like book time everyone want equality s...,really like book time everyone want equality s...,-1,0.977043,0.492497,0.233495,0.105419,0.168958,...,-0.094077,0.050498,-0.065301,-0.012698,-0.071451,0.064249,-0.041519,-0.009113,0.018505,0.055179
