<div class="alert alert-info">

## Introduction


</div>

In [2]:
#import
import pandas as pd
import numpy as np

<div class="alert alert-info">

## Data Description<a name="2"></a>
Given the large size of the dataset, only 10000 rows of the dataset is used for the models.
This project utilizes a comprehensive dataset sourced from Kaggle, which can be accessed via the following link: (https://www.kaggle.com/datasets/beaglelee/amazon-reviews-us-books-v1-02-tsv-zip). The dataset consists of 15 columns and encompasses a substantial total of 3,105,370 rows, providing rich insights into customer feedback and product ratings specifically within the book category.

Due to the extensive size of the dataset, a subset of 10,000 rows has been selected for analysis and modeling. This reduction allows for efficient processing while still capturing the diverse range of reviews and ratings present in the original dataset.
</div>

In [7]:
# Data
data = pd.read_csv("data/amazon_reviews_us_Books_v1_02.tsv", sep='\t', on_bad_lines='skip')


In [47]:
data.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,12076615,RQ58W7SMO911M,385730586,122662979,Sisterhood of the Traveling Pants (Book 1),Books,4.0,2.0,3.0,N,N,this book was a great learning novel!,this boook was a great one that you could lear...,2005-10-14
1,US,12703090,RF6IUKMGL8SF,811828964,56191234,The Bad Girl's Guide to Getting What You Want,Books,3.0,5.0,5.0,N,N,Fun Fluff,If you are looking for something to stimulate ...,2005-10-14
2,US,12257412,R1DOSHH6AI622S,1844161560,253182049,"Eisenhorn (A Warhammer 40,000 Omnibus)",Books,4.0,1.0,22.0,N,N,this isn't a review,never read it-a young relative idicated he lik...,2005-10-14
3,US,50732546,RATOTLA3OF70O,373836635,348672532,Colby Conspiracy (Colby Agency),Books,5.0,2.0,2.0,N,N,fine author on her A-game,Though she is honored to be Chicago Woman of t...,2005-10-14
4,US,51964897,R1TNWRKIVHVYOV,262181533,598678717,The Psychology of Proof: Deductive Reasoning i...,Books,4.0,0.0,2.0,N,N,Execellent cursor examination,Review based on a cursory examination by Unive...,2005-10-14


<div class="alert alert-info">

## Exploratory Data Analysis(EDA) <a name="3"></a>

This section outlines the exploratory data analysis techniques employed to extract meaningful insights from the dataset, guiding the subsequent steps in model development.

To create a focused subset for analysis, we identified the 1,000 unique product IDs with the highest number of reviews. From this selection, we sampled a total of 10,000 rows corresponding to their reviews. As a result, our final subset comprises 10,000 rows, encompassing 993 distinct products and 9,993 associated reviews. This approach ensures a diverse representation of products while allowing us to conduct a thorough examination of customer feedback.

</div>

In [48]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3105370 entries, 0 to 3105369
Data columns (total 15 columns):
 #   Column             Dtype  
---  ------             -----  
 0   marketplace        object 
 1   customer_id        int64  
 2   review_id          object 
 3   product_id         object 
 4   product_parent     int64  
 5   product_title      object 
 6   product_category   object 
 7   star_rating        float64
 8   helpful_votes      float64
 9   total_votes        float64
 10  vine               object 
 11  verified_purchase  object 
 12  review_headline    object 
 13  review_body        object 
 14  review_date        object 
dtypes: float64(3), int64(2), object(10)
memory usage: 355.4+ MB


In [49]:
data.isnull().sum()

marketplace            0
customer_id            0
review_id              0
product_id             0
product_parent         0
product_title          0
product_category       0
star_rating            4
helpful_votes          4
total_votes            4
vine                   4
verified_purchase      4
review_headline       57
review_body            4
review_date          133
dtype: int64

In [50]:
data = data.dropna()

In [52]:
data.replace(['null', 'N/A', '', ' '], np.nan, inplace=True)

In [53]:
data.isnull().sum()

marketplace          0
customer_id          0
review_id            0
product_id           0
product_parent       0
product_title        0
product_category     0
star_rating          0
helpful_votes        0
total_votes          0
vine                 0
verified_purchase    0
review_headline      0
review_body          0
review_date          0
dtype: int64

In [54]:
data.nunique()

marketplace                1
customer_id          1502265
review_id            3105184
product_id            779692
product_parent        666003
product_title         713665
product_category           1
star_rating                5
helpful_votes            942
total_votes             1024
vine                       2
verified_purchase          2
review_headline      2456998
review_body          3070458
review_date             3575
dtype: int64

In [1]:
# Getting subset
# 1. Group by product_id and count the number of reviews for each product
product_review_counts = data.groupby('product_id').size().reset_index(name='review_count')

# 2. Sort by review count in descending order and get the top 1000 products
top_1000_products = product_review_counts.sort_values(by='review_count', ascending=False).head(1000)

# 3. Filter the original dataset to get reviews only for the top 1000 products
top_1000_product_reviews = data[data['product_id'].isin(top_1000_products['product_id'])]

# 4. Sample 10000 rows from the filtered dataset
subset = top_150_product_reviews.sample(n=10000, random_state=42) 

# Display the resulting subset
subset.head()


NameError: name 'data' is not defined

In [63]:
subset.nunique()

marketplace              1
customer_id           9633
review_id            10000
product_id             993
product_parent         880
product_title          933
product_category         1
star_rating              5
helpful_votes          160
total_votes            205
vine                     1
verified_purchase        2
review_headline       9439
review_body           9993
review_date           2644
dtype: int64

In [64]:
subset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 2141488 to 2464275
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   marketplace        10000 non-null  object 
 1   customer_id        10000 non-null  int64  
 2   review_id          10000 non-null  object 
 3   product_id         10000 non-null  object 
 4   product_parent     10000 non-null  int64  
 5   product_title      10000 non-null  object 
 6   product_category   10000 non-null  object 
 7   star_rating        10000 non-null  float64
 8   helpful_votes      10000 non-null  float64
 9   total_votes        10000 non-null  float64
 10  vine               10000 non-null  object 
 11  verified_purchase  10000 non-null  object 
 12  review_headline    10000 non-null  object 
 13  review_body        10000 non-null  object 
 14  review_date        10000 non-null  object 
dtypes: float64(3), int64(2), object(10)
memory usage: 1.2+ MB


In [65]:
subset.to_csv('amazon_reviews_subset.csv', index=False)


<div class="alert alert-info">
    
## Collaborative Filtering
**Collaborative Filtering** is a widely-used technique for addressing the challenge of missing entries in a utility matrix, leveraging user behavior and interactions to make recommendations. This approach operates on the principle that users who have agreed in the past will continue to agree in the future, allowing the model to infer preferences based on the preferences of similar users.

This method can be likened to advanced dimensionality reduction techniques such as Latent Semantic Analysis (LSA) or Truncated Singular Value Decomposition (SVD). By capturing the underlying relationships between users and items, collaborative filtering helps to predict missing values, enhancing the accuracy and relevance of recommendations.

In this project, we will implement collaborative filtering as our baseline model to improve user experience by personalizing content based on historical data, thus enabling more informed decision-making.
</div>