# Data Preparation

In [11]:
import pandas as pd
import os

In [2]:
# read raw dataset
df = pd.read_csv('../data/archive/amazon_reviews_us_Electronics_v1_00.tsv', sep='\t', on_bad_lines='skip')
df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,41409413,R2MTG1GCZLR2DK,B00428R89M,112201306,yoomall 5M Antenna WIFI RP-SMA Female to Male ...,Electronics,5,0,0,N,Y,Five Stars,As described.,2015-08-31
1,US,49668221,R2HBOEM8LE9928,B000068O48,734576678,"Hosa GPM-103 3.5mm TRS to 1/4"" TRS Adaptor",Electronics,5,0,0,N,Y,It works as advertising.,It works as advertising.,2015-08-31
2,US,12338275,R1P4RW1R9FDPEE,B000GGKOG8,614448099,Channel Master Titan 2 Antenna Preamplifier,Electronics,5,1,1,N,Y,Five Stars,Works pissa,2015-08-31
3,US,38487968,R1EBPM82ENI67M,B000NU4OTA,72265257,LIMTECH Wall charger + USB Hotsync & Charging ...,Electronics,1,0,0,N,Y,One Star,Did not work at all.,2015-08-31
4,US,23732619,R372S58V6D11AT,B00JOQIO6S,308169188,Skullcandy Air Raid Portable Bluetooth Speaker,Electronics,5,1,1,N,Y,Overall pleased with the item,Works well. Bass is somewhat lacking but is pr...,2015-08-31


# Data selection

In this section we will focus on what data is relevant for us

In [26]:
# get all columns names
df.columns

Index(['marketplace', 'customer_id', 'review_id', 'product_id',
       'product_parent', 'product_title', 'product_category', 'star_rating',
       'helpful_votes', 'total_votes', 'vine', 'verified_purchase',
       'review_headline', 'review_body', 'review_date'],
      dtype='object')

Let's discuss a bit why some columns could be considered irrelevant and why. The solution is model-dependent as would influence how further algorithm is developed, so let's focus on both currently available recommendation-system designs.

**Personalized recommendation system**\
In context of personalized recommendations (unique for each user) we only need to store rating user gave, and to what product is was given to, resulting in tuple `customer_id`, `product_id` and `star_rating`. However, to make this data more appropricate and actually suitable for models, we should clear it from outliers, make some assumptions about 'usefullness' and 'acutality' of reviews given. For that reason it would make sence to store all columns that could be used as filteres in next steps.

According to [Data Understanding](https://github.com/akmchnkv/Amazon-Predictions-Data-Mining-Innopolis-2024/blob/main/notebooks/EDA.ipynb), we can say that columns `marketplace`, `product_category` could be removed entirely as they only contain 1 unique value throuout all the rows. In addition, we can also disgard `vine` due to huge class disbalance (can't filter by vine=1, and vine=0 as it would chande distributions). 

In given architecture we also decided not to use `product_title` as it could not be used as filter due to it's variability and 'text'-nature.

**Generalized recommendation system**\
In the context of universal recommendations we can use almost all values that describe product. We can use text embeddings, star ratings, even headlines and dates as a features to develop a model. The only features we would disgard in this case are `marketplace`, `product_category` for the same reasons mentioned before.

Currently we will prepare data only for **Personalized recommendation systems** as they correlate more with our business objective, however with a few changes (add 1 columns) we can make dataset for **Generalized recommendation systems**.

In [7]:
drop_personalized_columns = ['marketplace', 'product_category', 'vine']
drop_generalized_columns = ['marketplace', 'product_category']

# Data cleaning

In [24]:
# check how many NaN values each columns contain (in %, 100 = all none)
df.isna().sum() * 100 / df.shape[0]

marketplace          0.000000
customer_id          0.000000
review_id            0.000000
product_id           0.000000
product_parent       0.000000
product_title        0.000129
product_category     0.000000
star_rating          0.000000
helpful_votes        0.000000
total_votes          0.000000
vine                 0.000000
verified_purchase    0.000000
review_headline      0.001262
review_body          0.004756
review_date          0.000776
dtype: float64

In [8]:
# As there are few rows (less than 0.005%) that actually contain NaN values, we can just drop them
df = df.dropna()
df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,41409413,R2MTG1GCZLR2DK,B00428R89M,112201306,yoomall 5M Antenna WIFI RP-SMA Female to Male ...,Electronics,5,0,0,N,Y,Five Stars,As described.,2015-08-31
1,US,49668221,R2HBOEM8LE9928,B000068O48,734576678,"Hosa GPM-103 3.5mm TRS to 1/4"" TRS Adaptor",Electronics,5,0,0,N,Y,It works as advertising.,It works as advertising.,2015-08-31
2,US,12338275,R1P4RW1R9FDPEE,B000GGKOG8,614448099,Channel Master Titan 2 Antenna Preamplifier,Electronics,5,1,1,N,Y,Five Stars,Works pissa,2015-08-31
3,US,38487968,R1EBPM82ENI67M,B000NU4OTA,72265257,LIMTECH Wall charger + USB Hotsync & Charging ...,Electronics,1,0,0,N,Y,One Star,Did not work at all.,2015-08-31
4,US,23732619,R372S58V6D11AT,B00JOQIO6S,308169188,Skullcandy Air Raid Portable Bluetooth Speaker,Electronics,5,1,1,N,Y,Overall pleased with the item,Works well. Bass is somewhat lacking but is pr...,2015-08-31


# Data construction

In data construction step we will develop some features that may (or may not) be considered usefull in data modeling and future data preparations steps.
1. `helpfulness` - how many (in %) people consider given review helpfull.
2. `review_full_text` - combined `review_healine` and `review_body` for consistency (people are inconsistent and sometimes write review in headline, not in body)
3. `review_length` - total length of review headline and body (+1)

In [27]:
# percentage of people who considered review helpful
df['helpfulness'] = df['helpful_votes'] / df['total_votes']
df['helpfulness'] = df.helpfulness.fillna(0)

# full review text 
df["review_full_text"] = df["review_headline"] + " " + df["review_body"]

# length of review (number of symbols)
df['review_length'] = df.review_full_text.str.len()

df.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,helpfulness,review_full_text,review_length
0,US,41409413,R2MTG1GCZLR2DK,B00428R89M,112201306,yoomall 5M Antenna WIFI RP-SMA Female to Male ...,Electronics,5,0,0,N,Y,Five Stars,As described.,2015-08-31,0.0,Five Stars As described.,24
1,US,49668221,R2HBOEM8LE9928,B000068O48,734576678,"Hosa GPM-103 3.5mm TRS to 1/4"" TRS Adaptor",Electronics,5,0,0,N,Y,It works as advertising.,It works as advertising.,2015-08-31,0.0,It works as advertising. It works as advertising.,49
2,US,12338275,R1P4RW1R9FDPEE,B000GGKOG8,614448099,Channel Master Titan 2 Antenna Preamplifier,Electronics,5,1,1,N,Y,Five Stars,Works pissa,2015-08-31,1.0,Five Stars Works pissa,22
3,US,38487968,R1EBPM82ENI67M,B000NU4OTA,72265257,LIMTECH Wall charger + USB Hotsync & Charging ...,Electronics,1,0,0,N,Y,One Star,Did not work at all.,2015-08-31,0.0,One Star Did not work at all.,29
4,US,23732619,R372S58V6D11AT,B00JOQIO6S,308169188,Skullcandy Air Raid Portable Bluetooth Speaker,Electronics,5,1,1,N,Y,Overall pleased with the item,Works well. Bass is somewhat lacking but is pr...,2015-08-31,1.0,Overall pleased with the item Works well. Bass...,113


# Data formating & integration

In this section we will be focusing on formating and filtering data, as well as getting additional information into a dataset. 
This sections are usually considered differently, however for a reason of reasonable resource management we will do them at the same time.

We propose 3 different solutions to data formatting, and one of it requires additional data integration.

In [52]:
# get personalized recommentations (proposed in data selection)
df_personalized = df.drop(columns=drop_personalized_columns)
df_personalized.head()

Unnamed: 0,customer_id,review_id,product_id,product_parent,product_title,star_rating,helpful_votes,total_votes,verified_purchase,review_headline,review_body,review_date,helpfulness,review_full_text,review_length
0,41409413,R2MTG1GCZLR2DK,B00428R89M,112201306,yoomall 5M Antenna WIFI RP-SMA Female to Male ...,5,0,0,Y,Five Stars,As described.,2015-08-31,0.0,Five Stars As described.,24
1,49668221,R2HBOEM8LE9928,B000068O48,734576678,"Hosa GPM-103 3.5mm TRS to 1/4"" TRS Adaptor",5,0,0,Y,It works as advertising.,It works as advertising.,2015-08-31,0.0,It works as advertising. It works as advertising.,49
2,12338275,R1P4RW1R9FDPEE,B000GGKOG8,614448099,Channel Master Titan 2 Antenna Preamplifier,5,1,1,Y,Five Stars,Works pissa,2015-08-31,1.0,Five Stars Works pissa,22
3,38487968,R1EBPM82ENI67M,B000NU4OTA,72265257,LIMTECH Wall charger + USB Hotsync & Charging ...,1,0,0,Y,One Star,Did not work at all.,2015-08-31,0.0,One Star Did not work at all.,29
4,23732619,R372S58V6D11AT,B00JOQIO6S,308169188,Skullcandy Air Raid Portable Bluetooth Speaker,5,1,1,Y,Overall pleased with the item,Works well. Bass is somewhat lacking but is pr...,2015-08-31,1.0,Overall pleased with the item Works well. Bass...,113


#### Solution 1. No filtering.
The simplest idea is to not filter anything and just save all required data as it is.

In [53]:
os.makedirs('../data/processed', exist_ok=True) 

df_personalized[['customer_id', 'product_id', 'star_rating']].to_csv('../data/processed/unfiltered_personalized.csv', index=False)

#### Solution 2. Basic filtering.

Second idea is to apply some basic filters:
1. All reviews with given star rating `3` will be removes. This is a common practice in recommendation system as review `3` has ambiguous nature.
2. We will consider only reviews that are considered `helpfull` by other users. This should remove spam reviews and provide more clear and honest understanding of products we are working with. In this case `helpfulness` is considered as `helpful_votes` / `total_votes`. And where this value will exeed 50%, we will count them as `helpfull`.
3. For a reason of a doubt we will not trust any unverified purchases.

In [55]:
df_filtered = df_personalized[(df_personalized['star_rating'] != 3) & (df_personalized['total_votes'] > 0) & (df_personalized['helpfulness'] > 0.5) & (df_personalized['verified_purchase'] == 'Y')]
df_filtered

Unnamed: 0,customer_id,review_id,product_id,product_parent,product_title,star_rating,helpful_votes,total_votes,verified_purchase,review_headline,review_body,review_date,helpfulness,review_full_text,review_length
2,12338275,R1P4RW1R9FDPEE,B000GGKOG8,614448099,Channel Master Titan 2 Antenna Preamplifier,5,1,1,Y,Five Stars,Works pissa,2015-08-31,1.000000,Five Stars Works pissa,22
4,23732619,R372S58V6D11AT,B00JOQIO6S,308169188,Skullcandy Air Raid Portable Bluetooth Speaker,5,1,1,Y,Overall pleased with the item,Works well. Bass is somewhat lacking but is pr...,2015-08-31,1.000000,Overall pleased with the item Works well. Bass...,113
5,21257820,R1A4514XOYI1PD,B008NCD2LG,976385982,Pioneer SP-BS22-LR Andrew Jones Designed Books...,5,1,1,Y,Five Stars,The quality on these speakers is insanely good...,2015-08-31,1.000000,Five Stars The quality on these speakers is in...,125
18,47386264,R1WI5NISM6GAUG,B0045EJY90,892920832,TEAC CD-P650-B Compact Disc Player with USB an...,2,4,5,Y,It does not copy CD-R s to USB as advertised ...,It does not copy CD-R s to USB as advertised. ...,2015-08-31,0.800000,It does not copy CD-R s to USB as advertised ....,348
19,13000908,R27F4OF4BIA4LU,B003BT6BM8,631236454,Philips SHS8100/28 Earhook Headphones,2,1,1,Y,"Did not last long, Stop working within a year ...","Did not last long, Stop working within a year.",2015-08-31,1.000000,"Did not last long, Stop working within a year ...",97
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3090919,51816404,RE2BZ5R2WNSN,B00000J3NE,184603412,Sharp MDMT821 Ultra-Thin Minidisc Player/Recorder,5,69,72,Y,Get one...it is great!,I purchased a Sony R55 a few months ago and re...,1999-08-03,0.958333,Get one...it is great! I purchased a Sony R55 ...,657
3090923,52708254,R2AN3ZV5E3ALB5,B00000J4C9,83670990,Koss PP257 Water-Resistant Sports Armband Digi...,1,18,19,Y,Really bad reception,Constant static with nearly all radio stations...,1999-07-30,0.947368,Really bad reception Constant static with near...,146
3090928,52476206,R1J16TEDOYCZTN,B00000J4FW,296815603,Sharp CDC472 Pro Logic Home Theater System (Di...,4,30,31,Y,Overall a good combo of features for price...,I'm looking into systems to replace a stereo t...,1999-07-27,0.967742,Overall a good combo of features for price... ...,324
3090931,52980656,RGHZIK8D6X7QR,B00000JFHW,931077461,Harman Kardon TC-1000 Take Control System Cont...,5,11,14,Y,My Wifes Favorite Toy,This device is a G*d send.We had 8 different r...,1999-07-26,0.785714,My Wifes Favorite Toy This device is a G*d sen...,330


To proove that the filtering above did not change distributions, we can see how star rating deviates. 

Even thought we only left 18.2% of rows, we have almost identicall mean with a bit higher std (which is expected as we removed all revies with 3 stars).

In [56]:
df_filtered.star_rating.describe()

count    561980.000000
mean          4.045016
std           1.462215
min           1.000000
25%           4.000000
50%           5.000000
75%           5.000000
max           5.000000
Name: star_rating, dtype: float64

In [57]:
df.star_rating.describe()

count    3.090810e+06
mean     4.035417e+00
std      1.387458e+00
min      1.000000e+00
25%      3.000000e+00
50%      5.000000e+00
75%      5.000000e+00
max      5.000000e+00
Name: star_rating, dtype: float64

In [58]:
os.makedirs('../data/processed', exist_ok=True) 

df_filtered[['customer_id', 'product_id', 'star_rating']].to_csv('../data/processed/filtered_simple_personalized.csv')

### Solution 3. Text verification.

This idea is based on assumption that some reviews may provide inappropriate star rating that do not actually correspond to the review comment. In order to check it, we decided to take bert-like model that could analyze. We will also apply this technique to already filtered dataframe, as i takes 4 hours to run on whole, and only 1 hour to run on filtered.

In [59]:
from transformers import AutoTokenizer, AutoConfig, pipeline

distilled_student_sentiment_classifier = pipeline(
    model="lxyuan/distilbert-base-multilingual-cased-sentiments-student", 
    top_k=None,
    truncation=True,
    device=0,
)

In [67]:
from tqdm.auto import tqdm

pos_scores = []
neu_scores = []
neg_scores = []

for idx, row in tqdm(df_filtered.iterrows(), total=df_filtered.shape[0]):
    
    ######### REMOVE THIS IF YOU WANT TO RUN OUR CODE #########
    ######### it's is here to show approximate time to inference #########
    if idx > 1000:
        break
    ######### END OF REMOVE #########
    
        
    text = row["review_full_text"]
    out = distilled_student_sentiment_classifier(text)[0]
    
    for sentiment in out:
        if sentiment["label"] == "positive":
            pos_scores.append(sentiment["score"])

        if sentiment["label"] == "neutral":
            neu_scores.append(sentiment["score"])

        if sentiment["label"] == "negative":
            neg_scores.append(sentiment["score"])

  0%|          | 0/561980 [00:00<?, ?it/s]

In [69]:
df_filtered["positive_score"] = pos_scores
df_filtered["neutral_score"] = neu_scores
df_filtered["negative_score"] = neg_scores

In [70]:
os.makedirs('../data/processed', exist_ok=True) 

df_filtered.to_csv("../data/processed/df_filtered_with_scores.csv", index=False)

In [72]:
# apply with new values additional preprocessing
df_scored = pd.read_csv('../data/processed/df_filtered_with_scores.csv')
df_scored.head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,review_full_text,positive_score,neutral_score,negative_score
0,US,12338275,R1P4RW1R9FDPEE,B000GGKOG8,614448099,Channel Master Titan 2 Antenna Preamplifier,Electronics,5,1,1,N,Y,Five Stars,Works pissa,2015-08-31,Five Stars Works pissa,0.63359,0.18822,0.17819
1,US,23732619,R372S58V6D11AT,B00JOQIO6S,308169188,Skullcandy Air Raid Portable Bluetooth Speaker,Electronics,5,1,1,N,Y,Overall pleased with the item,Works well. Bass is somewhat lacking but is pr...,2015-08-31,Overall pleased with the item Works well. Bass...,0.645478,0.196312,0.15821
2,US,21257820,R1A4514XOYI1PD,B008NCD2LG,976385982,Pioneer SP-BS22-LR Andrew Jones Designed Books...,Electronics,5,1,1,N,Y,Five Stars,The quality on these speakers is insanely good...,2015-08-31,Five Stars The quality on these speakers is in...,0.863384,0.101171,0.035446
3,US,47386264,R1WI5NISM6GAUG,B0045EJY90,892920832,TEAC CD-P650-B Compact Disc Player with USB an...,Electronics,2,4,5,N,Y,It does not copy CD-R s to USB as advertised ...,It does not copy CD-R s to USB as advertised. ...,2015-08-31,It does not copy CD-R s to USB as advertised ....,0.189209,0.427809,0.382981
4,US,13000908,R27F4OF4BIA4LU,B003BT6BM8,631236454,Philips SHS8100/28 Earhook Headphones,Electronics,2,1,1,N,Y,"Did not last long, Stop working within a year ...","Did not last long, Stop working within a year.",2015-08-31,"Did not last long, Stop working within a year ...",0.360234,0.313394,0.326372


In [76]:
# let's look at distributions
df_scored.positive_score.describe(), df_scored.neutral_score.describe(), df_scored.negative_score.describe()

(count    561980.000000
 mean          0.562636
 std           0.298186
 min           0.003431
 25%           0.296404
 50%           0.593064
 75%           0.842058
 max           0.996073
 Name: positive_score, dtype: float64,
 count    561980.000000
 mean          0.169324
 std           0.108059
 min           0.002360
 25%           0.081750
 50%           0.157580
 75%           0.238560
 max           0.762785
 Name: neutral_score, dtype: float64,
 count    561980.000000
 mean          0.268040
 std           0.238484
 min           0.000868
 25%           0.057454
 50%           0.204236
 75%           0.427561
 max           0.986743
 Name: negative_score, dtype: float64)

In [99]:
df_scored[(df_scored.star_rating > 3) & (df_scored.negative_score > 0.9)].head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,review_full_text,positive_score,neutral_score,negative_score
2844,US,46085619,R12JEGPXV1HAO9,B00E952W3A,444188034,Zipbuds Zipbuds Pro Mic Earbuds,Electronics,4,1,1,N,Y,The chord creates just a bit more noise than I...,I've used these for just over a month and ther...,2015-08-25,The chord creates just a bit more noise than I...,0.011376,0.05122,0.937404
5433,US,118523,R2PY15IIGOZ9G7,B00OHHG768,720920423,C&E Mini HDMI 3-In 1-Out (Hdmi V1.3) Intellige...,Electronics,5,1,1,N,Y,i really hate that our tv only came with limit...,i really hate that our tv only came with limit...,2015-08-18,i really hate that our tv only came with limit...,0.044456,0.023469,0.932075
5647,US,48402755,R155JEV23CFZC9,B00ILCRZRK,262655934,Yamaha RX-V577 7.2-channel Wi-Fi Network AV Re...,Electronics,4,3,4,N,Y,Awesome except AirPlay vulnerability,Only bad thing so far it's inability to block ...,2015-08-18,Awesome except AirPlay vulnerability Only bad ...,0.023146,0.044035,0.932818
7492,US,33853810,R3FHT7MAC46WS2,B004ZKXY7C,836118938,Sharp SPC800 Quartz Analog Twin Bell Alarm Clo...,Electronics,4,1,1,N,Y,My daughters hate it.,My daughters hate it. It is loud and obnoxious...,2015-08-13,My daughters hate it. My daughters hate it. It...,0.008132,0.0173,0.974568
11309,US,21509517,R2F7TJLI04COJO,B00PVMNTAA,363270033,GENERIC AA59-00766A RMCTPF Smart Hub Audio sou...,Electronics,5,1,1,N,Y,Great replacement for my dropped once to often...,Great replacement for my dropped once to often...,2015-08-04,Great replacement for my dropped once to often...,0.026462,0.065254,0.908284


In [101]:
df_scored[(df_scored.star_rating < 3) & (df_scored.positive_score > 0.9)].head()

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date,review_full_text,positive_score,neutral_score,negative_score
207,US,2602418,R2LUDAE6F9SHPA,B00M9LBMC8,210220996,Symphonized NRG Premium Genuine Wood In-ear No...,Electronics,1,1,1,N,Y,The sound was pretty good until one of the spe...,The sound was pretty good until one of the spe...,2015-08-31,The sound was pretty good until one of the spe...,0.925921,0.0526,0.021479
363,US,10819127,RXHYP037YFZ6S,B00NS3MRKC,677514508,FiiO X1 High Resolution Lossless Music Player,Electronics,2,3,3,N,Y,Good when it worked,Was awesome when it worked.<br />Got a blue sc...,2015-08-31,Good when it worked Was awesome when it worked...,0.908742,0.05054,0.040718
1320,US,2885890,R1EMHOAO2ATP8T,B00EO8L5TE,455429670,ClearTV X-72 HDTV Digital Indoor Antenna,Electronics,2,1,1,N,Y,Rabbit ears are twice as good!,Do Not Buy! Rabbit ears are twice as good!,2015-08-28,Rabbit ears are twice as good! Do Not Buy! Rab...,0.947384,0.036779,0.015837
2104,US,22000601,R318JY1HK01DW1,B0058O8H30,993527082,Atlantic 38435720 Oskar 464 Media Wall Unit P2,Electronics,2,1,1,N,Y,I will say that the design is good and the ver...,The cams that hold the unit together were the ...,2015-08-26,I will say that the design is good and the ver...,0.925071,0.040925,0.034004
3543,US,39062322,R3OPKVB1YZ13OB,B00MXCIK32,823852970,Panasonic eneloop pro AA High Capacity New Ni-...,Electronics,1,28,37,N,Y,It's great and the best,It's great and the best. Nothing wrong with th...,2015-08-23,It's great and the best It's great and the bes...,0.94418,0.050998,0.004821


After a bit further analysis, we noticed that reviews with rating `4` and `negative_score` greater than 0.9 have interesting dynamic. They are positive (judging by rating), but comments actually describes how bad product is. For this inconsistency we decided to filter such rows as they may be misleading.

We did not find such correlations with ratings 5, 1 and 2. Which is pretty interesting for us. 

It is also worth mentioning that model we used have some FPs and TNs which may skew our real computations.

In [118]:
df_without_outliers = df_scored[~((df_scored.star_rating == 4) & (df_scored.negative_score > 0.9))]
df_without_outliers.shape

(561932, 19)

In [123]:
df_without_outliers.star_rating.describe()

count    561932.000000
mean          4.045020
std           1.462277
min           1.000000
25%           4.000000
50%           5.000000
75%           5.000000
max           5.000000
Name: star_rating, dtype: float64

In [120]:
os.makedirs('../data/processed', exist_ok=True) 

df_without_outliers[['customer_id', 'product_id', 'star_rating']].to_csv("../data/processed/df_filtered_removed_outliers.csv", index=False)

Here we remove users that have too few ratings

In [None]:
import pandas as pd
df = pd.read_csv("../data/processed/df_filtered_removed_outliers.csv")
counts = df['customer_id'].value_counts()

ids_to_drop = counts[counts < 3].index
filtered_df = df[~df['customer_id'].isin(ids_to_drop)]
len(filtered_df), len(df)
filtered_df.to_csv("../data/processed/df_no_cold_start.csv")