# RECOMMENDATION ENGINES - AMAZON TOYS AND GAMES

## GROUP C
• Nikolas Artadi<BR>
• Camila Vasquez <BR>
• Assemgul Khametova <BR>
• Miguel Frutos <BR>

<BR>
<BR>

## TASK
- **DATA SELECTION AND PRE-PROCESSING**(Mandatory)<BR>
First, you need to select a product category (from the “Small subsets for experiment”) and download the related file to create a training dataset and a testing dataset for the experiment. A recommended standard pre-processing strategy is that: each user randomly selects 80% of their ratings as the training ratings and uses the remaining 20% ratings as testing ratings.
- **COLLABORATIVE FILTERING RECOMMENDER SYSTEM** (Mandatory)<BR>
Based on the training dataset, you should develop a Collaborative Filtering model/algorithm to predict the ratings in the testing set. You may use any existing algorithm implemented in Surprise (or any other library) or develop new algorithms yourself. After predicting the ratings in the testing set, evaluate your predictions by calculating the RMSE.
- **CONTENT-BASED RECOMMENDER SYSTEM** (Mandatory)You should leverage the textual <BR>
information related to the reviews to create a Content-based RS to predict the ratings for the users in the test set. I do recommend you make use of the lab session related to the topic.
- **HYBRID HS**(Optional) <BR>
As an extra, you can propose a hybrid recommender system joining the operation of the two previously developed systems. To that end, you can make use of any of the ideas explained in class.
<BR>
<BR>

## VARIABLES
- **user-id**: which is denoted as “reviewerID” in the dataset
- **product-id**: which is denoted as “asin” in the dataset
- **rating**: a 1-5 integer star rating, which is the rating that the user rated on the product, it is denoted as “overall” in the dataset
- **review**: a piece of review text, which is the review content that the user commented about the product, it is denoted as “reviewText” in the dataset
- **title**: the title of the review, which is denoted as “summary” in the dataset
- **timestamp**: time that the user made the rating and review
- **helpfulness**: contains two numbers, i.e., [#users that think this review is not helpful,
#users that think this review is helpful]

# LET´S GET STARTED

 ## LIBRARIES INSTALATION

In [2]:
# ! pip install scikit-surprise
# ! pip install plotly
# ! pip install seaborn
import numpy as np
import pandas as pd
import seaborn as sns



## READ DATA

In [36]:
game_toy = pd.read_json('game_toy.json',lines=True)

## ANALYZE THE DATA

Take a quick look at the data to check if the dataset is correctly uploaded and to understand the variable´s content and the schema.

In [37]:
game_toy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167597 entries, 0 to 167596
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   reviewerID      167597 non-null  object
 1   asin            167597 non-null  object
 2   reviewerName    166759 non-null  object
 3   helpful         167597 non-null  object
 4   reviewText      167597 non-null  object
 5   overall         167597 non-null  int64 
 6   summary         167597 non-null  object
 7   unixReviewTime  167597 non-null  int64 
 8   reviewTime      167597 non-null  object
dtypes: int64(2), object(7)
memory usage: 11.5+ MB


In [38]:
game_toy

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A1VXOAVRGKGEAK,0439893577,Angie,"[0, 0]",I like the item pricing. My granddaughter want...,5,Magnetic board,1390953600,"01 29, 2014"
1,A8R62G708TSCM,0439893577,Candace,"[1, 1]",Love the magnet easel... great for moving to d...,4,it works pretty good for moving to different a...,1395964800,"03 28, 2014"
2,A21KH420DK0ICA,0439893577,capemaychristy,"[1, 1]",Both sides are magnetic. A real plus when you...,5,love this!,1359331200,"01 28, 2013"
3,AR29QK6HPFYZ4,0439893577,dcrm,"[0, 0]",Bought one a few years ago for my daughter and...,5,Daughters love it,1391817600,"02 8, 2014"
4,ACCH8EOML6FN5,0439893577,DoyZ,"[1, 1]",I have a stainless steel refrigerator therefor...,4,Great to have so he can play with his alphabet...,1399248000,"05 5, 2014"
...,...,...,...,...,...,...,...,...,...
167592,A18Q24BZK2CB5P,B00LBI9BKA,nicole todhunter,"[0, 0]",This drone is very fun and super duarable. Its...,5,Very fun,1404691200,"07 7, 2014"
167593,A1I8ON1X0B2N2W,B00LBI9BKA,PF,"[1, 1]",This is my brother's most prized toy. It's ext...,5,Coolest toy on the market,1404691200,"07 7, 2014"
167594,A3V24H5350ULKI,B00LBI9BKA,Sara Tafuri,"[0, 0]",This Panther Drone toy is awesome. I definitel...,5,A great idea for kids!,1404777600,"07 8, 2014"
167595,A1W2F1WI0QZ4AJ,B00LBI9BKA,Tabitha Nicole,"[0, 0]",This is my first drone and it has proven to be...,5,Excellent Drone,1405641600,"07 18, 2014"


We will include a mini-EDA for duplicates and missing data

In [39]:
def missing_values_percentage(df):
    """Return the % of missing values for each pd.series inside the Dataframe"""
    for i in df:
        missing_values_percentage = 100*df.isnull().sum()/df.isnull().count()
    return (missing_values_percentage[missing_values_percentage > 0]) 

In [40]:
missing_values_percentage(game_toy)

reviewerName    0.500009
dtype: float64

In [41]:
#Decided to drop column review name as we have a 50% of missing values and the adding value of this column is zero.
del game_toy['reviewerName']

In [42]:
missing_values_percentage(game_toy)

Series([], dtype: float64)

In [43]:
#Drop duplicates
game_toy.drop_duplicates
#Result, zero entire duplicated rows in game_toy dataset

<bound method DataFrame.drop_duplicates of             reviewerID        asin helpful  \
0       A1VXOAVRGKGEAK  0439893577  [0, 0]   
1        A8R62G708TSCM  0439893577  [1, 1]   
2       A21KH420DK0ICA  0439893577  [1, 1]   
3        AR29QK6HPFYZ4  0439893577  [0, 0]   
4        ACCH8EOML6FN5  0439893577  [1, 1]   
...                ...         ...     ...   
167592  A18Q24BZK2CB5P  B00LBI9BKA  [0, 0]   
167593  A1I8ON1X0B2N2W  B00LBI9BKA  [1, 1]   
167594  A3V24H5350ULKI  B00LBI9BKA  [0, 0]   
167595  A1W2F1WI0QZ4AJ  B00LBI9BKA  [0, 0]   
167596   AV6WVMUJVUHNB  B00LBI9BKA  [0, 0]   

                                               reviewText  overall  \
0       I like the item pricing. My granddaughter want...        5   
1       Love the magnet easel... great for moving to d...        4   
2       Both sides are magnetic.  A real plus when you...        5   
3       Bought one a few years ago for my daughter and...        5   
4       I have a stainless steel refrigerator therefor

# this is not correct, I have added something here.. not sure what the column means..

In [68]:
game_toy.sort_values("helpful", ascending=False).head(5)

Unnamed: 0,reviewerID,asin,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,users_nothelpful,users_helpful
46315,A1OUQCTNVKPVR9,B0010VS078,"[1589, 1637]",I loaned my iPod to my kid and he broke it. T...,4,It's a great portable music solution,1270166400,"04 2, 2010",1589,1637
103098,A4LD7XC56J3ZV,B004Z7H07K,"[1431, 1502]",Hi! I am Erin T. and I run a website called th...,5,My Son Won't Put it Down,1313712000,"08 19, 2011",1431,1502
131030,A1SC7Z2646QCP9,B0089RPUHO,"[1413, 1449]",If you want a child-friendly tablet-style devi...,5,Hands down the best choice for a child-friendl...,1350864000,"10 22, 2012",1413,1449
80422,A3DZFEICHK5LF2,B003JQT4Y0,"[1378, 1393]","Short version:The good: The pen is amazing, a ...",3,Great product but a lot more parent involvement.,1285632000,"09 28, 2010",1378,1393
103019,A2DG63DN704LOI,B004Z7H07K,"[1291, 1359]",I really want to like the LeapPad - my kids do...,3,"Kids like it, but educational value is not as ...",1315612800,"09 10, 2011",1291,1359


In [69]:
game_toy['users_nothelpful']=game_toy.helpful.str[0]
game_toy['users_helpful']=game_toy.helpful.str[1]

In [70]:
game_toy

Unnamed: 0,reviewerID,asin,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,users_nothelpful,users_helpful
0,A1VXOAVRGKGEAK,0439893577,"[0, 0]",I like the item pricing. My granddaughter want...,5,Magnetic board,1390953600,"01 29, 2014",0,0
1,A8R62G708TSCM,0439893577,"[1, 1]",Love the magnet easel... great for moving to d...,4,it works pretty good for moving to different a...,1395964800,"03 28, 2014",1,1
2,A21KH420DK0ICA,0439893577,"[1, 1]",Both sides are magnetic. A real plus when you...,5,love this!,1359331200,"01 28, 2013",1,1
3,AR29QK6HPFYZ4,0439893577,"[0, 0]",Bought one a few years ago for my daughter and...,5,Daughters love it,1391817600,"02 8, 2014",0,0
4,ACCH8EOML6FN5,0439893577,"[1, 1]",I have a stainless steel refrigerator therefor...,4,Great to have so he can play with his alphabet...,1399248000,"05 5, 2014",1,1
...,...,...,...,...,...,...,...,...,...,...
167592,A18Q24BZK2CB5P,B00LBI9BKA,"[0, 0]",This drone is very fun and super duarable. Its...,5,Very fun,1404691200,"07 7, 2014",0,0
167593,A1I8ON1X0B2N2W,B00LBI9BKA,"[1, 1]",This is my brother's most prized toy. It's ext...,5,Coolest toy on the market,1404691200,"07 7, 2014",1,1
167594,A3V24H5350ULKI,B00LBI9BKA,"[0, 0]",This Panther Drone toy is awesome. I definitel...,5,A great idea for kids!,1404777600,"07 8, 2014",0,0
167595,A1W2F1WI0QZ4AJ,B00LBI9BKA,"[0, 0]",This is my first drone and it has proven to be...,5,Excellent Drone,1405641600,"07 18, 2014",0,0


In [71]:
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

# Count the number of times each rating appears in the dataset
data = game_toy['overall'].value_counts().sort_index(ascending=False)

# Create the histogram
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / game_toy.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )
# Create layout
layout = dict(title = 'Distribution Of {} toys and games ratings'.format(game_toy.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

We can see that most of the reviews are at 5 on an extremely skewed graph. over 60% of the reviews value the products with 5 stars.

In [74]:
# Number of ratings per game_toy
data = game_toy.groupby('asin')['overall'].count()

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'overall',
                     xbins = dict(start = 0,size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per game_toy',
                   xaxis = dict(title = 'Number of Ratings Per Product ID'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

We can see a long tail of reviews. Only 50 products have most of the total reviews. It might be worth finding a way to remove the reviews that do not fullfil our quality criteria.

In [75]:

# Number of ratings per user
data = game_toy.groupby('reviewerID')['overall'].count()

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0, size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per User',
                   xaxis = dict(title = 'Ratings Per User'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

Here the same, it might be worth removing those users that do not review too much. We have an extremely long tail.

In [76]:
from surprise import Dataset
from surprise import Reader

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(game_toy[['reviewerID', 'asin', 'overall']], reader)

In [78]:
from surprise import KNNBaseline
from surprise.model_selection import cross_validate
knn = KNNBaseline()
cross_validate(knn,data,measures=['RMSE'],cv=3,verbose=True)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9709  0.9683  0.9727  0.9706  0.0018  
Fit time          18.37   18.00   14.81   17.06   1.60    
Test time         2.68    1.95    2.21    2.28    0.30    


{'test_rmse': array([0.9709298 , 0.9683175 , 0.97267938]),
 'fit_time': (18.370604991912842, 18.000859260559082, 14.809059143066406),
 'test_time': (2.6797292232513428, 1.9537627696990967, 2.2050440311431885)}

In [80]:
kkn = KNNBaseline(k=40,min_k=2,sim_options={'name':'pearson_baseline'},verbose=True)
cross_validate(knn,data,measures=['RMSE','MAE'],cv=5,verbose=True)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9660  0.9673  0.9695  0.9676  0.9855  0.9712  0.0072  
MAE (testset)     0.6892  0.6901  0.6927  0.6906  0.7002  0.6925  0.0040  
Fit time          15.66   15.00   14.17   14.18   13.83   14.57   0.67    
Test time         1.33    1.34    1.41    1.41    1.41    1.38    0.04    


{'test_rmse': array([0.96598255, 0.96729823, 0.96952947, 0.9675773 , 0.98548297]),
 'test_mae': array([0.68918966, 0.69005059, 0.69274139, 0.69055642, 0.70015508]),
 'fit_time': (15.655402898788452,
  14.99592900276184,
  14.166626930236816,
  14.183858156204224,
  13.826802968978882),
 'test_time': (1.332442283630371,
  1.3446922302246094,
  1.4090800285339355,
  1.414013147354126,
  1.4109160900115967)}

In [81]:
from surprise import SVD
SVD1=SVD()
cross_validate(SVD1,data,measures=['RMSE','MAE'],cv=5,verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9071  0.9037  0.9013  0.8945  0.8951  0.9003  0.0049  
MAE (testset)     0.6676  0.6690  0.6649  0.6639  0.6599  0.6650  0.0032  
Fit time          7.91    8.29    8.57    8.55    8.59    8.38    0.26    
Test time         0.24    0.57    0.23    0.59    0.23    0.37    0.17    


{'test_rmse': array([0.90710539, 0.90372863, 0.90127347, 0.89445296, 0.89511202]),
 'test_mae': array([0.66757661, 0.66895324, 0.6648558 , 0.66386795, 0.65987436]),
 'fit_time': (7.908226728439331,
  8.286352157592773,
  8.571156024932861,
  8.549629926681519,
  8.591503858566284),
 'test_time': (0.242110013961792,
  0.5697827339172363,
  0.23186182975769043,
  0.5934720039367676,
  0.2309250831604004)}