# RECOMMENDATION ENGINES - AMAZON TOYS AND GAMES

## GROUP C
- Nikolas Artadi
- Camila Vasquez
- Assemgul Khametova
- Miguel Frutos

## TASK
- **DATA SELECTION AND PRE-PROCESSING**(Mandatory)
First, you need to select a product category (from the “Small subsets for experiment”) and download the related file to create a training dataset and a testing dataset for the experiment. A recommended standard pre-processing strategy is that: each user randomly selects 80% of their ratings as the training ratings and uses the remaining 20% ratings as testing ratings.
- **COLLABORATIVE FILTERING RECOMMENDER SYSTEM** (Mandatory)
Based on the training dataset, you should develop a Collaborative Filtering model/algorithm to predict the ratings in the testing set. You may use any existing algorithm implemented in Surprise (or any other library) or develop new algorithms yourself. After predicting the ratings in the testing set, evaluate your predictions by calculating the RMSE.
- **CONTENT-BASED RECOMMENDER SYSTEM** (Mandatory)You should leverage the textual
information related to the reviews to create a Content-based RS to predict the ratings for the users in the test set. I do recommend you make use of the lab session related to the topic.
- **HYBRID HS**(Optional)
As an extra, you can propose a hybrid recommender system joining the operation of the two previously developed systems. To that end, you can make use of any of the ideas explained in class.
    
## DATASET
We follow the data set Toys and Games in the following [source](http://deepyeti.ucsd.edu/jianmo/amazon/index.html).

### Ratings only features explanation
- **reviewerID/user-id** - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- **asin/product-id** - ID of the product, e.g. 0000013714
- **reviewerName** - name of the reviewer
- **helpful/helpfulness** - helpfulness rating of the review, e.g. 2/3 [#users that think this review is not helpful,
#users that think this review is helpful]
- **reviewText/review** - text of the review
- **overall/rating** - rating of the product
- **summary/title** - summary of the review
- **unixReviewTime/timestamp** - time of the review (unix time)
- **reviewTime** - time of the review (raw)

# we still need to create e training dataset... we can do this once we decide what other metadata we want to use for the content based and hybrid... we will need to merge the data set here....

# LET´S GET STARTED

 ## LIBRARIES INSTALATION

In [7]:
# ! pip install scikit-surprise
# ! pip install plotly
# ! pip install seaborn

import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split

from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise import SVDpp
from surprise import SVD
from surprise import KNNBaseline
from surprise import SVDpp
from surprise import NormalPredictor
from surprise import BaselineOnly
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering

from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

import matplotlib.pyplot as plt

## READ DATA

In [8]:
# df = pd.read_json('game_toy.json',lines=True)

# Ratings data
df = pd.read_json('/Users/niko/Desktop/Recommnedation systems/project/game_toy.json',lines=True)

# Split dataset into training (df) and test dataset at 80% for training of the total data
df, test_dataset = train_test_split(df, train_size=0.80,random_state=42)

## ANALYZE THE DATA

Take a quick look at the data to check if the dataset is correctly uploaded and to understand the variable´s content and the schema.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 134077 entries, 103644 to 121958
Data columns (total 9 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   reviewerID      134077 non-null  object
 1   asin            134077 non-null  object
 2   reviewerName    133401 non-null  object
 3   helpful         134077 non-null  object
 4   reviewText      134077 non-null  object
 5   overall         134077 non-null  int64 
 6   summary         134077 non-null  object
 7   unixReviewTime  134077 non-null  int64 
 8   reviewTime      134077 non-null  object
dtypes: int64(2), object(7)
memory usage: 10.2+ MB


In [10]:
df.describe()
# We can see that the ratings have min of 1 and max of 5

Unnamed: 0,overall,unixReviewTime
count,134077.0,134077.0
mean,4.356668,1348671000.0
std,0.992509,61008400.0
min,1.0,964742400.0
25%,4.0,1335658000.0
50%,5.0,1364342000.0
75%,5.0,1388016000.0
max,5.0,1406074000.0


In [11]:
df

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
103644,A9XX8OHS2ZQ2X,B00508OLNY,R. Neil Scott,"[3, 4]",I received this puzzle for review through the ...,4,Does not Dissapoint,1345420800,"08 20, 2012"
119165,AX8ATTTB67KFM,B00767PSIO,Lita Counts,"[1, 2]",I ordered this for my sons 1st bithday. It wa...,5,good party scene,1342396800,"07 16, 2012"
117593,A1VWK2BNL8I93C,B006X415GK,Theresa Mead,"[0, 0]",I bought this doll for my daughter. She enjoy...,4,Grated deal,1357862400,"01 11, 2013"
131341,A1AJWJGB89GSIL,B008A2BA90,R. Huffman,"[0, 1]",I initially picked this game up due to many gl...,4,A SEASONS for all magicians,1374192000,"07 19, 2013"
94465,A3KOL1FYRGZPGQ,B004NIF5OQ,SL,"[0, 0]",We bought my son the train set for Christmas a...,5,My son loves it!,1357430400,"01 6, 2013"
...,...,...,...,...,...,...,...,...,...
119879,AG2IEP1MJQHFS,B007ADICI2,Kelly Houser,"[0, 2]",Ravensburger puzzles are absolutely the best j...,5,Great value and great fun!,1341532800,"07 6, 2012"
103694,A3OBP99ZV5TG6C,B00508OOSG,Jay,"[0, 0]",The quality of the pieces was excellent. No m...,5,Excellent Puzzle - not for beginners,1342396800,"07 16, 2012"
131932,A3JR8YFXZQQBU0,B008B68IE0,emd104,"[1, 1]",Her arms and legs are so stiff they don't bend...,3,Beautiful doll...but...,1387929600,"12 25, 2013"
146867,A7N541YZQZIXH,B00BDMNBMI,ttim12,"[0, 0]",The plastic in the drum area doesn't sound gre...,4,My 12 month old likes it,1379980800,"09 24, 2013"


We have included an EDA and have identified duplicates and missing data which we will handle in the next steps.

In [12]:
def missing_values_percentage(df):
    """Return the % of missing values for each pd.series inside the Dataframe"""
    for i in df:
        missing_values_percentage = 100*df.isnull().sum()/df.isnull().count()
    return (missing_values_percentage[missing_values_percentage > 0]) 

In [13]:
missing_values_percentage(df)

reviewerName    0.504188
dtype: float64

In [14]:
#Decided to drop column review name as we have a 50% of missing values and the adding value of this column is zero.
del df['reviewerName']

In [15]:
missing_values_percentage(df)

Series([], dtype: float64)

In [16]:
#Drop duplicates
df.drop_duplicates
#Result, zero entire duplicated rows in game_toy dataset

<bound method DataFrame.drop_duplicates of             reviewerID        asin helpful  \
103644   A9XX8OHS2ZQ2X  B00508OLNY  [3, 4]   
119165   AX8ATTTB67KFM  B00767PSIO  [1, 2]   
117593  A1VWK2BNL8I93C  B006X415GK  [0, 0]   
131341  A1AJWJGB89GSIL  B008A2BA90  [0, 1]   
94465   A3KOL1FYRGZPGQ  B004NIF5OQ  [0, 0]   
...                ...         ...     ...   
119879   AG2IEP1MJQHFS  B007ADICI2  [0, 2]   
103694  A3OBP99ZV5TG6C  B00508OOSG  [0, 0]   
131932  A3JR8YFXZQQBU0  B008B68IE0  [1, 1]   
146867   A7N541YZQZIXH  B00BDMNBMI  [0, 0]   
121958   ASGGVJI9IOZAA  B007HZ9S7C  [0, 0]   

                                               reviewText  overall  \
103644  I received this puzzle for review through the ...        4   
119165  I ordered this for my sons 1st bithday.  It wa...        5   
117593  I bought this doll for my daughter.  She enjoy...        4   
131341  I initially picked this game up due to many gl...        4   
94465   We bought my son the train set for Christmas a

In [17]:
df.sort_values("helpful", ascending=False).head(5)

Unnamed: 0,reviewerID,asin,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
46315,A1OUQCTNVKPVR9,B0010VS078,"[1589, 1637]",I loaned my iPod to my kid and he broke it. T...,4,It's a great portable music solution,1270166400,"04 2, 2010"
103098,A4LD7XC56J3ZV,B004Z7H07K,"[1431, 1502]",Hi! I am Erin T. and I run a website called th...,5,My Son Won't Put it Down,1313712000,"08 19, 2011"
131030,A1SC7Z2646QCP9,B0089RPUHO,"[1413, 1449]",If you want a child-friendly tablet-style devi...,5,Hands down the best choice for a child-friendl...,1350864000,"10 22, 2012"
22121,ASGI7E0AJ8H5X,B0006O8Q7Y,"[1247, 1258]","Prior to purchasing, I searched all over to tr...",5,ultra stomp rocket vs junior stomp rocket,1231718400,"01 12, 2009"
39168,A1GALZCXD8FHOR,B000NOU54O,"[988, 1018]",Let's cut to the chase: If you're looking for ...,4,Hard to beat its total value in a beginner mic...,1200355200,"01 15, 2008"


In [18]:
df['users_nothelpful']=df.helpful.str[0]
df['users_helpful']=df.helpful.str[1]

# Analyze the data

### See the count of ratings per rating

In [19]:
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

# Count the number of times each rating appears in the dataset
data = df['overall'].value_counts().sort_index(ascending=False)

# Create the histogram
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / df.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )
# Create layout
layout = dict(title = 'Distribution Of {} Toys and Games Ratings'.format(df.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

Most of the reviews are at 5, this shows an extremely skewed graph and therefore we can infer that the ratings given in this data set are strongly biased as most lay between 4 and 5. We have over 80% of users rating positively the product.

### See the number ratings per product

In [20]:
# Number of ratings per game_toy
data = df.groupby('asin')['overall'].count()

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'overall',
                     xbins = dict(start = 0,size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per Product',
                   xaxis = dict(title = 'Number of Ratings Per Product ID'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

We can clearly see a long tail of reviews per product. Only almost 50 products have most of the total reviews and the rest have extremely low reviews.

### See the number of ratings per users

In [21]:
# Number of ratings per user
data = df.groupby('reviewerID')['overall'].count()
# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0, size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per User',
                   xaxis = dict(title = 'Ratings Per User'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In this segement, the similar scenario happened as in the previous case. We see a high distribution among a small amount of users and an extremely long tail among the latter ones.

We can now savely remove those users that fall below our threshhold of at least 50 reviews.

In [22]:
# Removing those users that have a rating below the threshhold

tresh = 50
sub_df = df[df.groupby('reviewerID')['overall'].transform('count')>tresh].copy() 
print('Old shape: ',df.shape[0],'rows')
print('New shape: ',sub_df.shape[0],'rows')
print('Difference: ',-df.shape[0]+sub_df.shape[0],'rows')

Old shape:  134077 rows
New shape:  4952 rows
Difference:  -129125 rows


# Start of the collaborative filtering RS

In [23]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[['reviewerID', 'asin', 'overall']], reader)

### 1. KNN

In [24]:
# # To use item-based cosine similarity
# sim_options = {
#     "name": "cosine",
#     "user_based": False,  # Compute  similarities between items
# }
# knn = KNNBaseline(sim_options=sim_options)

In [25]:
sim_options = {'name':'pearson_baseline'}

knn = KNNBaseline(k=40,min_k=2,sim_options=sim_options,verbose=True)

results = cross_validate(knn,data,measures=['RMSE','MAE'],cv=5,verbose=True)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9115  0.9040  0.9116  0.9192  0.9119  0.9117  0.0048  
MAE (testset)     0.6893  0.6860  0.6912  0.6925  0.6902  0.6898  0.0022  
Fit time          24.87   20.69   19.12   20.37   20.02   21.01   2.00    
Test time         1.53    0.86    0.77    1.04    0.8

The RMSE is almost 91 with slights variations. Which shows a stable results over the 5 folds.

In [26]:
cross_validate(NormalPredictor(), data, measures=['RMSE'], cv=3, verbose=True)

Evaluating RMSE of algorithm NormalPredictor on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.2680  1.2683  1.2712  1.2692  0.0014  
Fit time          0.14    0.17    0.17    0.16    0.01    
Test time         0.73    0.29    0.28    0.43    0.21    


{'test_rmse': array([1.26801792, 1.26832657, 1.27116447]),
 'fit_time': (0.13781189918518066, 0.16914677619934082, 0.16768097877502441),
 'test_time': (0.7285256385803223, 0.28656983375549316, 0.27550792694091797)}

We can understand that KNN is learning from the dataset given. At this point almost 1.27 is the threshold for all the whole dataset.

Tuning the KNN

In [27]:
sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

gs = GridSearchCV(KNNBaseline, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matr

The best configuration corresponds to a item-based configuration using Mean Square distance with a min support equals to 5, which is able to slightly reduce the RMSE to 0.92. Which is similar to the first baseline we ran with KNN.

### 2. Matrix Factorization

In [28]:
from surprise import SVDpp

# We'll use the famous SVD algorithm.
svd = SVDpp()

results = cross_validate(svd, data, measures=['RMSE'], cv=3, verbose=False)

### 3. Benchmarking

In [29]:
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), BaselineOnly(), CoClustering()]:
    
    print("Testing {}".format(algorithm))
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')    

Testing <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7ff0f7084910>
Testing <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x7ff0f7084400>
Testing <surprise.prediction_algorithms.slope_one.SlopeOne object at 0x7ff0f7084370>
Testing <surprise.prediction_algorithms.matrix_factorization.NMF object at 0x7ff0f70845e0>
Testing <surprise.prediction_algorithms.random_pred.NormalPredictor object at 0x7ff0f7084f70>
Testing <surprise.prediction_algorithms.knns.KNNBaseline object at 0x7ff0f70844f0>
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Testing <surprise.prediction_algorithms.baseline_only.BaselineOnly object at 0x7ff0f70842b0>
Estimating biases using als...
Estimating biases using als

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,0.9083,18.285973,1.249162
SVD,0.914604,6.070348,0.430845
BaselineOnly,0.916206,0.472983,0.233745
KNNBaseline,0.968904,16.718082,1.39877
CoClustering,1.014459,4.091354,0.393917
SlopeOne,1.071099,4.265033,0.583375
NMF,1.139226,9.126509,0.415983
NormalPredictor,1.270089,0.168527,0.430159


SVDpp does take the longest but has the best performance. In this case, there is only a somewhat better performance than SVD and the Baseline and depending on how fast we want customer to be influenced by the recommendation we can select a fast model like SVD or even an easy one like BaselineOnly. It has a fast test time and is not too far away as the rest of the results.

In [30]:
df

Unnamed: 0,reviewerID,asin,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,users_nothelpful,users_helpful
103644,A9XX8OHS2ZQ2X,B00508OLNY,"[3, 4]",I received this puzzle for review through the ...,4,Does not Dissapoint,1345420800,"08 20, 2012",3,4
119165,AX8ATTTB67KFM,B00767PSIO,"[1, 2]",I ordered this for my sons 1st bithday. It wa...,5,good party scene,1342396800,"07 16, 2012",1,2
117593,A1VWK2BNL8I93C,B006X415GK,"[0, 0]",I bought this doll for my daughter. She enjoy...,4,Grated deal,1357862400,"01 11, 2013",0,0
131341,A1AJWJGB89GSIL,B008A2BA90,"[0, 1]",I initially picked this game up due to many gl...,4,A SEASONS for all magicians,1374192000,"07 19, 2013",0,1
94465,A3KOL1FYRGZPGQ,B004NIF5OQ,"[0, 0]",We bought my son the train set for Christmas a...,5,My son loves it!,1357430400,"01 6, 2013",0,0
...,...,...,...,...,...,...,...,...,...,...
119879,AG2IEP1MJQHFS,B007ADICI2,"[0, 2]",Ravensburger puzzles are absolutely the best j...,5,Great value and great fun!,1341532800,"07 6, 2012",0,2
103694,A3OBP99ZV5TG6C,B00508OOSG,"[0, 0]",The quality of the pieces was excellent. No m...,5,Excellent Puzzle - not for beginners,1342396800,"07 16, 2012",0,0
131932,A3JR8YFXZQQBU0,B008B68IE0,"[1, 1]",Her arms and legs are so stiff they don't bend...,3,Beautiful doll...but...,1387929600,"12 25, 2013",1,1
146867,A7N541YZQZIXH,B00BDMNBMI,"[0, 0]",The plastic in the drum area doesn't sound gre...,4,My 12 month old likes it,1379980800,"09 24, 2013",0,0


## Try out the solution

In [36]:
# Execute KNN
sim_options = {'name': 'pearson_baseline', 'user_based': False}
knn = KNNBaseline(sim_options=sim_options)
knn.fit(data.build_full_trainset())

# Target movie to analyze its neighbourhood
game_name = 'B007HZ9S7C'

# Get the closes neighbourds
neighbors = knn.get_neighbors(knn.trainset.to_inner_iid(game_name), k=10)
# Translate the internal ids used in the algorithm to the movie names
neighbors = (knn.trainset.to_raw_iid(inner_id) for inner_id in neighbors)

print()
print('The 10 nearest neighbors of {} are:\n'.format(game_name))
for game in neighbors:
    print("\t",game)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.

The 10 nearest neighbors of B007HZ9S7C are:

	 B007J3FAJ2
	 B0085UA8ZO
	 B0039X6XZG
	 B001W09LO8
	 B007Z8U7CG
	 B00BD9BXBM
	 B004ORV2O8
	 B004UCBU6M
	 B004ORWXFA
	 B00BFREGZ2


Is not really usefull at first sight, we can only define the product once we get the related product type. Nonetheless, we can pass this to our servers and make a join to define what to recommend to our users.

## Analyze bias

> Best games

In [40]:
game_name = [(b, data.build_full_trainset().to_raw_iid(i)) for i, b in enumerate(svd.bi)]
print("Best games:")
sorted(game_name, key=lambda x: x[0])[:15]

AttributeError: 'DataFrame' object has no attribute 'to_raw_iid'

> Worst games

In [37]:
print("Worst games:")
sorted(game_name, key=lambda x: -x[0])[:15]

Worst games:


NameError: name 'game_name' is not defined

### User bias

In [None]:
user_bias = [(b, train.to_raw_uid(i)) for i, b in enumerate(svd.bu)]
sorted(user_bias, key=lambda x: x[0])[0]

In [None]:
df[df.userID == XXX]

In [None]:
sorted(user_bias, key=lambda x: x[0])[-1]

In [None]:
df[df.userID == XXX]

# CONTENT BASED FILTERING

In [None]:
import pandas as pd

#Storing the movie information into a pandas dataframe
movies_df = pd.read_csv('./ml-latest-small/movies.csv')
#Storing the user information into a pandas dataframe
ratings_df = pd.read_csv('./ml-latest-small/ratings.csv')
#Head is a function that gets the first N rows of a dataframe. N's default is 5.
movies_df.head()

In [None]:
#Every genre is separated by a | so we simply have to call the split function on |
movies_df['genres'] = movies_df.genres.str.split('|')

#Copying the movie dataframe into a new one since we won't need to use the genre information in our first case.
moviesWithGenres_df = movies_df.copy()

#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column
for index, row in movies_df.iterrows():
    for genre in row['genres']:
        moviesWithGenres_df.at[index, genre] = 1
        
#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()

In [None]:
user_id = 2

# Get from the ratings dataframe only the rows (ratings) related to the user_id
user_rating = ratings_df[ratings_df.userId == user_id]
user_rating.drop("timestamp", 1)

# Merge with the movies dataframe to add the movie title to facilitate the analysis of the results
inputMovies = pd.merge(user_rating, movies_df, on='movieId').drop("timestamp",1).drop("genres",1).drop("userId",1)
inputMovies

In [None]:
#Filtering out the movies from the input
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

In [None]:
#Resetting the index to avoid future issues
userMovies = userMovies.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1)
userGenreTable

In [None]:
#Dot produt to get weights
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])
userProfile

In [None]:
#Now let's get the genres of every movie in our original dataframe
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])
#And drop the unnecessary information
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1)
genreTable.head(10) #This is for all movies

In [None]:
#Multiply the genres by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())

#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)

recommendationTable_df

In [None]:
recommendationTable_df.drop(inputMovies.movieId, inplace=True)

# Content based - Textual Features

In [None]:
from sklearn.metrics.pairwise import linear_kernel 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 

In [None]:
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel 
df

In [None]:
tfidf = TfidfVectorizer(analyzer='word', min_df=0, stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['reviewText'])


In [None]:
feature_names = tfidf.get_feature_names()
ndf = pd.DataFrame(tfidf_matrix.todense(), columns=feature_names)
ndf

In [None]:
dict(ndf.sort_values(by=1, ascending=False, axis=1).iloc[1])


In [None]:


# Compute cosine similarity
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

# Iterate over the items in the dataset to find the most similar ones to each one
results = {}
for idx, row in ds.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1] 
    similar_items = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices] 
    results[row['id']] = similar_items[1:]

In [None]:
def item(id):  
    return ds.loc[ds['id'] == id]['reviewText'].tolist()[0].split(' - ')[0] 

# Just reads the results out of the dictionary
def recommend(item_id, num):
    print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")   
    print("-------")
    recs = results[item_id][:num]   
    for rec in recs: 
        print("Recommended: " + item(rec[1]) + " (score:" +      str(rec[0]) + ")")

In [None]:
recommend(item_id=11, num=5)

# Conclusions