# Building user-based recommendation model for Amazon

## DESCRIPTION

The dataset provided contains movie reviews given by Amazon customers. Reviews were given between May 1996 and July 2014.

### Data Dictionary
UserID – 4848 customers who provided a rating for each movie
Movie 1 to Movie 206 – 206 movies for which ratings are provided by 4848 distinct users

### Data Considerations
- All the users have not watched all the movies and therefore, all movies are not rated. These missing values are represented by NA.
- Ratings are on a scale of -1 to 10 where -1 is the least rating and 10 is the best.

### Analysis Task
- ##### Exploratory Data Analysis:
>- Which movies have maximum views/ratings?
>- What is the average rating for each movie? Define the top 5 movies with the maximum ratings.
>- Define the top 5 movies with the least audience.
- ##### Recommendation Model: Some of the movies hadn’t been watched and therefore, are not rated by the users. Netflix would like to take this as an opportunity and build a machine learning recommendation algorithm which provides the ratings for each of the users.
>- Divide the data into training and test data
>- Build a recommendation model on training data
>- Make predictions on the test data

### Importing necessary libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

pd.pandas.set_option('display.max_columns',None)

### Loading the data

In [2]:
amazon_data=pd.read_csv('Amazon - Movies and TV Ratings.csv')
amazon_data.head()

Unnamed: 0,user_id,Movie1,Movie2,Movie3,Movie4,Movie5,Movie6,Movie7,Movie8,Movie9,Movie10,Movie11,Movie12,Movie13,Movie14,Movie15,Movie16,Movie17,Movie18,Movie19,Movie20,Movie21,Movie22,Movie23,Movie24,Movie25,Movie26,Movie27,Movie28,Movie29,Movie30,Movie31,Movie32,Movie33,Movie34,Movie35,Movie36,Movie37,Movie38,Movie39,Movie40,Movie41,Movie42,Movie43,Movie44,Movie45,Movie46,Movie47,Movie48,Movie49,Movie50,Movie51,Movie52,Movie53,Movie54,Movie55,Movie56,Movie57,Movie58,Movie59,Movie60,Movie61,Movie62,Movie63,Movie64,Movie65,Movie66,Movie67,Movie68,Movie69,Movie70,Movie71,Movie72,Movie73,Movie74,Movie75,Movie76,Movie77,Movie78,Movie79,Movie80,Movie81,Movie82,Movie83,Movie84,Movie85,Movie86,Movie87,Movie88,Movie89,Movie90,Movie91,Movie92,Movie93,Movie94,Movie95,Movie96,Movie97,Movie98,Movie99,Movie100,Movie101,Movie102,Movie103,Movie104,Movie105,Movie106,Movie107,Movie108,Movie109,Movie110,Movie111,Movie112,Movie113,Movie114,Movie115,Movie116,Movie117,Movie118,Movie119,Movie120,Movie121,Movie122,Movie123,Movie124,Movie125,Movie126,Movie127,Movie128,Movie129,Movie130,Movie131,Movie132,Movie133,Movie134,Movie135,Movie136,Movie137,Movie138,Movie139,Movie140,Movie141,Movie142,Movie143,Movie144,Movie145,Movie146,Movie147,Movie148,Movie149,Movie150,Movie151,Movie152,Movie153,Movie154,Movie155,Movie156,Movie157,Movie158,Movie159,Movie160,Movie161,Movie162,Movie163,Movie164,Movie165,Movie166,Movie167,Movie168,Movie169,Movie170,Movie171,Movie172,Movie173,Movie174,Movie175,Movie176,Movie177,Movie178,Movie179,Movie180,Movie181,Movie182,Movie183,Movie184,Movie185,Movie186,Movie187,Movie188,Movie189,Movie190,Movie191,Movie192,Movie193,Movie194,Movie195,Movie196,Movie197,Movie198,Movie199,Movie200,Movie201,Movie202,Movie203,Movie204,Movie205,Movie206
0,A3R5OBKS7OM2IR,5.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,AH3QC2PC1VTGP,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,A3LKP6WPMP9UKX,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,AVIY68KEPQ5ZD,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,A1CV1WROP5KTTW,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [3]:
amazon_data.shape

(4848, 207)

In [4]:
amazon_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4848 entries, 0 to 4847
Columns: 207 entries, user_id to Movie206
dtypes: float64(206), object(1)
memory usage: 7.7+ MB


### Checking for missing values

In [7]:
missing_val=[cols for cols in amazon_data.columns if amazon_data[cols].isnull().sum()>0]
for cols in missing_val:
    print('{} as {}% of missing values in it'.format(cols,np.round(amazon_data[cols].isnull().mean(),4)))

Movie1 as 0.9998% of missing values in it
Movie2 as 0.9998% of missing values in it
Movie3 as 0.9998% of missing values in it
Movie4 as 0.9996% of missing values in it
Movie5 as 0.994% of missing values in it
Movie6 as 0.9998% of missing values in it
Movie7 as 0.9998% of missing values in it
Movie8 as 0.9998% of missing values in it
Movie9 as 0.9998% of missing values in it
Movie10 as 0.9998% of missing values in it
Movie11 as 0.9996% of missing values in it
Movie12 as 0.999% of missing values in it
Movie13 as 0.9998% of missing values in it
Movie14 as 0.9998% of missing values in it
Movie15 as 0.9998% of missing values in it
Movie16 as 0.934% of missing values in it
Movie17 as 0.9998% of missing values in it
Movie18 as 0.9998% of missing values in it
Movie19 as 0.9996% of missing values in it
Movie20 as 0.9998% of missing values in it
Movie21 as 0.9998% of missing values in it
Movie22 as 0.9996% of missing values in it
Movie23 as 0.9994% of missing values in it
Movie24 as 0.999% of mi

In [81]:
amazon_data.describe().T.head()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Movie1,1.0,5.0,,5.0,5.0,5.0,5.0,5.0
Movie2,1.0,5.0,,5.0,5.0,5.0,5.0,5.0
Movie3,1.0,2.0,,2.0,2.0,2.0,2.0,2.0
Movie4,2.0,5.0,0.0,5.0,5.0,5.0,5.0,5.0
Movie5,29.0,4.103448,1.496301,1.0,4.0,5.0,5.0,5.0


### Exploratory Data Analysis Task :

#### Which movies have maximum views/ratings?

In [8]:
#movie with maximum views?
amazon_data.describe().T['count'].sort_values(ascending=False)[:1].to_frame()

Unnamed: 0,count
Movie127,2313.0


_**Movie127** has maximum views with a count of **2313.0**_

In [9]:
#movie with maximum ratings
amazon_data.drop('user_id',axis=1).sum().sort_values(ascending=False)[:1].to_frame()

Unnamed: 0,0
Movie127,9511.0


_**Movie127** has maximum rating with a sum of **9511.0**_

#### What is the average rating for each movie? Define the top 5 movies with the maximum ratings.

In [10]:
#average rating of each movie
amazon_data.drop('user_id',axis=1).mean()

Movie1      5.000000
Movie2      5.000000
Movie3      2.000000
Movie4      5.000000
Movie5      4.103448
              ...   
Movie202    4.333333
Movie203    3.000000
Movie204    4.375000
Movie205    4.628571
Movie206    4.923077
Length: 206, dtype: float64

In [11]:
#top 5 movies with the maximum ratings
amazon_data.drop('user_id',axis=1).mean().sort_values(ascending=False)[:5].to_frame()

Unnamed: 0,0
Movie1,5.0
Movie55,5.0
Movie131,5.0
Movie132,5.0
Movie133,5.0


_**Movie1,Movie55,Movie131,Movie132,Movie133** are the top 5 movies with maximum ratings on an average_

#### Define the top 5 movies with the least audience.

In [12]:
amazon_data.describe().T['count'].sort_values(ascending=True)[:5].to_frame()

Unnamed: 0,count
Movie1,1.0
Movie71,1.0
Movie145,1.0
Movie69,1.0
Movie68,1.0


_**Movie1,Movie71,Movie145,Movie69,Movie68** are the top 5 movies with the  least audience_

### Recommendation Model :

#### Importing libiraries required for model building

In [54]:
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import train_test_split
from surprise import SVD
from surprise import accuracy
from surprise.model_selection import cross_validate

In [21]:
movies_data = amazon_data.melt(id_vars = amazon_data.columns[0],value_vars=amazon_data.columns[1:],var_name="Movies",value_name="Rating")
movies_data.head()

Unnamed: 0,user_id,Movies,Rating
0,A3R5OBKS7OM2IR,Movie1,5.0
1,AH3QC2PC1VTGP,Movie1,
2,A3LKP6WPMP9UKX,Movie1,
3,AVIY68KEPQ5ZD,Movie1,
4,A1CV1WROP5KTTW,Movie1,


In [22]:
#reading the dataset
rd = Reader(rating_scale=(-1,10))
data = Dataset.load_from_df(movies_data.fillna(0),reader=rd)

In [34]:
#splitting the dataset
train_data,test_data = train_test_split(data,random_state=10)

In [35]:
#creating SVD (Singular Value Descomposition) object
svd = SVD()
svd.fit(train_data)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f6a8ba54610>

In [38]:
preds = svd.test(test_data)

In [61]:
preds[:5]

[Prediction(uid='A3EJVZ5LCBP61X', iid='Movie153', r_ui=0.0, est=0.0069878860151298015, details={'was_impossible': False}),
 Prediction(uid='A3LCW5H7XK5B22', iid='Movie158', r_ui=0.0, est=0.2817704759799421, details={'was_impossible': False}),
 Prediction(uid='A3KGYTO6CF8MGF', iid='Movie99', r_ui=0.0, est=0.006140894879915938, details={'was_impossible': False}),
 Prediction(uid='A2869WCKQA2NNW', iid='Movie40', r_ui=0.0, est=0.005542913593096648, details={'was_impossible': False}),
 Prediction(uid='A1897EYRUTMKKI', iid='Movie9', r_ui=0.0, est=0.007377839720217094, details={'was_impossible': False})]

In [62]:
#checking for rmse accuracy
accuracy.rmse(preds)

RMSE: 0.2781


0.2781421541974895

In [63]:
#checking for mae accuracy
accuracy.mae(preds)

MAE:  0.0409


0.040881756868222

In [64]:
u_id='A3GUOAL2MF4IYT'
iid = 'Movie197'
r_ui = 5.0
svd.predict(u_id,iid,r_ui,verbose= True)

user: A3GUOAL2MF4IYT item: Movie197   r_ui = 5.00   est = -0.01   {'was_impossible': False}


Prediction(uid='A3GUOAL2MF4IYT', iid='Movie197', r_ui=5.0, est=-0.008017731571423764, details={'was_impossible': False})

In [65]:
cross_validate(svd, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.2823  0.2818  0.2830  0.2824  0.0005  
MAE (testset)     0.0425  0.0430  0.0427  0.0427  0.0002  
Fit time          45.30   44.89   45.62   45.27   0.30    
Test time         3.69    4.32    4.36    4.12    0.31    


{'test_rmse': array([0.28229661, 0.28182134, 0.28304456]),
 'test_mae': array([0.04252376, 0.04300189, 0.04271283]),
 'fit_time': (45.30189538002014, 44.89337229728699, 45.62289333343506),
 'test_time': (3.6939566135406494, 4.317623853683472, 4.362195253372192)}

In [69]:
def repeat(ml_type,df,min_,max_):
    rd = Reader()
    data = Dataset.load_from_df(df,reader=rd)
    print(cross_validate(ml_type, data, measures = ['RMSE', 'MAE'], cv = 3, verbose = True))
    print("________"*20)
    u_id = 'A3GUOAL2MF4IYT'
    iid = 'Movie197'
    r_ui = 5.0
    print(ml_type.predict(u_id,iid,r_ui,verbose=True))
    print("________"*20)
    print()

In [70]:
amazon_data= amazon_data.iloc[:3000, :50]
movies_data = amazon_data.melt(id_vars = amazon_data.columns[0],value_vars=amazon_data.columns[1:],var_name="Movies",value_name="Rating")

In [71]:
repeat(SVD(),movies_data.fillna(0),-1,10)
repeat(SVD(),movies_data.fillna(movies_data.mean()),-1,10)
repeat(SVD(),movies_data.fillna(movies_data.median()),-1,10)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.0254  1.0306  1.0315  1.0292  0.0027  
MAE (testset)     1.0110  1.0130  1.0135  1.0125  0.0011  
Fit time          7.19    7.53    7.26    7.33    0.15    
Test time         0.48    0.48    1.21    0.72    0.34    
{'test_rmse': array([1.02542717, 1.03056233, 1.03154472]), 'test_mae': array([1.01101869, 1.0129967 , 1.01350196]), 'fit_time': (7.1880035400390625, 7.530601739883423, 7.2649760246276855), 'test_time': (0.4804365634918213, 0.4755537509918213, 1.2091670036315918)}
________________________________________________________________________________________________________________________________________________________________
user: A3GUOAL2MF4IYT item: Movie197   r_ui = 5.00   est = 1.00   {'was_impossible': False}
user: A3GUOAL2MF4IYT item: Movie197   r_ui = 5.00   est = 1.00   {'was_impossible': False}
____________________________________________

In [78]:
#using grid search to find the optimum hyperparameter value for n_factors
from surprise.model_selection import GridSearchCV

In [79]:
params = {'n_epochs':[20,30],
             'lr_all':[0.005,0.001],
             'n_factors':[50,100]}

In [80]:
GS = GridSearchCV(SVD,params,measures=['rmse','mae'],cv=3)
GS.fit(data)

In [81]:
GS.best_score

{'rmse': 0.2798426572690132, 'mae': 0.041522030061130095}

In [82]:
print(GS.best_params["rmse"])
print(GS.best_score["rmse"])

{'n_epochs': 30, 'lr_all': 0.005, 'n_factors': 50}
0.2798426572690132
