In [1]:
# imports
import pandas as pd 
import numpy as np
import warnings

from surprise import SVD, Reader, Dataset, accuracy
from surprise.model_selection import train_test_split


# settings
from IPython.core.display import HTML
pd.set_option('display.max_rows', 2000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
warnings.filterwarnings('ignore')


# data viz imports
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# load data
df1_phaseIII = pd.read_csv('Data/phaseII_cleaned.csv')

In [2]:
# glimpse at top 10 records
df1_phaseIII.head(10)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,2000-07-30
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,1996-11-08
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,2005-01-25
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,2017-11-13
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,2011-05-18
5,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,18,3.5,2016-02-11
6,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,19,4.0,2000-08-08
7,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,21,3.5,2014-08-09
8,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,27,3.0,2000-07-04
9,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,31,5.0,1996-12-13


In [3]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df1_phaseIII[['userId', 'movieId', 'rating']], reader)

In [4]:
train, testset = train_test_split(data, test_size=0.25)

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(train)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.8725


0.8724509666249268

After training and testing our algorithm for our data, we see that the accuracy of the algorithm is around 87.8%. This result inidcates that our model's prediction ability is good.

We can also consider a few more factors and methods to evaluate the model and improve the performance of our model.
    
__Cross Validation__

Cross Validation is performed in order to reduce the bias that may have happened when splitting the data between test set and training set. Cross validation divided the data into a specified number of sets, n (usually a default of 5 sets) and performs training on n-1 sets and uses 1 set as the test data to evaluate the performace of the model. This step is repeated till all the sets are used as test and train data.

In [5]:
from surprise.model_selection import cross_validate
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8765  0.8679  0.8842  0.8745  0.8639  0.8734  0.0070  
MAE (testset)     0.6730  0.6664  0.6779  0.6723  0.6656  0.6710  0.0046  
Fit time          5.10    4.43    4.92    5.03    4.95    4.89    0.23    
Test time         0.12    0.24    0.12    0.20    0.14    0.16    0.05    


{'test_rmse': array([0.8764662 , 0.86794284, 0.88421791, 0.87451768, 0.86389767]),
 'test_mae': array([0.67295638, 0.66644103, 0.67794137, 0.67231241, 0.66556339]),
 'fit_time': (5.097747564315796,
  4.434219121932983,
  4.924458980560303,
  5.033463716506958,
  4.948090553283691),
 'test_time': (0.12167644500732422,
  0.24139666557312012,
  0.1186826229095459,
  0.20081210136413574,
  0.14165520668029785)}

<div class="alert alert-warning">
<H3>Findings </H3>

According to the results seen above, RMSE score is around 87% which is very close to the results obtained during the model training and testing without cross validation.

GridSearchCV
Grid search is the process of performing hyper parameter tuning in order to determine the optimal values for a given model. This is significant as the performance of the entire model is based on the hyper parameter values specified.

The hyperparamters here are

n_epochs – The number of iteration of the Stochastic Gradient Descent(SGD) procedure. Default is 20
lr_all - The learning rate for all parameters. Default is 0.005
reg_all - The regularization term for all parameters. Default is 0.02

In [6]:
from surprise.model_selection import GridSearchCV
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

In [7]:
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.8935655773949193
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


As you can see from the above results, the optimal values for our hyperparameters are:

n_epochs = 10
lr_all = 0.005
reg_all = 0.4
Using these values in our SVD algorithm, we need to check for the accuracy rate of our model again

In [8]:

algo1 = SVD(n_epochs= 10, lr_all= 0.005, reg_all =  0.4)
# Train the algorithm on the trainset, and predict ratings for the testset
algo1.fit(train)
predictions = algo1.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.8862


0.8862073935894843

__Inference:__
As you can see from the above results, after finding the optimal paramters for the model using GridSearchCV, the performace of the model improved 600 basis points, which is very good.

Model Accuracy Analysis
Prediction accuracy answers the question how well the recommender does at estimating preference

Decision support metrics answers how well the recommender does at finding good things

Rank accuracy metrics look at how well the recommender estimates relative preference

Metrics Families:
1. Fraction of Concordant Pair
2. Mean Absolute Error
3. Mean Squared Error
4. Root Mean Squared Error

__Fraction of Concordant Pair__

Looks at the fraction of all pairs that it puts in the correct order


In [9]:
from surprise.accuracy import fcp
print(fcp(predictions))

FCP:  0.6702
0.6701777245366562


Hence according to this metrics it puts around 67.02% of the pairs in the correct order.

__Mean Absolute Error__
MAE=1|𝑅̂ * ∑ r𝑢𝑖∈𝑅̂ |𝑟𝑢𝑖−𝑟̂𝑢𝑖|

This gives us the abosulte mean error of predicted values and actual values.

In [10]:
from surprise.accuracy import mse
print(mse(predictions))

MSE: 0.7854
0.7853635444526672


Hence according to this metrics, its prediction accuracy is around 78.53%.



__Root Mean Squared Error__
RMSE=(1|𝑅̂ * ∑ r𝑢𝑖∈𝑅̂ |𝑟𝑢𝑖^2−𝑟̂𝑢𝑖^2|) ^ 0.5

If you observe the formula, this is the square root of the Mean sqaured error computed above

In [11]:
from surprise.accuracy import rmse
print(rmse(predictions))

RMSE: 0.8862
0.8862073935894843


This is indeed the square root of .8862. The accuracy of this model is around 90%.

__Conclusion__

We build a Collaborative Filtering based Reccommender System which has a very good accuracy of around 90%. Which will help the business retain customers and increase customer engagement in the entertainment platform.

There is definitely scope for improvement by taking more information about users and movies.

__Recommendation__

When the user uses the imdb database for the first time, it is recommended that the system asks the user to rate a set of movies and this helps to prevent the cold start problem, wherein, we do not have enough information about the user preferences and hence recommendation becomes very difficult.

In [12]:
len(predictions)

25209