## Movie Recommendation System Models

In [28]:
#Import necessary libraries
import numpy as np
import pandas as pd
from surprise import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV


In [29]:
#Import data into a DataFrame and drop unnecessary columns 
df = pd.read_csv('cleaneddata', index_col=False)
df2 = df[['userId', 'movieId', 'rating']]

  interactivity=interactivity, compiler=compiler, result=result)


In [30]:
df2.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [31]:
#Check sparsity of matrix
numratings = len(df2['rating'])
numusers = len(df2['userId'].unique())
numitems = len(df2['movieId'].unique())

sparse = 1 - (numratings / (numusers*numitems))
sparse

0.9829821819213007

Our matrix is very sparse, which could negatively impact our model results. In order to improve this issue, we will remove any users that have rated less than 200 movies. 

In [32]:
#Remove users who have rated less than 200 movies
df3 = df2.groupby('userId').filter(lambda x : len(x)>200)
df4 = df3.groupby('movieId').filter(lambda x : len(x)>10)
df.to_csv('d4')
df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47232 entries, 0 to 100817
Data columns (total 3 columns):
userId     47232 non-null int64
movieId    47232 non-null int64
rating     47232 non-null float64
dtypes: float64(1), int64(2)
memory usage: 1.4 MB


In [33]:
#Check sparsity of new matrix
numratings = len(df4['rating'])
numusers = len(df4['userId'].unique())
numitems = len(df4['movieId'].unique())

sparse = 1 - (numratings / (numusers*numitems))
sparse

0.7931695867508024

This result looks pretty good; our resulting matrix is much less sparse, and is less than 95%, so we hope to see improvements in our SVD models. Let's begin modeling and investigate our results.

In [34]:
#Look at the distribution of ratings again 
df4.rating.value_counts()

4.0    12991
3.0     9065
3.5     6970
5.0     5598
4.5     4265
2.0     3239
2.5     2911
1.0     1030
1.5      716
0.5      447
Name: rating, dtype: int64

In [35]:
#Instansiate reader and data 
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df4, reader) 

In [36]:
#Train test split 
trainset, testset = train_test_split(data, test_size=.2)

In [37]:
# Print number of uses and items for the trainset 
print('Number of users in train set : ', trainset.n_users, '\n')
print('Number of items in train set : ', trainset.n_items, '\n')


Number of users in train set :  133 

Number of items in train set :  1717 



### Baseline Model

In [38]:
#Instansiate a baseline model using KNNBaseline 
baseline = KNNBaseline(random_state=42)

In [39]:
#Fit model on the trainset 
baseline.fit(trainset)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7fc4c3e42668>

In [40]:
#Predict on the test set 
baselinepreds = baseline.test(testset)

In [41]:
#Check RMSE and MAE results 
accuracy.rmse(baselinepreds)
accuracy.mae(baselinepreds)

RMSE: 0.8047
MAE:  0.6205


0.6204802792565017

In [42]:
#Run 3-fold cross validation on the data and print results 
cv_baseline = cross_validate(baseline, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8203  0.8097  0.8156  0.8152  0.0044  
MAE (testset)     0.6258  0.6219  0.6283  0.6253  0.0026  
Fit time          0.09    0.06    0.07    0.07    0.01    
Test time         1.23    1.02    1.17    1.14    0.09    


In [43]:
# Print out the RMSE score for each fold 
for i in cv_baseline.items():
    print(i)

('test_rmse', array([0.82034983, 0.80970616, 0.81562078]))
('test_mae', array([0.62575976, 0.62187136, 0.62831242]))
('fit_time', (0.0889582633972168, 0.06191706657409668, 0.06855916976928711))
('test_time', (1.2310552597045898, 1.0223491191864014, 1.1656522750854492))


In [44]:
#Find the average test RMSE from the 3-Fold cross-validation
np.mean(cv_baseline['test_rmse'])

0.8152255891327548

### Model 1

In [47]:
#Set parameters for GridSearch on SVD model 
parameters = {'n_factors': [20, 50, 80],
             'reg_all': [0.04, 0.06],
             'n_epochs': [10, 20, 30],
             'lr_all': [.002, .005, .01]}
gridsvd = GridSearchCV(SVD, param_grid=parameters, n_jobs=-1)

In [48]:
#Fit SVD model on data
gridsvd.fit(data)

In [49]:
#Print best score and best parameters from the GridSearch 
print(gridsvd.best_score)
print(gridsvd.best_params)

{'rmse': 0.7985941345340264, 'mae': 0.6113250586868515}
{'rmse': {'n_factors': 80, 'reg_all': 0.06, 'n_epochs': 30, 'lr_all': 0.01}, 'mae': {'n_factors': 80, 'reg_all': 0.06, 'n_epochs': 30, 'lr_all': 0.01}}


In [50]:
#Reinstantiate the model with the best parameters fromGridSearch 
svdtuned = SVD(n_factors=80,
               reg_all=0.06,
               n_epochs=30,
               lr_all=0.01)

In [52]:
#Fit and predict the model 
svdtuned.fit(trainset)
svdpreds = svdtuned.test(testset)

In [53]:
#Print RMSE and MAE results 
accuracy.rmse(svdpreds)
accuracy.mae(svdpreds)

RMSE: 0.7912
MAE:  0.6091


0.6091493720204718

In [54]:
#Perform 3-Fold cross validation for SVD tuned model
cv_svd_tuned = cross_validate(svdtuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8109  0.8053  0.8095  0.8086  0.0024  
MAE (testset)     0.6195  0.6173  0.6230  0.6199  0.0023  
Fit time          2.19    1.97    2.10    2.09    0.09    
Test time         0.13    0.21    0.12    0.15    0.04    


In [55]:
#Display the results for all 3-folds 
for i in cv_svd_tuned.items():
    print(i)

('test_rmse', array([0.81092214, 0.80529148, 0.80948999]))
('test_mae', array([0.61946174, 0.6173417 , 0.62297539]))
('fit_time', (2.189645767211914, 1.9661180973052979, 2.1042981147766113))
('test_time', (0.1334850788116455, 0.20677471160888672, 0.11876106262207031))


In [56]:
# Print out the average RMSE score for the test set
np.mean(cv_svd_tuned['test_rmse'])

0.8085678681642622

### Model 2

In [58]:
# Set parameters to be used in KNN models 
knn_params = {'name': ['cosine', 'pearson'],
              'user_based':[True, False], 
              'min_support':[True, False],
            'min_k' : [1, 2]}

In [59]:
# Apply GridSearch to the KNN Basic model to identify the best parameters
gsknnbasic = GridSearchCV(KNNBasic, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnbasic.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [60]:
#Display the best scores and parameters from GridSearch
print(gsknnbasic.best_score)
print(gsknnbasic.best_params)

{'rmse': 0.8851060365662176, 'mae': 0.6857397819503204}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}}


In [61]:
#Reinstantiate the model with the best parameters from GridSearch 
knnbasic_tuned = KNNBasic(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':2, })

In [62]:
#Fit on the train set and predict on the test set 
knnbasic_tuned.fit(trainset)
knnbpreds = knnbasic_tuned.test(testset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [63]:
#Print RMSE and MAE results 
accuracy.rmse(knnbpreds)
accuracy.mae(knnbpreds)

RMSE: 0.9024
MAE:  0.7023


0.702316876127284

Another way to evalute the model is to perform a cross validation and print the resulting scores. We will explore this below:

In [64]:
#Conduct cross validation for the KNNBasic tuned model 
cv_knn_basic = cross_validate(knnbasic_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9032  0.9057  0.9042  0.9044  0.0010  
MAE (testset)     0.7018  0.7050  0.7023  0.7030  0.0014  
Fit time          0.03    0.04    0.03    0.03    0.00    
Test time         0.89    1.07    0.83    0.93    0.10    


In [45]:
# Print out results from the cross-valdiatoin 
for i in cv_knn_basic.items():
    print(i)

('test_rmse', array([0.90296961, 0.90670276, 0.90767707]))
('test_mae', array([0.70033263, 0.70655386, 0.70547007]))
('fit_time', (0.02528095245361328, 0.02921605110168457, 0.028800010681152344))
('test_time', (0.7427136898040771, 0.7487399578094482, 0.811363935470581))


In [65]:
# Print out the average RMSE score for the test set
np.mean(cv_knn_basic['test_rmse'])

0.904374182050126

This average of test RMSE results in our cross validation is approximately 0.97, similar to the RMSE we found above. This is a significant improvement from our baseline model which had an RMSE of 0.873.

### Model 3

In [66]:
#Apply KNN GridSearch parameters on the KNNBaseline model 
gsknnbaseline = GridSearchCV(KNNBaseline, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnbaseline.fit(data)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matr

In [67]:
#Display the best score and the best parameters 
print(gsknnbaseline.best_score)
print(gsknnbaseline.best_params)

{'rmse': 0.8164795256796018, 'mae': 0.6254832352772427}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 2}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 2}}


In [68]:
#Reinstantiate the model with the best parameters from GridSearch 
knnbaseline_tuned = KNNBaseline(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':2, })

In [69]:
#Fit the trainset and predict on the test set 
knnbaseline_tuned.fit(trainset)
knnbaselinepreds = knnbaseline_tuned.test(testset)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [70]:
#Print the RMSE and MAE scores 
accuracy.rmse(knnbaselinepreds)
accuracy.mae(knnbaselinepreds)

RMSE: 0.8086
MAE:  0.6237


0.6236677283081172

In [71]:
#Perform 3 fold cross validation 
cv_knn_baseline = cross_validate(knnbaseline_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8125  0.8270  0.8211  0.8202  0.0060  
MAE (testset)     0.6246  0.6309  0.6311  0.6289  0.0030  
Fit time          0.13    0.09    0.07    0.10    0.02    
Test time         1.24    1.39    2.04    1.56    0.35    


In [72]:
#Show the mean RMSE score for the test set 
np.mean(cv_knn_baseline['test_rmse'])

0.8201952569984375

### Model 4

In [73]:
#Apply GridSearch to the KNNWithMeans model 
gsknnWM = GridSearchCV(KNNWithMeans, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnWM.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [74]:
#Display the best score and best parameters from GridSearch 
print(gsknnWM.best_score)
print(gsknnWM.best_params)

{'rmse': 0.8235921862187469, 'mae': 0.6314015203262237}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}}


In [75]:
#Reinstansiate the model with the best parameters 
knnwm_tuned = KNNWithMeans(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':2, })

In [76]:
#Fit on the trainset, predict on the testset 
knnwm_tuned.fit(trainset)
knnwmpreds = knnwm_tuned.test(testset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [77]:
#Print RMSE and MAE results
accuracy.rmse(knnwmpreds)
accuracy.mae(knnwmpreds)

RMSE: 0.8143
MAE:  0.6279


0.6278975721496151

In [78]:
#Perform 3-Fold cross validation on KNNWithMeans model 
cv_knn_wm = cross_validate(knnwm_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8286  0.8267  0.8216  0.8256  0.0030  
MAE (testset)     0.6368  0.6324  0.6303  0.6331  0.0027  
Fit time          0.03    0.04    0.04    0.03    0.00    
Test time         0.86    0.93    0.85    0.88    0.04    


In [79]:
#Print the average RMSE score for the test set 
np.mean(cv_knn_wm['test_rmse'])

0.8256196206316081

### All results

In [80]:
#Create a dictionary for each models' results 
baselineresult = {'model': 'baseline','RMSE': accuracy.rmse(baselinepreds), 'MAE': accuracy.mae(baselinepreds), 'CV': np.mean(cv_baseline['test_rmse'])}
svdresult = {'model':'svd', 'RMSE': accuracy.rmse(svdpreds), 'MAE': accuracy.mae(svdpreds), 'CV': np.mean(cv_svd_tuned['test_rmse'])}
knnbasicresult = {'model':'knnbasic','RMSE': accuracy.rmse(knnbpreds), 'MAE': accuracy.mae(knnbpreds), 'CV': np.mean(cv_knn_basic['test_rmse'])}
knnbaselineresult = {'model':'knnbaseline','RMSE': accuracy.rmse(knnbaselinepreds), 'MAE': accuracy.mae(knnbaselinepreds), 'CV': np.mean(cv_knn_baseline['test_rmse'])}
knnwmresult = {'model':'knnwm','RMSE': accuracy.rmse(knnwmpreds), 'MAE': accuracy.mae(knnwmpreds), 'CV': np.mean(cv_knn_wm['test_rmse'])}

RMSE: 0.8047
MAE:  0.6205
RMSE: 0.7912
MAE:  0.6091
RMSE: 0.9024
MAE:  0.7023
RMSE: 0.8086
MAE:  0.6237
RMSE: 0.8143
MAE:  0.6279


In [63]:
#Combine all the results into a list 
result_list = [baselineresult, svdresult, knnbasicresult, knnbaselineresult, knnwmresult]

In [64]:
#Transform the results lists into a DataFrame 
df_results_updated = pd.DataFrame.from_dict(result_list, orient='columns')
df_results_updated = df_results_updated.set_index('model')

In [65]:
#Display the results for all of the models 
df_results_updated

Unnamed: 0_level_0,RMSE,MAE,CV
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
baseline,0.805064,0.619421,0.814845
svd,0.789376,0.607647,0.803589
knnbasic,0.89259,0.695836,0.905783
knnbaseline,0.807736,0.621723,0.817969
knnwm,0.81511,0.627821,0.826288


In [81]:
#Display the new user ratings 
userrating

[{'userId': 1000,
  'movieId': 3745,
  'title': 'Titan A.E. (2000)',
  'rating': '3'},
 {'userId': 1000,
  'movieId': 260,
  'title': 'Star Wars: Episode IV - A New Hope (1977)',
  'rating': '4'},
 {'userId': 1000, 'movieId': 3702, 'title': 'Mad Max (1979)', 'rating': '2'}]

The new user has rated three new movies: Congo, Grindhouse and Scarface. Our model can now provide predictions for these ratings by adding them to our model.

In [82]:
#Add new ratings to our DataFrame
new_ratings_df = df4.append(userrating,ignore_index=True, sort=False)

In [83]:
#Drop the 'title' column so that our dataframe is ready to be put into surprise
new_ratings_df.drop(['title'], axis=1, inplace=True)

In [84]:
#Investigate new DataFrame
new_ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4
1,1,3,4
2,1,6,4
3,1,47,5
4,1,50,5


Now we will redo the same modeling process as above in order to find predictions for the above movies.

In [85]:
#Instansiate reader and data 
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(new_ratings_df, reader)

In [86]:
#Train test split 
trainset, testset = train_test_split(data, test_size=.2)

In [88]:
#Reinstantiate the model with the best parameters from GridSearch and fit on the trainset 
svdtuned2 = SVD(n_factors=80,
               reg_all=0.06,
               n_epochs=30,
               lr_all=0.01)
svdtuned2.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fc4c9553828>

In [89]:
#Find predictions for the three movies that user with userId=1000 just rated
print(svdtuned2.predict(1000,3745))
print(svdtuned2.predict(1000,260))
print(svdtuned2.predict(1000,3702))


user: 1000       item: 3745       r_ui = None   est = 3.15   {'was_impossible': False}
user: 1000       item: 260        r_ui = None   est = 4.15   {'was_impossible': False}
user: 1000       item: 3702       r_ui = None   est = 3.38   {'was_impossible': False}


Now we have predictions for User 1000's movies that it has rated so far. We have now seen how our model can generate new ratings, and formulate predictions for these ratings as well.

In [None]:
##POst Modeling EDA 

# #have to double loop through all the users for each item and loop through movieID for each item 
# estimated = svdtuned.predict(1,1)[3]
# estimated

In [None]:
svdtuned.predict(1,1)
#do post model EDA on userId, movieID and estimated rating 
#is there popularity bias 
#can do a histogram of all the errors - distribution of the errors (leave to end)