## Movie Recommendation System Models

In [1]:
#Import necessary libraries
import numpy as np
import pandas as pd
from surprise import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV


In [2]:
#Import data into a DataFrame and drop unnecessary columns 
df = pd.read_csv('cleaneddata', index_col=False)
df2 = df[['userId', 'movieId', 'rating']]

In [3]:
#Look at first 5 rows of new Dataframe 
df2.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [4]:
#Instansiate reader and data 
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df2, reader) 

In [5]:
#Train test split with test sizre of 20% 
trainset, testset = train_test_split(data, test_size=.2)

In [6]:
# Print number of uses and items for the trainset 
print('Number of users in train set : ', trainset.n_users, '\n')
print('Number of items in train set : ', trainset.n_items, '\n')

Number of users in train set :  133 

Number of items in train set :  1717 



### Baseline Model

Our baseline model will be a KNNBaseline model without any hyperparameters.

In [7]:
#Instansiate a baseline model using KNNBaseline 
baseline = KNNBaseline(random_state=42)

In [8]:
#Fit model on the trainset 
baseline.fit(trainset)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7f889b0d83c8>

In [9]:
#Predict on the test set 
baselinepreds = baseline.test(testset)

In [10]:
#Check RMSE and MAE results 
accuracy.rmse(baselinepreds)
accuracy.mae(baselinepreds)

RMSE: 0.8237
MAE:  0.6310


0.6309938770172038

The RMSE for our baseline is 0.809 and the MAE is 0.623. These are the values we will look to improve by attempting different models and including hyperparameters in future models. 

In [11]:
#Run 3-fold cross validation on the data and print results 
cv_baseline = cross_validate(baseline, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8115  0.8101  0.8239  0.8152  0.0062  
MAE (testset)     0.6217  0.6219  0.6312  0.6249  0.0044  
Fit time          0.07    0.06    0.06    0.06    0.00    
Test time         1.17    1.00    1.04    1.07    0.07    


In [12]:
# Print out the RMSE score for each fold 
for i in cv_baseline.items():
    print(i)

('test_rmse', array([0.81145173, 0.81008594, 0.82394321]))
('test_mae', array([0.62168751, 0.62194509, 0.63118392]))
('fit_time', (0.06551671028137207, 0.05542707443237305, 0.05528998374938965))
('test_time', (1.1677052974700928, 1.0024900436401367, 1.0357470512390137))


In [13]:
#Find the average test RMSE from the 3-Fold cross-validation
np.mean(cv_baseline['test_rmse'])

0.8151602907739218

Our 3-fold cross validaiton has an average test RMSE of approximately 0.816. We will look to reduce this RMSE in future models.

### Model 1

Our first model will be an SVD Model using GridSearch. We will first apply GridSearch to idently the best parameters that reduce our RMSE, and then will re-instantiate our model with these parameters so we can then fit on the trainset and predict on the test set.

In [14]:
#Set parameters for GridSearch on SVD model 
parameters = {'n_factors': [20, 50, 80],
             'reg_all': [0.04, 0.06],
             'n_epochs': [10, 20, 30],
             'lr_all': [.002, .005, .01]}
gridsvd = GridSearchCV(SVD, param_grid=parameters, n_jobs=-1)

In [15]:
#Fit SVD model on data
gridsvd.fit(data)

In [16]:
#Print best score and best parameters from the GridSearch 
print(gridsvd.best_score)
print(gridsvd.best_params)

{'rmse': 0.7975568149129171, 'mae': 0.6105187911574741}
{'rmse': {'n_factors': 80, 'reg_all': 0.06, 'n_epochs': 30, 'lr_all': 0.01}, 'mae': {'n_factors': 80, 'reg_all': 0.06, 'n_epochs': 30, 'lr_all': 0.01}}


In [17]:
#Reinstantiate the model with the best parameters fromGridSearch 
svdtuned = SVD(n_factors=80,
               reg_all=0.06,
               n_epochs=30,
               lr_all=0.01)

In [18]:
#Fit and predict the model 
svdtuned.fit(trainset)
svdpreds = svdtuned.test(testset)

In [19]:
#Print RMSE and MAE results 
accuracy.rmse(svdpreds)
accuracy.mae(svdpreds)

RMSE: 0.8100
MAE:  0.6186


0.6186098688021788

Both our RMSE and MAE results are lower than in the baseline. Our SVD models are also improved because we have filtered out users and movies with lower ratings so that the sparsity of our matrix would decrease.

In [20]:
#Perform 3-Fold cross validation for SVD tuned model
cv_svd_tuned = cross_validate(svdtuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8141  0.8172  0.7997  0.8103  0.0077  
MAE (testset)     0.6229  0.6235  0.6152  0.6205  0.0038  
Fit time          2.11    2.18    2.12    2.14    0.03    
Test time         0.15    0.11    0.10    0.12    0.02    


In [21]:
#Display the results for all 3-folds 
for i in cv_svd_tuned.items():
    print(i)

('test_rmse', array([0.81409697, 0.81722173, 0.79965133]))
('test_mae', array([0.6228691 , 0.62354316, 0.61516316]))
('fit_time', (2.107988119125366, 2.1823620796203613, 2.1179049015045166))
('test_time', (0.1483759880065918, 0.10505509376525879, 0.10311293601989746))


In [22]:
# Print out the average RMSE score for the test set
np.mean(cv_svd_tuned['test_rmse'])

0.8103233436968088

Our 3-fold cross validation test RMSE result was approx. 0.808; a slight decrease from our baseline model 3-fold cross validation of 0.816.

### Model 2

Our next model will look at the KNNBasic algorithm to see if the results improve. We will again use GridSearch to look at different parameters in hopes of reducing our RMSE score.

In [23]:
# Set parameters to be used in KNN models 
knn_params = {'name': ['cosine', 'pearson'],
              'user_based':[True, False], 
              'min_support':[True, False],
            'min_k' : [1, 2]}

In [24]:
# Apply GridSearch to the KNN Basic model to identify the best parameters
gsknnbasic = GridSearchCV(KNNBasic, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnbasic.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [25]:
#Display the best scores and parameters from GridSearch
print(gsknnbasic.best_score)
print(gsknnbasic.best_params)

{'rmse': 0.8848209983882409, 'mae': 0.685499813396917}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}}


In [26]:
#Reinstantiate the model with the best parameters from GridSearch 
knnbasic_tuned = KNNBasic(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':2, })

In [27]:
#Fit on the train set and predict on the test set 
knnbasic_tuned.fit(trainset)
knnbpreds = knnbasic_tuned.test(testset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [28]:
#Print RMSE and MAE results 
accuracy.rmse(knnbpreds)
accuracy.mae(knnbpreds)

RMSE: 0.9107
MAE:  0.7073


0.7073344086754061

Our RMSE value has increased from the baseline. We will perform a 3-fold cross validation to see if the result improves.

In [29]:
#Conduct cross validation for the KNNBasic tuned model 
cv_knn_basic = cross_validate(knnbasic_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9103  0.8992  0.9051  0.9049  0.0045  
MAE (testset)     0.7073  0.6999  0.7050  0.7040  0.0031  
Fit time          0.03    0.03    0.03    0.03    0.00    
Test time         1.03    0.75    0.86    0.88    0.11    


In [30]:
# Print out results from the cross-valdiatoin 
for i in cv_knn_basic.items():
    print(i)

('test_rmse', array([0.91029943, 0.89918385, 0.90507235]))
('test_mae', array([0.70727711, 0.69985665, 0.70498138]))
('fit_time', (0.02928900718688965, 0.02982187271118164, 0.030117034912109375))
('test_time', (1.0250730514526367, 0.7518370151519775, 0.8638858795166016))


In [31]:
# Print out the average RMSE score for the test set
np.mean(cv_knn_basic['test_rmse'])

0.9048518746576839

This average of test RMSE results in our cross validation is approximately 0.905, similar to the RMSE we found above. This results in a higher RMSE from our baseline model, which had an RMSE of 0.816.

### Model 3

The next model we will explore is the KNN Baseline model. Do note that our baseline model was a version of the KNNBasline model; in contrast, here, we are including hyperparameters to tune our model to see if the result can be imrpoved. 

In [32]:
#Apply KNN GridSearch parameters on the KNNBaseline model 
gsknnbaseline = GridSearchCV(KNNBaseline, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnbaseline.fit(data)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matr

In [33]:
#Display the best score and the best parameters 
print(gsknnbaseline.best_score)
print(gsknnbaseline.best_params)

{'rmse': 0.8163205138663231, 'mae': 0.6258784951016335}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}}


In [34]:
#Reinstantiate the model with the best parameters from GridSearch 
knnbaseline_tuned = KNNBaseline(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':2, })

In [35]:
#Fit the trainset and predict on the test set 
knnbaseline_tuned.fit(trainset)
knnbaselinepreds = knnbaseline_tuned.test(testset)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [36]:
#Print the RMSE and MAE scores 
accuracy.rmse(knnbaselinepreds)
accuracy.mae(knnbaselinepreds)

RMSE: 0.8268
MAE:  0.6336


0.6336194159994631

Our RMSE is 0.812, which is a slight increase from our baseline. We will explore the 3-fold cross validation as another check to see if the results differ.

In [37]:
#Perform 3 fold cross validation 
cv_knn_baseline = cross_validate(knnbaseline_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8307  0.8224  0.8101  0.8211  0.0085  
MAE (testset)     0.6344  0.6309  0.6242  0.6299  0.0042  
Fit time          0.09    0.07    0.13    0.10    0.02    
Test time         1.50    1.80    1.39    1.56    0.17    


In [38]:
#Show the mean RMSE score for the test set 
np.mean(cv_knn_baseline['test_rmse'])

0.8210675122603274

Our 3-fold cross validation result is approximately 0.8198, a slight increase from the baseline model of 0.816. 

### Model 4

Our final model will look at the KNN Wtih Means algorithm, and apply a GridSearch similar to the KNN models above to tune our hyperparameters further.

In [39]:
#Apply GridSearch to the KNNWithMeans model 
gsknnWM = GridSearchCV(KNNWithMeans, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnWM.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [40]:
#Display the best score and best parameters from GridSearch 
print(gsknnWM.best_score)
print(gsknnWM.best_params)

{'rmse': 0.8233806291898969, 'mae': 0.6305076587109498}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}}


In [41]:
#Reinstansiate the model with the best parameters 
knnwm_tuned = KNNWithMeans(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':2, })

In [42]:
#Fit on the trainset, predict on the testset 
knnwm_tuned.fit(trainset)
knnwmpreds = knnwm_tuned.test(testset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [43]:
#Print RMSE and MAE results
accuracy.rmse(knnwmpreds)
accuracy.mae(knnwmpreds)

RMSE: 0.8338
MAE:  0.6388


0.6387569121922141

Our RMSE result is 0.8197, compared to the baseline of 0.809, so a slight increase. 

In [44]:
#Perform 3-Fold cross validation on KNNWithMeans model 
cv_knn_wm = cross_validate(knnwm_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8183  0.8386  0.8225  0.8265  0.0087  
MAE (testset)     0.6282  0.6425  0.6296  0.6334  0.0064  
Fit time          0.05    0.10    0.10    0.08    0.02    
Test time         1.99    1.80    2.02    1.94    0.10    


In [45]:
#Print the average RMSE score for the test set 
np.mean(cv_knn_wm['test_rmse'])

0.8264569583546688

Our 3-fold cross validation performs worse than the baseline; 0.827 is higher than our baseline result of 0.816. 

### All results

Below we will take a look at all of the results for our models and compare them to the basline.

In [46]:
#Create a dictionary for each models' results 
baselineresult = {'model': 'baseline','RMSE': accuracy.rmse(baselinepreds), 'MAE': accuracy.mae(baselinepreds), 'CV': np.mean(cv_baseline['test_rmse'])}
svdresult = {'model':'svd', 'RMSE': accuracy.rmse(svdpreds), 'MAE': accuracy.mae(svdpreds), 'CV': np.mean(cv_svd_tuned['test_rmse'])}
knnbasicresult = {'model':'knnbasic','RMSE': accuracy.rmse(knnbpreds), 'MAE': accuracy.mae(knnbpreds), 'CV': np.mean(cv_knn_basic['test_rmse'])}
knnbaselineresult = {'model':'knnbaseline','RMSE': accuracy.rmse(knnbaselinepreds), 'MAE': accuracy.mae(knnbaselinepreds), 'CV': np.mean(cv_knn_baseline['test_rmse'])}
knnwmresult = {'model':'knnwm','RMSE': accuracy.rmse(knnwmpreds), 'MAE': accuracy.mae(knnwmpreds), 'CV': np.mean(cv_knn_wm['test_rmse'])}

RMSE: 0.8237
MAE:  0.6310
RMSE: 0.8100
MAE:  0.6186
RMSE: 0.9107
MAE:  0.7073
RMSE: 0.8268
MAE:  0.6336
RMSE: 0.8338
MAE:  0.6388


In [47]:
#Combine all the results into a list 
result_list = [baselineresult, svdresult, knnbasicresult, knnbaselineresult, knnwmresult]

In [48]:
#Transform the results lists into a DataFrame 
df_results_updated = pd.DataFrame.from_dict(result_list, orient='columns')
df_results_updated = df_results_updated.set_index('model')

In [49]:
#Display the results for all of the models 
df_results_updated

Unnamed: 0_level_0,RMSE,MAE,CV
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
baseline,0.823742,0.630994,0.81516
svd,0.809976,0.61861,0.810323
knnbasic,0.910705,0.707334,0.904852
knnbaseline,0.826787,0.633619,0.821068
knnwm,0.83384,0.638757,0.826457


Unfortunately, all of our models perform worse than the baseline, with the exception of the SVD model. SVD had an RMSE of 0.797, compared to the baseline of 0.808. Additionally, SVD MAE score was 0.613 which is lower than the baseline result of 0.623. Lastly, when performing 3-fold cross validation, our SVD performs only slightly better than the baseline with a result of 0.808, comapred to the baseline result of 0.816. As a result, we will disregard the KNN models for the remainder of our analysis and will move forward with our SVD tuned model, as it performed the best in terms of RMSE, MAE and CV.

Therefore, we can conclude that on average, our SVD model estimates ratings with an error of approximately 0.8. On a scale of 0-5, this 0.8 value is not too significant, as a rating of 3 compared to 3.8 is not a significant different in context. Generally, with these models, we are trying to get a sense of what rating the user would rate a movie; and since these results are quite difficult to validate (we do not actually know if a user will enjoy a movie or not in reality), the estimation error of 0.8 is reasonably acceptable.

### Generating New Ratings 

We will create a function that generates ratings for a brand new user. We will then show how our model can use these ratings in order to make predictions. This step is important as it shows how our models and our recommendation systems can actually make predictions on new ratings!


In [50]:
#Define function that can generate new user movie ratings 
def movie_rater(movie_df,num, genre=None):
    #Create new user with userId = 1000
    userID = 1000
    
    #Create an empty list of ratings 
    rating_list = []
    
    #For all number of ratings, provide a random movie sample within the specified genre for the user to rate 
    while num > 0:
        if genre:
            movie = movie_df[movie_df['genres'].str.contains(genre)].sample(1)
        else:
            movie = movie_df.sample(1)
        print(movie)
    
    #Provide user with a prompt to rate the movie, then print the userID, movieID, then title, then append 
    #results to the rating_list 
        rating = input('How do you rate this movie on a scale of 1-5, press n if you have not seen :\n')
        if rating == 'n':
            continue
        else:
            rating_one_movie = {'userId':userID,'movieId':movie['movieId'].values[0],'title':movie['title'].values[0], 'rating':rating}
            rating_list.append(rating_one_movie) 
            num -= 1
    return rating_list  

In [51]:
dfnew = df[['userId', 'movieId', 'rating', 'title', 'genres']]

In [52]:
userrating = movie_rater(dfnew, 3, 'Action')

       userId  movieId  rating              title  \
33039     453     2126     1.0  Snake Eyes (1998)   

                              genres  
33039  Action|Crime|Mystery|Thriller  
How do you rate this movie on a scale of 1-5, press n if you have not seen :
2
      userId  movieId  rating                           title  \
9555     132     6934     3.0  Matrix Revolutions, The (2003)   

                                     genres  
9555  Action|Adventure|Sci-Fi|Thriller|IMAX  
How do you rate this movie on a scale of 1-5, press n if you have not seen :
4
      userId  movieId  rating                              title  \
6412      82      165     4.0  Die Hard: With a Vengeance (1995)   

                     genres  
6412  Action|Crime|Thriller  
How do you rate this movie on a scale of 1-5, press n if you have not seen :
2


In [53]:
## Display the new user ratings 
userrating

[{'userId': 1000,
  'movieId': 2126,
  'title': 'Snake Eyes (1998)',
  'rating': '2'},
 {'userId': 1000,
  'movieId': 6934,
  'title': 'Matrix Revolutions, The (2003)',
  'rating': '4'},
 {'userId': 1000,
  'movieId': 165,
  'title': 'Die Hard: With a Vengeance (1995)',
  'rating': '2'}]

The new user has rated three new movies: Dawn of the Dead, The Four Musketeers, and Braveheart. Our model can now provide predictions for these ratings by adding them to our model.

In [54]:
#Add new ratings to our DataFrame
new_ratings_df = df2.append(userrating,ignore_index=True, sort=False)

In [55]:
#Drop the 'title' column so that our dataframe is ready to be put into surprise
new_ratings_df.drop(['title'], axis=1, inplace=True)

In [56]:
#Investigate new DataFrame
new_ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4
1,1,3,4
2,1,6,4
3,1,47,5
4,1,50,5


Now we will redo the same modeling process as above in order to find predictions for the above movies.

In [57]:
#Instansiate reader and data 
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(new_ratings_df, reader)

In [58]:
#Train test split 
trainset, testset = train_test_split(data, test_size=.2)

In [59]:
#Reinstantiate the model with the best parameters from GridSearch and fit on the trainset 
svdtuned2 = SVD(n_factors=80,
               reg_all=0.06,
               n_epochs=30,
               lr_all=0.01)
svdtuned2.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f889c21aeb8>

In [60]:
#Find predictions for the three movies that user with userId=1000 just rated
print(svdtuned2.predict(1000,4643))
print(svdtuned2.predict(1000,96610))
print(svdtuned2.predict(1000,6534))


user: 1000       item: 4643       r_ui = None   est = 2.69   {'was_impossible': False}
user: 1000       item: 96610      r_ui = None   est = 3.67   {'was_impossible': False}
user: 1000       item: 6534       r_ui = None   est = 2.83   {'was_impossible': False}


Now we have predictions for User 1000's movies that it has rated so far. We have now seen how our model can generate new ratings, and formulate predictions for these ratings as well.

# Extracting Predictions for all Users and Movies 

Our final step in the process will be to create a Dataframe that includes all the estimated ratings for every combination of userId and movieId. Once this information is clearly presented in a Dataframe, we will be able to conduct some post-modeling EDA to determine how or if trends from our estimators differ from our original ratings.

In [61]:
#Create list of unique userIds and movieIds 
userids = new_ratings_df['userId'].unique()
movieids = new_ratings_df['movieId'].unique()


In [62]:
#Create a list and append the userId, movieId, and estimated ratings 
predictions = [] 
for u in userids:
    for m in movieids:
        predicted = svdtuned2.predict(u, m)
        predictions.append([u, m, predicted[3]])

In [63]:
#Convert the list to a dataframe
estimated = pd.DataFrame(predictions)


In [64]:
#rename columns of DataFrame 
estimated.rename(columns={0: 'userId', 1: 'movieId', 2:'estimatedrating'}, inplace=True)

In [65]:
#Print the final dataFrame
estimated

Unnamed: 0,userId,movieId,estimatedrating
0,1,1,4.510503
1,1,3,4.140471
2,1,6,4.380372
3,1,47,4.773006
4,1,50,4.934597
...,...,...,...
230073,1000,8640,3.075754
230074,1000,51412,3.296209
230075,1000,85510,3.303950
230076,1000,111364,3.001219


We now have a DataFrame that includes every userId, and every movieId, along with their estimated ratings. We can now use this data to create visualizations that demonstrate any trends or patterns with our estimations data.

In [66]:
#Export the estimated data to a csv file 
estimated.to_csv('estimated')