## Movie Recommendation System Models

In [1]:
#Import necessary libraries
import numpy as np
import pandas as pd
from surprise import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV


In [2]:
#Import data into a DataFrame and drop unnecessary columns 
df = pd.read_csv('cleaneddata', index_col=False)
df2 = df[['userId', 'movieId', 'rating']]

In [3]:
#Look at first 5 rows of new Dataframe 
df2.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [4]:
#Instansiate reader and data 
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(df2, reader) 

In [5]:
#Train test split with test sizre of 20% 
trainset, testset = train_test_split(data, test_size=.2)

In [6]:
# Print number of uses and items for the trainset 
print('Number of users in train set : ', trainset.n_users, '\n')
print('Number of items in train set : ', trainset.n_items, '\n')

Number of users in train set :  133 

Number of items in train set :  1717 



### Baseline Model

Our baseline model will be a KNNBaseline model without any hyperparameters.

In [7]:
#Instansiate a baseline model using KNNBaseline 
baseline = KNNBaseline(random_state=42)

In [8]:
#Fit model on the trainset 
baseline.fit(trainset)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7faf010bd2e8>

In [9]:
#Predict on the test set 
baselinepreds = baseline.test(testset)

In [10]:
#Check RMSE and MAE results 
accuracy.rmse(baselinepreds)
accuracy.mae(baselinepreds)

RMSE: 0.8011
MAE:  0.6127


0.612688283316229

The RMSE for our baseline is 0.801 and the MAE is 0.612. These are the values we will look to improve by attempting different models and including hyperparameters in future models. 

In [11]:
#Run 3-fold cross validation on the data and print results 
cv_baseline = cross_validate(baseline, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8160  0.8131  0.8163  0.8151  0.0015  
MAE (testset)     0.6265  0.6240  0.6240  0.6248  0.0012  
Fit time          0.07    0.08    0.06    0.07    0.01    
Test time         1.08    1.00    1.04    1.04    0.03    


In [12]:
# Print out the RMSE score for each fold 
for i in cv_baseline.items():
    print(i)

('test_rmse', array([0.81603438, 0.81305917, 0.81626081]))
('test_mae', array([0.62648796, 0.6239992 , 0.62396403]))
('fit_time', (0.06772494316101074, 0.08036613464355469, 0.06248211860656738))
('test_time', (1.0846278667449951, 1.0032038688659668, 1.0422539710998535))


In [13]:
#Find the average test RMSE from the 3-Fold cross-validation
np.mean(cv_baseline['test_rmse'])

0.815118123374194

Our 3-fold cross validaiton has an average test RMSE of approximately 0.815. We will look to reduce this RMSE in future models.

### Model 1

Our first model will be an SVD Model using GridSearch. We will first apply GridSearch to idently the best parameters that reduce our RMSE, and then will re-instantiate our model with these parameters so we can then fit on the trainset and predict on the test set.

In [14]:
#Set parameters for GridSearch on SVD model 
parameters = {'n_factors': [20, 50, 80],
             'reg_all': [0.04, 0.06],
             'n_epochs': [10, 20, 30],
             'lr_all': [.002, .005, .01]}
gridsvd = GridSearchCV(SVD, param_grid=parameters, n_jobs=-1)

In [15]:
#Fit SVD model on data
gridsvd.fit(data)

In [16]:
#Print best score and best parameters from the GridSearch 
print(gridsvd.best_score)
print(gridsvd.best_params)

{'rmse': 0.7963619719653734, 'mae': 0.6099887731059941}
{'rmse': {'n_factors': 80, 'reg_all': 0.06, 'n_epochs': 30, 'lr_all': 0.01}, 'mae': {'n_factors': 80, 'reg_all': 0.06, 'n_epochs': 30, 'lr_all': 0.01}}


In [17]:
#Reinstantiate the model with the best parameters fromGridSearch 
svdtuned = SVD(n_factors=80,
               reg_all=0.06,
               n_epochs=30,
               lr_all=0.01)

In [18]:
#Fit and predict the model 
svdtuned.fit(trainset)
svdpreds = svdtuned.test(testset)

In [19]:
#Print RMSE and MAE results 
accuracy.rmse(svdpreds)
accuracy.mae(svdpreds)

RMSE: 0.7897
MAE:  0.6048


0.6048424574166104

Both our RMSE and MAE results are lower than in the baseline. Our SVD models are also improved because we have filtered out users and movies with lower ratings so that the sparsity of our matrix would decrease.

In [20]:
#Perform 3-Fold cross validation for SVD tuned model
cv_svd_tuned = cross_validate(svdtuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.7922  0.8137  0.8188  0.8083  0.0115  
MAE (testset)     0.6085  0.6235  0.6254  0.6191  0.0076  
Fit time          2.30    2.40    2.70    2.47    0.17    
Test time         0.15    0.13    0.11    0.13    0.02    


In [21]:
#Display the results for all 3-folds 
for i in cv_svd_tuned.items():
    print(i)

('test_rmse', array([0.79224735, 0.81374189, 0.81884843]))
('test_mae', array([0.6085206 , 0.62348046, 0.62542037]))
('fit_time', (2.298301935195923, 2.4014530181884766, 2.700329065322876))
('test_time', (0.15017390251159668, 0.12639093399047852, 0.10625576972961426))


In [22]:
# Print out the average RMSE score for the test set
np.mean(cv_svd_tuned['test_rmse'])

0.808279222957499

Our 3-fold cross validation test RMSE result was approx. 0.808; a slight decrease from our baseline model 3-fold cross validation of 0.816.

### Model 2

Our next model will look at the KNNBasic algorithm to see if the results improve. We will again use GridSearch to look at different parameters in hopes of reducing our RMSE score.

In [23]:
# Set parameters to be used in KNN models 
knn_params = {'name': ['cosine', 'pearson'],
              'user_based':[True, False], 
              'min_support':[True, False],
            'min_k' : [1, 2]}

In [24]:
# Apply GridSearch to the KNN Basic model to identify the best parameters
gsknnbasic = GridSearchCV(KNNBasic, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnbasic.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [25]:
#Display the best scores and parameters from GridSearch
print(gsknnbasic.best_score)
print(gsknnbasic.best_params)

{'rmse': 0.8836200843160973, 'mae': 0.6847323505866144}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}}


In [26]:
#Reinstantiate the model with the best parameters from GridSearch 
knnbasic_tuned = KNNBasic(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':2, })

In [27]:
#Fit on the train set and predict on the test set 
knnbasic_tuned.fit(trainset)
knnbpreds = knnbasic_tuned.test(testset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [28]:
#Print RMSE and MAE results 
accuracy.rmse(knnbpreds)
accuracy.mae(knnbpreds)

RMSE: 0.8893
MAE:  0.6876


0.6876254127587851

Our RMSE value has increased from the baseline. We will perform a 3-fold cross validation to see if the result improves.

In [29]:
#Conduct cross validation for the KNNBasic tuned model 
cv_knn_basic = cross_validate(knnbasic_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9012  0.9044  0.9015  0.9024  0.0015  
MAE (testset)     0.6994  0.7027  0.7017  0.7013  0.0014  
Fit time          0.03    0.04    0.04    0.03    0.00    
Test time         0.84    0.80    1.09    0.91    0.13    


In [30]:
# Print out results from the cross-valdiatoin 
for i in cv_knn_basic.items():
    print(i)

('test_rmse', array([0.90123799, 0.90444378, 0.90147548]))
('test_mae', array([0.69937972, 0.70268252, 0.70173213]))
('fit_time', (0.02956223487854004, 0.03686380386352539, 0.03515291213989258))
('test_time', (0.8375589847564697, 0.8021156787872314, 1.0905179977416992))


In [31]:
# Print out the average RMSE score for the test set
np.mean(cv_knn_basic['test_rmse'])

0.9023857490888033

This average of test RMSE results in our cross validation is approximately 0.902, similar to the RMSE we found above. This results in a higher RMSE from our baseline model.

### Model 3

The next model we will explore is the KNN Baseline model. Do note that our baseline model was a version of the KNNBasline model; in contrast, here, we are including hyperparameters to tune our model to see if the result can be imrpoved. 

In [32]:
#Apply KNN GridSearch parameters on the KNNBaseline model 
gsknnbaseline = GridSearchCV(KNNBaseline, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnbaseline.fit(data)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matr

In [33]:
#Display the best score and the best parameters 
print(gsknnbaseline.best_score)
print(gsknnbaseline.best_params)

{'rmse': 0.8170886062358497, 'mae': 0.6260426540405558}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}}


In [34]:
#Reinstantiate the model with the best parameters from GridSearch 
knnbaseline_tuned = KNNBaseline(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':2, })

In [35]:
#Fit the trainset and predict on the test set 
knnbaseline_tuned.fit(trainset)
knnbaselinepreds = knnbaseline_tuned.test(testset)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.


In [36]:
#Print the RMSE and MAE scores 
accuracy.rmse(knnbaselinepreds)
accuracy.mae(knnbaselinepreds)

RMSE: 0.8046
MAE:  0.6159


0.6158684640657873

Our RMSE is 0.804, which is a slight increase from our baseline. We will explore the 3-fold cross validation as another check to see if the results differ.

In [37]:
#Perform 3 fold cross validation 
cv_knn_baseline = cross_validate(knnbaseline_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBaseline on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8240  0.8222  0.8177  0.8213  0.0027  
MAE (testset)     0.6322  0.6296  0.6276  0.6298  0.0019  
Fit time          0.06    0.08    0.07    0.07    0.00    
Test time         1.18    1.21    1.17    1.18    0.02    


In [38]:
#Show the mean RMSE score for the test set 
np.mean(cv_knn_baseline['test_rmse'])

0.8212862368511474

Our 3-fold cross validation result is approximately 0.821, a slight increase from the baseline model. 

### Model 4

Our final model will look at the KNN Wtih Means algorithm, and apply a GridSearch similar to the KNN models above to tune our hyperparameters further.

In [39]:
#Apply GridSearch to the KNNWithMeans model 
gsknnWM = GridSearchCV(KNNWithMeans, knn_params, measures=['rmse', 'mae'], cv=3)
gsknnWM.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [40]:
#Display the best score and best parameters from GridSearch 
print(gsknnWM.best_score)
print(gsknnWM.best_params)

{'rmse': 0.8247186748840228, 'mae': 0.6322291093818038}
{'rmse': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}, 'mae': {'name': 'cosine', 'user_based': True, 'min_support': True, 'min_k': 1}}


In [41]:
#Reinstansiate the model with the best parameters 
knnwm_tuned = KNNWithMeans(sim_options={'name': 'cosine', 
                                       'user_based': True, 
                                       'min_support':True, 
                                       'min_k':2, })

In [42]:
#Fit on the trainset, predict on the testset 
knnwm_tuned.fit(trainset)
knnwmpreds = knnwm_tuned.test(testset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


In [43]:
#Print RMSE and MAE results
accuracy.rmse(knnwmpreds)
accuracy.mae(knnwmpreds)

RMSE: 0.8133
MAE:  0.6221


0.6221008231343054

Our RMSE result is 0.813, again a slight increase from the baseilne result.

In [44]:
#Perform 3-Fold cross validation on KNNWithMeans model 
cv_knn_wm = cross_validate(knnwm_tuned, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8333  0.8247  0.8265  0.8282  0.0037  
MAE (testset)     0.6361  0.6342  0.6336  0.6346  0.0011  
Fit time          0.04    0.05    0.06    0.05    0.01    
Test time         1.21    1.01    0.92    1.05    0.12    


In [45]:
#Print the average RMSE score for the test set 
np.mean(cv_knn_wm['test_rmse'])

0.8281502760658467

Our 3-fold cross validation performs worse than the baseline; 0.828, which is higher than our baseline result.

### All results

Below we will take a look at all of the results for our models and compare them to the basline.

In [46]:
#Create a dictionary for each models' results 
baselineresult = {'model': 'baseline','RMSE': accuracy.rmse(baselinepreds), 'MAE': accuracy.mae(baselinepreds), 'CV': np.mean(cv_baseline['test_rmse'])}
svdresult = {'model':'svd', 'RMSE': accuracy.rmse(svdpreds), 'MAE': accuracy.mae(svdpreds), 'CV': np.mean(cv_svd_tuned['test_rmse'])}
knnbasicresult = {'model':'knnbasic','RMSE': accuracy.rmse(knnbpreds), 'MAE': accuracy.mae(knnbpreds), 'CV': np.mean(cv_knn_basic['test_rmse'])}
knnbaselineresult = {'model':'knnbaseline','RMSE': accuracy.rmse(knnbaselinepreds), 'MAE': accuracy.mae(knnbaselinepreds), 'CV': np.mean(cv_knn_baseline['test_rmse'])}
knnwmresult = {'model':'knnwm','RMSE': accuracy.rmse(knnwmpreds), 'MAE': accuracy.mae(knnwmpreds), 'CV': np.mean(cv_knn_wm['test_rmse'])}

RMSE: 0.8011
MAE:  0.6127
RMSE: 0.7897
MAE:  0.6048
RMSE: 0.8893
MAE:  0.6876
RMSE: 0.8046
MAE:  0.6159
RMSE: 0.8133
MAE:  0.6221


In [47]:
#Combine all the results into a list 
result_list = [baselineresult, svdresult, knnbasicresult, knnbaselineresult, knnwmresult]

In [48]:
#Transform the results lists into a DataFrame 
df_results_updated = pd.DataFrame.from_dict(result_list, orient='columns')
df_results_updated = df_results_updated.set_index('model')

In [49]:
#Display the results for all of the models 
df_results_updated

Unnamed: 0_level_0,RMSE,MAE,CV
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
baseline,0.801115,0.612688,0.815118
svd,0.789745,0.604842,0.808279
knnbasic,0.889289,0.687625,0.902386
knnbaseline,0.804648,0.615868,0.821286
knnwm,0.813338,0.622101,0.82815


Unfortunately, all of our models perform worse than the baseline, with the exception of the SVD model. SVD had an RMSE of 0.7897, compared to the baseline of 0.801. Additionally, SVD MAE score was 0.605 which is lower than the baseline result of 0.613. Lastly, when performing 3-fold cross validation, our SVD performs only slightly better than the baseline with a result of 0.808, compared to the baseline result of 0.815. As a result, we will disregard the KNN models for the remainder of our analysis and will move forward with our SVD tuned model, as it performed the best in terms of RMSE, MAE and CV.

Therefore, we can conclude that on average, our SVD model estimates ratings with an error of approximately 0.80. On a scale of 0-5, this 0.80 value is not too significant, as a rating of 3 compared to 3.8 is not a significant different in context. Generally, with these models, we are trying to get a sense of what rating the user would rate a movie; and since these results are quite difficult to validate (we do not actually know if a user will enjoy a movie or not in reality), the estimation error of 0.80 is reasonably acceptable.

### Generating New Ratings 

We will create a function that generates ratings for a brand new user. We will then show how our model can use these ratings in order to make predictions. This step is important as it shows how our models and our recommendation systems can actually make predictions on new ratings!


In [82]:
#Define function that can generate new user movie ratings 
def movie_rater(movie_df,num, genre=None):
    #Create new user with userId = 1000
    userID = 1000
    
    #Create an empty list of ratings 
    rating_list = []
    
    #For all number of ratings, provide a random movie sample within the specified genre for the user to rate 
    while num > 0:
        if genre:
            movie = movie_df[movie_df['genres'].str.contains(genre)].sample(1)
        else:
            movie = movie_df.sample(1)
        print(movie)
    
    #Provide user with a prompt to rate the movie, then print the userID, movieID, then title, then append 
    #results to the rating_list 
        rating = input('How do you rate this movie on a scale of 1-5, press n if you have not seen :\n')
        if rating == 'n':
            continue
        else:
            rating_one_movie = {'userId':userID,'movieId':movie['movieId'].values[0],'title':movie['title'].values[0], 'rating':rating}
            rating_list.append(rating_one_movie) 
            num -= 1
    return rating_list  

In [83]:
dfnew = df[['userId', 'movieId', 'rating', 'title', 'genres']]

In [84]:
userrating = movie_rater(dfnew, 3, 'Action')

       userId  movieId  rating                   title                  genres
29284     414     1240     5.0  Terminator, The (1984)  Action|Sci-Fi|Thriller
How do you rate this movie on a scale of 1-5, press n if you have not seen :
3
      userId  movieId  rating                 title           genres
2017      28     1608     2.0  Air Force One (1997)  Action|Thriller
How do you rate this movie on a scale of 1-5, press n if you have not seen :
4
      userId  movieId  rating                  title                 genres
7772     104     1833     2.5  Mercury Rising (1998)  Action|Drama|Thriller
How do you rate this movie on a scale of 1-5, press n if you have not seen :
2


In [85]:
## Display the new user ratings 
userrating

[{'userId': 1000,
  'movieId': 1240,
  'title': 'Terminator, The (1984)',
  'rating': '3'},
 {'userId': 1000,
  'movieId': 1608,
  'title': 'Air Force One (1997)',
  'rating': '4'},
 {'userId': 1000,
  'movieId': 1833,
  'title': 'Mercury Rising (1998)',
  'rating': '2'}]

The new user has rated three new movies: Dawn of the Dead, The Four Musketeers, and Braveheart. Our model can now provide predictions for these ratings by adding them to our model.

In [86]:
#Add new ratings to our DataFrame
new_ratings_df = df2.append(userrating,ignore_index=True, sort=False)

In [87]:
#Drop the 'title' column so that our dataframe is ready to be put into surprise
new_ratings_df.drop(['title'], axis=1, inplace=True)

In [88]:
#Investigate new DataFrame
new_ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4
1,1,3,4
2,1,6,4
3,1,47,5
4,1,50,5


Now we will redo the same modeling process as above in order to find predictions for the above movies.

In [89]:
#Instansiate reader and data 
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(new_ratings_df, reader)

In [90]:
#Train test split 
trainset, testset = train_test_split(data, test_size=.2)

In [91]:
#Reinstantiate the model with the best parameters from GridSearch and fit on the trainset 
svdtuned2 = SVD(n_factors=80,
               reg_all=0.06,
               n_epochs=30,
               lr_all=0.01)
svdtuned2.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7faf02df8b38>

In [75]:
#Find predictions for the three movies that user with userId=1000 just rated
print(svdtuned2.predict(1000,1240))
print(svdtuned2.predict(1000,96610))
print(svdtuned2.predict(1000,6534))


user: 1000       item: 4643       r_ui = None   est = 2.37   {'was_impossible': False}
user: 1000       item: 96610      r_ui = None   est = 3.51   {'was_impossible': False}
user: 1000       item: 6534       r_ui = None   est = 2.63   {'was_impossible': False}


Now we have predictions for User 1000's movies that it has rated so far. We have now seen how our model can generate new ratings, and formulate predictions for these ratings as well.

# Extracting Predictions for all Users and Movies 

Our final step in the process will be to create a Dataframe that includes all the estimated ratings for every combination of userId and movieId. Once this information is clearly presented in a Dataframe, we will be able to conduct some post-modeling EDA to determine how or if trends from our estimators differ from our original ratings.

In [76]:
#Create list of unique userIds and movieIds 
userids = new_ratings_df['userId'].unique()
movieids = new_ratings_df['movieId'].unique()


In [77]:
#Create a list and append the userId, movieId, and estimated ratings 
predictions = [] 
for u in userids:
    for m in movieids:
        predicted = svdtuned2.predict(u, m)
        predictions.append([u, m, predicted[3]])

In [78]:
#Convert the list to a dataframe
estimated = pd.DataFrame(predictions)


In [79]:
#rename columns of DataFrame 
estimated.rename(columns={0: 'userId', 1: 'movieId', 2:'estimatedrating'}, inplace=True)

In [80]:
#Print the final dataFrame
estimated

Unnamed: 0,userId,movieId,estimatedrating
0,1,1,4.604574
1,1,3,4.238422
2,1,6,4.467329
3,1,47,4.990054
4,1,50,4.883755
...,...,...,...
230073,1000,8640,2.641842
230074,1000,51412,2.714434
230075,1000,85510,2.797896
230076,1000,111364,2.552530


We now have a DataFrame that includes every userId, and every movieId, along with their estimated ratings. We can now use this data to create visualizations that demonstrate any trends or patterns with our estimations data.

In [81]:
#Export the estimated data to a csv file 
estimated.to_csv('estimated')