# Implementing Recommender Systems - Lab

## Introduction

In this lab, you'll practice creating a recommender system model using `surprise`. You'll also get the chance to create a more complete recommender system pipeline to obtain the top recommendations for a specific user.


## Objectives

In this lab you will: 

- Use surprise's built-in reader class to process data to work with recommender algorithms 
- Obtain a prediction for a specific user for a particular item 
- Introduce a new user with rating to a rating matrix and make recommendations for them 
- Create a function that will return the top n recommendations for a user 


For this lab, we will be using the famous 1M movie dataset. It contains a collection of user ratings for many different movies. In the last lesson, you were exposed to working with `surprise` datasets. In this lab, you will also go through the process of reading in a dataset into the `surprise` dataset format. To begin with, load the dataset into a Pandas DataFrame. Determine which columns are necessary for your recommendation system and drop any extraneous ones.

In [1]:
import pandas as pd
df = pd.read_csv('./ml-latest-small/ratings.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
userId       100836 non-null int64
movieId      100836 non-null int64
rating       100836 non-null float64
timestamp    100836 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [2]:
# Drop unnecessary columns
new_df = df.drop('timestamp',axis=1)

It's now time to transform the dataset into something compatible with `surprise`. In order to do this, you're going to need `Reader` and `Dataset` classes. There's a method in `Dataset` specifically for loading dataframes.

In [3]:
from surprise import Reader, Dataset
# read in values as Surprise dataset 
reader=Reader()
data=Dataset.load_from_df(new_df,reader)

Let's look at how many users and items we have in our dataset. If using neighborhood-based methods, this will help us determine whether or not we should perform user-user or item-item similarity

In [4]:
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

Number of users:  610 

Number of items:  9724


## Determine the best model 

Now, compare the different models and see which ones perform best. For consistency sake, use RMSE to evaluate models. Remember to cross-validate! Can you get a model with a higher average RMSE on test data than 0.869?

In [5]:
# importing relevant libraries
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV
import numpy as np

In [6]:
## Perform a gridsearch with SVD
# ⏰ This cell may take several minutes to run
param_grid={'n_factors':[20,100],'reg_all':[0.4,0.6]}
gridsearch_model=GridSearchCV(SVD,param_grid,n_jobs=-1,joblib_verbose=5)
gridsearch_model.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:  2.1min finished


In [7]:
# print out optimal parameters for SVD after GridSearch
print(gridsearch_model.best_score)
print(gridsearch_model.best_params)

{'rmse': 0.8829152410102159, 'mae': 0.681920722368597}
{'rmse': {'n_factors': 20, 'reg_all': 0.4}, 'mae': {'n_factors': 20, 'reg_all': 0.4}}


In [8]:
# cross validating with KNNBasic
from surprise.prediction_algorithms import knns
from surprise.similarities import pearson
from surprise import accuracy
from surprise.model_selection import cross_validate

sim_pearson={'name':'pearson','user_based':True}
basic=knns.KNNBasic(sim_options=sim_pearson)
cv_knn_basic=cross_validate(basic,data,n_jobs=-1)

In [9]:
# print out the average RMSE score for the test set
for item in cv_knn_basic.items():
    print(item)
print('-'*15)
print(np.mean(cv_knn_basic['test_rmse']))

('test_rmse', array([0.96466985, 0.97530381, 0.97645029, 0.97006617, 0.97582677]))
('test_mae', array([0.74565013, 0.75314846, 0.75637784, 0.75022129, 0.75080062]))
('fit_time', (2.3469998836517334, 3.1569907665252686, 2.0739729404449463, 2.5579946041107178, 1.8489890098571777))
('test_time', (3.8189942836761475, 3.1980020999908447, 3.6259987354278564, 2.967010498046875, 5.350003480911255))
---------------
0.9724633788359777


In [10]:
# cross validating with KNNBaseline
sim_pearson={'name':'pearson','user_based':True}
baseline=knns.KNNBaseline(sim_options=sim_pearson)
cv_knn_baseline=cross_validate(baseline,data,n_jobs=-1)

In [11]:
# print out the average score for the test set
for item in cv_knn_baseline.items():
    print(item)
print('-'*15)
print(np.mean(cv_knn_baseline['test_rmse']))

('test_rmse', array([0.87767715, 0.87791632, 0.86818693, 0.87859792, 0.88049464]))
('test_mae', array([0.66957268, 0.67309557, 0.66401866, 0.66924598, 0.67215008]))
('fit_time', (2.668001890182495, 3.4969892501831055, 2.718001365661621, 2.8639931678771973, 1.979992151260376))
('test_time', (6.156996250152588, 5.070993185043335, 5.53098726272583, 4.299993276596069, 3.408992052078247))
---------------
0.8765745926990386


Based off these outputs, it seems like the best performing model is the SVD model with `n_factors = 50` and a regularization rate of 0.05. Use that model or if you found one that performs better, feel free to use that to make some predictions.

## Making Recommendations

It's important that the output for the recommendation is interpretable to people. Rather than returning the `movie_id` values, it would be far more valuable to return the actual title of the movie. As a first step, let's read in the movies to a dataframe and take a peek at what information we have about them.

In [12]:
df_movies = pd.read_csv('./ml-latest-small/movies.csv')

In [13]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


## Making simple predictions
Just as a reminder, let's look at how you make a prediction for an individual user and item. First, we'll fit the SVD model we had from before.

In [14]:
svd = SVD(n_factors= 50, reg_all=0.05)
svd.fit(dataset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x27c96946ef0>

In [15]:
svd.predict(2, 4)

Prediction(uid=2, iid=4, r_ui=None, est=3.0727732514798203, details={'was_impossible': False})

This prediction value is a tuple and each of the values within it can be accessed by way of indexing. Now let's put our knowledge of recommendation systems to do something interesting: making predictions for a new user!

## Obtaining User Ratings 

It's great that we have working models and everything, but wouldn't it be nice to get to recommendations specifically tailored to your preferences? That's what we'll be doing now. The first step is to create a function that allows us to pick randomly selected movies. The function should present users with a movie and ask them to rate it. If they have not seen the movie, they should be able to skip rating it. 

The function `movie_rater()` should take as parameters: 

* `movie_df`: DataFrame - a dataframe containing the movie ids, name of movie, and genres
* `num`: int - number of ratings
* `genre`: string - a specific genre from which to draw movies

The function returns:
* rating_list : list - a collection of dictionaries in the format of {'userId': int , 'movieId': int , 'rating': float}

#### This function is optional, but fun :) 

In [18]:
def movie_rater(movie_df,num=5, genre=None):
    userId=1000
    rating_list=[]
    while num>0:
        if genre:
            movie = movie_df[movie_df['genres'].str.contains(genre)].sample(1)
        else:
            movie = movie_df.sample(1)
        print("\n")
        print(movie)
        rating=input("How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.")
        if rating=='n':
            continue
        else:
            rating_list.append({'userId':userId,'movieId':movie['movieId'].values[0],'rating':int(rating)})
            num-=1
    return rating_list

In [19]:
# try out the new function here!
movie_rater(df_movies,4)



      movieId                      title  genres
3846     5401  Undercover Brother (2002)  Comedy
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.n


      movieId                               title  genres
9569   174045  Goon: Last of the Enforcers (2017)  Comedy
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.5


      movieId                title          genres
7390    79590  Rebound, The (2009)  Comedy|Romance
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.n


      movieId                  title           genres
5768    31223  Racing Stripes (2005)  Children|Comedy
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.n


      movieId                      title                  genres
4727     7049  Flying Down to Rio (1933)  Comedy|Musical|Romance
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.n


      movieId                

[{'userId': 1000, 'movieId': 174045, 'rating': 5},
 {'userId': 1000, 'movieId': 7438, 'rating': 3},
 {'userId': 1000, 'movieId': 2012, 'rating': 5},
 {'userId': 1000, 'movieId': 7153, 'rating': 3}]

If you're struggling to come up with the above function, you can use this list of user ratings to complete the next segment

In [20]:
user_rating=[{'userId': 1000, 'movieId': 174045, 'rating': 5},
 {'userId': 1000, 'movieId': 7438, 'rating': 3},
 {'userId': 1000, 'movieId': 2012, 'rating': 5},
 {'userId': 1000, 'movieId': 7153, 'rating': 3}]

### Making Predictions With the New Ratings
Now that you have new ratings, you can use them to make predictions for this new user. The proper way this should work is:

* add the new ratings to the original ratings DataFrame, read into a `surprise` dataset 
* train a model using the new combined DataFrame
* make predictions for the user
* order those predictions from highest rated to lowest rated
* return the top n recommendations with the text of the actual movie (rather than just the index number) 

In [21]:
## add the new ratings to the original ratings DataFrame
new_ratings_df = new_df.append(user_rating,ignore_index=True)
new_data = Dataset.load_from_df(new_ratings_df,reader)

In [22]:
# train a model using the new combined DataFrame
svd = SVD(n_factors= 50, reg_all=0.05)
svd.fit(new_data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x27c9a54dd30>

In [24]:
# make predictions for the user
# you'll probably want to create a list of tuples in the format (movie_id, predicted_score)
list_of_movies = []
for movie in new_df['movieId'].unique():
    list_of_movies.append( (movie,svd.predict(1000,movie)[3]))

In [27]:
# order the predictions from highest to lowest rated
ranked_movies = sorted(list_of_movies,key=lambda x:x[1],reverse=True)

 For the final component of this challenge, it could be useful to create a function `recommended_movies()` that takes in the parameters:
* `user_ratings`: list - list of tuples formulated as (user_id, movie_id) (should be in order of best to worst for this individual)
* `movie_title_df`: DataFrame 
* `n`: int - number of recommended movies 

The function should use a `for` loop to print out each recommended *n* movies in order from best to worst

In [36]:
# return the top n recommendations using the 
def recommended_movies(user_ratings,movie_title_df,n):
    total=n+1
    print('Movie Recommendations:\n')
    while n>0:
        print(f'{total-n}. {movie_title_df.title[user_ratings[total-n-1][0]]}')
        n-=1
recommended_movies(ranked_movies,df_movies,10)

Movie Recommendations:

1. I Love Trouble (1994)
2. Oklahoma! (1955)
3. Pompatus of Love, The (1996)
4. Withnail & I (1987)
5. Leave It to Beaver (1997)
6. In the Company of Men (1997)
7. Virtuosity (1995)
8. Billy Elliot (2000)
9. TiMER (2009)
10. Night of the Living Dead (1968)


## Level Up (Optional)

* Try and chain all of the steps together into one function that asks users for ratings for a certain number of movies, then all of the above steps are performed to return the top $n$ recommendations
* Make a recommender system that only returns items that come from a specified genre

In [37]:
def movieRecommender(movie_df,num_to_rate=5, num_to_rec=5, genre=None):
    user_rating=movie_rater(movie_df,num_to_rate,genre)
    new_ratings_df = new_df.append(user_rating,ignore_index=True)
    new_data = Dataset.load_from_df(new_ratings_df,reader)
    svd = SVD(n_factors= 50, reg_all=0.05)
    svd.fit(new_data.build_full_trainset())
    list_of_movies = []
    for movie in new_df['movieId'].unique():
        list_of_movies.append((movie,svd.predict(1000,movie)[3]))
    ranked_movies = sorted(list_of_movies,key=lambda x:x[1],reverse=True)
    print('\n')
    recommended_movies(ranked_movies,movie_df,num_to_rec)
    return

In [39]:
movieRecommender(df_movies,10,10)



     movieId                      title                genres
976     1277  Cyrano de Bergerac (1990)  Comedy|Drama|Romance
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.n


      movieId                     title        genres
5905    33646  Longest Yard, The (2005)  Comedy|Drama
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.n


      movieId                                           title  \
8638   119155  Night at the Museum: Secret of the Tomb (2014)   

                                 genres  
8638  Adventure|Children|Comedy|Fantasy  
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.2


      movieId                     title                 genres
9222   152079  London Has Fallen (2016)  Action|Crime|Thriller
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.n


      movieId                   title  genres
4254     6204  Meteor Man, The (1993)  Comedy
H



      movieId                        title genres
7412    80454  Princess (Prinsessa) (2010)  Drama
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.n


     movieId                             title genres
121      148  Awfully Big Adventure, An (1995)  Drama
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.n


      movieId                          title                           genres
6732    59103  Forbidden Kingdom, The (2008)  Action|Adventure|Comedy|Fantasy
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.n


      movieId                        title                   genres
2879     3849  The Spiral Staircase (1945)  Horror|Mystery|Thriller
How do you rate this movie on a scale of 1-5? Press 'n' if you haven't seen it.n


     movieId           title                             genres
594      736  Twister (1996)  Action|Adventure|Romance|Thriller
How do you rate this movie on a scale 

## Summary

In this lab, you got the chance to implement a collaborative filtering model as well as retrieve recommendations from that model. You also got the opportunity to add your own recommendations to the system to get new recommendations for yourself! Next, you will learn how to use Spark to make recommender systems.