Recommender System Case Study 

Sam’s next exam would be to build a “Recommender System” using the Singular Value Decomposition 
(SVD) algorithm. Questions would be asked on the basis of what you’ve learnt in the respective module.
Questions: 

1. Implementing User-Based Recommender System using SVD (Singular Value Decomposition) 
method: 
a. Load the ‘ratings’ and ‘movies’ datasets which is a part of ‘MovieLense’

b. Find the unique number of users and movies in the ‘ratings’ ratings_ratings_dataset

c. Create a rating matrix for the ‘ratings’ dataset and store it in ‘Ratings’

d. Load the ‘ratings’ dataset as SVD’s Dataset object and compute 3-fold cross-validation using the SVD object 

e. Find all the movies rated as 5 stars by user id ‘5’ and store it in ‘ratings_1’ data frame 

f. Create a shallow copy of the ‘movies’ dataset and store the result in ‘user_5’

g. Train a recommender system using the SVD object and predict the ratings for user id ‘5’ 

h. Print the top10 movie recommendations for the user id ‘5’

In [1]:
import pandas as pd
import numpy as np
movies_data = pd.read_csv('movies.csv')
ratings_data =pd.read_csv('ratings.csv')

users = ratings_data['userId'].unique() 
movies =movies_data['movieId'].unique() 
print("Number of users:", len(users))
print("Number of movies:", len(movies))

Number of users: 7120
Number of movies: 27278


In [2]:
print(movies_data.shape)
movies_data.head(3)

(27278, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [3]:
print(ratings_data.shape)
ratings_data.head(3)

(1048575, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819


In [4]:
# Create the rating matrix with rows as movies and columns as users.

Ratings = np.ndarray(shape=(np.max(ratings_data.movieId.values), np.max(ratings_data.userId.values)),
    dtype=np.uint8)

Ratings[ratings_data.movieId.values-1, ratings_data.userId.values-1] = ratings_data.rating.values
Ratings

array([[0, 0, 4, ..., 0, 5, 4],
       [3, 0, 0, ..., 0, 0, 4],
       [0, 4, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8)

In [19]:
from surprise import SVD
from surprise import Dataset,Reader
from surprise.model_selection import cross_validate
reader = Reader(rating_scale=(1,6))
data = Dataset.load_from_df(ratings_data[['userId','movieId','rating']],reader)
algo = SVD()

#3-fold cross-validation 
cross_validate(algo, data, cv=3, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8459  0.8451  0.8451  0.8454  0.0004  
MAE (testset)     0.6482  0.6482  0.6475  0.6480  0.0003  
Fit time          41.91   47.79   47.22   45.64   2.65    
Test time         4.15    3.91    3.72    3.93    0.18    


{'test_rmse': array([0.84592897, 0.84512827, 0.84505773]),
 'test_mae': array([0.64822607, 0.64817361, 0.64754257]),
 'fit_time': (41.906856060028076, 47.79305958747864, 47.2236647605896),
 'test_time': (4.14899468421936, 3.912296772003174, 3.7203774452209473)}

In [35]:
ratings_data.dtypes

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object

In [38]:
print('total number of 5star rated movies:',ratings_data[ratings_data['rating']==5].shape[0])

total number of 5star rated movies: 152562


In [40]:
#user 5 --5star rated movies
filter = (ratings_data['userId']==5) & (ratings_data['rating'] == 5.0)

ratings_1 = ratings_data[filter]
# ratings_1 = ratings_data[ratings_data['rating']==5.0]
print('No.of movies rated 5 stars by user5:', ratings_1.shape[0],'movies')
ratings_1.head(3)

No.of movies rated 5 stars by user5: 38 movies


Unnamed: 0,userId,movieId,rating,timestamp
452,5,11,5.0,851527751
455,5,62,5.0,851526935
459,5,141,5.0,851526935


In [48]:
import copy
user_5 = copy.copy(movies_data)
user_5.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [28]:
from surprise import dataset,Reader
from surprise.model_selection import train_test_split
from surprise.model_selection import KFold


d1 = ratings_data[ratings_data['userId']==5]
# d1.head()
reader = Reader(rating_scale=(1,6))

class MyDataset(dataset.DatasetAutoFolds):

    def __init__(self, d1, reader):
        self.raw_ratings = [(uid, mid, r, None) for (uid, mid, r) in
                            zip(d1['userId'], d1['movieId'], d1['rating'])]
        self.reader=reader
data = MyDataset(d1, reader)

kf = KFold(n_splits=3)

algo = SVD()

trainset, testset = train_test_split(data, test_size=0.30)

for trainset, testset in kf.split(data):
    algo.fit(trainset)
    predictions = algo.test(testset)

In [58]:
# Top 10 recommendations for user 5

df = pd.DataFrame(predictions)
# df.head()
new_df = df.iloc[:,0:3]
new_df.rename(columns = {'uid':'userId','iid':'movieId','r_ui':'rating'},inplace=True)
# new_df.head()
top_10 = new_df.sort_values(by='rating',ascending=False)[:10]
top_10
s = pd.Series(range(1,11))
top_10.merge(user_5,on='movieId',how='left').set_index(s)

Unnamed: 0,userId,movieId,rating,title,genres
1,5,150,5.0,Apollo 13 (1995),Adventure|Drama|IMAX
2,5,377,5.0,Speed (1994),Action|Romance|Thriller
3,5,736,5.0,Twister (1996),Action|Adventure|Romance|Thriller
4,5,595,5.0,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX
5,5,62,5.0,Mr. Holland's Opus (1995),Drama
6,5,1028,5.0,Mary Poppins (1964),Children|Comedy|Fantasy|Musical
7,5,11,5.0,"American President, The (1995)",Comedy|Drama|Romance
8,5,1291,5.0,Indiana Jones and the Last Crusade (1989),Action|Adventure
9,5,368,5.0,Maverick (1994),Adventure|Comedy|Western
10,5,531,5.0,"Secret Garden, The (1993)",Children|Drama
