# 3. Recommender Systems
----------------------
## Matrix Factorization:
* Matrix factorization stood out for its excellent performance during the Netflix challenge and has remained popular ever since.
* Factorize the matrix 𝑅 into two lower dimensional matrices 𝑈 and 𝑉, so that 𝑅=𝑈ᵀ𝑉.

In [1]:
#!conda install -c conda-forge scikit-surprise
import surprise

In [2]:
import numpy as np
import pandas as pd
import urllib
import io
import zipfile

## a. Load the Data
* The librec FilmTrust dataset with 35497 movie ratings, originally collated for a particular recommender systems paper.
* Normally, recommender systems will use larger datasets than this.

In [3]:
# Download zip file.
tmpFile = urllib.request.urlopen('https://www.librec.net/datasets/filmtrust.zip')
# Unzip file.
tmpFile = zipfile.ZipFile(io.BytesIO(tmpFile.read()))
# Open desired data file as pandas dataframe, close zip file.
dataset = pd.read_table(io.BytesIO(tmpFile.read('ratings.txt')), sep=' ', names=['uid', 'iid', 'rating'])
tmpFile.close()

In [4]:
dataset.head()

Unnamed: 0,uid,iid,rating
0,1,1,2.0
1,1,2,4.0
2,1,3,3.5
3,1,4,3.0
4,1,5,4.0


* This dataset is saved in a sparse format, which is the input of matrix factorization methods.
* The first column is the user ID and the row number of the matrix $i$.
* The second column is the ID of the movie they’ve reviewed and the column number of the matrix $j$.
* The third column is their review score and the matrix entry $R_{ij}$.

## b. Fitting the Model
----------------------
### SVDpp
* The SVD++ algorithm, an extension of vanilla SVD.
* One of best performers in the Netflix challenge, which has now become a popular method.

In [5]:
# Let's first check the range of the reviews for this dataset.
lower_rating = dataset['rating'].min()
upper_rating = dataset['rating'].max()
print(f'Review range: {lower_rating} to {upper_rating}')

Review range: 0.5 to 4.0


In [6]:
# Change the review range when we load in our dataset.
reader = surprise.Reader(rating_scale=(lower_rating, upper_rating))
data = surprise.Dataset.load_from_df(dataset, reader)

In [7]:
# Train the model on the whole dataset.
alg = surprise.SVDpp()
trainset = data.build_full_trainset()
output = alg.fit(trainset)

In [8]:
# Predict and compare the ratings.
uid = 50 # User id
iid = 52 # movie id
q_out = dataset.query(f"uid == {uid} and iid == {iid}")
try:
    rating_origin = q_out.iloc[0]['rating']
except:
    rating_origin = None
pred = alg.predict(uid=uid, iid=iid)
score = pred.est
print(f'Original rating: {rating_origin}, Predicted rating: {score}')

Original rating: None, Predicted rating: 3.9154517894758105


## c. Making Recommendations
* Let’s make our recommendations to a particular user with uid 50 to find one item to recommend.
* First we need to find the movie ids that user 50 didn’t rate since we don’t want to recommend them a movie they’ve already watched!

In [9]:
# Get a list of all movie ids
iids = dataset['iid'].unique()
# Get a list of iids that uid 50 has rated
iids50 = dataset.loc[dataset['uid'] == 50, 'iid']
# Remove the iids that uid 50 has rated from the list of all movie ids
iids_to_pred = np.setdiff1d(iids, iids50)

* Next, we predict the score of each of the movie ids that user 50 didn’t rate, and find the best one.
* We create another dataset with the iids we want to predict and setting all the ratings to 4 as they are not needed.

In [10]:
testset = [[50, iid, 4.] for iid in iids_to_pred]
predictions = alg.test(testset)
predictions[0]

Prediction(uid=50, iid=14, r_ui=4.0, est=3.1088533419947546, details={'was_impossible': False})

In [11]:
# Convert each prediction object into an array of the predicted ratings.
pred_ratings = np.array([pred.est for pred in predictions])
# Find the index of the maximum predicted rating
i_max = pred_ratings.argmax()
# Use this to find the corresponding iid to recommend
iid = iids_to_pred[i_max]
print('Top item for user 50 has iid {0} with predicted rating {1}'.format(iid, pred_ratings[i_max]))

Top item for user 50 has iid 189 with predicted rating 4.0


## d. Tuning and Evaluating the Model
* SVD++ depends on several main tuning constants.
* We tune the learning rate and the regularisation term in this tutorial.
* In Surprise, tuning is performed by GridSearchCV function, which picks the best performing constants among the predefined values at predicting a held out test set using cross-validation.

In [12]:
# First, let’s define our list of constant values to check.
param_grid = {'lr_all' : [.01, .01], 'reg_all' : [.1, .5]}
gs = surprise.model_selection.GridSearchCV(surprise.SVDpp, param_grid, measures=['rmse', 'mae'], cv=3)
gs.fit(data)
# Print combination of parameters that gave best RMSE score.
print(gs.best_params['rmse'])

{'lr_all': 0.01, 'reg_all': 0.1}


In [13]:
# The performance of a model can be evaluated using cross-validation to compare methods or check whether the method is performing reasonably.
alg = surprise.SVDpp(lr_all = .001)  # parameter choices can be added here.
output = surprise.model_selection.cross_validate(alg, data, verbose = True)

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8174  0.8301  0.8309  0.8419  0.8209  0.8282  0.0086  
MAE (testset)     0.6464  0.6560  0.6573  0.6638  0.6528  0.6553  0.0057  
Fit time          11.33   10.98   10.93   11.39   11.05   11.13   0.19    
Test time         0.23    0.25    0.22    0.25    0.24    0.24    0.01    


## Reference
 1. https://blog.cambridgespark.com/tutorial-practical-introduction-to-recommender-systems-dbe22848392b
 2. https://surprise.readthedocs.io/en/stable/getting_started.html