In [1]:
import numpy as np
import pandas as pd
import urllib
import io
import zipfile

In [2]:
# Download zip file
tmpFile = urllib.request.urlopen('https://www.librec.net/datasets/filmtrust.zip')

#Unzip the file
tmpFile = zipfile.ZipFile(io.BytesIO(tmpFile.read()))

In [3]:
#Open desired data file as pandas dataframe, close zipfile
dataset = pd.read_table(io.BytesIO(tmpFile.read('ratings.txt')),sep=' ',names=['uid','iid','rating'])
tmpFile.close()
dataset.head()

Unnamed: 0,uid,iid,rating
0,1,1,2.0
1,1,2,4.0
2,1,3,3.5
3,1,4,3.0
4,1,5,4.0


In [4]:
lower_rating = dataset['rating'].min()
upper_rating = dataset['rating'].max()
print('Review range: {0} to {1}'.format(lower_rating,upper_rating))

Review range: 0.5 to 4.0


In [5]:
import surprise
#So our review range goes from 0.5 to 4, which is a little non-standard (the default for surprise is 1-5). 
#So we will need to change this when we load in our dataset, which is done like this:
reader = surprise.Reader(rating_scale=(0.5,4.0))

In [6]:
data = surprise.Dataset.load_from_df(dataset,reader)

In [8]:
alg = surprise.SVDpp()
output = alg.fit(data.build_full_trainset())

In [9]:
#we can check the predicted score of, for example, user 50 on a music artist 52 using the predict method
pred = alg.predict(uid='50',iid='52')
score = pred.est
print(score)

3.0028030537791928


In [11]:
#Let’s make our recommendations to a particular user. Let’s focus on uid 50 and find one item to recommend them

#Get list of all movie ids
iids = dataset['iid'].unique()

#Get list of movies that uid50 has rated
iids50 = dataset.loc[dataset['uid']==50,'iid']

#Remove the iids that uid50 has rated from the list of all movie iids
iids_to_pred = np.setdiff1d(iids,iids50)

In [12]:
#create another dataset with the iids we want to predict in the sparse format as before of: uid, iid, rating.

testset = [[50,iid,4.] for iid in iids_to_pred]
predictions = alg.test(testset)
predictions

[Prediction(uid=50, iid=14, r_ui=4.0, est=3.1300727028828383, details={'was_impossible': False}),
 Prediction(uid=50, iid=15, r_ui=4.0, est=3.23537738475172, details={'was_impossible': False}),
 Prediction(uid=50, iid=16, r_ui=4.0, est=3.5388115936896862, details={'was_impossible': False}),
 Prediction(uid=50, iid=18, r_ui=4.0, est=3.5694397804620603, details={'was_impossible': False}),
 Prediction(uid=50, iid=19, r_ui=4.0, est=3.436574935600214, details={'was_impossible': False}),
 Prediction(uid=50, iid=20, r_ui=4.0, est=3.260456749689059, details={'was_impossible': False}),
 Prediction(uid=50, iid=21, r_ui=4.0, est=3.246376746280475, details={'was_impossible': False}),
 Prediction(uid=50, iid=22, r_ui=4.0, est=3.5466519140787143, details={'was_impossible': False}),
 Prediction(uid=50, iid=23, r_ui=4.0, est=3.5452324500380024, details={'was_impossible': False}),
 Prediction(uid=50, iid=24, r_ui=4.0, est=3.6704987821634965, details={'was_impossible': False}),
 Prediction(uid=50, iid=2

In [13]:
#As you can see from the output, each prediction is a special object. In order to find the best, we’ll convert this object into an array of the predicted ratings. 
#We’ll then use this to find the iid with the best predicted rating.

In [15]:
pred_ratings = np.array([pred.est for pred in predictions])

In [18]:
#Find the index of maximum predicted rating
i_max = pred_ratings.argmax()

In [19]:
#Find maximum rating iids to recommand
iids = iids_to_pred[i_max]

In [21]:
print('Top item for user 50 has iid {0} with predicted rating {1}'.format(iids,pred_ratings[i_max]))

Top item for user 50 has iid 286 with predicted rating 4.0


In [22]:
#Similarly you can get the top n items for user 50, just replace the argmax() method with the argpartition()

In [24]:
i_max = pred_ratings.argpartition(10)

In [25]:
i_max

array([ 703,  547,  660, ...,    2, 2030, 1015], dtype=int64)

In [26]:
#Find maximum rating iids to recommand
iids = iids_to_pred[i_max]

In [27]:
iids

array([ 743,  587,  700, ...,   16, 2070, 1055], dtype=int64)

In [30]:
for i in iids:
    print('Top item for user 50 has iid {0} with predicted rating {1}'.format(i,pred_ratings[i]))

Top item for user 50 has iid 743 with predicted rating 3.4350174446647537
Top item for user 50 has iid 587 with predicted rating 3.4910704151242835
Top item for user 50 has iid 700 with predicted rating 3.3825265160874767
Top item for user 50 has iid 734 with predicted rating 3.5957441146796043
Top item for user 50 has iid 592 with predicted rating 3.584955824973685
Top item for user 50 has iid 676 with predicted rating 3.4218632745892044
Top item for user 50 has iid 832 with predicted rating 3.167859185270481
Top item for user 50 has iid 100 with predicted rating 3.083223487546837
Top item for user 50 has iid 1003 with predicted rating 3.3105031331885466
Top item for user 50 has iid 523 with predicted rating 3.917377626638785
Top item for user 50 has iid 441 with predicted rating 3.466466728432458
Top item for user 50 has iid 243 with predicted rating 3.07176659692632
Top item for user 50 has iid 453 with predicted rating 3.3349981408938074
Top item for user 50 has iid 461 with predic

IndexError: index 2051 is out of bounds for axis 0 with size 2032

As you probably already know, it is bad practice to fit a model on the whole dataset without checking its performance and tuning parameters which affect the fit. So for the remainder of the tutorial we’ll show you how to tune the parameters of SVD++ and evaluate the performance of the method. The method SVD++, as well as most other matrix factorisation algorithms, will depend on a number of main tuning constants: the dimension DD affecting the size of UU and VV; the learning rate, which affects the performance of the optimisation step; the regularisation term affecting the overfitting of the model; and the number of epochs, which determines how many iterations of optimisation are used.
In this tutorial we’ll tune the learning rate and the regularisation term. SVD++ has more than one learning rate and regularisation term. But surprise lets a fixed value be set for all the learning rate values, and another for all the regularisation terms, so we'll do this for speed. In surprise, tuning is performed using a function called GridSearchCV, which picks the constants which perform the best at predicting a held out testset. This means constant values to try need to be predefined.
First let’s define our list of constant values to check, typically the learning rate is a small value between 0 and 1. In theory, the regularisation parameter can be any positive real value, but in practice it is limited as setting it too small will result in overfitting, while setting it too large will result in poor performance; so trying a list of reasonable values should be fine. The GridSearchCV function can then be used to determine the best performing parameter values using cross validation. We've chosen quite a limited list since this code can take a while to run, as it has to fit multiple models with different parameters.


In [31]:
param_grid = {'lr_all':[.001,.01],'reg_all':[.1,.5]}
gs = surprise.model_selection.GridSearchCV(surprise.SVDpp,param_grid,measures=['rmse','mae'],cv=3)

In [32]:
gs.fit(data)

In [33]:
print(gs.best_params['rmse'])

{'lr_all': 0.01, 'reg_all': 0.1}


The output prints the combination of parameters that gets the best RMSE on a held out test set, RMSE is a way of measuring the prediction error. In this case, we’ve only checked a few tuning constant values, because these procedures can take a while to run. But typically you will try out as many values as possible to get the best performance you can.
The performance of a particular model you’ve chosen can be evaluated using cross validation. This might be used to compare a number of methods for example, or just to check your method is performing reasonably. This can be done by running the following:

In [34]:
alg = surprise.SVDpp(lr_all=.001,reg_all=0.1)

In [35]:
output = surprise.model_selection.cross_validate(alg, data, verbose=True)

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8245  0.8362  0.8248  0.8382  0.8260  0.8299  0.0060  
MAE (testset)     0.6525  0.6602  0.6550  0.6628  0.6556  0.6572  0.0037  
Fit time          15.65   20.69   27.38   36.70   20.92   24.27   7.25    
Test time         0.30    0.63    1.36    0.37    0.36    0.61    0.40    
