# Hands-on: movie recommender system
## Collaborative filtering (matrix factorization)

You are an online retailer/travel agent/movie review website, and you would like to help the visitors of your website to explore more of your products/destinations/movies. You got data which either describe the different products/destinations/films, or past transactions/trips/views (or preferences) of your visitors (or both!). You decide to leverage that data to provide relevant and meaningful recommendations.

This notebook implements a simple collaborative system using  factorization of the user-item matrix.

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [0]:
ratings="https://github.com/couturierc/tutorials/raw/master/recommender_system/data/ratings.csv"
movies="https://github.com/couturierc/tutorials/raw/master/recommender_system/data/movies.csv"

# If data stored locally
# ratings="./data/ratings.csv"
# movies="./data/movies.csv"

df_ratings = pd.read_csv(ratings, sep=',')
df_ratings.columns = ['userId', 'itemId', 'rating', 'timestamp']
df_movies = pd.read_csv(movies, sep=',')
df_movies.columns = ['itemId', 'title', 'genres']

In [0]:
df_movies.head()

Unnamed: 0,itemId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [0]:
df_ratings.head()

Unnamed: 0,userId,itemId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


## Quick exploration

Hints: use df.describe(), df.column_name.hist(), scatterplot matrix (sns.pairplot(df[column_range])), correlation matrix (sns.heatmap(df.corr()) ), check duplicates, ...

In [0]:
# Start your exploration -- use as many cells as you need !


## Obtain the user-item matrice by pivoting df_ratings

In [0]:
##### FILL HERE (1 line) ######
df_user_item = NULL # Use df.pivot, rows ~ userId's, columns ~ itemId's
################################

# Sort index/rows (userId's) and columns (itemId's)
df_user_item.sort_index(axis=0, inplace=True)
df_user_item.sort_index(axis=1, inplace=True)

This matrix has **many** missing values:

In [0]:
df_user_item.head()

itemId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [0]:
df_user_item.describe()

itemId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
count,215.0,110.0,52.0,7.0,49.0,102.0,54.0,8.0,16.0,132.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mean,3.92093,3.431818,3.259615,2.357143,3.071429,3.946078,3.185185,2.875,3.125,3.496212,...,3.5,3.0,4.0,4.0,3.5,4.0,3.5,3.5,3.5,4.0
std,0.834859,0.881713,1.054823,0.852168,0.907148,0.817224,0.977561,1.125992,0.974679,0.859381,...,,,,,,,,,,
min,0.5,0.5,0.5,1.0,0.5,1.0,1.0,1.0,1.5,0.5,...,3.5,3.0,4.0,4.0,3.5,4.0,3.5,3.5,3.5,4.0
25%,3.5,3.0,3.0,1.75,3.0,3.125,3.0,2.75,2.875,3.0,...,3.5,3.0,4.0,4.0,3.5,4.0,3.5,3.5,3.5,4.0
50%,4.0,3.5,3.0,3.0,3.0,4.0,3.0,3.0,3.0,3.5,...,3.5,3.0,4.0,4.0,3.5,4.0,3.5,3.5,3.5,4.0
75%,4.5,4.0,4.0,3.0,3.5,4.5,4.0,3.0,3.25,4.0,...,3.5,3.0,4.0,4.0,3.5,4.0,3.5,3.5,3.5,4.0
max,5.0,5.0,5.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,...,3.5,3.0,4.0,4.0,3.5,4.0,3.5,3.5,3.5,4.0


For instance, rating for userId=1 for movies with itemId 1 to 10:

In [0]:
df_user_item.loc[1][:10]

itemId
1     4.0
2     NaN
3     4.0
4     NaN
5     NaN
6     4.0
7     NaN
8     NaN
9     NaN
10    NaN
Name: 1, dtype: float64

In [0]:
# df_user_item.loc[1].dropna().sort_values(ascending=False)

Save the movie ids for user 1 for later:

In [0]:
item_rated_user_1 = df_user_item.loc[1].dropna().index
item_rated_user_1

Int64Index([   1,    3,    6,   47,   50,   70,  101,  110,  151,  157,
            ...
            3671, 3702, 3703, 3729, 3740, 3744, 3793, 3809, 4006, 5060],
           dtype='int64', name='itemId', length=232)

We want to find the matrix of rank $k$ which is closest to the original matrix.



## What not to do: Fill with 0's or mean values, then Singular Value Decomposition (SVD)

(Adapted from https://github.com/beckernick/matrix_factorization_recommenders/blob/master/matrix_factorization_recommender.ipynb)

Singular Value Decomposition decomposes a matrix $R$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $R$. Mathematically, it decomposes R into a two unitary matrices and a diagonal matrix:

$$\begin{equation}
R = U\Sigma V^{T}
\end{equation}$$

where: 
- R is users's ratings matrix, 
- $U$ is the user "features" matrix, it represents how much users "like" each feature,
- $\Sigma$ is the diagonal matrix of singular values (essentially weights), 
- $V^{T}$ is the movie "features" matrix, it represents how relevant each feature is to each movie,

with $U$ and $V^{T}$ orthogonal.

In [0]:
df_user_item = df_user_item.fillna(0)
df_user_item.head()

itemId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [0]:
R = df_user_item.values

In [0]:
R

array([[4. , 0. , 4. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       ...,
       [2.5, 2. , 2. , ..., 0. , 0. , 0. ],
       [3. , 0. , 0. , ..., 0. , 0. , 0. ],
       [5. , 0. , 0. , ..., 0. , 0. , 0. ]])

Apply SVD to R (e.g. using NumPy or SciPy)

In [0]:
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R, k = 50)

What do $U$, $\Sigma$, $V^T$ look like?

In [0]:
U

array([[ 0.0169691 ,  0.00367604,  0.01446425, ...,  0.01089745,
         0.06167385,  0.05555415],
       [ 0.0051006 ,  0.00091157,  0.01728319, ...,  0.00442345,
        -0.01773772,  0.0058663 ],
       [-0.00098164, -0.00537681, -0.00556381, ..., -0.00171517,
         0.00206861,  0.00135323],
       ...,
       [ 0.15137458, -0.10809064, -0.05384249, ...,  0.00976291,
         0.01184704,  0.11611442],
       [-0.00911714,  0.01038794, -0.00867598, ...,  0.03974124,
         0.01378463,  0.00757944],
       [-0.01073109,  0.02748873,  0.05781424, ..., -0.09267536,
        -0.20218445,  0.13886488]])

In [0]:
sigma

array([ 67.8676482 ,  68.43455046,  69.07855191,  69.50676339,
        69.93495369,  70.02143448,  70.20660519,  71.70985332,
        72.46953282,  73.2246949 ,  73.45188037,  74.05266585,
        74.29201322,  74.96494138,  75.40667214,  75.6272454 ,
        76.71225804,  78.00723454,  78.84651534,  79.16948319,
        79.52408732,  80.86997674,  81.73690785,  82.40743887,
        83.04476272,  85.15393734,  86.05702164,  87.29627026,
        88.83466993,  90.42515264,  90.97607986,  92.32408574,
        93.40879296,  97.11713355,  99.28999246,  99.87323589,
       102.05675293, 105.97376877, 107.93266172, 109.60313933,
       113.11144323, 121.44217651, 122.66302989, 135.65556768,
       147.33575651, 154.552948  , 170.42250831, 191.1508762 ,
       231.23661142, 534.41989777])

In [0]:
Vt

array([[-4.82157420e-02,  1.34410623e-03,  4.23829329e-03, ...,
        -1.19645832e-03, -1.19645832e-03,  3.02375151e-03],
       [ 1.66110170e-02, -3.11283049e-02,  1.32801055e-02, ...,
         1.22335963e-03,  1.22335963e-03, -1.71230857e-03],
       [-6.99543488e-02, -1.05175632e-02,  3.05311947e-02, ...,
        -1.19421011e-04, -1.19421011e-04, -4.26918965e-04],
       ...,
       [ 7.84438842e-02,  5.68447103e-02,  1.80051145e-02, ...,
        -8.71093879e-05, -8.71093879e-05,  1.22833344e-04],
       [ 2.75911949e-02,  2.06662722e-03,  2.47146155e-02, ...,
        -5.97586244e-04, -5.97586244e-04, -1.27236200e-03],
       [ 7.04498985e-02,  3.85393459e-02,  1.59129220e-02, ...,
         6.46836073e-05,  6.46836073e-05,  2.71729303e-04]])

Get recommendations:

In [0]:
# First make sigma a diagonal matrix:
sigma = np.diag(sigma)

In [0]:
R_after_svd = np.dot(np.dot(U, sigma), Vt)
R_after_svd

array([[ 2.18187197e+00,  3.93674189e-01,  8.38185756e-01, ...,
        -2.49842711e-02, -2.49842711e-02, -5.89881001e-02],
       [ 2.09809067e-01,  4.82051887e-03,  3.07424005e-02, ...,
         1.88951263e-02,  1.88951263e-02,  3.19658766e-02],
       [ 1.33940814e-02,  3.47258164e-02,  5.05247472e-02, ...,
        -1.61232411e-03, -1.61232411e-03, -5.29984436e-04],
       ...,
       [ 2.30963539e+00,  2.70243898e+00,  2.26419696e+00, ...,
        -1.25165145e-02, -1.25165145e-02,  9.27520866e-02],
       [ 7.83182598e-01,  5.30142683e-01,  9.79748203e-02, ...,
         9.84577917e-04,  9.84577917e-04, -5.49383653e-03],
       [ 5.35809290e+00, -2.88817350e-01, -9.07680249e-02, ...,
        -2.79227416e-02, -2.79227416e-02,  3.55476113e-02]])

Drawbacks of this approach: 
- the missing values (here filled with 0's) is feedback that the user did not give, we should not cannot consider it negative/null rating.
- the dense matrix is huge, applying SVD is not scalable.

## Approximate SVD with stochastic gradient descend (SGD)


This time, we do **not** fill missing values. 

We inject $\Sigma$ into U and V, and try to find P and q such that $\widehat{R} = P Q^{T}$ is close to  $R$ **for the item-user pairs already rated**.


A first function to simplify the entries (userId/itemId) : we map the set of 

In [0]:
def encode_ids(data):
    '''Takes a rating dataframe and return: 
    - a simplified rating dataframe with ids in range(nb unique id) for users and movies
    - 2 mapping disctionaries
    
    '''

    data_encoded = data.copy()
    
    users = pd.DataFrame(data_encoded.userId.unique(),columns=['userId'])  # df of all unique users
    dict_users = users.to_dict()    
    inv_dict_users = {v: k for k, v in dict_users['userId'].items()}

    items = pd.DataFrame(data_encoded.itemId.unique(),columns=['itemId']) # df of all unique items
    dict_items = items.to_dict()    
    inv_dict_items = {v: k for k, v in dict_items['itemId'].items()}

    data_encoded.userId = data_encoded.userId.map(inv_dict_users)
    data_encoded.itemId = data_encoded.itemId.map(inv_dict_items)

    return data_encoded, dict_users, dict_items
  

Here is the procedure we would like to implement in the function SGD():

1.   itinialize P and Q to random values

2.   for $n_{epochs}$ passes on the data:

    *   for all known ratings $r_{ui}$
        *   compute the error between the predicted rating $p_u \cdot q_i$ and the known ratings $r_{ui}$:
        $$ err = r_{ui} - p_u \cdot q_i $$
        *   update $p_u$ and $q_i$ with the following rule:
        $$ p_u \leftarrow p_u + \alpha \cdot err \cdot q_i  $$
        $$ q_i \leftarrow q_i + \alpha \cdot err \cdot p_u$$







In [0]:
# Adapted from http://nicolas-hug.com/blog/matrix_facto_4
def SGD(data,           # dataframe containing 1 user|item|rating per row
        n_factors = 10, # number of factors
        alpha = .01,    # number of factors
        n_epochs = 3,   # number of iteration of the SGD procedure
       ):
    '''Learn the vectors P and Q (ie all the weights p_u and q_i) with SGD.
    '''

    # Encoding userId's and itemId's in data
    data, dict_users, dict_items = encode_ids(data)
    
    ##### FILL HERE (2 lines) ######
    n_users = NULL  # number of unique users
    n_items = NULL  # number of unique items
    ################################
    
    # Randomly initialize the user and item factors.
    p = np.random.normal(0, .1, (n_users, n_factors))
    q = np.random.normal(0, .1, (n_items, n_factors))

    # Optimization procedure
    for epoch in range(n_epochs):
        print ('epoch: ', epoch)
        # Loop over the rows in data
        for index in range(data.shape[0]):
            row = data.iloc[[index]]
            u = int(row.userId)      # current userId = position in the p vector (thanks to the encoding)
            i = int(row.itemId)      # current itemId = position in the q vector
            r_ui = float(row.rating) # rating associated to the couple (user u , item i)
            
            ##### FILL HERE (1 line) ######
            err = NULL    # difference between the predicted rating (p_u . q_i) and the known ratings r_ui
            ################################
            
            # Update vectors p_u and q_i
            ##### FILL HERE (2 lines) ######
            p[u] = NULL  # cf. update rule above 
            q[i] = NULL
            ################################
            
    return p, q
    
    
def estimate(u, i, p, q):
    '''Estimate rating of user u for item i.'''
    ##### FILL HERE (1 line) ######
    return NULL             #scalar product of p[u] and q[i] /!\ dimensions
    ################################  

In [0]:
p, q = SGD(df_ratings)

epoch:  0
epoch:  1
epoch:  2


## Get the estimate for all user-item pairs:

Get the user-item matrix filled with predicted ratings:

In [0]:
df_user_item_filled = pd.DataFrame(np.dot(p, q.transpose()))
df_user_item_filled.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9714,9715,9716,9717,9718,9719,9720,9721,9722,9723
0,4.19512,3.64665,4.770994,4.723633,5.166779,3.863587,4.428188,4.524247,5.169087,3.215067,...,1.338584,1.327075,1.30737,1.328688,1.112436,0.802079,1.793594,1.202617,1.710734,1.233806
1,1.710783,1.791638,2.21413,2.199346,2.125607,2.036473,2.172976,1.57997,2.389387,1.649571,...,0.653825,0.647394,0.592676,0.6419,0.487238,0.301559,0.806451,0.583298,0.801552,0.620889
2,0.789019,0.856004,0.977775,1.088079,1.153905,0.914294,1.016302,0.772786,1.201006,0.76917,...,0.392188,0.315683,0.344761,0.269493,0.230205,0.164607,0.32116,0.238634,0.387512,0.27988
3,3.185667,2.945218,3.657897,3.707626,3.93868,2.942156,3.511712,3.461051,4.050755,2.494731,...,1.113513,1.040798,1.063283,1.047779,0.931484,0.704998,1.413862,0.924568,1.383933,0.974599
4,2.471003,1.797064,2.496036,2.70422,3.040842,1.854444,2.146399,2.9666,2.763359,1.541677,...,0.704799,0.651197,0.796864,0.70516,0.582905,0.472495,0.938227,0.585541,0.93337,0.634422


However, it is using the encode ids ; we need to retrieve the association of encoded ids to original ids, and apply it:

In [0]:
df_ratings_encoded, dict_users, dict_items = encode_ids(df_ratings)

In [0]:
df_user_item_filled.rename(columns=(dict_items['itemId']), inplace=True)
df_user_item_filled.rename(index=(dict_users['userId']), inplace=True)

# Sort index/rows (userId's) and columns (itemId's)
df_user_item_filled.sort_index(axis=0, inplace=True)
df_user_item_filled.sort_index(axis=1, inplace=True)

df_user_item_filled.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
1,4.19512,4.115866,3.64665,1.676802,3.344185,4.770994,3.847754,2.436408,3.27079,4.443133,...,0.487156,0.604013,1.021152,0.895469,0.403743,0.415509,0.676512,0.722694,0.658691,1.364273
2,1.710783,1.815002,1.791638,0.80871,1.54288,2.21413,1.848502,1.195241,1.508239,2.066793,...,0.181261,0.340422,0.538887,0.493597,0.254402,0.24664,0.337079,0.246432,0.367124,0.790713
3,0.789019,0.885836,0.856004,0.410771,0.6947,0.977775,0.883386,0.524789,0.684351,0.999344,...,0.081667,0.173749,0.185066,0.21005,0.105538,0.070614,0.182921,0.093267,0.230083,0.390324
4,3.185667,3.250935,2.945218,1.355171,2.682441,3.657897,3.149466,1.969156,2.678482,3.590795,...,0.367021,0.474567,0.717169,0.71568,0.352299,0.421521,0.539508,0.495842,0.630814,1.087694
5,2.471003,2.302174,1.797064,0.952661,1.760048,2.496036,1.997266,1.211738,1.802595,2.371534,...,0.285611,0.234617,0.430081,0.366324,0.153739,0.199917,0.36246,0.506857,0.343069,0.724392


Originally available ratings for user 1:

In [0]:
df_user_item.loc[1][:10]

itemId
1     4.0
2     0.0
3     4.0
4     0.0
5     0.0
6     4.0
7     0.0
8     0.0
9     0.0
10    0.0
Name: 1, dtype: float64

Estimated ratings after the approximate SVD:

In [0]:
df_user_item_filled.loc[1][:10]

1     4.195120
2     4.115866
3     3.646650
4     1.676802
5     3.344185
6     4.770994
7     3.847754
8     2.436408
9     3.270790
10    4.443133
Name: 1, dtype: float64

## Give recommendation to a user

For instance 10 recommended movies for user 1

In [0]:
recommendations = list((df_user_item_filled.loc[10]).sort_values(ascending=False)[:10].index)
recommendations

[2959, 1104, 1223, 1272, 2324, 2571, 1267, 4993, 4226, 898]

In [0]:
df_movies[df_movies.itemId.isin(recommendations)]

Unnamed: 0,itemId,title,genres
680,898,"Philadelphia Story, The (1940)",Comedy|Drama|Romance
841,1104,"Streetcar Named Desire, A (1951)",Drama
924,1223,"Grand Day Out with Wallace and Gromit, A (1989)",Adventure|Animation|Children|Comedy|Sci-Fi
966,1267,"Manchurian Candidate, The (1962)",Crime|Thriller|War
971,1272,Patton (1970),Drama|War
1730,2324,Life Is Beautiful (La Vita è bella) (1997),Comedy|Drama|Romance|War
1939,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller
2226,2959,Fight Club (1999),Action|Crime|Drama|Thriller
3141,4226,Memento (2000),Mystery|Thriller
3638,4993,"Lord of the Rings: The Fellowship of the Ring,...",Adventure|Fantasy


vs the ones that were rated initially:

In [0]:
already_rated = list((df_user_item.loc[10]).sort_values(ascending=False)[:10].index)
already_rated

[49286, 81845, 7458, 79091, 71579, 91529, 140110, 96079, 49272, 92259]

In [0]:
df_movies[df_movies.itemId.isin(already_rated)]

Unnamed: 0,itemId,title,genres
4948,7458,Troy (2004),Action|Adventure|Drama|War
6346,49272,Casino Royale (2006),Action|Adventure|Thriller
6352,49286,"Holiday, The (2006)",Comedy|Romance
7156,71579,"Education, An (2009)",Drama|Romance
7371,79091,Despicable Me (2010),Animation|Children|Comedy|Crime
7466,81845,"King's Speech, The (2010)",Drama
7768,91529,"Dark Knight Rises, The (2012)",Action|Adventure|Crime|IMAX
7802,92259,Intouchables (2011),Comedy|Drama
7955,96079,Skyfall (2012),Action|Adventure|Thriller|IMAX
9006,140110,The Intern (2015),Comedy


This is all the movies in descending order of predicted rating. Let's remove the ones that where alread rated.




---



To put this into production, you'd first separate data into a training and validation set and optimize the number of latent factors (n_factors) by minimizing the Root Mean Square Error. 
It is easier to use a framework that allows to do this, do cross-validation, grid search, etc.

# Gradient Descent SVD using Surprise

In [0]:
!pip install surprise
#!pip install scikit-surprise # if the first line does not work

Collecting surprise
  Downloading https://files.pythonhosted.org/packages/61/de/e5cba8682201fcf9c3719a6fdda95693468ed061945493dea2dd37c5618b/surprise-0.1-py2.py3-none-any.whl
Collecting scikit-surprise (from surprise)
[?25l  Downloading https://files.pythonhosted.org/packages/4d/fc/cd4210b247d1dca421c25994740cbbf03c5e980e31881f10eaddf45fdab0/scikit-surprise-1.0.6.tar.gz (3.3MB)
[K     |████████████████████████████████| 3.3MB 4.3MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/ec/c0/55/3a28eab06b53c220015063ebbdb81213cd3dcbb72c088251ec
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.0.6 surprise-0.1


In [0]:
# from surprise import Reader, Dataset, SVD, evaluate

# Following Surprise documentation examples 
# https://surprise.readthedocs.io/en/stable/getting_started.html

from surprise import Reader, Dataset, SVD, evaluate, NormalPredictor
from surprise.model_selection import cross_validate
from collections import defaultdict

# As we're loading a custom dataset, we need to define a reader.
reader = Reader(rating_scale=(0.5, 5))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df_ratings[['userId', 'itemId', 'rating']], reader)

# We'll use the famous SVD algorithm.
algo = SVD()

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8790  0.8691  0.8718  0.8705  0.8735  0.8728  0.0034  
MAE (testset)     0.6759  0.6637  0.6710  0.6686  0.6725  0.6703  0.0041  
Fit time          6.16    6.17    6.02    6.00    6.06    6.08    0.07    
Test time         0.15    0.25    0.15    0.24    0.16    0.19    0.04    


{'fit_time': (6.163897752761841,
  6.169508457183838,
  6.022744417190552,
  6.003763914108276,
  6.055308103561401),
 'test_mae': array([0.67592377, 0.66373146, 0.67098329, 0.66855377, 0.67245158]),
 'test_rmse': array([0.87898001, 0.86914204, 0.8718201 , 0.87049578, 0.87348853]),
 'test_time': (0.15217208862304688,
  0.24907231330871582,
  0.15253186225891113,
  0.23854446411132812,
  0.15514850616455078)}

#### Tune algorithm parameters with GridSearchCV



In [0]:
from surprise.model_selection import GridSearchCV

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.8943199768202422
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


In [0]:
# We can now use the algorithm that yields the best rmse:
algo = gs.best_estimator['rmse']
trainset = data.build_full_trainset()
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f770a4ed390>

In [0]:
algo.predict(621,1)

Prediction(uid=621, iid=1, r_ui=None, est=3.782323475434416, details={'was_impossible': False})

In [0]:
df_data = data.df
df_data = df_data.join(df_movies,how="left", on='itemId',rsuffix='_', lsuffix='')
df_data[df_data['userId']==1].sort_values(by = 'rating',ascending=False)[:10]

Unnamed: 0,userId,itemId,rating,itemId_,title,genres
231,1,5060,5.0,7932.0,Dark Days (2000),Documentary
185,1,2872,5.0,3840.0,Pumpkinhead (1988),Horror
89,1,1291,5.0,1721.0,Titanic (1997),Drama|Romance
90,1,1298,5.0,1732.0,"Big Lebowski, The (1998)",Comedy|Crime
190,1,2948,5.0,3952.0,"Contender, The (2000)",Drama|Thriller
189,1,2947,5.0,3951.0,Two Family House (2000),Drama
188,1,2944,5.0,3948.0,Meet the Parents (2000),Comedy
186,1,2899,5.0,3882.0,Bring It On (2000),Comedy
184,1,2858,5.0,3824.0,Autumn in New York (2000),Drama|Romance
179,1,2700,5.0,3623.0,Mission: Impossible II (2000),Action|Adventure|Thriller


In [0]:
# From Surprise documentation: https://surprise.readthedocs.io/en/stable/FAQ.html
def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [0]:
# Predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

In [0]:
top_n = get_top_n(predictions, n=10)

In [0]:
top_n.items()

dict_items([(1, [(318, 4.734885597825979), (750, 4.684112490251269), (858, 4.668114608446733), (1204, 4.661303643781753), (904, 4.650780222998474), (48516, 4.61959146862477), (1221, 4.606729861177805), (912, 4.603535216130644), (1104, 4.59829042087042), (1276, 4.596100144700121)]), (2, [(750, 4.223886829247538), (858, 4.204682386162703), (904, 4.2018042990394955), (2959, 4.196618987712081), (296, 4.17323810979213), (922, 4.153981049483229), (50, 4.153739894783362), (3275, 4.153078940419774), (1213, 4.151097062417556), (260, 4.150635998561021)]), (3, [(318, 3.4682812393504254), (1104, 3.428976257843655), (750, 3.4267936068826015), (858, 3.399194940755246), (1204, 3.393205863998405), (2959, 3.3906188652466094), (741, 3.3760247903890166), (904, 3.3734378768936732), (296, 3.3619411108304806), (50, 3.3530774277328455)]), (4, [(318, 4.045896457529509), (750, 3.9949423592376334), (858, 3.9783160515692613), (1204, 3.9574832128408617), (48516, 3.9303375810657566), (50, 3.9293407188507508), (122

In [0]:
# Print the recommended items for all user 1
for uid, user_ratings in top_n.items():
    print(uid, [iid for (iid, _) in user_ratings])
    if uid == 1:
        break

1 [318, 750, 858, 1204, 904, 48516, 1221, 912, 1104, 1276]


In [0]:
df_movies[df_movies.itemId.isin([318, 750, 1204, 858, 904, 48516, 1221, 912, 1276, 4973])]

Unnamed: 0,itemId,title,genres
277,318,"Shawshank Redemption, The (1994)",Crime|Drama
602,750,Dr. Strangelove or: How I Learned to Stop Worr...,Comedy|War
659,858,"Godfather, The (1972)",Crime|Drama
686,904,Rear Window (1954),Mystery|Thriller
694,912,Casablanca (1942),Drama|Romance
906,1204,Lawrence of Arabia (1962),Adventure|Drama|War
922,1221,"Godfather: Part II, The (1974)",Crime|Drama
975,1276,Cool Hand Luke (1967),Drama
3622,4973,"Amelie (Fabuleux destin d'Amélie Poulain, Le) ...",Comedy|Romance
6315,48516,"Departed, The (2006)",Crime|Drama|Thriller
