# LAB 10: Recommender Systems based on Collaborative filtering

In this lab, we will develop a recommender system based on collaborative filtering technics.

We will use the MovieLens dataset (https://grouplens.org/datasets/movielens/).

* The dataset has been downloaded from http://files.grouplens.org/datasets/movielens/ml-100k.zip.

* The full used dataset contains 100000 ratings by 943 users on 1682 items.
* Each user has rated at least 20 movies.  
* Users and items are numbered consecutively from 1.  


## Import libraries

First, we import libraries used in this lab.

In [953]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

# Plot data
import matplotlib.pyplot as plt
%matplotlib inline

## Load data

Data has been downloaded into the '../../ressources/LAB11_Collaborative_filtering/ml-100k' directory.


In [954]:
path = './ml-100k/'
header = ['user_id', 'item_id', 'rating', 'timestamp']
ratings = pd.read_csv(path+'u.data', sep='\t', names=header)

## Explore and prepare data


Display the dataframe top rows.

In [955]:
ratings.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [956]:
ratings.shape

(100000, 4)

Get the number of (unique) users and (unique) items which are movies.

In [957]:
n_users = ratings.user_id.unique().shape[0]
n_items = ratings.item_id.unique().shape[0]
print ('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items))

Number of users = 943 | Number of movies = 1682


## Split the dataset into train and test data

We split the dataset into train and test data. We keep 25% for test data.

In [958]:
train_data, test_data = train_test_split(ratings, test_size=0.25)
print(train_data)
print(test_data)

       user_id  item_id  rating  timestamp
54729      435      222       3  884132027
55369      745      515       4  880122863
99931      379      621       4  880525815
69177      851     1047       3  874789005
67206      846     1311       2  883950712
69536      906      696       4  879435758
45985      560      111       3  879976731
43187        7      540       3  892132972
93031      735     1012       2  876698897
23998      145      929       2  888398069
41147        5      396       5  875636265
60879      823      124       4  878437925
90341      429      936       4  882385934
11867        7      185       5  892135346
84641      717      121       2  884642762
53511       85      173       3  879454045
79239      880      375       1  880242782
62618      848      899       3  887037471
91816      648      364       5  884882528
59798      452      132       2  875560255
61139      840      506       5  891204385
80178      937       14       4  876769080
45401      

## Collaborative filtering

We will consider memory-based and model-based collaborative filtering.

First, we create the user-item matrix for training data.

In [959]:
#Create two user-item matrices, one for training and another for testing
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]
print(train_data_matrix)

[[0. 3. 4. ... 0. 0. 0.]
 [4. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [5. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 5. 0. ... 0. 0. 0.]]


We create the user-item matrix for testing data.

In [960]:
test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]
    
print(test_data_matrix)

[[5. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Memory-Based Collaborative Filtering

The main idea behind **memory-based** collaborative filtering is to calculate and use the **similarities** between users and/or items and use them as "weights" to predict a rating for a user and an item. 

We will test both:

* Item-Item Collaborative Filtering
* User-Item Collaborative Filtering

We use the cosine similarity. For this purpose, we use the pairwise_distances function from sklearn to calculate the cosine similarity. 

First, compute the similarity between users.

In [961]:
from sklearn.metrics import pairwise
user_similarity = pairwise.cosine_similarity(train_data_matrix)
print(user_similarity.shape)

(943, 943)


Compute the similarity between items (movies).

In [962]:
user_similarity[:25, 0:25]

array([[1.        , 0.12277321, 0.0443837 , 0.01913439, 0.26252322,
        0.34500381, 0.33618687, 0.18174035, 0.05023975, 0.30448736,
        0.25178965, 0.20723227, 0.38381141, 0.23187204, 0.13270086,
        0.27996074, 0.17837499, 0.36882912, 0.10539678, 0.19877594,
        0.14918994, 0.22489869, 0.30212389, 0.20371273, 0.21628374],
       [0.12277321, 1.        , 0.10899961, 0.14192275, 0.04243365,
        0.20167526, 0.0822419 , 0.04115023, 0.12140179, 0.06791461,
        0.10945425, 0.07472248, 0.19247136, 0.13356013, 0.46417032,
        0.09908782, 0.21224712, 0.1090637 , 0.08494904, 0.06833907,
        0.1278567 , 0.03377375, 0.1322182 , 0.11125677, 0.09397444],
       [0.0443837 , 0.10899961, 1.        , 0.17615786, 0.02830531,
        0.04217824, 0.05021873, 0.05426119, 0.        , 0.01632876,
        0.06610565, 0.01459703, 0.11537311, 0.02831578, 0.07196062,
        0.04442544, 0.00691041, 0.01071658, 0.0674498 , 0.06230016,
        0.09536596, 0.0446941 , 0.02639441, 0.

In [963]:
#item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

item_similarity = pairwise.cosine_similarity(train_data_matrix.T)

In [964]:
item_similarity[:5, 0:5]

array([[1.        , 0.28886707, 0.20368779, 0.31781696, 0.23608229],
       [0.28886707, 1.        , 0.16651094, 0.36996622, 0.22351254],
       [0.20368779, 0.16651094, 1.        , 0.22803064, 0.19919431],
       [0.31781696, 0.36996622, 0.22803064, 1.        , 0.22972298],
       [0.23608229, 0.22351254, 0.19919431, 0.22972298, 1.        ]])

Define a method for making predictions. This method predicts in both cases: item-item and User-item collaborative filtering.

In [965]:
def predict(ratings, similarity, kind='user'):
    
    sum_sim = np.array([np.abs(similarity).sum(axis=1)])
    sum_sim[sum_sim == 0] = 1
    if kind == 'user':
        return similarity.dot(ratings) / sum_sim.T
    elif kind == 'item':
        return ratings.dot(similarity) / sum_sim

Make predictions for the item case.

In [966]:
item_prediction = predict(train_data_matrix, item_similarity, 'item')

In [967]:
item_prediction[0:5,0:3]

array([[0.86756933, 0.86474958, 0.86702304],
       [0.20540948, 0.11826552, 0.1560434 ],
       [0.07423658, 0.06156086, 0.08038561],
       [0.06316123, 0.04841776, 0.05723922],
       [0.51071511, 0.5410871 , 0.45078205]])

**Exercise**

Complete the following code to make predictions in the user case (user-item collaboration).

In [968]:
user_prediction = None               # Replace None with the appropriate code

**Solution**

In [969]:
user_prediction = predict(train_data_matrix, user_similarity, 'user')

Define a model performance measure which is the RMSE measure. The method compares the true value and the perdicted ratings.

Only non null values are compared. Zero value means that the user has not rate the movie.

In [970]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, true_value):
    prediction = prediction[true_value.nonzero()].flatten()
    true_value = true_value[true_value.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, true_value))

Compute the user-based collaborative filtering RMSE. 

In [971]:
user_CF_RMSE = rmse(user_prediction, test_data_matrix)
print('User-based CF RMSE: ', user_CF_RMSE)

User-based CF RMSE:  3.034239981008242


**Exercise**

Complete the following code to compute and print the Item-based collaborative filtering RMSE.

In [972]:
item_CF_RMSE = None                # Replace None with the appropriate code
print('Item-based CF RMSE: ', item_CF_RMSE)

Item-based CF RMSE:  None


**Solution**

In [973]:
item_CF_RMSE = rmse(item_prediction, test_data_matrix)
print('Item-based CF RMSE: ', item_CF_RMSE)

Item-based CF RMSE:  3.158747233298894


### Model-based Collaborative Filtering

As the memory-based collaborative filtering, the same idea can be used in model-based algorithms: the similarities between users and/or items can be calculated and then stored as a *model*, and then we can use the stored similarity values to predict ratings. 

The model-based collaborative filtering is based on matrix factorization. 

We will use an SVD-based algorithm which reduces the dimensionality of our dataset and captures the "features".

Check the sparsity of our dataset.

In [974]:
sparsity=round(1.0-len(ratings)/float(n_users*n_items),3)
print('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


Decompose the train_data_matrix using the SVD method.

In [975]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

#get SVD components from train matrix. Choose k.
u, s, vt = svds(train_data_matrix, k = 20)

Create the diagonale matrix.

In [976]:
s_diag_matrix=np.diag(s)

Compute the rating predictions from the decomposition values.

In [977]:
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)

Compute the model RMSE.

In [978]:
print('SVD-based CF RMSE: ' + str(rmse(X_pred, test_data_matrix)))

SVD-based CF RMSE: 2.7149435517828784


**Exercise:**

Test the SVD-based model with different values for k and compare the obtained results.

**Solution**

In [979]:
for k in [5, 10, 15, 20, 25, 30]:
    u, s, vt = svds(train_data_matrix, k = k)
    s_diag_matrix=np.diag(s)
    X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
    print('SVD-based CF RMSE (k={}): {}'.format(k, str(rmse(X_pred, test_data_matrix))))

SVD-based CF RMSE (k=5): 2.735324732893819
SVD-based CF RMSE (k=10): 2.6687606157658603
SVD-based CF RMSE (k=15): 2.680900850197953
SVD-based CF RMSE (k=20): 2.714943551782879
SVD-based CF RMSE (k=25): 2.7537585845413584
SVD-based CF RMSE (k=30): 2.8009845619022333
