# LAB 10: Recommender Systems based on Collaborative filtering

In this lab, we will develop a recommender system based on collaborative filtering technics.

We will use the MovieLens dataset (https://grouplens.org/datasets/movielens/).

* The dataset has been downloaded from http://files.grouplens.org/datasets/movielens/ml-100k.zip.

* The full used dataset contains 100000 ratings by 943 users on 1682 items.
* Each user has rated at least 20 movies.  
* Users and items are numbered consecutively from 1.  


## Import libraries

First, we import libraries used in this lab.

In [86]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

# Plot data
import matplotlib.pyplot as plt
%matplotlib inline

## Load data

Data has been downloaded into the '../../ressources/LAB11_Collaborative_filtering/ml-100k' directory.


In [87]:
path = '../../ressources/LAB10_Collaborative_filtering/ml-100k/'
header = ['user_id', 'item_id', 'rating', 'timestamp']
ratings = pd.read_csv(path+'u.data', sep='\t', names=header)

## Explore and prepare data


Display the dataframe top rows.

In [88]:
ratings.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [89]:
ratings.shape

(100000, 4)

Get the number of (unique) users and (unique) items which are movies.

In [90]:
n_users = ratings.user_id.unique().shape[0]
n_items = ratings.item_id.unique().shape[0]
print ('Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_items))

Number of users = 943 | Number of movies = 1682


## Split the dataset into train and test data

We split the dataset into train and test data. We keep 25% for test data.

In [91]:
train_data, test_data = train_test_split(ratings, test_size=0.25)

## Collaborative filtering

We will consider memory-based and model-based collaborative filtering.

First, we create the user-item matrix for training data.

In [92]:
#Create two user-item matrices, one for training and another for testing
train_data_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]

We create the user-item matrix for testing data.

In [93]:
test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

### Memory-Based Collaborative Filtering

The main idea behind **memory-based** collaborative filtering is to calculate and use the **similarities** between users and/or items and use them as "weights" to predict a rating for a user and an item. 

We will test both:

* Item-Item Collaborative Filtering
* User-Item Collaborative Filtering

We use the cosine similarity. For this purpose, we use the pairwise_distances function from sklearn to calculate the cosine similarity. 

First, compute the similarity between users.

In [94]:
from sklearn.metrics import pairwise
user_similarity = pairwise.cosine_similarity(train_data_matrix)

Compute the similarity between items (movies).

In [95]:
user_similarity[:5, 0:5]

array([[ 1.        ,  0.13748335,  0.03440301,  0.04778357,  0.30934333],
       [ 0.13748335,  1.        ,  0.05440946,  0.12649189,  0.01214068],
       [ 0.03440301,  0.05440946,  1.        ,  0.28779976,  0.02791685],
       [ 0.04778357,  0.12649189,  0.28779976,  1.        ,  0.03188143],
       [ 0.30934333,  0.01214068,  0.02791685,  0.03188143,  1.        ]])

In [96]:
#item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

item_similarity = pairwise.cosine_similarity(train_data_matrix.T)

In [97]:
item_similarity[:5, 0:5]

array([[ 1.        ,  0.2779502 ,  0.24495551,  0.35193028,  0.21968451],
       [ 0.2779502 ,  1.        ,  0.22542778,  0.39673548,  0.19236885],
       [ 0.24495551,  0.22542778,  1.        ,  0.22282829,  0.14709108],
       [ 0.35193028,  0.39673548,  0.22282829,  1.        ,  0.24319678],
       [ 0.21968451,  0.19236885,  0.14709108,  0.24319678,  1.        ]])

Define a method for making predictions. This method predicts in both cases: item-item and User-item collaborative filtering.

In [98]:
def predict(ratings, similarity, kind='user'):
    
    sum_sim = np.array([np.abs(similarity).sum(axis=1)])
    sum_sim[sum_sim == 0] = 1
    
    if kind == 'user':
        return similarity.dot(ratings) / sum_sim.T
    elif kind == 'item':
        return ratings.dot(similarity) / sum_sim

Make predictions for the item case.

In [99]:
item_prediction = predict(train_data_matrix, item_similarity, 'item')

In [100]:
item_prediction[0:5,0:3]

array([[ 0.93513621,  0.88891504,  0.8178256 ],
       [ 0.20009813,  0.10851607,  0.14797021],
       [ 0.08724431,  0.05770369,  0.06984525],
       [ 0.07044095,  0.05017328,  0.06117246],
       [ 0.52369501,  0.52349766,  0.43165267]])

**Exercise**

Complete the following code to make predictions in the user case (user-item collaboration).

In [16]:
user_prediction = None               # Replace None with the appropriate code

**Solution**

In [101]:
user_prediction = predict(train_data_matrix, user_similarity, 'user')

Define a model performance measure which is the RMSE measure. The method compares the true value and the perdicted ratings.

Only non null values are compared. Zero value means that the user has not rate the movie.

In [102]:
from sklearn.metrics import mean_squared_error
from math import sqrt
def rmse(prediction, true_value):
    prediction = prediction[true_value.nonzero()].flatten()
    true_value = true_value[true_value.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, true_value))

Compute the user-based collaborative filtering RMSE. 

In [103]:
user_CF_RMSE = rmse(user_prediction, test_data_matrix)
print('User-based CF RMSE: ', user_CF_RMSE)

User-based CF RMSE:  3.049053384214934


**Exercise**

Complete the following code to compute and print the Item-based collaborative filtering RMSE.

In [20]:
item_CF_RMSE = None                # Replace None with the appropriate code
print('Item-based CF RMSE: ', item_CF_RMSE)

Item-based CF RMSE:  None


**Solution**

In [104]:
item_CF_RMSE = rmse(item_prediction, test_data_matrix)
print('Item-based CF RMSE: ', item_CF_RMSE)

Item-based CF RMSE:  3.1753239423465063


### Model-based Collaborative Filtering

As the memory-based collaborative filtering, the same idea can be used in model-based algorithms: the similarities between users and/or items can be calculated and then stored as a *model*, and then we can use the stored similarity values to predict ratings. 

The model-based collaborative filtering is based on matrix factorization. 

We will use an SVD-based algorithm which reduces the dimensionality of our dataset and captures the "features".

Check the sparsity of our dataset.

In [108]:
sparsity=round(1.0-len(ratings)/float(n_users*n_items),3)
print('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


Decompose the train_data_matrix using the SVD method.

In [109]:
import scipy.sparse as sp
from scipy.sparse.linalg import svds

#get SVD components from train matrix. Choose k.
u, s, vt = svds(train_data_matrix, k = 20)

Create the diagonale matrix.

In [110]:
s_diag_matrix=np.diag(s)

Compute the rating predictions from the decomposition values.

In [111]:
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)

Compute the model RMSE.

In [112]:
print('SVD-based CF RMSE: ' + str(rmse(X_pred, test_data_matrix)))

SVD-based CF RMSE: 2.725993770126907


**Exercise:**

Test the SVD-based model with different values for k and compare the obtained results.

**Solution**

In [113]:
for k in [5, 10, 15, 20, 25, 30]:
    u, s, vt = svds(train_data_matrix, k = k)
    s_diag_matrix=np.diag(s)
    X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
    print('SVD-based CF RMSE (k={}): {}'.format(k, str(rmse(X_pred, test_data_matrix))))

SVD-based CF RMSE (k=5): 2.7490927453988503
SVD-based CF RMSE (k=10): 2.6800599240155183
SVD-based CF RMSE (k=15): 2.6932424686876604
SVD-based CF RMSE (k=20): 2.725993770126908
SVD-based CF RMSE (k=25): 2.7694519527676014
SVD-based CF RMSE (k=30): 2.8064007647621563
