# Collaborative Filtering Comparison

In this notebook we will compare the three main methods seen in previous notebooks:
* User based filtering
* Item based filtering
* Factorization matrix filtering

## Imports

In [1]:
!pip install scikit-surprise

Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 344kB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp36-cp36m-linux_x86_64.whl size=1618285 sha256=46ae30509348a6822363196e752b27b8d0748a756f780865b2c5787f9af4a723
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [2]:
import os

from surprise import KNNWithMeans 
from surprise import SVD 
from surprise import Dataset                                                     
from surprise.model_selection import train_test_split
from surprise import dump

from surprise import accuracy

## Load the dataset

In [3]:
data = Dataset.load_builtin('ml-100k')
trainset, testset = train_test_split(data, test_size=.25)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


## Train the algorithms

### User based
We use the pearson coefficient as a similarity measure

In [4]:
sim_options_user_based = {'name': 'pearson', 
                          'user_based': True  # compute  similarities between users
               }
algo_user_based = KNNWithMeans(sim_options=sim_options_user_based)
algo_user_based.fit(trainset)                     

Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7faee5a6c668>

### Item based
We use the cosine as a similarity measure

In [5]:
sim_options_item_based = {'name': 'cosine', 
                          'user_based': False  # compute  similarities between items
               }
algo_item_based = KNNWithMeans(sim_options=sim_options_item_based)
algo_item_based.fit(trainset)                     

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7faee5a6c6a0>

### Matrix factorization
We will use the pearson coefficient as a similarity measure

In [6]:
algo_svd = SVD()
algo_svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7faee796a8d0>

# Save models to disk

In [7]:
file_name_user = os.path.expanduser('~/dump_file_user')
dump.dump(file_name_user, algo=algo_user_based)

file_name_item = os.path.expanduser('~/dump_file_item')
dump.dump(file_name_item, algo=algo_item_based)

file_name_svd = os.path.expanduser('~/dump_file_svd')
dump.dump(file_name_svd, algo=algo_svd)

# Load models

In [8]:
_, loaded_algo_user = dump.load(file_name_user)
_, loaded_algo_item = dump.load(file_name_item)
_, loaded_algo_svd = dump.load(file_name_svd)

## Predict values

In [9]:
predictions_user = loaded_algo_user.test(testset)
predictions_item = loaded_algo_item.test(testset)
predictions_svd = loaded_algo_svd.test(testset)


In [10]:
print("accuracy measures for user, item and svd):")
print("rmse    = " + str("{:10.3f}".format(accuracy.rmse(predictions_user,verbose=False))) + "  " +
                     str("{:10.3f}".format(accuracy.rmse(predictions_item,verbose=False))) + "  " +
                     str("{:10.3f}".format(accuracy.rmse(predictions_svd,verbose=False))))

print("mse     = " + str("{:10.3f}".format(accuracy.mse(predictions_user,verbose=False))) + "  " +
                     str("{:10.3f}".format(accuracy.mse(predictions_item,verbose=False))) + "  " +
                     str("{:10.3f}".format(accuracy.mse(predictions_svd,verbose=False))))

print("mae     = " + str("{:10.3f}".format(accuracy.mae(predictions_user,verbose=False))) + "  " +
                     str("{:10.3f}".format(accuracy.mae(predictions_item,verbose=False))) + "  " +
                     str("{:10.3f}".format(accuracy.mae(predictions_svd,verbose=False))))

print("fcp     = " + str("{:10.3f}".format(accuracy.fcp(predictions_user,verbose=False))) + "  " +
                     str("{:10.3f}".format(accuracy.fcp(predictions_item,verbose=False))) + "  " +
                     str("{:10.3f}".format(accuracy.fcp(predictions_svd,verbose=False))))

#accuracy.mse(predictions)
#accuracy.mae(predictions)
#accuracy.fcp(predictions)

accuracy measures for user, item and svd):
rmse    =      0.962       0.957       0.951
mse     =      0.925       0.916       0.905
mae     =      0.753       0.752       0.750
fcp     =      0.705       0.696       0.696


With the different error measures we see that the svd and the user based collaborative filtering obtain better results than the item based in this dataset

## Other datasets or configurations

Can you extrapolate this results to other datasets? Playing with the parameters of each algorithm you are able to improve the results of this notebook?