## La recommandation et le Collaborative Filtering

Based on https://github.com/rounakbanik/movies/blob/master/movies_recommender.ipynb

Content based engine suffers from some severe limitations. It is only capable of suggesting movies which are *close* to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying such an engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

**Collaborative Filtering** is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

We will use the **Surprise** library that used extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations.

### The Surprise package

https://surprise.readthedocs.io/en/stable/getting_started.html


https://surprise.readthedocs.io/en/stable/building_custom_algo.html



In [1]:
import scipy 
print (scipy.__version__)
import numpy 
print (numpy.__version__)
import surprise 
print (surprise.__version__)

1.1.0
1.14.3
1.0.6


In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD, evaluate

import warnings; warnings.simplefilter('ignore')

In [3]:
reader = Reader()

In [4]:
ratings = pd.read_csv('datas/ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [5]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
data.split(n_folds=5)

In [6]:
svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])

Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8758
MAE:  0.6712
------------
Fold 2
RMSE: 0.8714
MAE:  0.6723
------------
Fold 3
RMSE: 0.8696
MAE:  0.6678
------------
Fold 4
RMSE: 0.8703
MAE:  0.6693
------------
Fold 5
RMSE: 0.8814
MAE:  0.6746
------------
------------
Mean RMSE: 0.8737
Mean MAE : 0.6710
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'mae': [0.6711976632318537,
                             0.672325753957933,
                             0.6677632329716944,
                             0.6693329343570376,
                             0.6745750024123661],
                            'rmse': [0.8758463740071303,
                             0.8713682821299826,
                             0.8696337719026809,
                             0.8703292697273627,
                             0.8814043677752365]})

In [7]:
trainset = data.build_full_trainset()
svd.train(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7ffaf41a4f98>

In [8]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


In [9]:
from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import KFold

# Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')

# define a cross-validation iterator
kf = KFold(n_splits=3)

algo = SVD()

for trainset, testset in kf.split(data):

    # train and test algorithm.
    algo.fit(trainset)
    predictions = algo.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions, verbose=True)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /users/promo2018/amarion/.surprise_data/ml-100k
RMSE: 0.9511
RMSE: 0.9424
RMSE: 0.9435


In [10]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

# Use movielens-100K
data = Dataset.load_builtin('ml-100k')

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.9636082550234252
{'reg_all': 0.4, 'lr_all': 0.005, 'n_epochs': 10}
