## Collaborative Filtering
#### Model Based Approach

In [1]:
import pandas as pd

from surprise import SVD
from surprise import Dataset
from surprise import Reader

from surprise import accuracy

from surprise.model_selection import train_test_split

from surprise.model_selection import GridSearchCV

from surprise.model_selection import cross_validate

We will be working with the [same data](https://drive.google.com/file/d/1WvTmAfO09TCX7xp7uu06__ziic7JnrL5/view?usp=sharing) we used in the previous exercise.

In [2]:
book_ratings = pd.read_csv('BX-CSV-Dump\\BX-Book-Ratings.csv',sep=";", encoding="latin")

In [3]:
book_ratings

Unnamed: 0,User-ID,ISBN,Book-Rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6
...,...,...,...
1149775,276704,1563526298,9
1149776,276706,0679447156,0
1149777,276709,0515107662,10
1149778,276721,0590442449,10


In [4]:
book_ratings.groupby('Book-Rating').describe()

Unnamed: 0_level_0,User-ID,User-ID,User-ID,User-ID,User-ID,User-ID,User-ID,User-ID
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Book-Rating,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,716109.0,143370.549256,80345.14378,2.0,73394.0,145641.0,212898.0,278854.0
1,1770.0,127012.059322,86750.741252,1314.0,48423.75,116395.5,208668.0,278759.0
2,2759.0,138600.495469,84060.292551,387.0,62464.0,138018.0,216626.0,278764.0
3,5996.0,138070.616077,83818.299314,17.0,63180.25,137649.0,214553.75,278820.0
4,8904.0,138104.75876,83264.004877,86.0,65514.0,137190.0,213316.0,278723.0
5,50974.0,137953.412544,79099.693575,8.0,70183.0,138777.0,202200.25,278851.0
6,36924.0,135065.78082,81575.208656,8.0,64015.0,135149.0,206023.0,278854.0
7,76457.0,134472.395163,81432.702382,8.0,63694.0,133571.0,205485.0,278854.0
8,103736.0,135406.82201,80778.299921,32.0,67542.0,133123.0,206074.0,278854.0
9,67541.0,135306.242638,79049.826965,16.0,70594.0,129465.0,203611.0,278849.0


In [5]:
book_ratings['Book-Rating'].describe()

count    1.149780e+06
mean     2.866950e+00
std      3.854184e+00
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      7.000000e+00
max      1.000000e+01
Name: Book-Rating, dtype: float64

* create surprise dataset from book_ratings

In [6]:
reader = Reader(rating_scale=(0, 10))

# Loads Pandas dataframe
data = Dataset.load_from_df(book_ratings, reader)

* split data to train and test set, use test size 15%

In [7]:
train_set, test_set = train_test_split(data, test_size=0.3) # <<--the best

* Use SVD (with default settings) to create recommendations for each user
    - print default model's rmse that was computed on the test set (using object accuracy we imported in the beginning)

In [8]:
model = SVD()

In [9]:
model.fit(train_set)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1d6350f34c0>

In [10]:
predictions = model.test(test_set)

In [11]:
accuracy.rmse(predictions)

RMSE: 3.4983


3.49828431056315

<font color= 'red'> Description: </font>
<font color= 'Orange'> Usually RMSE is expected to be between 0 and 1, while in this case it is not. If we look at the upper section's describe() results, we would see the count of the USER-IDs which have got ratings 0-10, total count, mean value, and standard deviation; we can notice such result is not that much bad </font>

<font color='orange'> Train test set splitting for grid search </font>

In [12]:
raw_ratings = data.raw_ratings
threshold = int(0.7 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]
data.raw_ratings = A_raw_ratings

* create parameters grid, use this params:
* 'n_factors': [110, 120, 140, 160]
* 'reg_all': [0.08, 0.1, 0.15]

In [13]:
param_grid = {'n_factors': [110, 120, 140, 160], 'reg_all': [0.08, 0.1, 0.15]}

* instantiate GridSearch with SVD as model, our pre-defined parameter grid and rmse and mae as evaluation metrics

In [14]:
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)

* fit GridSearch

In [15]:
grid_search.fit(data)

* print best RMSE score from training

In [16]:
algo = grid_search.best_estimator['rmse']

In [17]:
trainset = data.build_full_trainset()

In [18]:
algo.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1d632806860>

* predict test set with optimal model based on `RMSE`

In [19]:
predictions = algo.test(trainset.build_testset())

In [20]:
accuracy.rmse(predictions)

RMSE: 1.4213


1.4213083046559942

* print optimal model's RMSE that was computed on test set
    - is it better than the default parameters?

In [21]:
testset = data.construct_testset(B_raw_ratings)

In [22]:
test_predictions = algo.test(testset)

In [23]:
accuracy.rmse(test_predictions)

RMSE: 3.7771


3.777050320016362