# Colaborative Filtering Recommender System


## Surprise

With surprise Library, we will benchmark the following algorithms. We use "rmse" as our accuracy metric for the predictions

In [1]:
import pandas as pd
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import cross_validate
from surprise.model_selection import train_test_split
from surprise import SVD, SVDpp, SlopeOne, NMF, NormalPredictor, KNNBaseline, \
    KNNBasic, KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering


In [2]:
reader = Reader(line_format='item user rating', sep=',', skip_lines=1, rating_scale=(0.5,10))

In [3]:
data = Dataset.load_from_file('ratings_cleaned.csv', reader=reader)

In [4]:
algo_list = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]
benchmark = []

for algo in algo_list:
    results = cross_validate(algo=algo, data=data, measures=["rmse"], cv=3, n_jobs=-1, verbose=False)
    algo_name = str(algo).split(' ')[0].split('.')[-1]
    print(algo_name)
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp['Algorithm'] = algo_name
    benchmark.append(tmp)

SVD
SVDpp
SlopeOne
NMF
NormalPredictor
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Done computing similarity matrix.
KNNBaseline
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Done computing similarity matrix.
KNNBasic
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity matrix.
Done computing similarity matrix.
KNNWithMeans
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Computing the msd similarity matrix...
Done computing similarity matrix.
Done computing similarity 

In [5]:
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,2.046756,0.137173,0.039502
BaselineOnly,2.054566,0.052964,0.043458
KNNBaseline,2.055028,1.837526,0.035611
SVD,2.056549,0.297736,0.038188
NMF,2.10332,1.167993,0.056383
KNNWithZScore,2.103425,2.274471,0.037356
CoClustering,2.103482,1.959868,0.032777
SlopeOne,2.103501,0.569013,0.036005
KNNWithMeans,2.103728,1.582587,0.034793
KNNBasic,2.103797,1.491451,0.0395


## Train and Predict
`SVDpp` algorithm gave us the best rmse, therefore, we will train and predict with `SVDpp`with a simple grid search.

In [6]:
from surprise.model_selection import GridSearchCV

param_grid = {'n_factors': [15, 20, 25], 
              'n_epochs': [15, 20, 25], 
              'lr_all': [0.001, 0.007, 0.012],
              'reg_all': [0.01, 0.02, 0.03]}

gs = GridSearchCV(algo_class=SVDpp, param_grid=param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)

gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

2.0378257716600316
{'n_factors': 15, 'n_epochs': 25, 'lr_all': 0.012, 'reg_all': 0.02}


You can now use these optimal hyperparameters to train your SVD model on the entire dataset and make predictions.

In [13]:
# Split the dataset into training and testing sets
trainset, testset = train_test_split(data, test_size=0.25, shuffle=True)

# Create an SVD algorithm with the best hyperparameters
optimal_svd = SVDpp(n_factors=15, n_epochs=25, lr_all=0.012, reg_all=0.02)

# Train the algorithm on the training set
optimal_svd.fit(trainset)

# Make predictions on the test set
predictions = optimal_svd.test(testset)

# Evaluate the performance using RMSE
rmse = accuracy.rmse(predictions)
print("Test RMSE:", rmse)

RMSE: 2.0153
Test RMSE: 2.0152985009080933


To inspect our predictions in details, we are going to build a pandas data frame with all the predictions. The following code were largely taken from this [notebook](http://nbviewer.jupyter.org/github/NicolasHug/Surprise/blob/master/examples/notebooks/KNNBasic_analysis.ipynb).

In [14]:
def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
    
def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0
    
df = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
df['Iu'] = df.uid.apply(get_Iu)
df['Ui'] = df.iid.apply(get_Ui)
df['err'] = abs(df.est - df.rui)
best_predictions = df.sort_values(by='err')[:10]
worst_predictions = df.sort_values(by='err')[-10:]

In [15]:
best_predictions

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
2370,7699,166426,6.0,5.999426,{'was_impossible': False},0,1,0.000574
378,7701,166426,6.0,5.999426,{'was_impossible': False},0,1,0.000574
1820,6365,242033,6.0,6.000869,{'was_impossible': False},0,1,0.000869
432,11209,493529,7.0,6.999038,{'was_impossible': False},0,5,0.000962
11,10059,550988,7.0,6.99888,{'was_impossible': False},0,5,0.00112
2207,5827,101299,6.0,6.003456,{'was_impossible': False},0,2,0.003456
1127,9536,425001,6.0,6.004015,{'was_impossible': False},0,1,0.004015
2632,10491,628900,6.0,5.99582,{'was_impossible': False},0,2,0.00418
2696,7248,258489,6.0,5.990688,{'was_impossible': False},0,1,0.009312
1740,6825,211672,6.0,6.01334,{'was_impossible': False},0,1,0.01334


In collaborative filtering, the small values in the **err** column also imply a high level of confidence in the predictions. 

In [16]:
worst_predictions

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
506,3028,686,1.0,7.076706,{'was_impossible': False},0,3,6.076706
1040,7856,346364,1.0,7.137821,{'was_impossible': False},0,4,6.137821
2832,8877,474350,1.0,7.142201,{'was_impossible': False},0,7,6.142201
744,796,28,1.0,7.16885,{'was_impossible': False},0,4,6.16885
2215,6977,334541,1.0,7.204771,{'was_impossible': False},0,2,6.204771
202,2718,9598,1.0,7.222246,{'was_impossible': False},0,1,6.222246
2566,4064,6171,1.0,7.249354,{'was_impossible': False},0,1,6.249354
2493,10152,370172,1.0,7.284259,{'was_impossible': False},0,5,6.284259
2531,11547,466420,1.0,7.314201,{'was_impossible': False},0,5,6.314201
88,11202,603692,1.0,7.377509,{'was_impossible': False},0,11,6.377509


The worst predictions, as indicated by the provided data, highlight instances where the recommender system struggled to accurately estimate user ratings for certain items. In these cases, the predicted ratings significantly deviated from the actual ratings, resulting in comparatively high error values.

Understanding and addressing these challenges can pave the way for further improvements in recommender systems. Techniques such as incorporating more advanced algorithms, enhancing data preprocessing, or exploring hybrid models that combine collaborative and content-based filtering approaches could contribute to refining predictions and delivering more accurate recommendations to users.