# SVDpp baseline algorithm using surprise package:

using `matrix_factorization.SVDpp` algorithm from http://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

---
### [KBV09] Yehuda Koren. Matrix factorization techniques for recommender systems.

see [Matrix Factorization techniques](https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf)

explicit feedback = (user,item, rating) represented as user-item matrix.

Matrix factorization models map both users and items to a joint latent factor space of dimensionality f, such that user-item interactions are modeled as inner products in that space.

Assume $r_{ui} = q_i^Tp_u$, how to compute $q_i, p_u$?

Want to minimize $min_{p_u, q_i} \sum_{x_{ui}}(r_{ui}-p_u^Tq_i)^2 + \lambda ($$ \lVert q_i \rVert $$^2 + $$ \lVert p_u \rVert $$^2)$

1. Can use SGD to optimize (see Simon Funk) <- focus here
2. Can use ALS (convexifies the objective)

__Adding Biases__:
some users tend to give higher/lower ratings then others. And some items tend to receive higher/lower ratings than others (relatively seen).

Bias involved in rating $r_{ui}$ is denoted by $b_{ui}$:<br/>
$b_{ui} = \mu + b_i + b_u$ <br/>
$\mu$: average rating over all movies <br/>
$b_i$: deviation of item i from average <br/>
$b_u$: deviation of user u from average <br/>

estimate of rating is: <br/>
$r_{ui} = \mu + b_i + b_u + q_i^Tp_u$

adjusted objective: <br/>
 $min_{p_u, q_i} \sum_{x_{ui}}(r_{ui}-\mu-b_u-b_i -p_u^Tq_i)^2 + \lambda ($$ \lVert q_i \rVert $$^2 + $$ \lVert p_u \rVert $$^2 + b_u^2 + b_i^2)$

---
### [RRSK10] Francesco Ricci. Recommender Systems Handbook.

section 5.3.2: SVD++ <br/>
Prediction accuracy is improved by considering also implicit feedback, which provides an additional indication of user preferences.
A second set of item factors is added, relating each item i to a factor vector $y_i$. Those new item factors are used to characterize users based on the set of items that they rated. The exact model is as follows: 

$r_{ui} = \mu +b_i+b_u+q_i^T (p_u+|R(u)|^{-1} \sum_{j \in R(u)}y_j)$ 

The set R(u) contains the items rated by user u. 
Now, a user u is modeled as 

$p_u+|R(u)|^{-1} \sum_{j \in R(u)}y_j$.


In [1]:
import pandas as pd
import numpy as np
from scipy import stats

from surprise import Reader, Dataset
from surprise.model_selection.search import RandomizedSearchCV
from surprise.prediction_algorithms.matrix_factorization import SVDpp

import helpers
from surprise_helpers import CustomReader, get_ratings_from_predictions

## Data loading

In [2]:
reader = CustomReader()
filepath = helpers.get_train_file_path()
data = Dataset.load_from_file(filepath, reader=reader)

## Search over params


In [None]:
param_grid = {'n_factors': stats.randint(5,150),
              'lr_all': stats.uniform(0.001,0.01),
              'reg_all': stats.uniform(0.01,0.1),
             }      
        

gs = RandomizedSearchCV(algo_class=SVDpp, param_distributions=param_grid, measures=['rmse'], 
                        cv=10, joblib_verbose=100, n_jobs=-1, n_iter=20)
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

In [None]:
pd.DataFrame.from_dict(gs.cv_results)

## Results: params 

note: run on Leonhard cluster (20 cores and 22GB mem) <br/>
cv=10

0.995445402547  
{'lr_all': 0.0092244620489602275, 'n_factors': 71, 'reg_all': 0.078864113503182592}

0.996853075175  
{'lr_all': 0.0080231466104846508, 'n_factors': 77, 'reg_all': 0.081201815513722436}

1.01384855614  
{}

0.996911527037  
{'lr_all': 0.01021104810095308, 'n_factors': 15, 'reg_all': 0.073931026257899185}

0.997546908531  
{'lr_all': 0.0073480945698983675, 'n_factors': 12, 'reg_all': 0.072260697833741921}

1.00041116787  
{'lr_all': 0.005849089260803208, 'n_factors': 69, 'reg_all': 0.090348276620233745}

## Train

In [3]:
# choose optimal params from above
# params = {'lr_all': 0.0092244620489602275, 'n_factors': 71, 'reg_all': 0.078864113503182592}
algo = SVDpp(lr_all=0.0092244620489602275, n_factors=71, reg_all=0.078864113503182592)

# train 
algo.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x109d6ccc0>

## Predicting
We load the test data to predict.

In [4]:
test_file_path = helpers.get_test_file_path()
test_data = Dataset.load_from_file(test_file_path, reader=reader)
testset = test_data.construct_testset(test_data.raw_ratings)
predictions = algo.test(testset)
predictions[0]

Prediction(uid=36, iid=0, r_ui=3.0, est=3.3858988120492683, details={'was_impossible': False})

We need to convert the predictions into the right format.

In [5]:
ratings = get_ratings_from_predictions(predictions)

Now we can write the file.

In [6]:
output = helpers.write_submission(ratings, 'submission_surprise_SVDpp_0.csv')
print(output[0:100])

Id,Prediction
r37_c1,3.385899
r73_c1,3.124768
r156_c1,3.766790
r160_c1,3.321574
r248_c1,3.561472
r25
