# SVD baseline algorithm using surprise package:

using `matrix_factorization.SVD` algorithm from http://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

---
### [KBV09] Yehuda Koren. Matrix factorization techniques for recommender systems.

see [Matrix Factorization techniques](https://datajobs.com/data-science-repo/Recommender-Systems-[Netflix].pdf)

explicit feedback = (user,item, rating) represented as user-item matrix.

Matrix factorization models map both users and items to a joint latent factor space of dimensionality f, such that user-item interactions are modeled as inner products in that space.

Assume $r_{ui} = q_i^Tp_u$, how to compute $q_i, p_u$?

Want to minimize $min_{p_u, q_i} \sum_{x_{ui}}(r_{ui}-p_u^Tq_i)^2 + \lambda ($$ \lVert q_i \rVert $$^2 + $$ \lVert p_u \rVert $$^2)$

1. Can use SGD to optimize (see Simon Funk) <- focus here
2. Can use ALS (convexifies the objective)

__Adding Biases__:
some users tend to give higher/lower ratings then others. And some items tend to receive higher/lower ratings than others (relatively seen).

Bias involved in rating $r_{ui}$ is denoted by $b_{ui}$:<br/>
$b_{ui} = \mu + b_i + b_u$ <br/>
$\mu$: average rating over all movies <br/>
$b_i$: deviation of item i from average <br/>
$b_u$: deviation of user u from average <br/>

estimate of rating is: <br/>
$r_{ui} = \mu + b_i + b_u + q_i^Tp_u$

adjusted objective: <br/>
 $min_{p_u, q_i} \sum_{x_{ui}}(r_{ui}-\mu-b_u-b_i -p_u^Tq_i)^2 + \lambda ($$ \lVert q_i \rVert $$^2 + $$ \lVert p_u \rVert $$^2 + b_u^2 + b_i^2)$

In [12]:
import pandas as pd
import numpy as np
from scipy import stats

import helpers
from surprise_helpers import CustomReader, get_ratings_from_predictions
from surprise import Reader, Dataset
from surprise.model_selection.search import RandomizedSearchCV
from surprise.prediction_algorithms.matrix_factorization import SVD

## Data loading

In [13]:
reader = CustomReader()
filepath = helpers.get_train_file_path()
data = Dataset.load_from_file(filepath, reader=reader)

## Search over params


In [None]:
param_grid = {'n_epochs': stats.randint(5,20), 
              'lr_all': stats.uniform(0.002,0.005),
              'reg_all': stats.uniform(0.02,0.6),
              'n_factors': stats.randint(50,150),
             }      
        

gs = RandomizedSearchCV(algo_class=SVD, param_distributions=param_grid, measures=['rmse'], 
                        cv=10, joblib_verbose=100, n_jobs=-1, n_iter=100)
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

In [None]:
pd.DataFrame.from_dict(gs.cv_results)

## Results: params 

note: run on Leonhard cluster (20 cores and 22GB mem)
cv=10

0.996617863993
{'lr_all': 0.0080655611939959484, 'n_epochs': 19, 'n_factors': 9, 'reg_all': 0.042201220509606799}

0.998777808857
{'n_factors': 5}

1.00085021593
{'lr_all': 0.0035314408264436933, 'n_epochs': 19, 'n_factors': 50, 'reg_all': 0.027105037999075404}

1.00534587695
{'lr_all': 0.0034656840329879137, 'n_epochs': 10, 'n_factors': 42, 'reg_all': 0.12231592623013628}

1.00104676332
{'lr_all': 0.0066032381482039656, 'n_epochs': 17, 'n_factors': 107, 'reg_all': 0.036362623151074552}

1.00382744957
{'lr_all': 0.0045664408289589759, 'n_epochs': 12, 'n_factors': 9, 'reg_all': 0.04029560227746723}

## Train

In [19]:
# choose optimal params from above
#algo = SVD(n_epochs=19, lr_all=0.0080655611939959484, reg_all=0.042201220509606799, n_factors=9)
algo = SVD( n_factors=5)

# train 
algo.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x10f07c128>

## Predicting
We load the test data to predict.

In [20]:
test_file_path = helpers.get_test_file_path()
test_data = Dataset.load_from_file(test_file_path, reader=reader)
testset = test_data.construct_testset(test_data.raw_ratings)
predictions = algo.test(testset)
predictions[0]

Prediction(uid=36, iid=0, r_ui=3.0, est=3.175193448906952, details={'was_impossible': False})

We need to convert the predictions into the right format.

In [21]:
ratings = get_ratings_from_predictions(predictions)

Now we can write the file.

In [22]:
output = helpers.write_submission(ratings, 'submission_surprise_SVD_1.csv')
print(output[0:100])

Id,Prediction
r37_c1,3.175193
r73_c1,3.022989
r156_c1,3.743569
r160_c1,3.393022
r248_c1,3.336667
r25
