# NMF baseline algorithm using surprise package:

using matrix_factorization.NMF algorithm from http://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.NMF


A collaborative filtering algorithm based on Non-negative Matrix Factorization.
### [LZXZ14]Luo, Zhou, Xia and Zhu. And efficient non-negative matrix factorization-based approach to collab filtering for recommender systems.
This algorithm is very similar to SVD. The prediction $r̂_{ui}$ is set as:
$r̂_{ui}=q^T_i*p_u$,
where user and item factors are kept positive.
The optimization procedure is a (regularized) stochastic gradient descent with a specific choice of step size that ensures non-negativity of factors, provided that their initial values are also positive.

At each step of the SGD procedure, the factors $f$ for user $u$ and item $i$ are updated as follows:
$p_{uf} = p_{uf} * \frac{\sum_{i \in I_u} q_{if} * r_{ui}}{\sum_{i \in I_u} q_{if} * r̂_{ui} + \lambda_u |{I_u}| p_{uf}}$

$q_{if} = q_{if} * \frac{\sum_{u \in U_i} p_{uf} * r_{ui}}{\sum_{u \in U_i} p_{uf} * r̂_{ui} + \lambda_i |{U_i}| q_{if}}$

where $\lambda_u$ and $\lambda_i$ are regularization parameters.

A biased version is available by setting $\textit{biased}$ parameter to $\textit{True}$. In this case, the prediction is set as
$r̂_{ui}=μ+b_u+b_i+q^T_ip_u$
still ensuring positive factors.

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

import helpers
from surprise_helpers import CustomReader, get_ratings_from_predictions
from surprise import Reader, Dataset
from surprise.model_selection.search import RandomizedSearchCV
from surprise.prediction_algorithms.matrix_factorization import NMF

## Data Loading

In [2]:
reader = CustomReader()
filepath = helpers.get_train_file_path()
data = Dataset.load_from_file(filepath, reader=reader)

## Search over Parameters

In [None]:
param_grid = {'n_epochs': stats.randint(230,290), 
              'n_factors': stats.randint(1,30),
                'reg_pu': stats.uniform(0.1,0.2),
                'reg_qi': stats.uniform(0.1,0.2),
                'biased': [True, False],
             }      
        

gs = RandomizedSearchCV(algo_class=NMF, param_distributions=param_grid, measures=['rmse'], 
                        cv=10, joblib_verbose=100, n_jobs=-1, n_iter=5)
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

In [None]:
pd.DataFrame.from_dict(gs.cv_results)

## Results: params 

note: run on Leonhard cluster (20 cores and 22GB mem) <br/>
cv=10

0.992061420614
{'biased': False, 'n_epochs': 193, 'n_factors': 31, 'reg_pu': 0.066509535353532462, 'reg_qi': 0.10446537083767632}

0.992958716098
{'biased': False, 'n_epochs': 176, 'n_factors': 37, 'reg_pu': 0.069167790265929271, 'reg_qi': 0.086294785465031928}

0.99311438531
{'biased': False, 'n_epochs': 191, 'n_factors': 31, 'reg_pu': 0.07808007756523791, 'reg_qi': 0.088612273434871519}

0.993592996799
{'biased': False, 'n_epochs': 188, 'n_factors': 38, 'reg_pu': 0.089413334809372413, 'reg_qi': 0.10338587840560964}

0.994897339341
{'biased': False, 'n_epochs': 183, 'n_factors': 27, 'reg_pu': 0.056499330646607303, 'reg_qi': 0.10566161340114467}

0.994350620317
{'biased': False, 'n_epochs': 150, 'n_factors': 42, 'reg_pu': 0.071276306378441942, 'reg_qi': 0.084673247552462749}

0.994469288179
{'n_epochs': 261, 'n_factors': 23, 'reg_bi': 1.6667791514258101, 'reg_bu': 0.64371831932001311, 'reg_pu': 0.10279700072169747, 'reg_qi': 0.12452450006647738}

0.995282415101
{'n_epochs': 256}



0.997362682285
{'biased': False, 'n_epochs': 165, 'n_factors': 21, 'reg_pu': 0.1015205155303968, 'reg_qi': 0.064342096085956021}

0.997185497246
{'biased': False, 'n_epochs': 181, 'n_factors': 14, 'reg_pu': 0.065538361077981833, 'reg_qi': 0.071701378482202344}

0.996537013593
{'biased': False, 'n_epochs': 146, 'n_factors': 44, 'reg_pu': 0.039543644841938855, 'reg_qi': 0.094522593860989254}




---

1.00796302363
{'reg_pu': 0.16886198480906289}

1.00722502711
{'reg_qi': 0.13702113849142605}

1.0092083515
{'reg_bu': 0.34284468705230597}

1.00944021138
{'reg_bi': 0.93748069281861202}

1.00976479276
{'n_factors': 15}

1.00594746757
{'n_epochs': 262, 'n_factors': 21, 'reg_bi': 1.4693650385908883, 'reg_bu': 0.29370679004185357, 'reg_pu': 0.16575459605631229, 'reg_qi': 0.13074898536926785}

1.00956524442
{}

1.01386334123
{'biased': True}

1.00241560742
{'biased': True, 'n_factors': 13}



## Train

In [3]:
# choose optimal params from above
# {'biased': False, 'n_epochs': 193, 'n_factors': 31, 'reg_pu': 0.066509535353532462, 'reg_qi': 0.10446537083767632}

algo = NMF(biased=False, n_epochs=193, n_factors=31, reg_pu=0.066509535353532462, reg_qi=0.10446537083767632)

# train 
algo.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.NMF at 0x105fed390>

## Predicting
We load the test data to predict.

In [5]:
test_file_path = helpers.get_test_file_path()
test_data = Dataset.load_from_file(test_file_path, reader=reader)
testset = test_data.construct_testset(test_data.raw_ratings)
predictions = algo.test(testset)
predictions[0]

Prediction(uid=36, iid=0, r_ui=3.0, est=3.2801484348638974, details={'was_impossible': False})

We need to convert the predictions into the right format.

In [6]:
ratings = get_ratings_from_predictions(predictions)

Now we can write the file.

In [7]:
output = helpers.write_submission(ratings, 'submission_surprise_NMF_1.csv')
print(output[0:100])

Id,Prediction
r37_c1,3.280148
r73_c1,2.938519
r156_c1,3.551042
r160_c1,3.216799
r248_c1,3.265300
r25
