# Alternating Least Squares 

baseline algorithm using surprise package:

using `baseline_only.BaselineOnly` algorithm from http://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

---
## Theory behind

Rating of user u and item i is estimated based on the following equation: 

$r_{ui} = \mu + b_u + b_i$

where $\mu$ is the overall average rating, $b_u$ and $b_i$ are biases that capture tendencies of users to rate higher/lower and tendencies of items to be rated higher/lower.

The regularized square loss over which we optimize looks as follows:

$\sum_{r_{ui} \in R_{train}}(r_{ui}-(\mu + b_u + b_i))^2 + \lambda (b_u^2 + b_i^2)$

The optimization method used to minimize the loss function above is ALS which alternatively updates $b_u$ and $b_i$ to convexify the loss function. 

---

In [3]:
import data_handler
from surprise_extensions import CustomReader, get_ratings_from_predictions
from surprise import Reader, Dataset

## Data loading
We load the data using our custom reader.
See: http://surprise.readthedocs.io/en/stable/getting_started.html#use-a-custom-dataset

In [3]:
reader = CustomReader()
filepath = data_handler.get_train_file_path()
data = Dataset.load_from_file(filepath, reader=reader)

## Training and validation
We run cross-validation with the built in function `cross_validate` using default 5-fold cv.

This gives us default error metrics.

In [4]:
from surprise import BaselineOnly
from surprise.model_selection import cross_validate

algo = BaselineOnly()
# do a cross validation with 5 folds
results = cross_validate(algo, data, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0009  0.9986  1.0000  0.9984  0.9987  0.9993  0.0010  
MAE (testset)     0.8060  0.8044  0.8062  0.8042  0.8043  0.8050  0.0009  
Fit time          3.07    3.56    3.35    3.58    3.39    3.39    0.18    
Test time         3.62    3.27    3.56    3.35    3.52    3.46    0.13    


## Predicting
We load the test data to predict.

In [5]:
test_file_path = data_handler.get_test_file_path()
test_data = Dataset.load_from_file(test_file_path, reader=reader)
testset = test_data.construct_testset(test_data.raw_ratings)
predictions = algo.test(testset)
predictions[0]

Prediction(uid=36, iid=0, r_ui=3.0, est=3.2514233714031837, details={'was_impossible': False})

We need to convert the predictions into the right format.

In [8]:
ratings = get_ratings_from_predictions(predictions)

Now we can write the file.

In [9]:
output = data_handler.write_submission(ratings, 'submission_surprise_baseline.csv')
print(output[0:10])

'Id,Predict'