# Quick Overview

All toolbox algorithms operate on 2d pandas dataframes with rows as unique _users_ and columns as unique _items_. To illustrate basic model fitting and how we handle dense and sparse data differently, we'll use a small simple toy dataset for illustrative purposes.

In [124]:
import numpy as np
import pandas as pd
from emotioncf import NNMF_sgd, estimate_performance, approximate_generalization

In [145]:
np.random.seed(0)
ratings_dict = {
    "Subject": [
        "A","A","A","A","A","B","B","B","B","B","C","C","C","C","C",
        "D","D","D","D","D","E","E","E","E","E","F","F","F","F","F"
    ],
    "Item": [1,2,3,4,5]*6,
    "Rating": np.random.randint(1, 101, size=30)
}
df = pd.DataFrame(ratings_dict)
mat = create_sub_by_item_matrix(df)

# Fitting a model to _dense_ data

Here we have a user x item matrix of ratings. Each of **6 users** rated **5 items** on scale from 1-100. This is a _dense_ dataset because every user rated every item, i.e. no values are missing.

In [146]:
mat

Item,1,2,3,4,5
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,45,48,65,68,68
B,10,84,22,37,88
C,71,89,89,13,59
D,66,40,88,47,89
E,82,38,26,78,73
F,10,21,81,70,80


We can test how well collaborative filtering would work to recover data _if they were missing_ using the  `NNMF` algorithm trained via stochastic gradient descent. To do so, we can initialize a model and tell it to "sparsify" our data by masking out 25% of the values and retaining 75%. 

In [147]:
model = NNMF_sgd(mat, n_mask_items=.25)

We can see that this approximates a dataset in which each user only provided a rating for 4 of the 5 items. We're now going to fit a model to try to recover what these missing ratings *would have been*.

In [148]:
model.masked_data

Item,1,2,3,4,5
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,45.0,48.0,,68,68.0
B,10.0,,22.0,37,88.0
C,,89.0,89.0,13,59.0
D,66.0,,88.0,47,89.0
E,82.0,38.0,,78,73.0
F,10.0,21.0,81.0,70,


In [149]:
model.fit()

## Examining model predictions

Lets take a look the predicted ratings matrix. The model makes predictions for *every* user/item using _observed_ values. This enables the recovery of our _missing_ values.

In [150]:
model.predictions

Item,1,2,3,4,5
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,44.999151,47.953675,58.691145,67.968257,67.969929
B,10.000185,46.30364,21.999898,36.999965,88.000474
C,66.349398,88.990034,88.999753,12.993177,58.99247
D,66.000434,50.697684,87.999868,46.99974,89.001361
E,81.999818,37.996828,58.978183,77.997957,72.99743
F,9.999371,21.006285,81.000367,70.005018,91.235274


We can calculate how well our model did using the `.score()` method. By default this calculates the root mean squared error (RMSE) of our predictions vs the  _true missing_ values, i.e. the 25% we masked out.

RMSE is interpretable as the average amount of error on the *original scale* of the data (1-100). We can see that the model easily learns to re-create the original (`observed`) ratings nearly perfectly, but its performance is lower on the unobserved ratings (`missing`). The `full` RMSE just reflects the error combined across both `observed` and `missing` values.

In [151]:
print(f"RMSE observed: {model.score(dataset='observed')}")
print(f"RMSE missing: {model.score(dataset='missing')}")
print(f"RMSE all: {model.score(dataset='full')}")

RMSE observed: 0.01346242071583265
RMSE missing: 21.64361991977338
RMSE all: 9.679328573601552


## Evaluting Model Fit

RMSE isn't the only performance metric we currently support. We can also calculate:

- mean-absolute-error (MAE)
- mean-squared-error (MSE)
- pearson correlation

At the same time we can calculate these performance metrics by _grouping_ the data in two different ways:

1. _user_ model fit: this is the performance calculated _separately_ per user and then averaged. This approach is more common for calculating metrics in psychological and social science research. 
2. _overall_ model fit: this is the performance ignoring the fact that scores from the same user might be more similar. This is more commonly used in machine-learning or industry settings.

Rather writing custom code to repeatedly call `.score()` for each metric and group, we provide a convenient `.summary()` method which will calculate all of these!


In [152]:
model.summary()

Unnamed: 0,algorithm,dataset,group,metric,score
0,NNMF_sgd,full,all,correlation,0.928144
1,NNMF_sgd,full,all,mae,3.457459
2,NNMF_sgd,full,all,mse,93.689402
3,NNMF_sgd,full,all,rmse,9.679329
4,NNMF_sgd,full,user,correlation,0.942099
5,NNMF_sgd,full,user,mae,3.457459
6,NNMF_sgd,full,user,mse,93.689402
7,NNMF_sgd,full,user,rmse,7.719451
8,NNMF_sgd,missing,all,correlation,0.30851
9,NNMF_sgd,missing,all,mae,17.26116


## Benchmarking Model Performance

For benchmarking a model's performance given dense data, it's helpful to repeat the process above (masking, fitting, scoring) multiple times with different random masks. We can then compute the _average_ performance across these multiple runs to ensure that a model's performance isn't a fluke due to a particular combination of masked and unmasked data. 

For convenience we offer an `esimate_performance` function that does just that! Let's run it for 10 iterations with the same level of masking (75%)

In [153]:
all_results = estimate_performance(
    NNMF_sgd, mat, n_iter=10, model_kwargs={"n_mask_items": 0.75}
)

We can see now that the performance we observed above on `missing` data was a "lucky" overestimate. After rerunning the estimation with different random masks, the mean performance drops substantially and there's quite a bit of variance. This is because our dataset is a small toy example.

In [154]:
all_results

Unnamed: 0,algorithm,dataset,group,metric,mean,std
0,NNMF_sgd,full,all,correlation,0.273217,0.084262
1,NNMF_sgd,full,all,mae,20.185084,1.769544
2,NNMF_sgd,full,all,mse,791.135264,164.59701
3,NNMF_sgd,full,all,rmse,28.005223,2.757363
4,NNMF_sgd,full,user,correlation,0.246005,0.163178
5,NNMF_sgd,full,user,mae,20.185084,1.769544
6,NNMF_sgd,full,user,mse,791.135264,164.59701
7,NNMF_sgd,full,user,rmse,26.824523,2.808243
8,NNMF_sgd,missing,all,correlation,-0.057269,0.249291
9,NNMF_sgd,missing,all,mae,25.227155,2.208383


# Fitting a model to _sparse_ data

While manually "sparsifying" a dense dataset is useful for benchmarking, in many scenarios data will _already_ be sparse and we won't have ground truth values to compare against. Of course models support this use case as well. This time when we intialize a model we don't pass in any thing for `n_mask_items`. Models are smart enough to raise an error if provided a sparse dataset and additional masking is requested.

In [155]:
# Lets use our masked data from before as if it were a real dataset
real_sparse_mat = model.masked_data

new_model = NNMF_sgd(real_sparse_mat)

data contains NaNs...treating as pre-masked


Now we can proceed to fitting just like before, but there's a catch:

this time we won't be able to obtain a score for the _missing_ data because we never observed it in teh first place!

In [156]:
new_model.fit()
print(f"RMSE observed: {new_model.score(dataset='observed')}")
print(f"RMSE all: {new_model.score(dataset='full')}")

RMSE observed: 0.01167553207781378
RMSE all: 0.01167553207781378


In [157]:
print(f"RMSE missing: {new_model.score(dataset='missing')}")

RMSE missing: None




## Benchmarking Model Performance

One popular approach to handle this scenario is to use _cross-validation_, whereby a model is estimated on a subset of the data (_train_ set) and evaluated on an independent subset (_test_ set). In a collaborative filtering context, standard methods to generate cross-validation folds, such as those provided by `sklearn` will not work. This is because we don't want to train on a subset of _users_ or a subset of _items_ (observations and features respectively in a typical supervised-learning situation). Instead, we want to predict new _combinations_ of users + items that we did not observe.

To do this we can add _additional sparsity_ in a way keeps track of which user-item values we mask out in the training set to ensure that these values are "un-masked" in the testing set. By doing this we, can evaluate the performance of a model despite having sparse (missing) data to begin with!

Using this approach we actually have 2 different kinds of sparsity: 

1. A value that was *never* observed and therefore doesn't exist at all in either train or test splits. By definition, there's no way for us to incorporate this into model evaluation. 
2. A value we _manually mask out_ when fitting the model (train set) and _unmask_ when we want to evaluate the model (test set). These observations will serve as the model's _generalization_ performance. 

To make this kind of estimation simple we offer another convenience function: `approximate_generalization`.

**Note**: keep in mind that using this approach will _increase_ the sparsity of an already sparse dataset!

You can control the extent to which this happens using the `n_folds` parameter. More folds, means that the model is trained on _more_ data thereby decreasing the additional sparsity.

Here we run this using 10 folds which means that ~90% of the _observed_ values will be used for training the model and ~10% of the _observed_ values will be used for testing performance.

In [158]:
cv_results = approximate_generalization(NNMF_sgd, real_sparse_mat, n_folds=10)

In this particular example, the performance of our model on unseen data drops even further because of how small our dataset was to begin with.

In [159]:
cv_results

Unnamed: 0,algorithm,dataset,group,metric,mean,std
0,NNMF_sgd,test,all,correlation,-0.404187,0.8748287
1,NNMF_sgd,test,all,mae,36.741604,12.126
2,NNMF_sgd,test,all,mse,2054.950232,1159.434
3,NNMF_sgd,test,all,rmse,43.228941,14.38398
4,NNMF_sgd,test,user,correlation,0.0,1.414214
5,NNMF_sgd,test,user,mae,36.629386,12.06473
6,NNMF_sgd,test,user,mse,2042.186382,1160.706
7,NNMF_sgd,test,user,rmse,37.607371,13.46175
8,NNMF_sgd,train,all,correlation,1.0,4.291507e-08
9,NNMF_sgd,train,all,mae,0.004285,0.001732634


# Concluding Thoughts

This notebook has provided a full example of how to fit a model in several different ways. Here are a few parting tips to help you plan your own analyses:

If you already have a _sparse_ dataset, `approximate_generalization` can help you assess how well collborative filtering works for your use case (at the cost of increasing sparsity during evaluation). This approach is the defacto standard in several other collaborative filtering toolboxes such as [Surprise](http://surpriselib.com/). 

If you're working with a small but _dense_ dataset, using `estimate_performance` maybe preferable. That's because you can leverage ground truth observations and more carefully control how much sparsity you would like to use during model evaluation. This is useful when making new data collection or experimental design choices as it simulates how well a model will perform had your dataset been _sparse_.
