In [None]:
from fastai.gen_doc.gen_notebooks import update_module_metadata
import fastai.collab
# For updating jekyll metadata. You MUST reload notebook immediately after executing this cell for changes to save
# Leave blank to autopopulate from mod.__doc__
update_module_metadata(fastai.collab, title='collab', summary='Application to collaborative filtering')

# Collaborative filtering

In [None]:
from fastai.gen_doc.nbdoc import *
from fastai.collab import * 
from fastai.docs import *

This package contains all the necessary functions to quickly train a model for a collaborative filtering task.

## Overview

Collaborative filtering is when you're tasked to predict how much a user is going to like a certain item. The fastai library contains a `CollabFilteringDataset` class that will help you create datasets suitable for training, and a function `get_colab_learner` to build a simple model directly from a ratings table. Let's first see how we can get started before devling in the documentation.

For our example, we'll use a small subset of the [MovieLens](https://grouplens.org/datasets/movielens/) dataset. In there, we have to predict the rating a user gave a given movie (from 0 to 5). It comes in the form of a csv file where each line is the rating of a movie by a given person.

In [None]:
ratings = get_movie_lens()
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,73,1097,4.0,1255504951
1,561,924,3.5,1172695223
2,157,260,3.5,1291598691
3,358,1210,5.0,957481884
4,130,316,2.0,1138999234


We'll first turn the `userId` and `movieId` columns in category codes, so that we can replace them with their codes when it's time to feed them to an `Embedding` layer. This step would be even more important if our csv had names of users, or names of items in it.

In [None]:
series2cat(ratings, 'userId','movieId')

Now that this step is done, we can directly create a `Learner` object:

In [None]:
learn = get_collab_learner(ratings, n_factors=50, pct_val=0.2, min_score=0., max_score=5.)

And the immediately begin training

In [None]:
learn.fit_one_cycle(5, 5e-3, wd=0.1)

VBox(children=(HBox(children=(IntProgress(value=0, max=5), HTML(value='0.00% [0/5 00:00<00:00]'))), HTML(valueâ€¦

Total time: 00:02
epoch  train loss  valid loss
0      2.347037    1.848229    (00:00)
1      1.074374    0.703097    (00:00)
2      0.727364    0.668684    (00:00)
3      0.628840    0.661857    (00:00)
4      0.572673    0.659882    (00:00)



In [None]:
show_doc(CollabFilteringDataset, doc_string=False)

## <a id=CollabFilteringDataset></a>`class` `CollabFilteringDataset`
> `CollabFilteringDataset`(`user`:`Series`, `item`:`Series`, `ratings`:`ndarray`) :: `DatasetBase`
<a href="https://github.com/fastai/fastai/blob/master/fastai/collab.py#L10">[source]</a>

This is the basic class to buil a `Dataset` suitable for colaborative filtering. `user` and `item` should be categorical series that will be replaced with their codes internally and have the corresponding `ratings`. One of the factory methods will prepare the data in this format.

In [None]:
show_doc(CollabFilteringDataset.from_df, doc_string=False)

#### <a id=from_df></a>`from_df`
> `from_df`(`rating_df`:`DataFrame`, `pct_val`:`float`=`0.2`, `user_name`:`Optional`\[`str`\]=`None`, `item_name`:`Optional`\[`str`\]=`None`, `rating_name`:`Optional`\[`str`\]=`None`) -> `Tuple`\[`ColabFilteringDataset`, `ColabFilteringDataset`\]
<a href="https://github.com/fastai/fastai/blob/master/fastai/collab.py#L33">[source]</a>

Takes a `rating_df` ans splits it randomly for train and test following `pct_val` (unless it's None). `user_name`, `item_name` and `rating_name` give the names of the corresponding columns (defaults to the first, the second and the third column).

In [None]:
show_doc(CollabFilteringDataset.from_csv, doc_string=False)

#### <a id=from_csv></a>`from_csv`
> `from_csv`(`csv_name`:`str`, `kwargs`) -> `Tuple`\[`ColabFilteringDataset`, `ColabFilteringDataset`\]
<a href="https://github.com/fastai/fastai/blob/master/fastai/collab.py#L49">[source]</a>

Opens the file in `csv_name` as a `DataFrame` and feeds it to `show_doc.from_df` with the `kwargs`.

## Model and `Learner`

In [None]:
show_doc(EmbeddingDotBias, doc_string=False, title_level=3)

### <a id=EmbeddingDotBias></a>`class` `EmbeddingDotBias`
> `EmbeddingDotBias`(`n_factors`:`int`, `n_users`:`int`, `n_items`:`int`, `min_score`:`float`=`None`, `max_score`:`float`=`None`) :: `Module`
<a href="https://github.com/fastai/fastai/blob/master/fastai/collab.py#L55">[source]</a>

Creates a simple model with `Embedding` weights and biases for `n_users` and `n_items`, with `n_factors` latent factors. Takes the dot product of the embeddings and adds the bias, then feed the result to a sigmoid rescaled to go from `min_score` to `max_score`. 

In [None]:
show_doc(get_collab_learner, doc_string=False)

#### <a id=get_collab_learner></a>`get_collab_learner`
> `get_collab_learner`(`ratings`:`DataFrame`, `n_factors`:`int`, `pct_val`:`float`=`0.2`, `user_name`:`Optional`\[`str`\]=`None`, `item_name`:`Optional`\[`str`\]=`None`, `rating_name`:`Optional`\[`str`\]=`None`, `test`:`DataFrame`=`None`, `min_score`:`float`=`None`, `max_score`:`float`=`None`, `loss_fn`:`LossFunction`=`'mse_loss'`, `kwargs`) -> `Learner`
<a href="https://github.com/fastai/fastai/blob/master/fastai/collab.py#L70">[source]</a>

Creates a `Learner` object built from the data in `ratings`, `pct_val`, `user_name`, `item_name`, `rating_name` to `CollabFilteringDataset`. Optionally, creates another `CollabFilteringDataset` for `test`. `kwargs` are fed to `DataBunch.create` with these datasets. The model is given by `EmbeddingDotBias` with `n_factors`, `min_score` and `max_score` (the numbers of users and items will be inferred from the data).

## Undocumented Methods - Methods moved below this line will intentionally be hidden

In [None]:
show_doc(EmbeddingDotBias.forward)