In [1]:
import fastbook
fastbook.setup_book()

In [2]:
from fastbook import *
from fastai.collab import *
from fastai.tabular.all import *

In [3]:
# Downloading the data the usual way
path = untar_data(URLs.ML_100k)

# Extracting the Ratings
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user', 'movie', 'rating', 'timestamp'])

# Extracting the Movie Titles
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
                    usecols=(0,1), names=('movie', 'title'), header=None)

# Merging the two dataframes
ratings = ratings.merge(movies)

# Creating our DataLoaders
dls = CollabDataLoaders.from_df(ratings,
                                 user_name='user',
                                 item_name='title',
                                 rating_name = 'rating',
                                 bs=64)

# Initialising our Latent Factors
n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

# Collaborative Filtering Deep Dive

## Collaborative Filtering from Scratch

Creating a new PyTorch module requires inheriting from `Module` which provides some basic foundations that we want to build on. 
When creating a new PyTorch module, PyTorch wil lcall a method in your class called `forward` and will pass along to that any parameters that are included in the call.
Below is a class defining our dot product model.

In [4]:
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
    
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users * movies).sum(dim=1)

Note that the input of the model is a tensor of shape `batch_size x 2`, where the first column (`x[:,0]`) contains the user IDs and the secon column (`x[:,1]`) contains the movie IDs.

As explained before, we use the *embedding layers* to represent our matrices of user and movie latent factors.

In [5]:
x,y = dls.one_batch()
x.shape

torch.Size([64, 2])

Now that we have defined our architecture and created our parameter matrices, we need to create a `Learner` to optimise our model. Since we are doing things from scratch, we will use the plain `Learner` class.

In [6]:
model =DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,1.371919,1.31346,00:03
1,1.095397,1.090162,00:03
2,0.982468,1.003651,00:03
3,0.855089,0.907152,00:03
4,0.76564,0.889513,00:03


### Improving our Model

To make this model a little bit better, we should force our predictions to be between 0 and 5. To do this, we just need to use `sigmoid_range` like we did previously. Empirically, it's better to have the range go a little bit beyond 5, so we use `(0, 5.5)`.

In [7]:
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range = y_range
    
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

In [9]:
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,1.0135,1.001645,00:03
1,0.872036,0.915327,00:03
2,0.702334,0.876489,00:03
3,0.476385,0.877797,00:03
4,0.376548,0.882103,00:03


Although this is a reasonable start, we can do better. One obvious missing piece is that some users are just more positive or negative in their recommendations than others, and some movies are just plain better or worse than others. In our dot product recommendation, we have no way to encode either of these things. That's because at this point, we only have weights; we do not have any biases. 

If we have a single number for each user that we can add to our scores, and ditto for each movie, that will handle this missing piece very nicely.

In [10]:
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

In [11]:
model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,0.917701,0.944697,00:03
1,0.802276,0.870173,00:04
2,0.640088,0.868531,00:03
3,0.426239,0.888938,00:03
4,0.290958,0.895513,00:03


Instead of being better, our model ends up being worse at the end of training. We can see this as the validation loss stopped improving in the midle and started to ge worse. This is a clear indication of overfitting.

In this problem, there is no way to use data augmentation, so we will have to use another regularisation technique.