# Collaborative Filtering
Problem setting: there are many products and many users. In collaborative filtering, we look at the products the current user has used or liked, find others who have liked similar products, and recommend products that those other users have used or liked.

## Latent Factors
The foundational idea is that of "latent factors." We don't really need to know much of anything about the users or about the products except for which users have used/liked which products. But we assume there are some underlying unifying characteristics about these users and/or products. Netflix could recommend a bunch of '70s sci-fi without possessing any concept of '70s sci-fi. But that underlying concept is still what we are getting at.

## The Data 
We will use a subset of the `MovieLens` dataset. The full dataset contains tens of millions of movie rankings (a rating, a movie ID, and a user ID). We will use a subset of 100,000 of these rankings.

In [19]:
from fastai.collab import *
from fastai.tabular.all import *
path = untar_data(URLs.ML_100k)

In [20]:
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                     names=['user', 'movie', 'rating', 'timestamp'])
ratings.head()

Unnamed: 0,user,movie,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


Compared to the full dataset, this subset contains some of the most popular (most reviewed) movies, and some of the most prolific reviewers of movies. What we are ultimately interested in is being able to assign predicted ratings to user-movie combinations that do not have ratings.

## Learning Latent Factors
We can imagine a circumstance where each movie is rated according to a handful of factors, and each user has a certain proclivity for those factors. Maybe we have `['recency', 'high action', 'long movie']`. A movie with values `[-0.9, 0, 0.7]` would be an old, medium-action, long movie. A user's preferences can be expressed in the same way. A user who likes new, high-action, short movies could be represented with the list `[0.8, 0.9, -0.7]`. We could determine how likely the user is to like our old movie by taking the dot product. We end up with -1.21 -- a poor match (quality of match here could range from -3 to 3).

In the model of interest in this chapter, though, we want the *model* to learn the latent factors so we don't have to. That involves the following steps:

1. Randomly initialize some parameters. These parameters are a set of latent factors for each movie and user. We need to choose how many to use, but we do *not* choose what they mean.
2. Calculate predictions. We can take the dot products of the parameters for each movie and user to assign a predicted score to each movie/user combination.
3. Calculate the loss. We can use any loss function; for now we'll use MSE for simplicity. With these steps in place, we can optimize our parameters with SGD. 

My question at this phase is: how do we choose to optimize viewer vs. movie parameters? Is there any meaningful difference?

## Preparing the DataLoaders
We start by getting movie titles and corresponding IDs

In [21]:
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
                    usecols=(0,1), names=('movie', 'title'), header=None)
movies.head()

Unnamed: 0,movie,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [22]:
# Merge with ratings
ratings = ratings.merge(movies, on = "movie")
ratings.head()

Unnamed: 0,user,movie,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,63,242,3,875747190,Kolya (1996)
2,226,242,5,883888671,Kolya (1996)
3,154,242,3,879138235,Kolya (1996)
4,306,242,5,876503793,Kolya (1996)


In [23]:
# Build a DataLoaders
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

Unnamed: 0,user,title,rating
0,698,"Third Man, The (1949)",2
1,114,"Full Monty, The (1997)",4
2,291,Pink Floyd - The Wall (1982),4
3,605,Red Corner (1997),3
4,332,Leaving Las Vegas (1995),3
5,457,Stargate (1994),3
6,373,"Jungle Book, The (1994)",4
7,314,Roommates (1995),2
8,429,Candyman (1992),2
9,142,Blade Runner (1982),3


### Doing this in PyTorch
We can't use the pandas crosstab presentation directly. We need tensors.

In [24]:
n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5 # Can adjust this however we want

user_factors = torch.randn(n_users, n_factors)
movie_factors = torch.randn(n_movies, n_factors)

One challenge is how to look up our user-movie correspondences (i.e. to find the indices). To do this, we replace our indices with one-hot coded vectors.

In [25]:
one_hot_3 = one_hot(3, n_users).float()
one_hot_3[1:10]

tensor([0., 0., 1., 0., 0., 0., 0., 0., 0.])

In [26]:
user_factors.t() @ one_hot_3

tensor([-1.5439, -0.1012, -1.6105,  0.3358, -1.2102])

This is the same as if we used `user_factors[3]`. This vector-based approach resembles what is occurring behind the scenes. Behind the scences there is an "embedding matrix" that is multiplied to find indices.

As discussed above, in this model, we're not specifying what the factors of interest are. We're letting the model figure it out for itself by working through the connections between users and movies. We'll be able to see what sorts of movies are "grouped" at the end and to isolate genres, blockbusters, etc.

## Collaborative Filtering Model from Scratch
(begins with a little review of classes). We will use the PyTorch `module` class to define our own dot product class. I updated the original version from p261 to include the addition of `y_range` from 262, ensuring our predictions fall between 0 and 5.

In [33]:
class DotProduct(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range = (0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range=y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        # We defined y_range as a tuple so *self.y_range unpacks it.
        return sigmoid_range((users * movies).sum(dim=1), *self.y_range)

In [34]:
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5,5e-3)

epoch,train_loss,valid_loss,time
0,1.000378,0.984772,00:08
1,0.87895,0.882145,00:08
2,0.697181,0.854221,00:08
3,0.471408,0.860707,00:08
4,0.352587,0.865184,00:08


So we now have a working model. What are the next steps? Some missing pieces:
- Some viewers are just more positive/negative and some movies are just good/bad. How do we account for this?

Answer: we currently just have weights. We also need a bias term. One bias term for each viewer and each movie will allow us to adjust for these considerations.

In [35]:
class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users,1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies,1)
        self.y_range=y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users*movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)
    
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5,5e-3)

epoch,train_loss,valid_loss,time
0,1.016422,1.002899,00:08
1,0.895334,0.885311,00:08
2,0.669896,0.850087,00:08
3,0.507418,0.857344,00:08
4,0.371815,0.862517,00:08


Our model actually got worse by the end. Oops. Next time, we'll get into *weight decay* as a method of solving this.