# Intro
7: Collaborative Filtering

# Lesson Overview

This lesson covered some more notebooks about how to tweak models to archive better results, including how to get around GPU memory constraints. It also covers how to make a model to predict multiple values, for example classification of multiple categories. Collaborative filtering is detailed in depth for the textbook chapter.

# The topics covered, briefly
- How to get around GPU memory constraints using gradient accumulation.
- Using ensembles of different architectures.
- How to make a model predict more than one dependent variable.
- Cross entropy loss, mathematically.
- Collaborative filtering: what it is, how to create it from scratch.
- Collaborative filtering and recommender systems.
- Embedding matrices. 
- PCA.
- Positive feedback loops.
- The bootstrapping problem.
- Weight decay, avoiding overfitting, loss functions.

# Lecture Notes
The lecture covers what is inside a neural net and tweaks to the basic neural net we constructed previously.

## [Going through road to the top, part 3](https://www.kaggle.com/code/jhoward/scaling-up-road-to-the-top-part-3):
- A larger model has more parameters so can find more features, but the problem is that it takes GPU memory that isn't as flexible as CPU memory. 
- How to use as large a model as you like without worrying about memory. For example, Kaggle has 16Gb GPUs. 
- You can first find out how much memory a model uses. What's important is that training for longer does not actually require more GPU memory. 
- Gradient accumulation is how: Run smaller batch sizes, but modify them as to act and train as if we were using the same normal batch size for all the training data.
- Gradient accumulation results are identical to using a higher memory GPU for certain models. It is for convText and Transformers (NLP). If a model uses batch normalisation, then it won't exactly, it will have different results, but probably still good ones.
- Pick a batch size that fits your GPU memory, and generally higher and a multiple of 8 is better. Generally (not always) if you double batch size, half your learning rate.
- We can use ensembles of good models of different architectures, and get even better results. Furthermore, we can add in bagging too to train them on different sets of the training data. 
- At the start it may feel random as to why certain approaches/models are better, but over time as you develop intuition, it will feel be less random and more systematic.
- Generally, it makes sense to iterate on small models then switch to large models, but there's a better way of ensuring this performance converts correctly. This is covered in the second course.

## [Going through road to the top, part 4](https://www.kaggle.com/code/jhoward/multi-target-road-to-the-top-part-4):
- We want a model to now predict two things instead of one I.E, two dependent variables instead of one. E.g. from a rice photo, the type of rice (10 types) and the disease it may have (10 types), so there are 20 categories.
- This requires an understanding of making custom loss functions and a deeper look into how cross entropy loss works.
- Make a learner just for the first dependent variable, disease, and create a specific metric function for it.
- Cross entropy loss: Jeremy states it is really important to understand and so goes into the maths using a separate excel sheet. He tries to predict if a image is a cat,dog,plane,fish or building, so there are 5 classification categories. It outputs 5 numbers (relating to probabilities of each category). CEL first finds the softmax value for each of them. Then it compares the actual value (1 for the correct category, 0 for not) to the softmax value for each category. It multiplies the log of the probability prediction for the correct category by the actual value. 
- Further info: https://chris-said.io/2020/12/26/two-things-that-confused-me-about-cross-entropy/
- Binary cross entropy is just cross entropy for 1 category: is a cat or not. Careful here, it's not for 2 categories e.g. cat or dog.
- The loss functions in our python environment has two types. The F function type and the nn class type. The latter has more parameters to play with.
- Change the last node outputs to be the number of categories predicted instead of the usual 1 for classification.
- You encode the loss function for the model to know what/how many categories to predict. You sum loss functions for multiple category types and their sub categories.
- This new model, that can predict 20 categories, actually is better than a model that just predicts disease type! This is because the training to do other types of predictions helps. Sometimes this approach is better, sometimes not!

## [Collaborative Filtering Deep Dive](https://www.kaggle.com/code/jhoward/collaborative-filtering-deep-dive/notebook):
- Collaborative filtering is a key part of recommender systems.
- We use the Movielens dataset of movie ratings with 3 columns: UserID, MovieID, and rating.
- Imagine a matrix of the users and their movie ratings. There are missing values for unrated/unseen movies. Collaborative filtering is just trying to fill in these missing values to complete the matrix.
- The problem is predicting how a user will rate an unrated movie for them. We want to match up the user's movie preferences with the movie's features to predict this.
- But we don't know their preferences and the movie features, these are called latent variables. We only have their ID, their previous ratings, and those of other users. We can however infer a user's preferences and a movie's features from this data.
- Let's assume there are 5 latent factors, say like for a movie, it's genre, length etc, we don't set these, we calculate  them and then can try and interpret what they are. 
- On choosing the number of latent factors, its hard. Fast.ai has a function to calculate this based on Jeremy's intuition, but you can play around too.
- Use SGD to optimise these latent factors after we set a loss function.
- What is embedding? Just looking something up in an array. An embedding matrix is the array that is looked up. Matrix multiplication in an embedding matrix is the same as looking up index values in a list say as a function in excel. Think about a dot product with a one-hot encoded vector, it just returns the value you're looking up. 
- We then cover how to create a collaborative filtering model from scratch using python, PyTorch class definition, and features. 
- We create a DotProduct class to define embeddings and looking up values for UserIDs and MovieIDs.
- Some of our user rating predictions can greater than 5, the maximum. Take our predictions and squish them with a sigmoid to fix this.
- We noticed that some users just relates all movies highly, while some users have a range of ratings. Let's incorporate this into our model predictions. To do so, we make another inference variable, a movie bias and a user bias, reflecting that for movies, they tend to be especially related well or badly, and that for some users, they can rate all movies generally as good or bad. 
- It's not covered, but I think we could try and cluster users instead to try and incorporate user and movie types/preferences?
- We can use L2 regularisation (weight decay), to avoid overfitting. This adds the sum of the square of the weights to the loss function. This also solves the issue of having useless interfered variables, because they won't contribute. I suppose this makes getting the exact number of inference variables less important.
- In fast.ai, usually defaults are good, but for tabular data, it's hard to know good defaults, so it's good to test yourself.

# Questions
The questions from now on will be a part of the blog post rather than uploaded to the lesson's GitHub repo.

**What problem does collaborative filtering solve?**

If you had a matrix of users and their ratings of movies, there are unrated movies so empty spaces. Collaborative filtering attempts to make predictions to fill these spaces.

**How does it solve it?**

It uses the data we have to create/infer latent variables, like movie genre or user preferences, to use to predict.

**Why might a collaborative filtering predictive model fail to be a very useful recommendation system?**

A recommendation system might be better if it could tell users why they are being recommended certain movies. 
If we find it difficult to understand what the latent variables mean, this could be hard.
The system also can't act as fluidly as a human and recommend things with incredible movie domain and user preference knowledge. For example, if I asked at a bookstore for a recommendation, they could cater very well to my individual preferences and find a niche great book for me. A collab filtering model could struggle to do this, even with many interference variables. 

**What does a crosstab representation of collaborative filtering data look like?**

We first show the matrix of userID and movieID detailing the user ratings. Then behind userID and movieID, we show all the values of our latent variables for them. In our example, 5. This is a neat view of for each prediction, the userID and movieID for it, and the 10 total latent variables for it. As we train our model, we can see both the prediction and 10 latent variables change. We use SGD to optimise as usual.

**Write the code to create a crosstab representation of the MovieLens data (you might need to do some web searching!)**

-pass-

**What is a latent factor? Why is it "latent"?**

A latent factor/variable is a factor not present in the dataset that we infer in order to aid predictions. It's latent because we have to infer it from the data. 

**What is a dot product? Calculate a dot product manually using pure Python with lists.**

A dot product is just multiplying each element in a matrix together and adding them up. e.g. (1,2,3).(1,2,3) = 1 + 4 + 9 = 14.

**What does pandas.DataFrame.merge do?**

We can merge two columns together if we want. In this case, as humans we want to see not just movieIDs but their titles.

**What is an embedding matrix?**

It's actually not so complicated or scary. An embedding matrix is just a matrix we multiply a one-hot-encoded matrix with. Say we have a one-hot-encoded matrix, if we dot product it with an embedding matrix, it will only return the values we want. Essentially, this is just a quick way to look up values in the embedding matrix. The embedding matrix itself contains the user latent factors and the movie latent factors.

In concrete terms, say we want to make a prediction for user 32's rating of movie 45. We get make a one-hot-encoded vector and dot product it by the embedding matrix to get user 32's latent factors and movie 45's latent factors. Then with our latent factors, we can make our prediction.

**What is the relationship between an embedding and a matrix of one-hot-encoded vectors?**

Explained previously. The one-hot-encoded vector simply picks out which embedding matrix values to access/look up.

**Why do we need Embedding if we could use one-hot-encoded vectors for the same thing?**

It uses a lot more memory and time.

**What does an embedding contain before we start training (assuming we're not using a pretrained model)?**

They are initialised to random values, so nothing. Just like weights in any model, after training they are more meaningful. After training, they represent something about inference factors and predictions. In NLP, between words' relationships to eachother.

**Create a class (without peeking, if possible!) and use it.**

-pass-

**What does x[:,0] return?**

The first column.

**Rewrite the DotProduct class (without peeking, if possible!) and train a model with it.**

-pass-

**What is a good loss function to use for MovieLens? Why?**

Movielens contains movie ID, user ID, and numeric ratings. 
We use MSELossFlat as our loss function.

It's fine to use MSE for classification. In fact, there for classification, cross-entropy and MSE can sidegrades to eachother. https://stats.stackexchange.com/questions/568238/disadvantages-of-using-a-regression-loss-function-in-multi-class-classification

But we are not doing classification, we are doing regression as our prediction is 1-5. For regression, MSE is better suited, but cross-entropy is faster when predictions vary significantly. https://rohanvarma.me/Loss-Functions/

**What would happen if we used cross-entropy loss with MovieLens? How would we need to change the model?**

-pass-

**What is the use of bias in a dot product model**

To encode extra information about the problem, in our case, the fact that some users rate really highly or lowly, and some movies are generally really well rated or badly rated.

**What is another name for weight decay?**

L2 regularization

**Write the equation for weight decay (without peeking!).**

Add a wd*(w^2) to the loss function, where wd is weight decay, a parameter just to control how much we dislike large weights. 

**Write the equation for the gradient of weight decay. Why does it help reduce weights?**

add a wd*(w^2) to the loss function, where wd is weight decay. 
It penalises large weights, because as we try and lower loss, the weight decay term will be large if the weight is too high.

**Why does reducing weights lead to better generalization?**

Because larger weights are associated with a more complex total loss function. For instance, large weights results in a very sharp curve with many edges going to specific data points. Having lower weights corresponds to having a more gradual curve that is less overfitted to specific data points.

**What does argsort do in PyTorch?**

It can give us the elements with the largest or smallest values of a column we want.
In our example, for movie_bias, we can easily see the movies with the lowest biases after training.

**Does sorting the movie biases give the same result as averaging overall movie ratings by movie? Why/why not?**

No. Specifically looking into the sorted movie biases tells us specific information we can interpret. In our case, the movies with the lowest biases are the movies that people tend not to like even if it's something that they'd normally enjoy (something that matches their preferences). This is because these movies' latent factors (genre, etc) match the users' latent factors (preferences) but still have very little movie bias. A higher movie bias tells us that users regardless of movie and user latent factors rate that movie highly, so a small movie bias tells us that users rate that movie badly, regardless of how well the movie and user latent factors match.

**How do you print the names and details of the layers in a model?**

learn.model

**Extra: What is PCA?**

Principal component analysis is a technique to interpret embedding matrices. For humans its hard to understand latent factors, so this helps. We can plot a graph of the strongest PCA components, which are like latent variables, for the movies. And then looking at it, as humans we can try and infer what these PCA components mean, and why they are so valuable for predictions. It's amazing, since it can discover things about the data with no help. 

**What is the "bootstrapping problem" in collaborative filtering?**

The extreme case, when we have no users and so no data, what do we recommend to the first user? Or even if we have many users, how do we recommend a new product to them?

**How could you deal with the bootstrapping problem for new users? For new movies?**

We can ask new users about their preferences and create a separate model that tries to predict their embedding vector based on this. 

**How can feedback loops impact collaborative filtering systems?**

Say a small number of dedicated users really like anime. They only watch it rather than other genres, and always rate it. This tunes the recommendation system to recommend more anime, which in turn makes other existing users watch anime, makes users who don't like anime leave, and attracts new users who like anime. This causes a feedback loop and can make the original purpose of the website change.

**When using a neural network in collaborative filtering, why can we have different numbers of factors for movies and users?**

In our previous non deep learning model, we took the dot product to use embeddings. But in DL we don't use the dot product, we use a different way. We take the results of the previous embedding lookup and concatenate its activations. 

**Why is there an nn.Sequential in the CollabNN model?**

In order to create our neural network layers in the order we want.

**What kind of model should we use if we want to add metadata about users and items, or information such as date and time, to a collaborative filtering model?**

A deep learning model. 
Our collaborative filtering model dot product approach is hardcoded to only be able to use userID, movieID, ratings, latent variables, and biases. 
A DL model, or TabularModel, can incorporate more information, however as the next lesson discusses, there are advantages and disadvantages to both.

# Links

The course page for this sessions is https://course.fast.ai/Lessons/lesson7.html, which includes a lecture, notebooks, and a set of questions from the course book. 