<a href="https://colab.research.google.com/github/alexvasyuk/fastai/blob/main/lesson7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Lesson 7, Collaborative Filering
# Here's the notebook from the course: https://www.kaggle.com/code/jhoward/collaborative-filtering-deep-dive/notebook

In [2]:
# Quick refresher before Lesson 7:
#
# - Goal of ML: predict outputs (y) from inputs (x).
# - Training setup:
#     * Model (f(x; w)) makes predictions with weights w.
#     * Loss function (L(y_hat, y)) measures how wrong the prediction is.
#     * Optimization = tweak weights to minimize loss.
# - Gradient Descent (SGD):
#     * Compute gradients of loss w.r.t. weights.
#     * Update weights by nudging them in the opposite direction of the gradient.
# - Batches:
#     * Instead of updating per row (too noisy/slow) or whole dataset (too memory-heavy),
#       we use small batches (e.g. 16, 64, 128).
#     * For each batch: forward pass → loss → gradients averaged over batch → weight update.
#     * Each batch update = one "step". One full pass through dataset = one "epoch".
# - Neural Networks:
#     * Stack layers of parameters, with non-linear activations (ReLU, sigmoid, etc.).
#     * More layers/parameters let the model approximate more complex functions.
# - Final reminder:
#     * Goal is not just minimizing training loss, but generalizing well to unseen data
#       (avoid overfitting).


In [3]:
# Extra reminder before Lesson 7 (Collaborative Filtering & Embeddings):
#
# - Same supervised learning story:
#     * Inputs → user ID + movie ID
#     * Output → rating (or preference)
#     * Loss → difference between predicted rating and actual rating
#     * Optimization → tweak weights to reduce loss
#
# - New twist in Lesson 7:
#     * Instead of "normal" features (age, income, pixels, etc.),
#       we use IDs (user/movie).
#     * IDs get mapped into vectors (embeddings), which the model learns.
#     * These embeddings capture hidden structure, e.g.:
#         - users with similar tastes have similar embeddings
#         - movies with similar audiences have similar embeddings
#
# - Core idea: collaborative filtering = learn latent factors of users and items,
#   so the model can generalize and recommend even for user/movie pairs not seen before.


In [4]:
from fastai.collab import *
from fastai.tabular.all import *
set_seed(42)

# The certificate in fastai has expired. Let's update it
!pip install --upgrade certifi
import ssl
import certifi
ssl._create_default_https_context = ssl._create_unverified_context



In [5]:
# First thing is as usual - loading the data
path = untar_data(URLs.ML_100k)

In [6]:
# Let's see what the data looks like
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None, names=['user', 'movie', 'rating', 'timestamp'])
ratings.head() #gives first 5

Unnamed: 0,user,movie,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [7]:
# The lesson then jumps to the idea of latent factors. I immediately question the
# connection between latent factors and collaborative filtering.

In [8]:
# Q: In classical collaborative filtering, to recommend movies for User X
#    we find a similar User Y, look at what Y liked, and suggest movies
#    X hasn't seen. How does this relate to the latent factor / embeddings
#    approach we’re using here?
#
# A: Classical (memory-based) CF:
#      * For User X, find another User Y with similar ratings.
#      * Recommend movies that Y liked but X hasn’t seen.
#      * Problem: slow for large datasets, fails when data is sparse.
#
#    Latent Factor (matrix factorization / embeddings) CF:
#      * Every user has a hidden "taste vector" (embedding).
#      * Every movie has a hidden "attribute vector" (embedding).
#      * Dot product(user_vec, movie_vec) ≈ predicted rating.
#      * Vectors are initialized randomly and learned with SGD by minimizing loss.
#
#    Connection:
#      * Users with similar embeddings end up close in vector space.
#      * Movies with similar embeddings also cluster together.
#      * So the "find User Y similar to User X" intuition is implicitly captured,
#        but compressed into a low-dimensional embedding space that the model learns.
#
#    Advantages of latent factors:
#      * Generalizes better (works even with sparse data).
#      * More efficient (compare vectors instead of scanning rating tables).
#      * Captures hidden structure (genres, tastes) automatically.


In [9]:
# Q: Are latent factors and embeddings of a user/movie the same thing?
#    How does the terminology work here?
#
# A: They’re closely related but not exactly the same:
#
#    - Latent factors:
#        * The conceptual idea of hidden dimensions that explain preferences.
#        * Examples: "likes action movies", "prefers older films", "enjoys comedies".
#        * Not directly labeled — the model discovers them.
#
#    - Embeddings:
#        * The practical implementation in the model.
#        * A vector of learned numbers that represent a user or a movie.
#        * Each component in the embedding corresponds to one latent factor.
#
#    Example:
#        If we choose 5 latent factors, a user’s embedding might be:
#        [0.8, -0.3, 0.1, 0.7, -0.4]
#        and a movie’s embedding:
#        [0.6, -0.1, 0.3, 0.9, -0.5]
#        The dot product of these two embeddings ≈ predicted rating.
#
#    Summary:
#        * Latent factors = the hidden dimensions (conceptual idea).
#        * Embeddings = the learned numerical representation across those dimensions.
#        * In practice, people often use the terms interchangeably.


In [10]:
# Latent factors calculation
# 0. Pick a number of latent factors (LFs) representing user's interest and
#    movie's attributes. Say 5. + bias term.
# 1. Initialize them (embeddings) to random values for each movie and user
# 2. Go through each (w/o rating) pairing (movie, user) and calculate dot product of their embeddings.
# 3. Compare against the known y by calculating loss (loss func eg - mean sqaured error)
# 4. Take the derivative of the loss func. with respect to embeddings
# and calculate gradient delta using backpropagation
# 5. Apply gradient delta times learning rate to embeddings of each pairing (user, movie)
# 6. Repeat for all pairings N times until satisfied w/ loss

#Lets try it.

In [11]:
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie', 'title'), header=None)
movies.head()

Unnamed: 0,movie,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


In [12]:
ratings=ratings.merge(movies)
ratings.head()

Unnamed: 0,user,movie,rating,timestamp,title
0,196,242,3,881250949,Kolya (1996)
1,186,302,3,891717742,L.A. Confidential (1997)
2,22,377,1,878887116,Heavyweights (1994)
3,244,51,2,880606923,Legends of the Fall (1994)
4,166,346,1,886397596,Jackie Brown (1997)


In [13]:
dls = CollabDataLoaders.from_df(ratings, item_name='title', bs=64)
dls.show_batch()

Unnamed: 0,user,title,rating
0,782,Starship Troopers (1997),2
1,943,Judge Dredd (1995),3
2,758,Mission: Impossible (1996),4
3,94,Farewell My Concubine (1993),5
4,23,Psycho (1960),4
5,296,Secrets & Lies (1996),5
6,940,"American President, The (1995)",4
7,334,Star Trek VI: The Undiscovered Country (1991),1
8,380,Braveheart (1995),4
9,690,So I Married an Axe Murderer (1993),1


In [14]:
# Let's make 2 matricies of latent factors for users and movies
n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])
n_factors = 5

user_factors = torch.rand(n_users, n_factors)
movies_factors = torch.rand(n_movies, n_factors)

In [15]:
# Next thing is we need the dot product of users and movies factors matrices

In [16]:
class DotProduct(Module):
  def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)): # added y_range
      # to constrain output from 0 to 5
    self.user_factors = Embedding(n_users, n_factors)
    self.user_bias = Embedding(n_users, 1) # adjusting for users who are positive/negative in general
    self.movie_factors = Embedding(n_movies, n_factors)
    self.movie_bias = Embedding(n_movies, 1) # adjusting for movies that are good/bad in general
    self.y_range = y_range

  def forward(self, x):
    users = self.user_factors(x[:, 0])
    movies = self.movie_factors(x[:, 1])
    res = (users * movies).sum(dim=1, keepdim=True)
    res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
    return sigmoid_range(res, *self.y_range)


In [17]:
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())

In [18]:
learn.fit_one_cycle(5, 5e-3)

epoch,train_loss,valid_loss,time
0,0.853494,0.925751,00:16
1,0.576824,0.910235,00:10
2,0.408276,0.93974,00:09
3,0.310911,0.953071,00:09
4,0.29624,0.953698,00:13


In [19]:
# Do you see that val loss has gotten progressively smaller, but at the end rose again?
# This is an indication of overfitting.
# The reason this is happening is because the coefficients have grown too large.
# Intuition explained in the lesson with different parabola coeffs.

In [20]:
# To mitigate this we need some type of regularization that will help us generalize.
# The lesson suggest using weigh decay, which is a way of stopping our params
# from growing too fast. How it works under the hood is unclear from the lesson.

In [21]:
model = DotProduct(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

epoch,train_loss,valid_loss,time
0,0.889936,0.942432,00:09
1,0.68231,0.887724,00:08
2,0.518489,0.859999,00:09
3,0.455478,0.849411,00:09
4,0.434298,0.844207,00:10


In [22]:
# The course goes into creating the Embedding module from scratch.
# I'm skipping this, because i'm not into that

In [23]:
# Next is interpreting what the model has learned.
# This means looking embeddings of factors and biases

In [24]:
# Lets look at embeddings of the lowest/highest biases

In [25]:
# Highest Bias / Movies that rank the heighest across all kinds of users
movie_bias = learn.model.movie_bias.weight.squeeze() #added .weight, because my
                                            # model is using the Embedding class
indicies = movie_bias.argsort(descending=True)[:5]
[dls.classes['title'][i] for i in indicies]


['L.A. Confidential (1997)',
 'Titanic (1997)',
 'Good Will Hunting (1997)',
 'Shawshank Redemption, The (1994)',
 "Schindler's List (1993)"]

In [26]:
# Lowest Bias / Movies that rank the lowest across all kinds of users
movie_bias = learn.model.movie_bias.weight.squeeze() #added .weight, because my
                                            # model is using the Embedding class
indicies = movie_bias.argsort()[:5]
[dls.classes['title'][i] for i in indicies]

['Grease 2 (1982)',
 'Showgirls (1995)',
 'Children of the Corn: The Gathering (1996)',
 'Dracula: Dead and Loving It (1995)',
 'Spice World (1997)']

In [27]:
# The lesson presetns 2 core ideas: distance and bootstrap problem.
# Distance between vectors is important as it appears later in the LLM lessons.
# The idea is that similar movies/items have similar vectors and the distance
# between them is small. You can use this to find similar movies/items.

# The bootstrap problem is about what to do when you don't have enough data.
# Say it's new user or new product. Read about it more in the lesson.

In [28]:
# So far we've used DotProduct model for our collab filtering. This approach is called
# PMF (probabilistic matric factorization). But you can also use deep learning for CF.

In [29]:
# We now build a neural net for collaborative filtering:
#
# 1. There are two embedding matrices, one for users and one for movies:
#    - Users: [n_users × 74] trainable parameters
#    - Movies: [n_movies × 102] trainable parameters
#
# 2. For each data point (userID, movieID, rating):
#    - Look up the row for that user → vector of size [1×74]
#    - Look up the row for that movie → vector of size [1×102]
#    - Concatenate them → [1×176] input activations
#
# 3. Pass this vector through the dense layers:
#    - [1×176] → [176×100] + biases → [1×100]
#    - [1×100] → [100×50] + biases → [1×50]
#    - [1×50]  → [50×1]  + bias    → [1×1] prediction (the rating)
#
# 4. Compute the loss vs. true rating.
#
# 5. Backpropagation updates all trainable parameters:
#    - User and movie embedding matrices (so their rows become better latent features)
#    - Dense layer weights and biases
#
# 6. After training, the embeddings represent users and movies in a meaningful way,
#    and the model can predict ratings for any (user, movie) pair.


In [30]:
# A small NN-based collaborative filtering model:
# - Looks up user/item embeddings
# - Concatenates them
# - Passes through a 2-layer NN
# - Maps output into a rating range with sigmoid_range

class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0, 5.5), n_act=100):
        # user_sz and item_sz are tuples from fastai's get_emb_sz(dls), e.g.
        #   user_sz  = (n_users, emb_dim_users)
        #   item_sz  = (n_items, emb_dim_items)
        # Example from the lesson: user_sz=(944, 74), item_sz=(1665, 102)

        # Trainable lookup tables (nn.Embedding) for users and items.
        # Embedding(*user_sz) ≡ Embedding(n_users, emb_dim_users)
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)

        # A tiny NN that takes [emb_dim_users + emb_dim_items] inputs.
        # First Linear projects concat embeddings -> n_act, then ReLU,
        # then another Linear to a single scalar (the raw rating score).
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1] + item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1)
        )

        # We’ll squash the raw scalar into the desired rating range later.
        self.y_range = y_range

    def forward(self, x):
        # x is a 2‑column LongTensor of IDs with shape [bs, 2]:
        #   x[:,0] = user ids, x[:,1] = item (movie) ids

        # Look up embeddings for the batch of user ids and item ids.
        # Shapes: user_vecs: [bs, emb_dim_users], item_vecs: [bs, emb_dim_items]
        embs = self.user_factors(x[:, 0]), self.item_factors(x[:, 1])

        # Concatenate along feature dimension -> [bs, emb_dim_users + emb_dim_items]
        x = torch.cat(embs, dim=1)

        # Pass through NN -> [bs, 1] raw scores
        x = self.layers(x)

        # Map raw scores into the target rating interval (e.g., 0..5.5).
        # sigmoid_range(z, lo, hi) = lo + (hi - lo) * sigmoid(z)
        return sigmoid_range(x, *self.y_range)


In [31]:
model = CollabNN(*embs)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

NameError: name 'embs' is not defined