<a href="https://colab.research.google.com/github/cchummer/ml-dl-scratch/blob/main/classy_collaborative_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Collaborative Filtering from Scratch

These models were developed around the movielens dataset, with the main goal of learning to predict specific users' ratings of specific titles. However, collaborative filtering has applications in many scenarios where a system might benefit from correctly predicting a user's preferences or tastes. Two variations of probabalistic matrix factorization (PMF) based models are given, and two neural net based.

In [1]:
import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

# Datasets + Dataloaders

We have four variations of colaborative filtering, using two custom dataset structures. The two PMF models are quite similar, with one simply incorporating an extra feature (on top of the assumed item and users). Same goes for the two NN models. During development, this extra feature was movie genre data. This could easily be modified or extended to incorporate more features as needed.

In [2]:
# Utilizing pytorch Dataset + DataLoader functionality for easy batch size control + ID mapping
class collabSimpleDataset(Dataset):
  def __init__(self, df, user_col, item_col, score_col):

    # Create mappings of user and item ids to 0-based indices so they will play nicely with embedding lookups. Also allows for use of non-numeric columns (movie title, etc)
    self.user_col = user_col
    self.item_col = item_col
    self.score_col = score_col

    self.data = df
    self.user_mapping = {user_id: i for i, user_id in enumerate(self.data[self.user_col].unique())}
    self.item_mapping = {item_id: i for i, item_id in enumerate(self.data[self.item_col].unique())}

  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx):
    user_id = self.data.iloc[idx][self.user_col]
    item_id = self.data.iloc[idx][self.item_col]
    rating = self.data.iloc[idx][self.score_col]

    # Return our index values rather than the raw ID's (which are potentially non-numeric)
    # This has the effect of indices rather than ID's being returned to forward() in the model below
    user_idx = self.user_mapping[user_id]
    item_idx = self.item_mapping[item_id]

    #user_idx_tensor = torch.tensor(user_idx, dtype=torch.long)
    #item_idx_tensor = torch.tensor(item_idx, dtype=torch.long)
    #ratings_tensor = torch.tensor(rating, dtype=torch.float32)

    #return [user_idx, item_idx, ratings_tensor]
    return torch.tensor([user_idx, item_idx, rating], dtype=torch.float32)

In [3]:
# Define a new dataset which will also hold genre data
class collabExtendedDataset(Dataset):
  def __init__(self, df, user_col, item_col, genre_col, score_col):

    self.user_col = user_col
    self.item_col = item_col
    self.genre_col = genre_col
    self.score_col = score_col

    self.data = df
    self.user_mapping = {user_id: i for i, user_id in enumerate(self.data[self.user_col].unique())}
    self.item_mapping = {item_id: i for i, item_id in enumerate(self.data[self.item_col].unique())}

    # Create our mapping of unique genres
    self.genres_list = self._get_unique_genres()
    self.genre_mapping = {genre: i for i, genre in enumerate(self.genres_list)}

  def __len__(self):
    return len(self.data)

  def __getitem__(self, idx):
    user_id = self.data.iloc[idx][self.user_col]
    item_id = self.data.iloc[idx][self.item_col]
    genres = self.data.iloc[idx][self.genre_col]
    rating = self.data.iloc[idx][self.score_col]

    user_idx = self.user_mapping[user_id]
    genre_idxs = [self.genre_mapping[genre] for genre in genres.split('|')] # Is now a list
    item_idx = self.item_mapping[item_id]

    # First thought here was to write [user_idx, item_idx, genre_idxs, rating], but using [] + creates a flat list, concatenating the elements of genre_idxs
    # rather than the list itself. Easier to parse inside the model
    return torch.tensor([user_idx, item_idx, rating] + genre_idxs, dtype=torch.float32)

  # Helper function called in initialization
  def _get_unique_genres(self):
    unique_genres = set()
    for genres in self.data[self.genre_col].unique():
      unique_genres.update(genres.split('|'))
    return list(unique_genres)

In [5]:
# Needed to handle variable length of samples' genre index lists. Pad tensors per batch
def collate_fn_with_padding(batch):

  # Extract individual components from the batch
  batch_size = len(batch)

  user_idxs = torch.zeros(batch_size, dtype=torch.long)
  item_idxs = torch.zeros(batch_size, dtype=torch.long)
  ratings = torch.zeros(batch_size, dtype=torch.float32)
  max_num_genres = max(len(item) - 3 for item in batch)  # Calculate max length of genre indices

  genre_idxs_padded = []

  for i, item in enumerate(batch):

    user_idxs[i] = item[0]
    item_idxs[i] = item[1]
    ratings[i] = item[2]
    genre_idxs = torch.tensor(item[3:], dtype=torch.long)
    padded_genre_idxs = torch.cat([genre_idxs, torch.zeros(max_num_genres - len(genre_idxs), dtype=torch.long)])
    genre_idxs_padded.append(padded_genre_idxs)

  genre_idxs_padded = torch.stack(genre_idxs_padded, dim=0)

  # Concatenate all tensors into a single tensor
  batch_tensor = torch.cat([user_idxs.unsqueeze(1),
                            item_idxs.unsqueeze(1),
                            ratings.unsqueeze(1),
                            genre_idxs_padded], dim=1)

  return batch_tensor

In [6]:
def create_data_loaders(trn_df, val_df, user_col, item_col, score_col, genre_col=None, batch_size=64):
  '''
  user_col: str name of column in dataframe containing user ids
  item_col: str name of column in dataframe containing item ids
  genre_col: (optional) str name of column in dataframe containing genres
  score_col: str name of column in dataframe containing ratings
  '''
  trn_ds = None
  val_ds = None
  trn_dl = None
  val_dl = None

  if genre_col is None:
    trn_ds = collabSimpleDataset(trn_df, user_col, item_col, score_col)
    val_ds = collabSimpleDataset(val_df, user_col, item_col, score_col)

    trn_dl = DataLoader(trn_ds, batch_size=batch_size, shuffle=True)
    val_dl = DataLoader(val_ds, batch_size=batch_size, shuffle=True)

  else:
    trn_ds = collabExtendedDataset(trn_df, user_col, item_col, genre_col, score_col)
    val_ds = collabExtendedDataset(val_df, user_col, item_col, genre_col, score_col)

    trn_dl = DataLoader(trn_ds, batch_size=batch_size, shuffle=True, collate_fn=collate_fn_with_padding) # See custom collate method above
    val_dl = DataLoader(val_ds, batch_size=batch_size, shuffle=True, collate_fn=collate_fn_with_padding)

  return trn_dl, val_dl, trn_ds, val_ds

# The Models
First, the PMF based models and then the neural net

In [7]:
def sigmoid_range(x, low, high):
  '''
  Sigmoid function with range `(low, high)`
  https://github.com/fastai/fastai/blob/master/fastai/layers.py#L100
  '''
  return torch.sigmoid(x) * (high - low) + low

In [8]:
# Simple PMF, user and item only
class DotProductBias(nn.Module):
  def __init__(self, n_users, n_items, n_factors, y_range=(0,5.5)):
    super().__init__()
    self.user_factors = nn.Embedding(n_users, n_factors)
    self.user_bias = nn.Embedding(n_users, 1)
    self.item_factors = nn.Embedding(n_items, n_factors)
    self.item_bias = nn.Embedding(n_items, 1)
    self.y_range = y_range

    # Initialize embeddings and biases
    nn.init.normal_(self.user_factors.weight, std=0.01)
    nn.init.normal_(self.item_factors.weight, std=0.01)
    nn.init.normal_(self.user_bias.weight, std=0.01)
    nn.init.normal_(self.item_bias.weight, std=0.01)

  def forward(self, x):

    user_idx = x[:, 0].long()
    item_idx = x[:, 1].long()
    #ratings = x[:, 2]

    users = self.user_factors(user_idx)
    items = self.item_factors(item_idx)
    users_bias = self.user_bias(user_idx).squeeze()
    items_bias = self.item_bias(item_idx).squeeze()

    dot_product = torch.sum(users * items, dim=1)
    bias = users_bias + items_bias

    prediction = dot_product + bias
    return sigmoid_range(prediction, *self.y_range)

In [32]:
# Incorporate another feature
class DPBWithItemFeatures(nn.Module):
  '''
    Similar dot-product model but with room for an extra categorical/feature (currently genres), which is taken into account for the item
    Could easily be modified to handle more features, for either users or items
    '''
  def __init__(self, n_users, n_items, n_genres, n_factors, y_range=(0,5.5)):
    super().__init__()
    self.user_factors = nn.Embedding(n_users, n_factors)
    self.user_bias = nn.Embedding(n_users, 1)
    self.item_factors = nn.Embedding(n_items, n_factors)
    self.item_bias = nn.Embedding(n_items, 1)
    self.genre_factors = nn.Embedding(n_genres, n_factors)
    self.genre_bias = nn.Embedding(n_genres, 1)
    self.y_range = y_range

    # Initialize embeddings and biases
    nn.init.normal_(self.user_factors.weight, std=0.01)
    nn.init.normal_(self.item_factors.weight, std=0.01)
    nn.init.normal_(self.genre_factors.weight, std=0.01)
    nn.init.normal_(self.user_bias.weight, std=0.01)
    nn.init.normal_(self.item_bias.weight, std=0.01)
    nn.init.normal_(self.genre_bias.weight, std=0.01)

  def forward(self, x):

    user_idx = x[:, 0].long()
    item_idx = x[:, 1].long()
    # ratings = x[:, 2].long()
    genre_idxs = x[:, 3:].long() # Assuming genres now take up the 4th column and onward

    users = self.user_factors(user_idx)
    items = self.item_factors(item_idx)
    users_bias = self.user_bias(user_idx).squeeze()
    items_bias = self.item_bias(item_idx).squeeze()

    # Embedding lookup for genres
    genres_embedded = self.genre_factors(genre_idxs)

    # Currently summing biases of all the genres of the sample. Could also average
    genre_bias = self.genre_bias(genre_idxs).squeeze().sum(dim=1)

    # Multiple ways to use of the genre embeddings, especially if they are of different size than the item embeddings
    #item_with_genre = items.unsqueeze(1) * genres_embedded
    items_with_genre = torch.cat([items.unsqueeze(1) * genres_embedded, items.unsqueeze(1)], dim=1)

    # Sum the effect of each genre's factors on the item factors, reduce dimensionality from (n_samples, n_genres, n_item_factors) to (n_samples, n_item_factors)
    # This is assuming item embedding and genre embedding sizes are equal
    items_with_genre = items_with_genre.sum(dim=1) # or mean

    dot_product = torch.sum(users * items_with_genre, dim=1)
    bias = users_bias + items_bias + genre_bias

    prediction = dot_product + bias
    return sigmoid_range(prediction, *self.y_range)

In [26]:
# Single hidden layer neural net, again only user + item
class nnSimpleCollab(nn.Module):
  def __init__(self, user_size, item_size, hidden_dim=128, y_range=(0,5.5)):
    super().__init__()
    self.user_embedding = nn.Embedding(*user_size)
    self.item_embedding = nn.Embedding(*item_size)

    # Realistically, incorporating dropout and/or other regularization techniques should be experimented with
    self.fc_layers = nn.Sequential(
      nn.Linear(user_size[1] + item_size[1], hidden_dim),
      nn.ReLU(),
      nn.Linear(hidden_dim, 1)  # Output is a single rating prediction
      )

    self.y_range = y_range

  def forward(self, x):

    user_embedded = self.user_embedding(x[:, 0].long())
    item_embedded = self.item_embedding(x[:, 1].long())

    # Concatenate user and item embeddings
    embedded = torch.cat([user_embedded, item_embedded], dim=1)

    # Pass through layers
    output = self.fc_layers(embedded)

    # Scale to y_range
    output = sigmoid_range(output, *self.y_range)

    return output.squeeze()

In [44]:
# Adds handling of additional (genre) feature
class nnExtendedCollab(nn.Module):
  def __init__(self, user_size, item_size, genre_size, hidden_dim=128, y_range=(0,5.5)):
    super().__init__()
    self.user_embedding = nn.Embedding(*user_size)
    self.item_embedding = nn.Embedding(*item_size)
    self.genre_embedding = nn.Embedding(*genre_size)

    '''
    We need to know what size to make our first linear layer, which is passed a concatenation of relevent embeddings.
    Couple of ways to handle the possible variable number of genres per item:
      1. Sum or average genres found, and then concatenating the user and item embeddings with the summed/averaged genre embedding (which has the size of a single genre embedding)
        In this case, we only need the first layer to be (user_size[1] + item_size[1] + genre_size[1])
      2. Concatenate each relevent genre embedding to the user and item embeddings, padding unused space with 0's
        In this case, we need the first layer to be (user_size[1] + item_size[1] + max_num_genres * genre_size[1])
        Thus we will need to find the sample with the most genres before creating the model

    We will take the first approach for starters
    '''

    self.fc_layers = nn.Sequential(
      nn.Linear(user_size[1] + item_size[1] + genre_size[1], hidden_dim),
      nn.ReLU(),
      nn.Linear(hidden_dim, 1)  # Output is a single rating prediction
      )

    self.y_range = y_range

  def forward(self, x):

    user_idx = x[:, 0].long()
    item_idx = x[:, 1].long()
    # ratings = x[:, 2].long()
    genre_idxs = x[:, 3:].long() # Assuming genres now take up the 4th column and onward

    user_embedded = self.user_embedding(user_idx)
    item_embedded = self.item_embedding(item_idx)
    genres_embedded = self.genre_embedding(genre_idxs)

    # Sum/avg genre embeddings
    genre_embedded = torch.sum(genres_embedded, dim=1)
    #genre_embedded = torch.mean(genres_embedded, dim=1)

    # Concatenate embeddings
    embedded = torch.cat([user_embedded, item_embedded, genre_embedded], dim=1)

    # Pass through layers
    output = self.fc_layers(embedded)

    # Scale to y_range
    output = sigmoid_range(output, *self.y_range)

    return output.squeeze()

# Training + Inference

In [10]:
# Simple training loop. Assumes rating is 3rd column in dataset
def train_pytorch_model(model, train_loader, optimizer, criterion, epochs=5):
  for epoch in range(epochs):

    model.train()
    running_loss = 0.0
    for batch in train_loader:

      targets = batch[:, 2]

      optimizer.zero_grad()

      outputs = model(batch)
      loss = criterion(outputs, targets)

      loss.backward()
      optimizer.step()

      running_loss += loss.item()

    print(f"Epoch {epoch+1}, Loss: {running_loss / len(train_loader)}")

Some simple preprocessing

In [11]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [12]:
ratings_df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/ML+DL/movielens_ratings.csv')
# Grab movie names from other csv
names_df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/ML+DL/movielens_movies.csv')
# Merge into ratings df
ratings_df = ratings_df.merge(names_df, on='movieId')

ratings_df

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...,...
100831,610,160341,2.5,1479545749,Bloodmoon (1997),Action|Thriller
100832,610,160527,4.5,1479544998,Sympathy for the Underdog (1971),Action|Crime|Drama
100833,610,160836,3.0,1493844794,Hazard (2005),Action|Drama|Thriller
100834,610,163937,3.5,1493848789,Blair Witch (2016),Horror|Thriller


In [13]:
#np.random.seed(42)
trn_df,val_df = train_test_split(ratings_df, test_size=0.25)

n_users = len(ratings_df.userId.unique())
n_movies = len(ratings_df.movieId.unique())
n_factors = 50

print(n_users)
print(n_movies)

610
9724


Ready to build dataloaders and train.

In [14]:
# First, the standard PMF model
trn_dl, val_dl, trn_ds, val_ds = create_data_loaders(trn_df, val_df, 'userId', 'title', 'rating') # No genre column, no neural net
test_model = DotProductBias(n_users, n_movies, n_factors)

In [15]:
#optimizer = torch.optim.Adam(test_model.parameters(), lr=0.005, weight_decay=0.1)
optimizer = torch.optim.Adam(test_model.parameters(), lr=0.005)
criterion = nn.MSELoss()

In [16]:
train_pytorch_model(test_model, trn_dl, optimizer, criterion, epochs=10)

Epoch 1, Loss: 0.9080956054570913
Epoch 2, Loss: 0.4273509998914554
Epoch 3, Loss: 0.23368170141997271
Epoch 4, Loss: 0.1821519449454072
Epoch 5, Loss: 0.1768910806002048
Epoch 6, Loss: 0.17468626905617174
Epoch 7, Loss: 0.16799746335258742
Epoch 8, Loss: 0.16370294591673537
Epoch 9, Loss: 0.15908487692992196
Epoch 10, Loss: 0.15605366658639988


Now some inference. Couple of interesting things to examine are item-item similarities and item bias values. In our case, user-user similarities and biases are not of much use since the users are only identified by IDs.

In [17]:
# Extract item bias embeddings from the model
item_bias_embeddings = test_model.item_bias.weight.squeeze().detach().numpy()

# Map the item bias embeddings back to their original item IDs, using the mapping we created in the dataset
item_biases = [(item_id, item_bias_embeddings[item_idx]) for item_id, item_idx in trn_ds.item_mapping.items()]

# Sort items based on bias values
sorted_item_biases = sorted(item_biases, key=lambda x: x[1], reverse=True)

# Output the items with the highest bias
num_top_items = 10
top_items_with_bias = sorted_item_biases[:num_top_items]
print(f"Top {num_top_items} items ranked by bias:")
for item_id, bias_value in top_items_with_bias:
  print(f"Item ID: {item_id}, Bias Value: {bias_value}")

Top 10 items ranked by bias bias:
Item ID: Shawshank Redemption, The (1994), Bias Value: 1.0387192964553833
Item ID: Lawrence of Arabia (1962), Bias Value: 0.9313034415245056
Item ID: Star Wars: Episode IV - A New Hope (1977), Bias Value: 0.8656998872756958
Item ID: Fight Club (1999), Bias Value: 0.8644293546676636
Item ID: Star Wars: Episode V - The Empire Strikes Back (1980), Bias Value: 0.8625494837760925
Item ID: 12 Angry Men (1957), Bias Value: 0.8552147746086121
Item ID: Philadelphia Story, The (1940), Bias Value: 0.8292230367660522
Item ID: Hustler, The (1961), Bias Value: 0.8070851564407349
Item ID: Godfather, The (1972), Bias Value: 0.8061210513114929
Item ID: Usual Suspects, The (1995), Bias Value: 0.8020727634429932


In [38]:
def calculate_item_item_similarity(model, item_embeddings):
  model.eval()

  # Get item embeddings
  #item_embeddings = model.item_factors.weight  # shape: (num_items, embedding_size)

  # Normalize item embeddings to unit length
  item_norms = torch.norm(item_embeddings, dim=1, keepdim=True)  # shape: (num_items, 1)
  item_embeddings_normalized = item_embeddings / item_norms

  # Calculate cosine similarities between all pairs of item embeddings
  cosine_similarities = torch.matmul(item_embeddings_normalized, item_embeddings_normalized.T)  # shape: (num_items, num_items)

  return cosine_similarities

In [None]:
cosine_sims = calculate_item_item_similarity(test_model, test_model.item_factors.weight)

In [19]:
# Choose movie to inspect cosine similarities
base_item_id = 'Lawrence of Arabia (1962)'
item_idx = trn_ds.item_mapping[base_item_id]

similar_items = torch.argsort(cosine_sims[item_idx], descending=True)

# Print top 5 similar items
top_k = 5
for i in range(1, top_k + 1):  # Skip the first item (itself)
    similar_item_idx = similar_items[i].item()
    similar_item_id = next(key for key, val in trn_ds.item_mapping.items() if val == similar_item_idx)

    similarity_score = cosine_sims[item_idx, similar_item_idx].item()
    print(f"Item: {base_item_id} is similar to Item: {similar_item_id} with similarity score: {similarity_score:.4f}")

Item: Lawrence of Arabia (1962) is similar to Item: Hunt for Red October, The (1990) with similarity score: 0.5442
Item: Lawrence of Arabia (1962) is similar to Item: Man for All Seasons, A (1966) with similarity score: 0.5387
Item: Lawrence of Arabia (1962) is similar to Item: Monster (2003) with similarity score: 0.5188
Item: Lawrence of Arabia (1962) is similar to Item: Perfect Plan, A (Plan parfait, Un) (2012) with similarity score: 0.4984
Item: Lawrence of Arabia (1962) is similar to Item: Pickup on South Street (1953) with similarity score: 0.4911


In [22]:
# Lets just test the raw predictions, on the validation set this time
def test_model_preds(model, val_dl, criterion):
    model.eval()

    with torch.no_grad():
        total_loss = 0
        num_batches = 0

        for i, batch in enumerate(val_dl):
            # Extract targets from the batch
            targets = batch[:, 2]  # Ratings

            # Forward pass to get predictions
            predictions = model(batch)

            # Ensure targets and predictions are the same shape
            if predictions.shape != targets.shape:
                raise ValueError(f"Shape mismatch: predictions {predictions.shape}, targets {targets.shape}")

            # Compute loss
            loss = criterion(predictions, targets)
            total_loss += loss.item()
            num_batches += 1

            # Print predictions and targets
            for pred, actual in zip(predictions, targets):
                print(f"Predicted: {pred.item():.2f}, Actual: {actual.item()}")

        # Print average loss
        average_loss = total_loss / num_batches if num_batches > 0 else float('nan')
        print(f"Validation loss: {average_loss:.4f}")

In [None]:
test_model_preds(test_model, val_dl, criterion)

Lets compare with the standard neural net based model. Then we will compare the genre-inclusive models

In [24]:
# Method to calculate an embedding size for a feature, based on the number of categories / unique values it holds.
# We will use it to determine our embedding sizes for users and items based on the number of each present in our dataset
def get_emb_size(n_cat):
  '''
  Quickly calculate number of factors for embedding layer of the given column in the dataframe
  https://github.com/fastai/fastai/blob/master/fastai/tabular/model.py#L12
  '''

  n_factors = min(600, round(1.6 * n_cat**0.56))
  return int(n_factors)

In [27]:
test_model = nnSimpleCollab((n_users, get_emb_size(n_users)), (n_movies, get_emb_size(n_movies)))

# Update optimizer to new model, criterion can stay the same
optimizer = torch.optim.Adam(test_model.parameters(), lr=0.001)
# criterion = nn.MSELoss()

In [28]:
train_pytorch_model(test_model, trn_dl, optimizer, criterion, epochs=10)

Epoch 1, Loss: 0.9046981757768318
Epoch 2, Loss: 0.7558842561823866
Epoch 3, Loss: 0.6947595283969001
Epoch 4, Loss: 0.6480425290902052
Epoch 5, Loss: 0.6055868424084384
Epoch 6, Loss: 0.5627825580247364
Epoch 7, Loss: 0.5188155913388265
Epoch 8, Loss: 0.4719425866950789
Epoch 9, Loss: 0.42646533763902844
Epoch 10, Loss: 0.3820772226616211


Off the bat, the NN model takes longer to train and at least by training loss, did not reach the same accuracy as the PMF model. Lets check the validation accuracy just for the heck. Was about 1.7 on my last run of the PMF model.

In [None]:
test_model_preds(test_model, val_dl, criterion)

Similar results (~1.75). Obiously, there is likely room for improvement in terms of regularization and hyperparameter tuning. Lets move on to the models incorporating genre information

In [33]:
# Get the number of unique genres
unique_genres = set()
for genres in ratings_df.genres.unique():
  unique_genres.update(genres.split('|'))
n_genres = len(list(unique_genres))

n_genres

20

In [49]:
# Recreate dataloaders with genre info
trn_dl, val_dl, trn_ds, val_ds = create_data_loaders(trn_df, val_df, 'userId', 'title', 'rating', 'genres')

test_model = DPBWithItemFeatures(n_users, n_movies, n_genres, n_factors)

optimizer = torch.optim.Adam(test_model.parameters(), lr=0.005)
#criterion = nn.MSELoss()

In [50]:
train_pytorch_model(test_model, trn_dl, optimizer, criterion, epochs=10)

  genre_idxs = torch.tensor(item[3:], dtype=torch.long)


Epoch 1, Loss: 0.8236763685336573
Epoch 2, Loss: 0.5048020575101
Epoch 3, Loss: 0.26478905312704354
Epoch 4, Loss: 0.18226403080544698
Epoch 5, Loss: 0.15894147946598566
Epoch 6, Loss: 0.14689810131558306
Epoch 7, Loss: 0.1365100285526863
Epoch 8, Loss: 0.1297222082353435
Epoch 9, Loss: 0.12939203933335197
Epoch 10, Loss: 0.12286290533008612


We can now inspect the genre biases and genre-genre similarities. Slight modification to code used above (for item biases and item-item similarities)

In [51]:
genre_bias_embeddings = test_model.genre_bias.weight.squeeze().detach().numpy()

# Map back to their original genre ids/names
genre_biases = [(genre_id, genre_bias_embeddings[genre_idx]) for genre_id, genre_idx in trn_ds.genre_mapping.items()]

sorted_genre_biases = sorted(genre_biases, key=lambda x: x[1], reverse=True)

num_top_items = 10
top_genres_with_bias = sorted_genre_biases[:num_top_items]
print(f"Top {num_top_items} genres ranked by bias:")
for genre_id, bias_value in top_genres_with_bias:
  print(f"Genre: {genre_id}, Bias Value: {bias_value}")

Top 10 genres ranked by bias:
Genre: Documentary, Bias Value: 0.6882173418998718
Genre: Film-Noir, Bias Value: 0.36215248703956604
Genre: Drama, Bias Value: 0.359806627035141
Genre: Animation, Bias Value: 0.2901704013347626
Genre: War, Bias Value: 0.2563265264034271
Genre: Western, Bias Value: 0.2541514039039612
Genre: Crime, Bias Value: 0.18676279485225677
Genre: Mystery, Bias Value: 0.14704154431819916
Genre: Musical, Bias Value: 0.14622989296913147
Genre: Comedy, Bias Value: 0.1361905336380005


In [39]:
cosine_sims = calculate_item_item_similarity(test_model, test_model.genre_factors.weight)

In [42]:
# Choose genre to inspect cosine similarities
base_genre_id = 'Thriller'
genre_idx = trn_ds.genre_mapping[base_genre_id]

similar_genres = torch.argsort(cosine_sims[genre_idx], descending=True)

# Print top 5 similar genres
top_k = 5
for i in range(1, top_k + 1):  # Skip the first genre (itself)
    similar_genre_idx = similar_genres[i].item()
    similar_genre_id = next(key for key, val in trn_ds.genre_mapping.items() if val == similar_genre_idx)

    similarity_score = cosine_sims[genre_idx, similar_genre_idx].item()
    print(f"Genre: {base_genre_id} is similar to genre: {similar_genre_id} with similarity score: {similarity_score:.4f}")

Genre: Thriller is similar to genre: Drama with similarity score: 0.6790
Genre: Thriller is similar to genre: (no genres listed) with similarity score: 0.6747
Genre: Thriller is similar to genre: Horror with similarity score: 0.6723
Genre: Thriller is similar to genre: Children with similarity score: 0.5754
Genre: Thriller is similar to genre: Documentary with similarity score: 0.5740


Let's see the predictions and validation loss

In [None]:
test_model_preds(test_model, val_dl, criterion)

Oof, not great (>2.0 my last run). Perhaps there is room for better handling of the genre embeddings (averaging rather than summing per sample, etc). Lets take a look with the NN version.

In [45]:
test_model = nnExtendedCollab((n_users, get_emb_size(n_users)), (n_movies, get_emb_size(n_movies)), (n_genres, get_emb_size(n_genres)))

optimizer = torch.optim.Adam(test_model.parameters(), lr=0.001)
#criterion = nn.MSELoss()

In [46]:
train_pytorch_model(test_model, trn_dl, optimizer, criterion, epochs=10)

  genre_idxs = torch.tensor(item[3:], dtype=torch.long)


Epoch 1, Loss: 0.9001253139095258
Epoch 2, Loss: 0.7494193808429334
Epoch 3, Loss: 0.684297267379091
Epoch 4, Loss: 0.6326422305831247
Epoch 5, Loss: 0.583021673832448
Epoch 6, Loss: 0.5327189792004333
Epoch 7, Loss: 0.48103412732406314
Epoch 8, Loss: 0.43207636150309275
Epoch 9, Loss: 0.3880645381113398
Epoch 10, Loss: 0.34925424819418216


In [None]:
test_model_preds(test_model, val_dl, criterion)

Off the bat, still not great (~2.5 val loss). Again there is room for changes to the handling of genre data, and for hyperparameter tuning. Monitoring validation loss during training, and stopping when the best tradeoff of training and validation loss is reached may be beneficial. Rounding predictions to the nearest .5 at the last step may be of help. More testing to be done, as always :)