# Bert4rec
Author: David Zelenay

This notebook builds a **movie recommendation system using a BERT-style model (BERT4Rec)**.

Here's the workflow:

1.  **Data Loading & Preprocessing:** Loads movie ratings and metadata.
2.  **User Sequence Creation:** Converts user rating histories into ordered sequences of `(rating, movieId)` interactions.
3.  **Tokenization & Padding:** Maps movie interactions to integer IDs and pads/truncates sequences to a fixed length.
4.  **Dataset Preparation:** Creates a PyTorch Dataset that randomly masks items in the sequences for training (Masked Language Model objective).
5.  **Model Training:**
    *   Defines a `BertForMaskedLM` model (from HuggingFace Transformers).
    *   Trains this model to predict the *masked* movies in the user sequences.
    *   Saves model checkpoints after each epoch.
6.  **Recommendation Generation (Inference):**
    *   Takes a user's recent movie sequence.
    *   Masks the *next potential item* in that sequence.
    *   Uses the trained model to predict the most probable movies for that masked slot.
    *   Filters out already seen movies and presents the top predictions as recommendations.

In short, it learns patterns from user movie interaction histories to predict what they might watch next.

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px


# Preprocessing

In [None]:
ratings_df = pd.read_parquet("../data/parquet/ratings.parquet")
ratings_df.info()

In [None]:
movies_df = pd.read_parquet("../data/parquet/movies.parquet")
movies_df.info()

# Prepare User Sequences
To train BERT4Rec, we need to convert the ratings data into sequences of movie IDs for each user, ordered by timestamp. Each sequence represents the user's interaction history.

In [None]:
# Create user sequences as tuples of (rating, movieId), ordered by timestamp
user_sequences = ratings_df.sample(frac=0.001).sort_values(['userId', 'timestamp']).groupby('userId')[['rating', 'movieId']].apply(lambda x: list(zip(x['rating'], x['movieId'])))
user_sequences = user_sequences.tolist()
print(f"Number of users: {len(user_sequences)}")
print(f"Example sequence: {user_sequences[0][:10]}")

## Install and Import Required Packages
We'll use PyTorch and HuggingFace Transformers for the BERT4Rec model. If not already installed, run the following cell.

## Tokenize and Pad Sequences
BERT4Rec requires sequences of equal length. We'll map movie IDs to integer tokens and pad the sequences to a fixed length.

In [None]:
from collections import defaultdict
from torch.nn.utils.rnn import pad_sequence
import torch

# Create movieId to index mapping
movie_ids = set([movie for seq in user_sequences for movie in seq])
movie2idx = {movie: idx+1 for idx, movie in enumerate(sorted(movie_ids))}  # 0 reserved for padding
idx2movie = {idx: movie for movie, idx in movie2idx.items()}

# Convert sequences to index lists
indexed_sequences = [[movie2idx[movie] for movie in seq] for seq in user_sequences]

# Pad sequences
max_seq_len = 10  # You can adjust this
padded_sequences = pad_sequence([torch.tensor(seq[-max_seq_len:]) for seq in indexed_sequences], batch_first=True, padding_value=0)
print(f"Padded sequences shape: {padded_sequences.shape}")

## Prepare PyTorch Dataset and DataLoader
We'll create a custom Dataset to feed the padded sequences into the BERT4Rec model. Each sample will be a sequence for masked language modeling.

In [None]:
from torch.utils.data import Dataset, DataLoader
import random

class BERT4RecDataset(Dataset):
    def __init__(self, sequences, mask_prob=0.15):
        self.sequences = sequences
        self.mask_prob = mask_prob
        self.vocab_size = len(movie2idx) + 1  # +1 for padding

    def __len__(self):
        return len(self.sequences)

    def __getitem__(self, idx):
        seq = self.sequences[idx].clone()
        labels = seq.clone()
        mask = torch.rand(seq.size()) < self.mask_prob
        seq[mask] = self.vocab_size - 1  # Use last index as [MASK] token
        labels[~mask] = -100  # Only compute loss on masked tokens
        return seq, labels

# Create dataset and dataloader
mask_token = len(movie2idx) + 1
train_dataset = BERT4RecDataset(padded_sequences)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Check a batch
for batch in train_loader:
    print(batch[0].shape, batch[1].shape)
    break

## Define and Train the BERT4Rec Model
We'll use HuggingFace's `BertForMaskedLM` as the base for BERT4Rec. The model will be trained to predict masked movies in user sequences.

In [20]:
import torch

# Robust device selection for Mac M3 GPU (Apple Silicon)
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
    device = torch.device("mps")
    print("Using Apple Silicon GPU (MPS backend)")
elif torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using CUDA GPU")
else:
    device = torch.device("cpu")
    print("Using CPU (no GPU available)")

from transformers import BertConfig, BertForMaskedLM
from torch.optim import AdamW
from tqdm.notebook import tqdm  # Add tqdm for progress bar
import os

config = BertConfig(
    vocab_size=len(movie2idx) + 2,  # +1 for padding, +1 for [MASK]
    max_position_embeddings=max_seq_len,
    num_attention_heads=4,
    num_hidden_layers=4,
    type_vocab_size=1
)
model = BertForMaskedLM(config).to(device)

optimizer = AdamW(model.parameters(), lr=5e-4)

# Training loop (all epochs with tqdm progress bar)
epochs = 5  # Set the number of epochs as needed
model.train()
# Create directory for checkpoints if it doesn't exist
os.makedirs("./model_checkpoints", exist_ok=True)
for epoch in tqdm(range(epochs), desc="Epochs", leave=True):
    # print(f"Epoch {epoch+1}/{epochs}")
    epoch_loss = 0
    progress_bar = tqdm(train_loader, desc=f"Training", leave=True)
    for batch in progress_bar:
        input_ids, labels = batch[0].to(device), batch[1].to(device)
        # Create attention mask: 1 for non-padding, 0 for padding (padding token is 0)
        attention_mask = (input_ids != 0).long()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        epoch_loss += loss.item()
        progress_bar.set_postfix({"batch_loss": f"{loss.item():.4f}"})
    avg_loss = epoch_loss / len(train_loader)
    print(f"Epoch {epoch+1} average loss: {avg_loss:.4f}")
    # Save model checkpoint after each epoch
    checkpoint_path = f"./model_checkpoints/bert4rec_epoch_{epoch+1}.pth"
    torch.save(model.state_dict(), checkpoint_path)
    # print(f"Model saved to {checkpoint_path}")

Using Apple Silicon GPU (MPS backend)


Epochs:   0%|          | 0/5 [00:00<?, ?it/s]

Training:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch 1 average loss: 1.8859


Training:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch 2 average loss: 1.6335


Training:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch 3 average loss: 1.6816


Training:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch 4 average loss: 1.6716


Training:   0%|          | 0/313 [00:00<?, ?it/s]

Epoch 5 average loss: 1.6144


## Generate Recommendations with BERT4Rec
To recommend movies for a user, we mask the next position in their sequence and let the model predict likely movies. We then select the top predictions as recommendations.

In [21]:
# Example: Recommend movies for the first user in the dataset
user_seq = indexed_sequences[0][-max_seq_len:]  # Most recent interactions
seen_movies = set(user_seq)

# Prepare input: mask the last position
input_seq = user_seq.copy()
if len(input_seq) < max_seq_len:
    input_seq = [0] * (max_seq_len - len(input_seq)) + input_seq
input_seq[-1] = len(movie2idx) + 1  # [MASK] token
input_tensor = torch.tensor([input_seq]).to(device)
# Create attention mask for inference
attention_mask = (input_tensor != 0).long()

model.eval()
with torch.no_grad():
    outputs = model(input_tensor, attention_mask=attention_mask)
    logits = outputs.logits
    masked_pos = -1  # Last position
    probs = logits[0, masked_pos].softmax(dim=-1)
    topk = torch.topk(probs, k=10)
    rec_indices = topk.indices.cpu().numpy()
    # Filter out padding, mask, and already seen movies
    rec_movies = [idx2movie[idx] for idx in rec_indices if idx in idx2movie and idx not in seen_movies][:10]  # Top 10 recommendations

print("Recommended movie IDs:", rec_movies)
# Optionally, map to movie titles
rec_movie_ids = [movie_id for rating, movie_id in rec_movies if rating > 3]

movie_titles = movies_df.set_index('movieId').loc[rec_movie_ids]['title'].tolist()
print("Recommended movie titles:", movie_titles)

Recommended movie IDs: [(3.0, 377), (4.0, 110), (5.0, 318), (5.0, 3052), (4.0, 480), (5.0, 527), (4.0, 590), (4.5, 50), (3.0, 708)]
Recommended movie titles: ['Braveheart (1995)', 'Shawshank Redemption, The (1994)', 'Dogma (1999)', 'Jurassic Park (1993)', "Schindler's List (1993)", 'Dances with Wolves (1990)', 'Usual Suspects, The (1995)']


In [None]:
rec_movie_ids_only = [movie_id for _, movie_id in rec_movies]
print(rec_movie_ids_only)