# Introduction

Personalized recommendations are crucial in today's digital world, driving user engagement on platforms like e-commerce and streaming services. However, creating effective recommendation systems is challenging due to the complexity of predicting user behavior from vast amounts of sequential data. Traditional methods often struggle with this task. That's why I decided to implement a Transformer architecture, known for its ability to process sequential data with self-attention mechanisms that capture long-range dependencies and intricate patterns in user interactions.


For this project, I chose the `MovieLens` dataset because it is one of the most popular and accessible datasets for recommender systems, making it easy for anyone to replicate and build upon. Using user IDs and movie sequences, we'll build personalized movie recommendations. This project covers everything from data upload and preprocessing to sequence generation, building the `Transformer` model, and evaluating the results.

# Main Goal

The main goal of this project is to help others understand Transformer architectures and their logic, making them easier to digest. Every step is well-documented, which should provide clarity and insight into how Transformers can enhance recommendation systems. Whether you're new to this field or looking to deepen your understanding, this project is designed to be informative and engaging.

# Imports

Here we will import all the necessary libraries for this project.

In [172]:
# Standard Libraries
import time
import math
import os
from collections import Counter
from tempfile import TemporaryDirectory
from typing import Tuple
from zipfile import ZipFile
from urllib.request import urlretrieve

# Data Handling Libraries
import pandas as pd
import numpy as np

# PyTorch Libraries
import torch
from torch import nn, Tensor
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence

# PyTorch Text Libraries
from torchtext.vocab import vocab


# Downloading and Loading the Dataset

In this section, we download the MovieLens dataset and load it into pandas DataFrames.

As I mention before, the **MovieLens** dataset is a widely-used benchmark for recommender systems research. It includes user ratings for a variety of movies, along with user demographics and movie information.

The dataset is downloaded as a zip file, extracted, and then loaded into separate DataFrames for users, ratings, and movies.

In [173]:
# Downloading the dataset
urlretrieve("http://files.grouplens.org/datasets/movielens/ml-1m.zip", "movielens.zip")
ZipFile("movielens.zip", "r").extractall()

# Loading the dataset
users = pd.read_csv(
    "ml-1m/users.dat",
    sep="::",
    names=["user_id", "sex", "age_group", "occupation", "zip_code"],
    engine='python'
)

ratings = pd.read_csv(
    "ml-1m/ratings.dat",
    sep="::",
    names=["user_id", "movie_id", "rating", "unix_timestamp"],
    engine='python'
)

movies = pd.read_csv(
    "ml-1m/movies.dat",
    sep="::",
    names=["movie_id", "title", "genres"],
    encoding='latin-1',
    engine='python'
)


Let's take a look at the first few rows of each dataset to understand their structure and the information they contain.

In [174]:
print("Users DataFrame:")
print(users.head(), "\n")

print("Movies DataFrame:")
print(movies.head(), "\n")

print("Ratings DataFrame:")
print(ratings.head())


Users DataFrame:
   user_id sex  age_group  occupation zip_code
0        1   F          1          10    48067
1        2   M         56          16    70072
2        3   M         25          15    55117
3        4   M         45           7    02460
4        5   M         25          20    55455 

Movies DataFrame:
   movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy 

Ratings DataFrame:
   user_id  movie_id  rating  unix_timestamp
0        1      1193       5       978300760
1        1       661       3       978302109
2        1       914       3       978301968
3        1  

# Data Preprocessing

The dataset is straightforward and well-organized, so extensive data preprocessing is not necessary. However, we do need to ensure that the IDs are formatted as strings with appropriate prefixes to avoid any potential conflicts or errors during processing.

In [175]:
users["user_id"] = "user_" + users["user_id"].astype(str)
movies["movie_id"] = "movie_" + movies["movie_id"].astype(str)
ratings["movie_id"] = "movie_" + ratings["movie_id"].astype(str)
ratings["user_id"] = "user_" + ratings["user_id"].astype(str)

In this step, we'll create dictionaries to convert the unique IDs into numerical indices that our model can use.

In [176]:
## Here we are generating a set of uninue movie IDs
unique_movie_ids = set(movies['movie_id'])

## Adding a special token for unknown values (serves as a placeholder for unknown tokens)
unique_movie_ids.add('<unk>')

## Creating a mapping from movie ID to numerical index
movie_id_to_index = {movie_id: idx for idx, movie_id in enumerate(unique_movie_ids)}

## Creating a dictionary mapping each movie ID to its corresponding title
movie_title_dict = dict(zip(movies['movie_id'], movies['title']))

movie_title_dict

{'movie_1': 'Toy Story (1995)',
 'movie_2': 'Jumanji (1995)',
 'movie_3': 'Grumpier Old Men (1995)',
 'movie_4': 'Waiting to Exhale (1995)',
 'movie_5': 'Father of the Bride Part II (1995)',
 'movie_6': 'Heat (1995)',
 'movie_7': 'Sabrina (1995)',
 'movie_8': 'Tom and Huck (1995)',
 'movie_9': 'Sudden Death (1995)',
 'movie_10': 'GoldenEye (1995)',
 'movie_11': 'American President, The (1995)',
 'movie_12': 'Dracula: Dead and Loving It (1995)',
 'movie_13': 'Balto (1995)',
 'movie_14': 'Nixon (1995)',
 'movie_15': 'Cutthroat Island (1995)',
 'movie_16': 'Casino (1995)',
 'movie_17': 'Sense and Sensibility (1995)',
 'movie_18': 'Four Rooms (1995)',
 'movie_19': 'Ace Ventura: When Nature Calls (1995)',
 'movie_20': 'Money Train (1995)',
 'movie_21': 'Get Shorty (1995)',
 'movie_22': 'Copycat (1995)',
 'movie_23': 'Assassins (1995)',
 'movie_24': 'Powder (1995)',
 'movie_25': 'Leaving Las Vegas (1995)',
 'movie_26': 'Othello (1995)',
 'movie_27': 'Now and Then (1995)',
 'movie_28': 'Persu

Now we will do the exact same thing for user_ids.

In [177]:
## Generating a set of unique user IDs
unique_user_ids = set(users['user_id'])

## Creating a mapping from user ID to numerical index
user_id_to_index = {user_id: idx for idx, user_id in enumerate(unique_user_ids)}

## Display the user ID to index mapping
user_id_to_index


{'user_2277': 0,
 'user_568': 1,
 'user_5815': 2,
 'user_5098': 3,
 'user_5074': 4,
 'user_317': 5,
 'user_1977': 6,
 'user_3644': 7,
 'user_5517': 8,
 'user_1315': 9,
 'user_4314': 10,
 'user_5894': 11,
 'user_5766': 12,
 'user_3719': 13,
 'user_4839': 14,
 'user_2937': 15,
 'user_2484': 16,
 'user_325': 17,
 'user_2778': 18,
 'user_3434': 19,
 'user_536': 20,
 'user_3098': 21,
 'user_3328': 22,
 'user_4225': 23,
 'user_1110': 24,
 'user_2105': 25,
 'user_4418': 26,
 'user_4142': 27,
 'user_1136': 28,
 'user_2648': 29,
 'user_5063': 30,
 'user_3521': 31,
 'user_2804': 32,
 'user_4978': 33,
 'user_2459': 34,
 'user_3128': 35,
 'user_2232': 36,
 'user_5372': 37,
 'user_1560': 38,
 'user_386': 39,
 'user_3372': 40,
 'user_3959': 41,
 'user_2447': 42,
 'user_4221': 43,
 'user_5934': 44,
 'user_3419': 45,
 'user_1755': 46,
 'user_2626': 47,
 'user_3897': 48,
 'user_4841': 49,
 'user_4657': 50,
 'user_3080': 51,
 'user_3902': 52,
 'user_328': 53,
 'user_5756': 54,
 'user_5133': 55,
 'user_5

Great! We are done with data preprocessing part of the project (thankfully, it was pretty short), and now we will move on to sequence generation.

# Generating Sequences

In recommender systems, maintaining the order of events is really crucial. Sorting makes sure that the sequence of user interactions is accurately represented, which helps in making better predictions and recommendations.

Here we will need to group and sort user interactions by the timestamp.

In [178]:
# Grouping ratings by user_id in order of increasing unix_timestamp
ratings_group_by_users = ratings.sort_values(by='unix_timestamp').groupby('user_id')

# Creating a DataFrame with grouped data
final_ratings_data = pd.DataFrame({
    "user_id": list(ratings_group_by_users.groups.keys()),
    "movie_ids": ratings_group_by_users['movie_id'].apply(list).tolist(),
    "timestamps": ratings_group_by_users['unix_timestamp'].apply(list).tolist()
})

final_ratings_data


Unnamed: 0,user_id,movie_ids,timestamps
0,user_1,"[movie_3186, movie_1270, movie_1022, movie_172...","[978300019, 978300055, 978300055, 978300055, 9..."
1,user_10,"[movie_858, movie_743, movie_597, movie_1210, ...","[978224375, 978224375, 978224375, 978224400, 9..."
2,user_100,"[movie_260, movie_1676, movie_1198, movie_541,...","[977593595, 977593595, 977593607, 977593624, 9..."
3,user_1000,"[movie_2990, movie_971, movie_260, movie_2973,...","[975040566, 975040566, 975040566, 975040629, 9..."
4,user_1001,"[movie_1198, movie_2885, movie_1617, movie_390...","[975039591, 975039702, 975039702, 975039898, 9..."
...,...,...,...
6035,user_995,"[movie_247, movie_260, movie_1894, movie_74, m...","[975054785, 975054785, 975054785, 975054853, 9..."
6036,user_996,"[movie_2146, movie_1347, movie_1961, movie_274...","[975052132, 975052132, 975052195, 975052284, 9..."
6037,user_997,"[movie_1196, movie_2082, movie_2447, movie_324...","[975044235, 975044425, 975044426, 975044426, 9..."
6038,user_998,"[movie_2266, movie_1641, movie_1097, movie_126...","[975043499, 975043593, 975043593, 975043593, 9..."


Now that the dataset is prepared, we will create sequences.

In [179]:
# First, let's specify some parameters

sequence_length = 4  # Length of each sequence
min_history = 1  # Minimum number of items that must remain in the list to continue creating sequences
step_size = 2  # Number of steps to move the sliding window for the next sequence

## function to create sequences from data
def create_sequences(values, window_size, step_size, min_history):
    sequences = []
    start_index = 0

    while len(values[start_index:]) > min_history:
        seq = values[start_index : start_index + window_size]
        if len(seq) == window_size:  # make sure sequence length matches the desired length
            sequences.append(seq)
        start_index += step_size

    return sequences

# applying the sequence creation function to the 'movie_ids' column
final_ratings_data['movie_ids'] = final_ratings_data['movie_ids'].apply(
    lambda ids: create_sequences(ids, sequence_length, step_size, min_history)
)



In [180]:
# deleting the timestamps column since it's not needed
del final_ratings_data["timestamps"]

## make sure to explode the sequences into separate rows
ratings_data_transformed = final_ratings_data.explode('movie_ids', ignore_index=True)

## renaming the column appropriately
ratings_data_transformed.rename(columns={'movie_ids': 'sequence_movie_ids'}, inplace=True)

ratings_data_transformed

Unnamed: 0,user_id,sequence_movie_ids
0,user_1,"[movie_3186, movie_1270, movie_1022, movie_1721]"
1,user_1,"[movie_1022, movie_1721, movie_2340, movie_1836]"
2,user_1,"[movie_2340, movie_1836, movie_3408, movie_1207]"
3,user_1,"[movie_3408, movie_1207, movie_2804, movie_1193]"
4,user_1,"[movie_2804, movie_1193, movie_720, movie_260]"
...,...,...
492578,user_999,"[movie_24, movie_2264, movie_2540, movie_2676]"
492579,user_999,"[movie_2540, movie_2676, movie_1363, movie_765]"
492580,user_999,"[movie_1363, movie_765, movie_3565, movie_1410]"
492581,user_999,"[movie_3565, movie_1410, movie_2269, movie_2504]"


Here's what we have done here:

- Created Sequences from User Ratings
  - We generated sequences of movie IDs using a sliding window approach with a length of 4 and a step size of 2, preserving the order in which movies were watched.

- Transformed the Data

  - The `create_sequences` function is applied to each user's movie list, and the sequences are exploded into separate rows, ensuring each row contains a single sequence.

- Final Output

  - The resulting DataFrame, `ratings_data_transformed`, contains `user_id` and `sequence_movie_ids`, where each row represents a user and a sequence of four movie IDs, ready for training recommendation models.

# Dataset Split

The data is split into training and testing sets using a random indexing approach, where approximately 85% of the data is allocated to the training set and the remaining 15% to the testing set. Although considering timestamps could potentially provide a more refined split, for the sake of simplicity, we opt for this random indexing method.

In [181]:
# Random indexing
random_selection = np.random.rand(len(ratings_data_transformed.index)) <= 0.85

# Split train data
df_train_data = ratings_data_transformed[random_selection]
train_data_raw = df_train_data[["user_id", "sequence_movie_ids"]].values

# Split test data
df_test_data = ratings_data_transformed[~random_selection]
test_data_raw = df_test_data[["user_id", "sequence_movie_ids"]].values

Here we will define a custom PyTorch Dataset class, `MovieSeqDataset`, that is aimed to convert user IDs and sequences of movie IDs into numerical indices based on predefined mappings. The `collate_batch` function is then used to pad the movie sequences to ensure uniform length within each batch. Finally, DataLoader instances for the training and validation sets are created, which handle batching and shuffling of data, which should facilitate training and evaluation of the model.

In [182]:
# PyTorch Dataset for user interactions
class MovieSeqDataset(Dataset):
    # Initialize dataset
    def __init__(self, data, movie_id_to_index, user_id_to_index):
        '''
        data: List of tuples, each containing a user ID and a sequence of movie IDs
        movie_id_to_index: Dictionary mapping movie IDs to unique integers
        user_id_to_index: Dictionary mapping user IDs to unique integers
        '''
        self.data = data
        self.movie_id_to_index = movie_id_to_index
        self.user_id_to_index = user_id_to_index

    def __len__(self):
        '''
        Return the number of samples in the dataset
        '''
        return len(self.data)

    # Fetch data from the dataset
    def __getitem__(self, idx):
        '''
        idx: Index to retrieve a specific sample
        '''
        user, movie_sequence = self.data[idx]
        # Convert movie IDs to their integer representation
        movie_data = [self.movie_id_to_index.get(item, self.movie_id_to_index['<unk>']) for item in movie_sequence]
        # Convert user ID to its integer representation
        user_data = self.user_id_to_index[user]
        return torch.tensor(movie_data), torch.tensor(user_data)

# Collate function and padding
def collate_batch(batch):
    '''
    batch: List of samples fetched by the DataLoader
    '''
    movie_list = [item[0] for item in batch]
    user_list = [item[1] for item in batch]
    # Pad movie sequences to ensure they have the same length
    return pad_sequence(movie_list, padding_value=movie_id_to_index['<unk>'], batch_first=True), torch.stack(user_list)

BATCH_SIZE = 256
# Create instances of the Dataset for each set
train_dataset = MovieSeqDataset(train_data_raw, movie_id_to_index, user_id_to_index)
val_dataset = MovieSeqDataset(test_data_raw, movie_id_to_index, user_id_to_index)

# Create DataLoaders
train_iter = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=collate_batch)
val_iter = DataLoader(val_dataset, batch_size=BATCH_SIZE,
                      shuffle=False, collate_fn=collate_batch)


# Model Definition

In this section we are finally starting the most interesting part of the project. We will define our Transformer model, starting with the crucial component of positional encoding. `Positional encodings` provide a way for the model to understand the order of elements within a sequence, which is essential for processing sequential data. We will build the positional encoder, which incorporates this positional information into the input embeddings, allowing the Transformer to effectively capture the sequence order during training.

In [183]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        '''Initializes the PositionalEncoding module.

        Args:
            d_model: The dimension of the model.
            dropout: The dropout rate to apply.
            max_len: The maximum length of the input sequences.
        '''
        super().__init__()

        ## initializing the dropout layer
        self.dropout = nn.Dropout(p=dropout)

        ## creating a tensor with position indices from 0 to max_len-1.
        ## `unsqueeze(1)` adds an extra dimension for broadcasting purposes.
        position = torch.arange(max_len).unsqueeze(1)

        ## calculating the scaling factor for positional encodings as per the formula in "Attention is All You Need".
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))

        ## initializing a positional encoding matrix of zeros with shape (max_len, 1, d_model).
        pe = torch.zeros(max_len, 1, d_model)

        ## assigning the sine values to even indices in the last dimension, as per the formula PE(pos, 2i).
        pe[:, 0, 0::2] = torch.sin(position * div_term)

        ## assigning the cosine values to odd indices in the last dimension, as per the formula PE(pos, 2i+1).
        pe[:, 0, 1::2] = torch.cos(position * div_term)

        ## registering `pe` as a buffer, which means it is a persistent tensor not updated by the optimizer but saved with the model's state.
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
        # Adding positional encodings to the input tensor `x`.
        # this step incorporates positional information into the input sequences.
        x = x + self.pe[:x.size(0)]
        # applying dropout to the tensor and returning it.
        return self.dropout(x)


## Explanation

The `PositionalEncoding` class we created above initializes a dropout layer and creates a positional encoding matrix using `sine` and `cosine` functions to `encode positional information` for sequences up to a specified maximum length (max_len).
- It first generates position indices, calculates scaling factors to ensure the positional encodings cover a range of frequencies, and assigns sine and cosine values to the positional encoding matrix.

- These encodings are then added to the input sequences, providing the model with information about the position of each element in the sequence.

In simple words, positional encoding is designed to give the model awareness of the order of movie sequences, which allows it to differentiate between different positions in the sequence. In our case, it adds this positional information to the sequences of movie IDs for each user, enabling the model to understand the sequential nature of user interactions with movies.

# Transformer Architecture

After setting up the positional encoder, we move on to defining our Transformer model. This model accepts both the user ID and the sequence of movie IDs as inputs, and its job is to generate movie predictions. By utilizing multi-head attention, positional encoding, and embedding layers, the model processes the sequences and makes accurate predictions based on the contextual relationships between movies and the user-specific information.

In [184]:
class TransformerModel(nn.Module):
    def __init__(self, ntoken: int, nuser: int, d_model: int, nhead: int, d_hid: int,
                 nlayers: int, dropout: float = 0.5):

        '''initializing the TransformerModel.

        Args:
            ntoken: Number of unique tokens (movies).
            nuser: Number of unique users.
            d_model: Dimension of the model (embedding size).
            nhead: Number of heads in the multihead attention mechanism.
            d_hid: Dimension of the feedforward network model.
            nlayers: Number of encoder layers.
            dropout: Dropout rate.
        '''


        super().__init__()
        self.model_type = 'Transformer'

        #Positional Encoder
        self.pos_encoder = PositionalEncoding(d_model, dropout) #initializes the positional encoding

        #Multihead attention
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers) #Initializes the transformer encoder

        #Embedding Layers
        self.movie_embedding = nn.Embedding(ntoken, d_model)
        self.user_embedding = nn.Embedding(nuser, d_model) #Initializes the embedding layers for movies and users

        # Defining the size of the input to the model.
        self.d_model = d_model

        #Linear Layer
        self.linear = nn.Linear(2*d_model, ntoken) #Initializes a linear layer to map the concatenated output

        self.init_weights()

    def init_weights(self) -> None:
        # Initializing the weights of the embedding and linear layers.
        initrange = 0.1
        self.movie_embedding.weight.data.uniform_(-initrange, initrange)
        self.user_embedding.weight.data.uniform_(-initrange, initrange)
        self.linear.bias.data.zero_()
        self.linear.weight.data.uniform_(-initrange, initrange) # Uniformly initializes weights for the embeddings and linear layers

    def forward(self, src: Tensor, user: Tensor, src_mask: Tensor = None) -> Tensor:

        '''Defines the forward pass of the Transformer model.

        Args:
            src: Input tensor of shape [seq_len, batch_size].
            user: User tensor of shape [batch_size].
            src_mask: Source mask tensor of shape [seq_len, seq_len] (optional).

        Returns:
            Tensor of shape [seq_len, batch_size, ntoken].
        '''
        #Step 1
        # Embedding movie ids and user id
        movie_embed = self.movie_embedding(src) * math.sqrt(self.d_model) #Embeds the movie ids and scales by the square root of the model dimension for variance normalization
        user_embed = self.user_embedding(user) * math.sqrt(self.d_model) #Embeds the user id and scales by the square root of the model dimension for variance normalization.

        #Step 2
        movie_embed = self.pos_encoder(movie_embed) #applying positional encoding to the movie embeddings

        #Step 3
        output = self.transformer_encoder(movie_embed, src_mask) #processes the positionally encoded movie embeddings through the transformer encoder layers

        #Step 4

        user_embed = user_embed.expand(-1, output.size(1), -1) #expands the user embedding tensor to match the sequence length dimension of the output tensor

        #Step 5
        output = torch.cat((output, user_embed), dim = -1) #cancatenates the explanded user embeddings with the tranformer output along the last dimention

        #Step 6
        output = self.linear(output) #passes the concatenated tensor through the linear layer to map to the movie vocabulary size

        return output


## Explanation & Purpose
Here I would like to get into the details and explain exactly what is happening and the purpose behind each step:

### Step 1: Embedding Movie IDs and User ID

- Converts movie and user IDs into their respective embeddings and scales them for variance normalization.

- **Purpose:** To transform IDs into continuous vector representations that the model can process.

###Step 2: Applying Positional Encoding to the Movie Embeddings

- Adds positional information to the movie embeddings.

- **Purpose:** To help the model understand the order of movies in the sequence.

### Step 3: Processing Through Transformer Encoder

- Passes the positionally encoded movie embeddings through multiple layers of multi-head self-attention and feed-forward neural networks.

- **Purpose:** To really capture complex relationships and dependencies within the movie sequence.

### Step 4: Expanding User Embedding

- Expands the user embedding tensor to match the sequence length dimension of the movie embeddings.

**Purpose:** To prepare user embeddings for concatenation with movie embeddings, ensuring user-specific information is integrated.

### Step 5: Concatenating User and Movie Embeddings

- Combines the transformer output and user embeddings along the last dimension.

- **Purpose:** To merge contextual movie information with user-specific data for final prediction.

### Step 6: Passing Through Linear Layer

- Maps the combined embeddings to the movie vocabulary size.

- **Purpose:** To produce the final output predictions corresponding to movie tokens.

# Model Setup

Here's the setup for our Transformer model, including the definition of hyperparameters, the model initialization, and the configuration of the training components such as the loss function, optimizer, and learning rate scheduler.

In [185]:
ntokens = len(movie_id_to_index)  # size of movie vocabulary
nusers = len(user_id_to_index) # size of user vocabulary
emsize = 128  # embedding dimension
d_hid = 128  # dimension of the feedforward network model
nlayers = 2  # number of ``nn.TransformerEncoderLayer``
nhead = 2  # number of heads in ``nn.MultiheadAttention``
dropout = 0.2  # dropout probability

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = TransformerModel(ntokens, nusers, emsize, nhead, d_hid, nlayers, dropout).to(device)

criterion = nn.CrossEntropyLoss()
lr = 1.0  # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)



# Training Function

Our training function iterates over the training data, performing forward and backward passes through the model to update its parameters based on the computed loss. It includes steps for moving data to the appropriate device, calculating and backpropagating the loss, clipping gradients to prevent exploding gradients, and logging progress at specified intervals to monitor training performance.

I annotated every step of the code in comments bellow.

In [186]:
def train(model: nn.Module, train_iter, epoch) -> None:
  model.train() #initiating the training mode

  total_loss = 0 #initializes the total loss for the epoch to 0

  log_interval = 200 ## specifies the interval at which logs will be printed

  start_time = time.time() ### records the start time to calculate the time taken for each interval



  for i, (movie_data, user_data) in enumerate(train_iter):
    movie_data, user_data = movie_data.to(device), user_data.to(device) #moves the movie sequences and user IDs to the specified device (CPU or GPU).
    user_data = user_data.reshape(-1, 1) #reshapes the user data tensor to have a single column

    inputs, targets = movie_data[:, :-1], movie_data[:, 1:] #splits the movie sequence into inputs and targets. Inputs are all elements except the last, targets are all elements except the first.
    targets_flat = targets.reshape(-1) # flattens the targets tensor to a 1D array for calculating the loss

    #Predicting Movies
    output = model(inputs, user_data) # feeds the inputs and user data into the model to get the output predictions
    output_flat = output.reshape(-1, ntokens) ## reshapes the output tensor to match the dimensions required for the loss calculation.

    #Backprop
    loss = criterion(output_flat, targets_flat) #calculates the loss between the predicted output and the actual targets using Cross-Entropy Loss.
    optimizer.zero_grad() ## resets the gradients of all model paramenters
    loss.backward() # performs backpropagation to compute the gradients of the loss with respect to the model parameters

    torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5) # clips the gradients to a maximum norm of 0.5 to prevent exploding gradients

    optimizer.step() # updates the model parameters using the gradients calculated during backpropagation

    total_loss += loss.item() ## accumulater the loss over each batch

    # Results
    if i % log_interval == 0 and i > 0:
      lr = scheduler.get_last_lr()[0] # getting the current learning rate from the scheduler

      ms_per_batch = (time.time() - start_time) * 1000 / log_interval #calculater time per batch in milliseconds

      cur_loss = total_loss / log_interval ## calculates the average loss over the logging inverval

      ppl = math.exp(cur_loss) # calculates perplexity, which is the exponentiation of the average loss.

      print(f'| epoch {epoch:3d} '
                  f'lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}') # prints the epoch number, learning rate, time per batch, average loss, and perplexity

      total_loss = 0 # resets the total loss for the next interval

      start_time = time.time() #resets the start time for the next interval



# Evaluation Function

To assess the model's effectiveness, we use an evaluation function that calculates the average loss on a separate validation dataset. This function runs the model in evaluation mode, where it processes the validation data without updating the model parameters, and computes the loss to gauge the model's performance on unseen data.

In [187]:
def evaluate(model: nn.Module, eval_data: Tensor) -> float:
  model.eval() #entering the evaluation mode
  total_loss = 0

  with torch.no_grad():
    for i, (movie_data, user_data) in enumerate(eval_data):
      movie_data, user_data = movie_data.to(device), user_data.to(device)
      user_data = user_data.reshape(-1, 1) # loading movie sequences and user ids

      inputs, targets = movie_data[:, :-1], movie_data[:, 1:]
      targets_flat = targets.reshape(-1) # splitting movie sequences to inputs and targets

      output = model(inputs, user_data)
      output_flat = output.reshape(-1, ntokens) # Predicting movies

      loss = criterion(output_flat, targets_flat)
      total_loss += loss.item() # calculating loss

  return total_loss / (len(eval_data) - 1)

# Training and Evaluation Loop

This section of the code handles the training and evaluation of the Transformer model over a specified number of epochs, while also implementing a mechanism to save the best-performing model parameters. It uses a TemporaryDirectory to store the best model parameters based on validation loss and iterates through the training and evaluation phases, updating the learning rate after each epoch. The loop logs performance metrics, including validation loss and perplexity, at the end of each epoch, and reloads the best model parameters once training is complete.

In [188]:
best_val_loss = float('inf')
epochs = 11

with TemporaryDirectory() as tempdir:
  best_model_params_path = os.path.join(tempdir, 'best_model_params.pt')

  for epoch in range (1, epochs + 1):
    epoch_start_time = time.time()

    #training
    train(model, train_iter, epoch)

    #evaluation
    val_loss = evaluate(model, val_iter)

    #perplexity on validation loss
    val_ppl = math.exp(val_loss)
    elapsed = time.time() - epoch_start_time

    #results
    print('-' * 89)
    print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
          f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
    print('-' * 89)

    # Save best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        torch.save(model.state_dict(), best_model_params_path)

    scheduler.step()
  model.load_state_dict(torch.load(best_model_params_path)) # load best model states

| epoch   1 lr 1.00 | ms/batch 16.42 | loss  7.81 | ppl  2475.41
| epoch   1 lr 1.00 | ms/batch 11.63 | loss  7.63 | ppl  2050.54
| epoch   1 lr 1.00 | ms/batch 11.45 | loss  7.60 | ppl  2005.28
| epoch   1 lr 1.00 | ms/batch 11.42 | loss  7.58 | ppl  1961.74
| epoch   1 lr 1.00 | ms/batch 11.24 | loss  7.55 | ppl  1899.13
| epoch   1 lr 1.00 | ms/batch 15.07 | loss  7.51 | ppl  1821.61
| epoch   1 lr 1.00 | ms/batch 12.63 | loss  7.38 | ppl  1609.89
| epoch   1 lr 1.00 | ms/batch 11.52 | loss  7.22 | ppl  1363.31
-----------------------------------------------------------------------------------------
| end of epoch   1 | time: 22.83s | valid loss  7.10 | valid ppl  1208.61
-----------------------------------------------------------------------------------------
| epoch   2 lr 0.95 | ms/batch 11.64 | loss  7.00 | ppl  1094.17
| epoch   2 lr 0.95 | ms/batch 14.85 | loss  6.85 | ppl   947.67
| epoch   2 lr 0.95 | ms/batch 12.16 | loss  6.76 | ppl   864.99
| epoch   2 lr 0.95 | ms/batch 

# Baseline Model: Popular Movies

To evaluate our Transformer-based recommendation system, we need a baseline model for comparison.

A common and straightforward baseline is recommending the most popular items. The logic is simple: items that are frequently and highly rated are likely to be appealing to a wide audience. Despite its simplicity, this model can be effective in diverse environments with a variety of content.

By using the popular movies baseline, we set a basic performance benchmark, allowing us to measure the improvements offered by our Transformer model. This approach is widely used as an initial step in developing more complex recommendation systems.

In [192]:
def get_popular_movies(df_ratings):
  rating_counts = df_ratings['movie_id'].value_counts().reset_index()
  rating_counts.columns = ['movie_id', 'rating_count'] # calculating the number of ratings for each movie

  min_ratings_threshold = rating_counts['rating_count'].quantile(0.95) # getting the most frequently rated movies

  popular_movies = ratings.merge(rating_counts, on='movie_id')
  popular_movies = popular_movies[popular_movies['rating_count'] >= min_ratings_threshold] #  # filtering movies based on the minimum number of ratings

  average_ratings = popular_movies.groupby('movie_id')['rating'].mean().reset_index() # calculating the average rating for each movie

  top_10_movies = list(average_ratings.sort_values('rating', ascending=False).head(10).movie_id.values) # getting the top 10 rated movies

  return top_10_movies

In [193]:
top_10_movies = get_popular_movies(ratings)
[movie_title_dict[movie] for movie in top_10_movies]

['Shawshank Redemption, The (1994)',
 'Godfather, The (1972)',
 'Usual Suspects, The (1995)',
 "Schindler's List (1993)",
 'Raiders of the Lost Ark (1981)',
 'Star Wars: Episode IV - A New Hope (1977)',
 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)',
 'Casablanca (1942)',
 'Sixth Sense, The (1999)',
 "One Flew Over the Cuckoo's Nest (1975)"]

# Results Comparison

In this part, we're diving into the evaluation phase where we compare our Transformer model's recommendations to a baseline model. We start by setting up a decoder to map numerical indices back to their original movie IDs and prepare lists to store our recommendation results. We then loop through the validation dataset, inputting movie sequences and user IDs into our model to get predictions. These predictions are reshaped to pull out the top movie recommendations while filtering out any movies the user has already seen. For each sequence, we compare our model’s recommendations against the actual next movie in the sequence and the top 10 most popular movies. Finally, we convert these results into numpy arrays to make it easier to calculate performance metrics like NDCG (Normalized Discounted Cumulative Gain), setting us up for a detailed performance comparison.

In [197]:
index_to_movie_id = {idx: movie_id for movie_id, idx in movie_id_to_index.items()}

# placeholders to store results of recommendations
transformer_reco_results = list()
popular_reco_results = list()

# getting top 10 movies
k = 10

# iterating over the validation data
for i, (movie_data, user_data) in enumerate(val_iter):

    # feeding the input and get the outputs
    movie_data, user_data = movie_data.to(device), user_data.to(device)
    user_data = user_data.reshape(-1, 1)
    inputs, targets = movie_data[:, :-1], movie_data[:, 1:]
    output = model(inputs, user_data)
    output_flat = output.reshape(-1, ntokens)
    targets_flat = targets.reshape(-1)

    # reshaping the output_flat to get top predictions
    outputs = output_flat.reshape(output_flat.shape[0] // inputs.shape[1],
                                  inputs.shape[1],
                                  output_flat.shape[1])[:, -1, :]

    # k + len(inputs) = 13 movies obtained
    # In order to prevent to recommend already watched movies

    values, indices = outputs.topk(k + inputs.shape[1], dim=-1)

    for sub_sequence, sub_indice_org in zip(movie_data, indices):
        sub_indice_org = sub_indice_org.cpu().detach().numpy()
        sub_sequence = sub_sequence.cpu().detach().numpy()

        # generating mask array to eliminate already watched movies
        mask = np.isin(sub_indice_org, sub_sequence[:-1], invert=True)

        # after masking get top k movies
        sub_indice = sub_indice_org[mask][:k]

        # generating results array
        transformer_reco_result = np.isin(sub_indice, sub_sequence[-1]).astype(int)

        # decoding movie to search in popular movies
        target_movie_decoded = index_to_movie_id[sub_sequence[-1]]
        popular_reco_result = np.isin(top_10_movies, target_movie_decoded).astype(int)

        transformer_reco_results.append(transformer_reco_result)
        popular_reco_results.append(popular_reco_result)

# converting results to numpy arrays for easier calculation of metrics
transformer_reco_results = np.array(transformer_reco_results)
popular_reco_results = np.array(popular_reco_results)


**Normalized Discounted Cumulative Gain (NDCG)** is a super commonly used metric in the recommendation systems space to evaluate the quality of ranked lists.

It measures the usefulness or gain of an item based on its position in the list, where higher-ranked items contribute more to the overall score.

NDCG accounts for the relevance of recommendations by incorporating a discount factor that reduces the contribution of lower-ranked items, ensuring that items appearing earlier in the list have a higher impact on the score.

For more info about NDCG, please [read](https://medium.com/@readsumant/understanding-ndcg-as-a-metric-for-your-recomendation-system-5cd012fb3397) this article.

In [198]:
from sklearn.metrics import ndcg_score

# Since we have already sorted our recommendations
# An array that represent our recommendation scores is used.
representative_array = [[i for i in range(k, 0, -1)]] * len(transformer_reco_results)

for k in [3, 5, 10]:
  transformer_result = ndcg_score(transformer_reco_results,
                                  representative_array, k=k)
  popular_result = ndcg_score(popular_reco_results,
                              representative_array, k=k)

  print(f"Transformer NDCG result at top {k}: {round(transformer_result, 4)}")
  print(f"Popular recommendation NDCG result at top {k}: {round(popular_result, 4)}\n\n")


Transformer NDCG result at top 3: 0.0541
Popular recommendation NDCG result at top 3: 0.0044


Transformer NDCG result at top 5: 0.0689
Popular recommendation NDCG result at top 5: 0.0062


Transformer NDCG result at top 10: 0.0913
Popular recommendation NDCG result at top 10: 0.0095




Our evaluation indicates that the Transformer model **outperforms** the baseline popular recommendation method, achieving NDCG scores of 0.0541, 0.0689, and 0.0913 for the top 3, 5, and 10 recommendations, respectively.

In contrast, the baseline method scored 0.0044, 0.0062, and 0.0095 for the same metrics. This highlights the Transformer model's effectiveness in providing more relevant movie recommendations compared to the simple baseline approach.

# Generating Personal Recommendations

In this very last section of our project, we implement a function to generate **personalized movie recommendations using our trained Transformer model**.

The function, `generate_recommendation`, takes a user ID and a sequence of watched movies, then predicts the next movies the user is likely to enjoy. The model is evaluated by feeding the input tensors through the trained Transformer and extracting the top-k recommendations, ensuring that already watched movies are excluded. The output is a list of recommended movies based on the user's viewing history.

In [203]:
def generate_recommendation(user_id, movie_sequence, k=10):
    model.eval()
    input_sequence = movie_sequence[:-1]
    # tokenize and numerically encode the user id and movie sequence
    user_tensor = torch.tensor(user_id_to_index[user_id])
    movie_tensor = torch.tensor([movie_id_to_index[movie_id] for movie_id in input_sequence])
    # shape: [1, 1]
    user_tensor = user_tensor.unsqueeze(0).to(device)
    user_tensor = user_tensor.view(user_tensor.shape[0], 1)

    # shape: [1, seq_length]
    movie_tensor = movie_tensor.unsqueeze(0).to(device)[0]
    movie_tensor = movie_tensor.view(1, movie_tensor.shape[0])

    # pass the tensors through the model
    with torch.no_grad():
        predictions = model(movie_tensor, user_tensor)

    # the output is a probability distribution over the next movie.
    # topk to get most probable movies
    values, indices = predictions.topk(k + len(input_sequence), dim=-1)
    # eliminate already watched movies
    indices = [indice.item() for indice in indices[0, -1, :] if indice.item() not in movie_tensor[0]]
    indices = indices[:k]
    predicted_movies = [movie_title_dict[index_to_movie_id[movie]] for movie in indices]
    return predicted_movies


In [204]:
row_iter = test_data_raw[59233]
print("Input Sequence:")
print("-" + "\n-".join([movie_title_dict[ea_movie] for ea_movie in row_iter[1][:-1]]))
recos = '\n-'.join(generate_recommendation(row_iter[0], row_iter[1]))

print(f"Recommendations:\n-{recos}")


Input Sequence:
-Deep End of the Ocean, The (1999)
-Footloose (1984)
-Pleasantville (1998)
Recommendations:
-Clueless (1995)
-There's Something About Mary (1998)
-Grosse Pointe Blank (1997)
-Waking Ned Devine (1998)
-Clerks (1994)
-Much Ado About Nothing (1993)
-Opposite of Sex, The (1998)
-Life Is Beautiful (La Vita è bella) (1997)
-As Good As It Gets (1997)
-Full Monty, The (1997)


The recommendations generated by our Transformer model demonstrate a solid understanding of the user's preferences based on their viewing history. The suggested movies span various genres and include well-known, critically acclaimed films from the late 1990s, showcasing the model's ability to provide relevant and diverse recommendations. This approach ensures a higher likelihood of user satisfaction compared to a simple recommendation baseline.

Thank you for following along with this project. Your attention and interest are greatly appreciated!