## Content

In this blog post, we will define our transformer model and generate personalized recommendations based on user sequences at problemLens dataset. From data pre-processing and model training to making the final predictions, we will go through all steps one by one.

In [31]:
import math
import os
from tempfile import TemporaryDirectory
from typing import Tuple

import torch
from torch import nn, Tensor
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset
from torchtext.vocab import vocab
from torch.utils.data import DataLoader, Dataset
from torch.nn.utils.rnn import pad_sequence

from collections import Counter

from zipfile import ZipFile
from urllib.request import urlretrieve

import pandas as pd
import numpy as np

import time

# 1. Data Preprocessing
In this section, we'll start by loading the problemLens dataset. We will then construct vocabularies for problem IDs and user IDs, and create sequences of user interactions. These steps lay the groundwork for our recommendation model, converting the data into a format that our model can utilize effectively.
## 1.1 Loading Dataset
At first we will download our dataset to generate our sequences and vocabularies. Then user_id and problem_id values are processesed to fix their data types.

In [32]:
# urlretrieve("http://files.grouplens.org/datasets/problemlens/ml-1m.zip", "problemlens.zip")
# ZipFile("problemlens.zip", "r").extractall()

In [33]:
interactions = pd.read_csv(
    "/content/users_problems.csv",
    sep=",",
    names=["user_id", "problem_id", "timestamp"],
    skiprows=1
)


In [34]:
# Preventing ids to be written as integer or float data type

interactions["problem_id"] = interactions["problem_id"].apply(lambda x: f"problem_{x}")
interactions["user_id"] = interactions["user_id"].apply(lambda x: f"user_{x}")

In [35]:
interactions

Unnamed: 0,user_id,problem_id,timestamp
0,user_orzdevinwang,problem_1428:G2,1602994360
1,user_orzdevinwang,problem_1428:G1,1602992678
2,user_orzdevinwang,problem_1428:F,1602985166
3,user_orzdevinwang,problem_1428:E,1602983894
4,user_orzdevinwang,problem_1428:D,1602983856
...,...,...,...
14108,user_sv1shan,problem_1416:C,1665427829
14109,user_sv1shan,problem_1499:D,1665339759
14110,user_sv1shan,problem_1370:E,1665305440
14111,user_sv1shan,problem_850:B,1665257679


## 1.2 Creating Vocabulary
Now that we have our data ready, it's time to prepare our vocabularies for user IDs and problem IDs. This step will convert the unique IDs into numerical indices that our model can use. The following code snippet accomplishes this task.

In [36]:
# Generating a list of unique problem ids
problem_ids = interactions.problem_id.unique()

# Counter is used to feed problems to movive_vocab
problem_counter = Counter(problem_ids)

# Genarting vocabulary
problem_vocab = vocab(problem_counter, specials=['<unk>'])

# For indexing input ids
problem_vocab_stoi = problem_vocab.get_stoi()

# problem to title mapping dictionary
# problem_title_dict = dict(zip(problems.problem_id, problems.title))

# Similarly generating a vocabulary for user ids
user_ids = interactions.user_id.unique()
user_counter = Counter(user_ids)
user_vocab = vocab(user_counter, specials=['<unk>'])
user_vocab_stoi = user_vocab.get_stoi()

## 1.3 Generating Sequences
All interactions of users are first sorted by their interaction timestamp and then divided into sub sequences to train our model.

In [37]:
# Group ratings by user_id in order of increasing unix_timestamp.
ratings_group = interactions.sort_values(by=["timestamp"]).groupby("user_id")

interactions_data = pd.DataFrame(
    data={
        "user_id": list(ratings_group.groups.keys()),
        "problem_ids": list(ratings_group.problem_id.apply(list)),
        "timestamps": list(ratings_group.timestamp.apply(list)),
    }
)

In [38]:
interactions_data

Unnamed: 0,user_id,problem_ids,timestamps
0,user_0wuming0,"[problem_1200:E, problem_1451:F, problem_723:A...","[1606018760, 1606019571, 1606042942, 160622532..."
1,user_0x0002,"[problem_1842:G, problem_1824:B2, problem_1824...","[1705650931, 1705651653, 1705651717, 170567142..."
2,user_160cm,"[problem_104789:A, problem_104789:B, problem_1...","[1700213033, 1700214053, 1700467666, 170047034..."
3,user_1L1YA,"[problem_1270:G, problem_1601:D, problem_1882:...","[1695582688, 1695642605, 1695678230, 169570815..."
4,user_36champ,"[problem_1718:B, problem_1722:A, problem_1722:...","[1660672497, 1662086644, 1662086765, 166208730..."
...,...,...,...
111,user_yyyz04,"[problem_1582:F2, problem_1582:G, problem_1601...","[1635239587, 1635248059, 1635305780, 163567526..."
112,user_zdc123456,"[problem_1779:E, problem_1442:D, problem_743:C...","[1674379798, 1674388745, 1674457695, 167446782..."
113,user_zhouqixuan1,"[problem_1928:F, problem_1928:E, problem_1928:...","[1707720100, 1707720144, 1707720204, 170772023..."
114,user_zjy2008,"[problem_1806:D, problem_156:D, problem_1603:C...","[1679151237, 1679190604, 1679194023, 167927464..."


In [39]:
# Sequence length, min history count and window slide size
sequence_length = 5
min_history = 1
step_size = 2

# Creating sequences from lists with sliding window
def create_sequences(values, window_size, step_size, min_history):
  sequences = []
  start_index = 0
  while len(values[start_index:]) > min_history:
    seq = values[start_index : start_index + window_size]
    sequences.append(seq)
    start_index += step_size
  return sequences

interactions_data.problem_ids = interactions_data.problem_ids.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size, min_history)
)


del interactions_data["timestamps"]

In [40]:
interactions_data

Unnamed: 0,user_id,problem_ids
0,user_0wuming0,"[[problem_1200:E, problem_1451:F, problem_723:..."
1,user_0x0002,"[[problem_1842:G, problem_1824:B2, problem_182..."
2,user_160cm,"[[problem_104789:A, problem_104789:B, problem_..."
3,user_1L1YA,"[[problem_1270:G, problem_1601:D, problem_1882..."
4,user_36champ,"[[problem_1718:B, problem_1722:A, problem_1722..."
...,...,...
111,user_yyyz04,"[[problem_1582:F2, problem_1582:G, problem_160..."
112,user_zdc123456,"[[problem_1779:E, problem_1442:D, problem_743:..."
113,user_zhouqixuan1,"[[problem_1928:F, problem_1928:E, problem_1928..."
114,user_zjy2008,"[[problem_1806:D, problem_156:D, problem_1603:..."


In [41]:
# Sub-sequences are exploded.
# Since there might be more than one sequence for each user.
interactions_data_transformed = interactions_data[["user_id", "problem_ids"]].explode(
    "problem_ids", ignore_index=True
)

interactions_data_transformed.rename(
    columns={"problem_ids": "sequence_problem_ids"},
    inplace=True,
)

In [42]:
print(interactions_data_transformed.sample(frac=1).reset_index(drop=True).head())
print(interactions_data_transformed.shape)

          user_id                               sequence_problem_ids
0     user__Gawd_  [problem_1914:G1, problem_833:B, problem_1515:...
1   user_MIKEFENG  [problem_431:E, problem_997:C, problem_845:C, ...
2    user_Mu_Silk  [problem_1952:J, problem_1952:C, problem_1952:...
3  user_Tricopter  [problem_1887:C, problem_378:E, problem_349:D,...
4  user_AA_Surely  [problem_19:D, problem_540:E, problem_609:F, p...
(7024, 2)


## 1.4 Train Test Split
The data is split into training and testing sets. Although considering timestamps could potentially provide a more refined split, for the sake of simplicity, we opt for a random indexing approach.

In [43]:
# Random indexing
random_selection = np.random.rand(len(interactions_data_transformed.index)) <= 0.85

# Split train data
df_train_data = interactions_data_transformed[random_selection]
train_data_raw = df_train_data[["user_id", "sequence_problem_ids"]].values

# Split test data
df_test_data = interactions_data_transformed[~random_selection]
test_data_raw = df_test_data[["user_id", "sequence_problem_ids"]].values

In [44]:
df_train_data

Unnamed: 0,user_id,sequence_problem_ids
1,user_0wuming0,"[problem_723:A, problem_1344:A, problem_1344:B..."
2,user_0wuming0,"[problem_1344:B, problem_1344:C, problem_1364:..."
3,user_0wuming0,"[problem_1364:A, problem_469:A, problem_262:A,..."
4,user_0wuming0,"[problem_262:A, problem_1426:A, problem_734:B,..."
6,user_0wuming0,"[problem_1391:A, problem_1455:B, problem_1433:..."
...,...,...
7017,user_zlxFTH,"[problem_1809:B, problem_1809:C, problem_1809:..."
7019,user_zlxFTH,"[problem_786:B, problem_1837:A, problem_1837:B..."
7020,user_zlxFTH,"[problem_1837:B, problem_1837:C, problem_1837:..."
7021,user_zlxFTH,"[problem_1837:D, problem_1837:E, problem_1837:..."


In [45]:
train_data_raw

array([['user_0wuming0',
        list(['problem_723:A', 'problem_1344:A', 'problem_1344:B', 'problem_1344:C', 'problem_1364:A'])],
       ['user_0wuming0',
        list(['problem_1344:B', 'problem_1344:C', 'problem_1364:A', 'problem_469:A', 'problem_262:A'])],
       ['user_0wuming0',
        list(['problem_1364:A', 'problem_469:A', 'problem_262:A', 'problem_1426:A', 'problem_734:B'])],
       ...,
       ['user_zlxFTH',
        list(['problem_1837:B', 'problem_1837:C', 'problem_1837:D', 'problem_1837:E', 'problem_1837:F'])],
       ['user_zlxFTH',
        list(['problem_1837:D', 'problem_1837:E', 'problem_1837:F', 'problem_1824:A', 'problem_1824:B1'])],
       ['user_zlxFTH', list(['problem_1824:B1', 'problem_1824:B2'])]],
      dtype=object)

DataLoader is defined to be used for training and evaluation as final pre-processing step.

In [46]:
# Pytorch Dataset for user interactions
class problemSeqDataset(Dataset):
    # Initialize dataset
    def __init__(self, data, problem_vocab_stoi, user_vocab_stoi):
        self.data = data

        self.problem_vocab_stoi = problem_vocab_stoi
        self.user_vocab_stoi = user_vocab_stoi


    def __len__(self):
        return len(self.data)

    # Fetch data from the dataset
    def __getitem__(self, idx):
        user, problem_sequence = self.data[idx]
        # Directly index into the vocabularies
        problem_data = [self.problem_vocab_stoi[item] for item in problem_sequence]
        user_data = self.user_vocab_stoi[user]
        return torch.tensor(problem_data), torch.tensor(user_data)


# Collate function and padding
def collate_batch(batch):
    problem_list = [item[0] for item in batch]
    user_list = [item[1] for item in batch]
    return pad_sequence(problem_list, padding_value=problem_vocab_stoi['<unk>'], batch_first=True), torch.stack(user_list)


BATCH_SIZE = 256
# Create instances of your Dataset for each set
train_dataset = problemSeqDataset(train_data_raw, problem_vocab_stoi, user_vocab_stoi)
val_dataset = problemSeqDataset(test_data_raw, problem_vocab_stoi, user_vocab_stoi)

# Create DataLoaders
train_iter = DataLoader(train_dataset, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=collate_batch)
val_iter = DataLoader(val_dataset, batch_size=BATCH_SIZE,
                      shuffle=False, collate_fn=collate_batch)


In [47]:
print(train_dataset[0])

(tensor([ 754, 1678, 1677, 1676, 1675]), tensor(18))


# 2. Model Definition
In this section we will define and initialize our model. Then the model will be trained with our previously generated dataset.
## 2.1 Positional Encoder
We start by defining the positional encoder, which is crucial for sequence-based models like the Transformer. This encoder will capture the positions of problem interactions in our sequences, thus embedding the order information that the Transformer model needs.

In [48]:
# class PositionalEncoding(nn.Module):

#     def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
#         super().__init__()
#         self.dropout = nn.Dropout(p=dropout)

#         position = torch.arange(max_len).unsqueeze(1)

#         # `div_term` is used in the calculation of the sinusoidal values.
#         div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))

#         # Initializing positional encoding matrix with zeros.
#         pe = torch.zeros(max_len, 1, d_model)

#         # Calculating the positional encodings.
#         pe[:, 0, 0::2] = torch.sin(position * div_term)
#         pe[:, 0, 1::2] = torch.cos(position * div_term)
#         self.register_buffer('pe', pe)

#     def forward(self, x: Tensor) -> Tensor:
#         """
#         Arguments:
#             x: Tensor, shape ``[seq_len, batch_size, embedding_dim]``
#         """
#         x = x + self.pe[:x.size(0)]
#         return self.dropout(x)

## 2.2 Transformer Model
Following the definition of our positional encoder, we then establish our transformer model. This model takes both the user id and the problem id sequence as input, and it is responsible for generating the output problem predictions.

In [49]:
# class TransformerModel(nn.Module):
#     def __init__(self, ntoken: int, nuser: int, d_model: int, nhead: int, d_hid: int,
#                  nlayers: int, dropout: float = 0.5):
#         super().__init__()
#         self.model_type = 'Transformer'
#         # positional encoder
#         self.pos_encoder = PositionalEncoding(d_model, dropout)

#         # Multihead attention mechanism.
#         encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout)
#         self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)

#         # Embedding layers
#         self.problem_embedding = nn.Embedding(ntoken, d_model)
#         self.user_embedding = nn.Embedding(nuser, d_model)

#         # Defining the size of the input to the model.
#         self.d_model = d_model

#         # Linear layer to map the output toproblem vocabulary.
#         self.linear = nn.Linear(2*d_model, ntoken)

#         self.init_weights()

#     def init_weights(self) -> None:
#         # Initializing the weights of the embedding and linear layers.
#         initrange = 0.1
#         self.problem_embedding.weight.data.uniform_(-initrange, initrange)
#         self.user_embedding.weight.data.uniform_(-initrange, initrange)
#         self.linear.bias.data.zero_()
#         self.linear.weight.data.uniform_(-initrange, initrange)

#     def forward(self, src: Tensor, user: Tensor, src_mask: Tensor = None) -> Tensor:
#         # Embedding problem ids and userid
#         problem_embed = self.problem_embedding(src) * math.sqrt(self.d_model)
#         user_embed = self.user_embedding(user) * math.sqrt(self.d_model)

#         # positional encoding
#         problem_embed = self.pos_encoder(problem_embed)

#         # generating output with final layers
#         output = self.transformer_encoder(problem_embed, src_mask)

#         # Expand user_embed tensor along the sequence length dimension
#         user_embed = user_embed.expand(-1, output.size(1), -1)

#         # Concatenate user embeddings with transformer output
#         output = torch.cat((output, user_embed), dim=-1)

#         output = self.linear(output)
#         return output


In [50]:
class LSTMModel(nn.Module):
    def __init__(self, ntoken: int, nuser: int, d_model: int, d_hid: int, nlayers: int, dropout: float = 0.5):
        super().__init__()
        self.model_type = 'LSTM'

        # Embedding layers
        self.problem_embedding = nn.Embedding(ntoken, d_model)
        self.user_embedding = nn.Embedding(nuser, d_model)

        # LSTM layers
        self.lstm = nn.LSTM(d_model, d_hid, nlayers, batch_first=True, dropout=dropout)

        self.d_model = d_model

        # Linear layer to map the LSTM output to problem vocabulary
        self.linear = nn.Linear(2*d_hid, ntoken)

        self.init_weights()

    def init_weights(self) -> None:
        # Initializing the weights of the embedding and linear layers
        initrange = 0.1
        self.problem_embedding.weight.data.uniform_(-initrange, initrange)
        self.user_embedding.weight.data.uniform_(-initrange, initrange)
        self.linear.bias.data.zero_()
        self.linear.weight.data.uniform_(-initrange, initrange)

    def forward(self, src: Tensor, user: Tensor) -> Tensor:
        # Embedding problem ids and user id
        problem_embed = self.problem_embedding(src)* math.sqrt(self.d_model)
        user_embed = self.user_embedding(user)* math.sqrt(self.d_model)

        # print("problem_embed shape:", problem_embed.shape)
        # print("user_embed shape:", user_embed.shape)

        # Pass the combined embeddings through LSTM layers
        lstm_output, _ = self.lstm(problem_embed)

        # print("lstm_output", lstm_output.shape)

        user_embed = user_embed.expand(-1, lstm_output.size(1), -1)
        # print("user_embed", user_embed.shape)

        output = torch.cat((lstm_output, user_embed), dim=-1)

        # print("hello", output.shape)

        # Apply linear layer to obtain the output logits
        output = self.linear(output)

        return output


Following the model definitions, we proceed to initialize our model using a set of arbitrarily selected hyperparameters.

In [51]:
ntokens = len(problem_vocab)  # size of vocabulary
nusers = len(user_vocab)
d_model = 128  # embedding dimension
d_hid = 128  # dimension of the LSTM hidden states
nlayers = 2  # number of LSTM layers
dropout = 0.2  # dropout probability

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = LSTMModel(ntokens, nusers, d_model, d_hid, nlayers, dropout).to(device)

criterion = nn.CrossEntropyLoss()
lr = 1.0  # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)


In [52]:
# ntokens = len(problem_vocab)  # size of vocabulary
# nusers = len(user_vocab)
# emsize = 128  # embedding dimension
# d_hid = 128  # dimension of the feedforward network model
# nlayers = 2  # number of ``nn.TransformerEncoderLayer``
# nhead = 2  # number of heads in ``nn.MultiheadAttention``
# dropout = 0.2  # dropout probability

# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# model = TransformerModel(ntokens, nusers, emsize, nhead, d_hid, nlayers, dropout).to(device)

# criterion = nn.CrossEntropyLoss()
# lr = 1.0  # learning rate
# optimizer = torch.optim.SGD(model.parameters(), lr=lr)
# scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

# 3. Train & Evaluation
We're now ready to kick off the training process with our model, where it will learn from the dataset we've prepared. Following the training phase, we'll evaluate how well our model performs on unseen data to check its effectiveness.
## 3.1 Train Function

In [53]:
# def train(model: nn.Module, train_iter, epoch) -> None:
#     # Switch to training mode
#     model.train()
#     total_loss = 0.
#     log_interval = 200
#     start_time = time.time()

#     for i, (problem_data, user_data) in enumerate(train_iter):
#         # Load problem sequence and user id
#         problem_data, user_data = problem_data.to(device), user_data.to(device)
#         user_data = user_data.reshape(-1, 1)

#         # Split problem sequence to inputs and targets
#         inputs, targets = problem_data[:, :-1], problem_data[:, 1:]
#         targets_flat = targets.reshape(-1)

#         # Predict problems
#         output = model(inputs, user_data)
#         output_flat = output.reshape(-1, ntokens)

#         # Backpropogation process
#         loss = criterion(output_flat, targets_flat)
#         optimizer.zero_grad()
#         loss.backward()
#         torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
#         optimizer.step()

#         total_loss += loss.item()
#         # Results
#         if i % log_interval == 0 and i > 0:
#             lr = scheduler.get_last_lr()[0]
#             ms_per_batch = (time.time() - start_time) * 1000 / log_interval
#             cur_loss = total_loss / log_interval
#             ppl = math.exp(cur_loss)
#             print(f'| epoch {epoch:3d} '
#                   f'lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | '
#                   f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
#             total_loss = 0
#             start_time = time.time()

In [54]:
def train(model: nn.Module, train_iter, epoch) -> None:
    # Switch to training mode
    model.train()
    total_loss = 0.
    log_interval = 200
    start_time = time.time()


    for i, (problem_data, user_data) in enumerate(train_iter):
        # Load problem sequence and user id
        problem_data, user_data = problem_data.to(device), user_data.to(device)
        # user_data = user_data.reshape(-1, 1)

        # Split problem sequence to inputs and targets
        inputs, targets = problem_data[:, :-1], problem_data[:, 1:]
        targets_flat = targets.reshape(-1)
        user_data = user_data.unsqueeze(1)

        # print(f"Input shape before LSTM: {inputs.shape}")  # Print the shape of inputs

        # if (i==0):
        #   print("problems", problem_data.shape)
        #   print("inputs", inputs.shape)
        #   print("targets", targets.shape)
        #   print("targets_flat", targets_flat.shape)
        #   print("user", user_data.shape)

        # Predict problems
        output = model(inputs, user_data)

        # if (i==0):
        #   print("outputs", output.shape)

        # Backpropogation process
        loss = criterion(output.view(-1, ntokens), targets_flat)  # Use ntokens for the output view
        # if (i==0):
        #   print("loss_output",output.view(-1, ntokens).shape)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        # Results
        if i % log_interval == 0 and i > 0:
            lr = scheduler.get_last_lr()[0]
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval
            cur_loss = total_loss / log_interval
            ppl = math.exp(cur_loss)
            print(f'| epoch {epoch:3d} '
                  f'lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            total_loss = 0
            start_time = time.time()



## 3.2 Evaluation Function

In [55]:
# def evaluate(model: nn.Module, eval_data: Tensor) -> float:
#     # Switch the model to evaluation mode.
#     # This is necessary for layers like dropout,
#     model.eval()
#     total_loss = 0.

#     with torch.no_grad():
#         for i, (problem_data, user_data) in enumerate(eval_data):
#             # Load problem sequence and user id
#             problem_data, user_data = problem_data.to(device), user_data.to(device)
#             user_data = user_data.reshape(-1, 1)
#             # Split problem sequence to inputs and targets
#             inputs, targets = problem_data[:, :-1], problem_data[:, 1:]
#             targets_flat = targets.reshape(-1)
#             # Predict problems
#             output = model(inputs, user_data)
#             output_flat = output.reshape(-1, ntokens)
#             # Calculate loss
#             loss = criterion(output_flat, targets_flat)
#             total_loss += loss.item()
#     return total_loss / (len(eval_data) - 1)

In [56]:
def evaluate(model: nn.Module, eval_data: Tensor) -> float:
    # Switch the model to evaluation mode.
    # This is necessary for layers like dropout,
    model.eval()
    total_loss = 0.

    with torch.no_grad():
        for i, (problem_data, user_data) in enumerate(eval_data):
            # Load problem sequence and user id
            problem_data, user_data = problem_data.to(device), user_data.to(device)

            # Split problem sequence to inputs and targets
            inputs, targets = problem_data[:, :-1], problem_data[:, 1:]
            targets_flat = targets.reshape(-1)
            user_data = user_data.unsqueeze(1)

            # Predict problems
            output = model(inputs, user_data)

            # Calculate loss
            loss = criterion(output.view(-1, ntokens), targets_flat)  # Reshape output for loss calculation
            total_loss += loss.item()

    # Return average loss over all batches
    return total_loss / len(eval_data)


## 3.3 Train & Evaluation Loop

In [57]:
best_val_loss = float('inf')
epochs = 10

with TemporaryDirectory() as tempdir:
    best_model_params_path = os.path.join(tempdir, "best_model_params.pt")

    for epoch in range(1, epochs + 1):
        epoch_start_time = time.time()

        # Training
        train(model, train_iter, epoch)

        # Evaluation
        val_loss = evaluate(model, val_iter)

        # Compute the perplexity of the validation loss
        val_ppl = math.exp(val_loss)
        elapsed = time.time() - epoch_start_time

        # Results
        print('-' * 89)
        print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
            f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
        print('-' * 89)

        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), best_model_params_path)

        scheduler.step()

    # After training, load the best model parameters
    model.load_state_dict(torch.load(best_model_params_path))


-----------------------------------------------------------------------------------------
| end of epoch   1 | time:  7.11s | valid loss  8.69 | valid ppl  5937.09
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
| end of epoch   2 | time:  6.06s | valid loss  8.61 | valid ppl  5482.72
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
| end of epoch   3 | time:  6.81s | valid loss  8.53 | valid ppl  5063.97
-----------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------
| end of epoch   4 | time:  5.82s | valid loss  8.44 | valid ppl  4630.86
--------------------------------------------------------------------------

## 3.4 Generating Popular problem Recommendations as Baseline
In order to compare our model success a baseline recommendation method is required. One of the easiest recommendation method is popular problem recommendation which is obtained by most frequent and highly rated problems.

In [58]:
# def get_popular_problems(interactions):
#   # Calculate the number of ratings for each problem
#   rating_counts = interactions['problem_id'].value_counts().reset_index()
#   rating_counts.columns = ['problem_id', 'rating_count']

#   # Get the most frequently rated problems
#   min_ratings_threshold = rating_counts['rating_count'].quantile(0.95)

#   # Filter problems based on the minimum number of ratings
#   popular_problems = interactions.merge(rating_counts, on='problem_id')
#   popular_problems = popular_problems[popular_problems['rating_count'] >= min_ratings_threshold]


#   # Calculate the average rating for each problem
#   average_ratings = popular_problems.groupby('problem_id')['rating'].mean().reset_index()
#   # Get the top 10 rated problems
#   top_10_problems = list(average_ratings.sort_values('rating', ascending=False).head(10).problem_id.values)
#   return top_10_problems

In [59]:
# top_10_problems = get_popular_problems(interactions)
# [problem_title_dict[problem] for problem in top_10_problems]

## 3.5 Recommendations Result Comparison
Like the evaluation function we will iterate our validation dataset and store recommendation results in lists to compare them with normalized discounted gain(NDCG) metric.

In [60]:
# # problem id decoder
# problem_vocab_itos = problem_vocab.get_itos()

# # A placeholders to store results of recommendations
# transformer_reco_results = list()
# popular_reco_results = list()

# # Get top 10 problems
# k = 10
# # Iterate over the validation data
# for i, (problem_data, user_data) in enumerate(val_iter):
#     # Feed the input and get the outputs
#     problem_data, user_data = problem_data.to(device), user_data.to(device)
#     user_data = user_data.reshape(-1, 1)
#     inputs, targets = problem_data[:, :-1], problem_data[:, 1:]
#     output = model(inputs, user_data)
#     output_flat = output.reshape(-1, ntokens)
#     targets_flat = targets.reshape(-1)

#     # Reshape the output_flat to get top predictions
#     outputs = output_flat.reshape(output_flat.shape[0] // inputs.shape[1],
#                                   inputs.shape[1],
#                                   output_flat.shape[1])[: , -1, :]
#     # k + len(inputs) = 13 problems obtained
#     # In order to prevent to recommend already watched problems
#     values, indices = outputs.topk(k + inputs.shape[1], dim=-1)

#     for sub_sequence, sub_indice_org in zip(problem_data, indices):
#         sub_indice_org = sub_indice_org.cpu().detach().numpy()
#         sub_sequence = sub_sequence.cpu().detach().numpy()

#         # Generate mask array to eliminate already watched problems
#         mask = np.isin(sub_indice_org, sub_sequence[:-1], invert=True)

#         # After masking get top k problems
#         sub_indice = sub_indice_org[mask][:k]

#         # Generate results array
#         transformer_reco_result = np.isin(sub_indice, sub_sequence[-1]).astype(int)

#         # Decode problem to search in popular problems
#         target_problem_decoded = problem_vocab_itos[sub_sequence[-1]]
#         popular_reco_result = np.isin(top_10_problems, target_problem_decoded).astype(int)

#         transformer_reco_results.append(transformer_reco_result)
#         popular_reco_results.append(popular_reco_result)

After generating result for each recommendation now time to compare baseline method vs transformer model.

In [61]:
# from sklearn.metrics import ndcg_score

# # Since we have already sorted our recommendations
# # An array that represent our recommendation scores is used.
# representative_array = [[i for i in range(k, 0, -1)]] * len(transformer_reco_results)

# for k in [3, 5, 10]:
#   transformer_result = ndcg_score(transformer_reco_results,
#                                   representative_array, k=k)
#   popular_result = ndcg_score(popular_reco_results,
#                               representative_array, k=k)

#   print(f"Transformer NDCG result at top {k}: {round(transformer_result, 4)}")
#   print(f"Popular recommendation NDCG result at top {k}: {round(popular_result, 4)}\n\n")


Here we have seen our model results are approximately 10 times better than popular problem recommendation at NDCG metric. A function to generate recommendation for single data is given below.

In [62]:
def generate_recommendation(user_id, problem_sequence, k=5):
    model.eval()
    input_sequence = problem_sequence[:-1]
    # Tokenize and numerically encode the user id and problem sequence
    user_tensor = torch.tensor(user_vocab_stoi[user_id])
    problem_tensor = torch.tensor([[problem_vocab_stoi[problem_id]] for problem_id in input_sequence])
    # Shape: [1, 1]
    user_tensor = user_tensor.unsqueeze(0).to(device)
    user_tensor = user_tensor.view(user_tensor.shape[0], 1)

    # Shape: [1, seq_length]
    problem_tensor = problem_tensor.unsqueeze(0).to(device)[0]
    problem_tensor = problem_tensor.view(1, problem_tensor.shape[0])

    # Pass the tensors through the model
    with torch.no_grad():
        predictions = model(problem_tensor, user_tensor)

    # The output is a probability distribution over the next problem.
    # Topk to get most probable problems
    values, indices = predictions.topk(k + len(input_sequence), dim=-1)

    # Eliminate already watched problems
    indices = [indice for indice in indices[-1, :][0] if indice not in problem_tensor][:k]
    predicted_problems = [problem_vocab.get_itos()[problem] for problem in indices]
    return predicted_problems

In [63]:
len(test_data_raw)

1090

In [73]:
row_iter = test_data_raw[600]
print("Input Sequence:")
print("-" + "\n-".join([ea_problem for ea_problem in row_iter[1][:-1]]))
recos = '\n-'.join(generate_recommendation(row_iter[0],row_iter[1]))

print(f"Recomendations:\n-{recos}")

Input Sequence:
-problem_293:E
-problem_1859:E
-problem_1920:F1
-problem_1918:E
Recomendations:
-problem_1922:F
-problem_1864:F
-problem_813:F
-problem_1919:D
-problem_1499:F


In [65]:
row_iter

array(['user_Fido_Puppy',
       list(['problem_1338:D', 'problem_119:D', 'problem_165:E', 'problem_768:D', 'problem_510:C'])],
      dtype=object)

In [66]:
row_iter[0] = '<unk>'

In [67]:
row_iter

array(['<unk>',
       list(['problem_1338:D', 'problem_119:D', 'problem_165:E', 'problem_768:D', 'problem_510:C'])],
      dtype=object)

In [68]:
# user_vocab_stoi

In [69]:
generate_recommendation(row_iter[0],row_iter[1])

['<unk>',
 'problem_1709:A',
 'problem_101991:A',
 'problem_545:E',
 'problem_1719:C']

# Conclusion
In this blog post, we have made an attempt to use the Transformer model, known for its effectiveness in NLP, to create a personalized problem recommendation system. We've gone through from data preprocessing to prediction step using the problemLens dataset. While this is a starting point and there's much more to learn, it hopefully sheds some light on how Transformer models can be used in different contexts, such as recommendation systems.

# References
https://pytorch.org/tutorials/beginner/transformer_tutorial.html
https://keras.io/examples/structured_data/problemlens_recommendations_transformers/
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ndcg_score.html