# Transformer models in PyTorch

## Table of Contents

1. [Understanding transformer models](#understanding-transformer-models)
2. [Setting up the environment](#setting-up-the-environment)
3. [Defining the input data](#defining-the-input-data)
4. [Implementing positional encoding](#implementing-positional-encoding)
5. [Building the scaled dot-product attention mechanism](#building-the-scaled-dot-product-attention-mechanism)
6. [Implementing multi-head attention](#implementing-multi-head-attention)
7. [Building the feed-forward network](#building-the-feed-forward-network)
8. [Constructing the transformer encoder](#constructing-the-transformer-encoder)
9. [Training the transformer model](#training-the-transformer-model)
10. [Evaluating the transformer model](#evaluating-the-transformer-model)
11. [Experimenting with different transformer configurations](#experimenting-with-different-transformer-configurations)

## Understanding transformer models

### **Key concepts**
Transformer models are a class of deep learning architectures that leverage self-attention mechanisms to process sequential data. They excel in tasks like natural language processing (NLP), computer vision, and beyond, by capturing long-range dependencies efficiently. Unlike recurrent or convolutional networks, Transformers process entire sequences in parallel, enabling faster computation and improved scalability.

Key features of Transformers include:
- **Self-attention mechanism**: Dynamically focuses on relevant parts of the input sequence for each token, capturing global context.
- **Multi-head attention**: Uses multiple attention heads in parallel to learn diverse relationships within the data.
- **Positional encoding**: Provides order information to the sequence, as Transformers process tokens without inherent order.
- **Feedforward layers**: Independently apply transformations to each token after self-attention, adding depth and expressiveness.
- **Layer normalization and residual connections**: Improve stability and efficiency during training.

In PyTorch, the `torch.nn.Transformer` module offers a flexible framework for building and training Transformer-based models.

### **Applications**
Transformer models are widely applied in tasks requiring sequence understanding and context modeling:
- **Natural language processing (NLP)**: Machine translation, text summarization, sentiment analysis, and question answering.
- **Computer vision**: Tasks like image classification, object detection, and image segmentation with architectures such as Vision Transformers (ViT).
- **Speech processing**: Speech-to-text, text-to-speech, and audio feature extraction tasks.
- **Time-series forecasting**: Modeling dependencies in long-term temporal data for predictive analytics.

### **Advantages**
- **Parallel processing**: Processes sequences as a whole, significantly speeding up training compared to recurrent models.
- **Scalability**: Handles large datasets and model sizes effectively, achieving state-of-the-art performance.
- **Versatility**: Adapts to a wide range of tasks and data modalities, from text to vision.
- **Global context**: Captures long-range dependencies without the need for sequential processing.

### **Challenges**
- **Computational intensity**: Requires substantial memory and compute resources, especially for long sequences.
- **Data dependency**: Performs best when trained on large, high-quality datasets.
- **Interpretability**: The complexity of attention mechanisms can make model decisions less transparent.
- **Training difficulty**: Hyperparameter tuning and model design can be intricate and resource-intensive.

## Setting up the environment


##### **Q1: How do you install the necessary libraries for building and training transformer models in PyTorch?**


In [36]:
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# !pip install matplotlib
# !pip install numpy

##### **Q2: How do you import the required PyTorch modules to construct attention mechanisms and build transformer models?**


In [37]:
import torch
import torch.nn as nn
import torch.nn.functional as F

##### **Q3: How do you configure the environment to use GPU support for training transformer models in PyTorch?**

In [38]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## Defining the input data


##### **Q4: How do you define input sequences to feed into the transformer model?**


In [39]:
input_sequences = torch.tensor([
    [4, 7, 2, 9, 0],
    [5, 6, 1, 0, 0]
], dtype=torch.long).to(device)  # move to gpu if available

##### **Q5: How do you preprocess the input data and convert it into embeddings for the transformer model?**


In [40]:
vocab_size = 10
embedding_dim = 32

embedding_layer = nn.Embedding(vocab_size, embedding_dim).to(device)

In [41]:
embedded_inputs = embedding_layer(input_sequences)  # shape: (batch_size, seq_len, embedding_dim)

##### **Q6: How do you pad input sequences to ensure consistent lengths before feeding them into the transformer model?**

In [42]:
from torch.nn.utils.rnn import pad_sequence

# example tokenized input sequences (varying lengths)
sequences = [
    torch.tensor([4, 7, 2, 9], dtype=torch.long),
    torch.tensor([5, 6, 1], dtype=torch.long)
]

padded_sequences = pad_sequence(sequences, batch_first=True, padding_value=0).to(device)  # pad with 0

## Implementing positional encoding


##### **Q7: How do you implement sinusoidal positional encoding in PyTorch to represent the order of tokens in sequences?**


In [43]:
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)  # positional encoding table
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)  # shape: (max_len, 1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))  # decay rates

        pe[:, 0::2] = torch.sin(position * div_term)  # even indices
        pe[:, 1::2] = torch.cos(position * div_term)  # odd indices

        self.pe = pe.unsqueeze(0)  # shape: (1, max_len, d_model)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)].to(x.device)  # add positional encoding up to sequence length
        return x

##### **Q8: How do you add positional encodings to the input embeddings for the transformer model?**


In [44]:
pos_encoder = PositionalEncoding(d_model=embedding_dim).to(device)

encoded_inputs = pos_encoder(embedded_inputs)  # shape: (batch_size, seq_len, embedding_dim)

##### **Q9: How do you verify that the positional encoding has been correctly added to the input data?**

In [45]:
print("Embedded inputs (first token):", embedded_inputs[0, 0])  # before adding positional encoding
print("Positional encoded (first token):", encoded_inputs[0, 0])  # after adding positional encoding

difference = encoded_inputs[0, 0] - embedded_inputs[0, 0]
print("Difference (added position info):", difference)

Embedded inputs (first token): tensor([ 0.9781, -0.1581, -0.6428, -0.2876,  0.3417,  1.7808, -1.3230, -1.2824,
         0.8206, -0.8285, -0.3401,  0.5750, -0.9450, -0.6592, -0.4777,  0.1327,
         1.1768, -0.1969,  0.2404, -0.3387, -0.4212,  0.1310, -0.3772,  1.2905,
        -1.2184, -1.3354, -0.8109, -1.4791,  0.6834,  0.3014, -0.1476, -0.6980],
       device='cuda:0', grad_fn=<SelectBackward0>)
Positional encoded (first token): tensor([ 0.9781,  0.8419, -0.6428,  0.7124,  0.3417,  2.7808, -1.3230, -0.2824,
         0.8206,  0.1715, -0.3401,  1.5750, -0.9450,  0.3408, -0.4777,  1.1327,
         1.1768,  0.8031,  0.2404,  0.6613, -0.4212,  1.1310, -0.3772,  2.2905,
        -1.2184, -0.3354, -0.8109, -0.4791,  0.6834,  1.3014, -0.1476,  0.3020],
       device='cuda:0', grad_fn=<SelectBackward0>)
Difference (added position info): tensor([0.0000, 1.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000,
        1.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000, 1.000

## Building the scaled dot-product attention mechanism


##### **Q10: How do you implement the scaled dot-product attention mechanism in PyTorch?**


In [46]:
def scaled_dot_product_attention(query, key, value, mask=None):
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)  # scaled dot-product

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))  # apply mask

    attn_weights = F.softmax(scores, dim=-1)  # normalize
    output = torch.matmul(attn_weights, value)  # weighted sum

    return output, attn_weights

##### **Q11: How do you compute attention scores by calculating the dot product of query and key matrices?**


In [47]:
batch_size = 2
seq_len = 5
d_k = 32

query = torch.rand(batch_size, seq_len, d_k).to(device)
key = torch.rand(batch_size, seq_len, d_k).to(device)

In [48]:
# compute raw attention scores (no scaling yet)
raw_scores = torch.matmul(query, key.transpose(-2, -1))  # shape: (batch_size, seq_len, seq_len)
print("Raw attention scores shape:", raw_scores.shape)

Raw attention scores shape: torch.Size([2, 5, 5])


##### **Q12: How do you apply softmax to the attention scores and multiply them by the value matrix to get the final output?**

In [49]:
value = torch.rand(batch_size, seq_len, d_k).to(device)

scaled_scores = raw_scores / math.sqrt(d_k)  # scale by sqrt(d_k)
attn_weights = F.softmax(scaled_scores, dim=-1)  # apply softmax
attn_output = torch.matmul(attn_weights, value)  # attention output

print("Attention output shape:", attn_output.shape)

Attention output shape: torch.Size([2, 5, 32])


## Implementing multi-head attention


##### **Q13: How do you implement multi-head attention by splitting input sequences into multiple attention heads in PyTorch?**


In [50]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        assert embed_dim % num_heads == 0  # ensure divisibility
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        self.q_linear = nn.Linear(embed_dim, embed_dim)
        self.k_linear = nn.Linear(embed_dim, embed_dim)
        self.v_linear = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        batch_size, seq_len, _ = x.size()

        # linear projections
        q = self.q_linear(x)  # shape: (batch_size, seq_len, embed_dim)
        k = self.k_linear(x)
        v = self.v_linear(x)

        # reshape for multi-head attention
        q = q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)  # shape: (batch_size, num_heads, seq_len, head_dim)
        k = k.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

        return q, k, v

##### **Q14: How do you apply the scaled dot-product attention mechanism separately for each head in the multi-head attention?**


In [51]:
def scaled_dot_product_attention(q, k, v, mask=None):
    d_k = q.size(-1)
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)  # shape: (batch_size, num_heads, seq_len, seq_len)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    attn_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attn_weights, v)  # shape: (batch_size, num_heads, seq_len, head_dim)

    return output, attn_weights

##### **Q15: How do you concatenate the outputs of the multiple attention heads and apply a linear projection?**

In [52]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        assert embed_dim % num_heads == 0
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        self.q_linear = nn.Linear(embed_dim, embed_dim)
        self.k_linear = nn.Linear(embed_dim, embed_dim)
        self.v_linear = nn.Linear(embed_dim, embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x, mask=None):
        batch_size, seq_len, _ = x.size()

        q = self.q_linear(x)
        k = self.k_linear(x)
        v = self.v_linear(x)

        q = q.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

        attn_output, attn_weights = scaled_dot_product_attention(q, k, v, mask)

        # concatenate heads
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, seq_len, self.embed_dim)

        # final linear projection
        output = self.out_proj(attn_output)

        return output, attn_weights

## Building the feed-forward network


##### **Q16: How do you implement the position-wise feed-forward network using `torch.nn.Linear` layers?**


In [53]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, embed_dim, ff_dim):
        super().__init__()
        self.linear1 = nn.Linear(embed_dim, ff_dim)
        self.linear2 = nn.Linear(ff_dim, embed_dim)

    def forward(self, x):
        return self.linear2(F.relu(self.linear1(x)))  # two-layer transformation with ReLU

##### **Q17: How do you apply activation functions after the linear layers in the feed-forward network?**


In [54]:
# already applied ReLU in Q16 within the forward method of PositionwiseFeedForward
# for illustration:
x = torch.rand(2, 5, 64).to(device)  # dummy input: (batch_size, seq_len, embed_dim)
ffn = PositionwiseFeedForward(embed_dim=64, ff_dim=256).to(device)

output = ffn(x)  # applies linear1 -> ReLU -> linear2
print("Output shape:", output.shape)

Output shape: torch.Size([2, 5, 64])


##### **Q18: How do you add dropout and layer normalization to the feed-forward network for regularization?**

In [55]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, embed_dim, ff_dim, dropout=0.1):
        super().__init__()
        self.linear1 = nn.Linear(embed_dim, ff_dim)
        self.linear2 = nn.Linear(ff_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)
        self.layer_norm = nn.LayerNorm(embed_dim)

    def forward(self, x):
        residual = x
        x = self.linear1(x)
        x = F.relu(x)
        x = self.dropout(x)
        x = self.linear2(x)
        x = self.dropout(x)
        return self.layer_norm(x + residual)  # apply residual and normalization

## Constructing the transformer encoder


##### **Q19: How do you combine multi-head attention and the feed-forward network to construct a transformer encoder layer?**


In [56]:
class TransformerEncoderLayer(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(embed_dim, num_heads)
        self.feed_forward = PositionwiseFeedForward(embed_dim, ff_dim, dropout)
        self.dropout = nn.Dropout(dropout)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

    def forward(self, x, mask=None):
        attn_output, _ = self.self_attn(x, mask)
        x = self.norm1(x + self.dropout(attn_output))  # residual connection after attention
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))  # residual connection after feed-forward
        return x

##### **Q20: How do you implement residual connections and layer normalization around the attention and feed-forward layers in the transformer encoder?**


In [57]:
# already applied in TransformerEncoderLayer forward method from Q19

##### **Q21: How do you stack multiple transformer encoder layers to create a deep transformer model?**

In [58]:
class TransformerEncoder(nn.Module):
    def __init__(self, num_layers, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(embed_dim, num_heads, ff_dim, dropout)
            for _ in range(num_layers)
        ])

    def forward(self, x, mask=None):
        for layer in self.layers:
            x = layer(x, mask)
        return x

## Training the transformer model


##### **Q22: How do you define the loss function for a sequence-based task in the transformer model?**


In [59]:
criterion = nn.CrossEntropyLoss(ignore_index=0)  # ignore padding index during loss computation

##### **Q23: How do you set up the optimizer to update the transformer model’s parameters during training?**


In [60]:
model = TransformerEncoder(num_layers=4, embed_dim=64, num_heads=4, ff_dim=256).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)

##### **Q24: How do you implement the training loop, including forward pass, loss calculation, and backpropagation for the transformer model?**


In [61]:
class OutputProjection(nn.Module):
    def __init__(self, embed_dim, vocab_size):
        super().__init__()
        self.linear = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        return self.linear(x)  # project to vocab size

In [62]:
embedding_layer = nn.Embedding(num_embeddings=10, embedding_dim=64).to(device)
pos_encoder = PositionalEncoding(d_model=64).to(device)
projection_layer = OutputProjection(embed_dim=64, vocab_size=10).to(device)

In [63]:
def train_step(model, projection_layer, optimizer, criterion, inputs, targets, mask=None):
    model.train()
    optimizer.zero_grad()

    x = embedding_layer(inputs)  # embed token indices
    x = pos_encoder(x)  # add positional encoding

    encoded = model(x, mask)  # pass through encoder
    logits = projection_layer(encoded)  # project to vocab size

    loss = criterion(logits.view(-1, logits.size(-1)), targets.view(-1))  # flatten for loss
    loss.backward()
    optimizer.step()

    return loss.item(), logits

##### **Q25: How do you track and log the training loss and accuracy over multiple epochs when training the transformer model?**

In [64]:
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset

# sample input data
x_batch = torch.tensor([
    [4, 7, 2, 9, 0],
    [5, 6, 1, 0, 0]
], dtype=torch.long)

y_batch = torch.tensor([
    [7, 2, 9, 0, 0],
    [6, 1, 0, 0, 0]
], dtype=torch.long)

dataset = TensorDataset(x_batch, y_batch)
data_loader = DataLoader(dataset, batch_size=2, shuffle=True)

train_losses = []
train_accuracies = []

for epoch in range(1, 11):
    total_loss = 0
    total_correct = 0
    total_tokens = 0

    for inputs, targets in data_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        loss, logits = train_step(model, projection_layer, optimizer, criterion, inputs, targets)

        predictions = logits.argmax(dim=-1)
        mask = targets != 0
        correct = (predictions[mask] == targets[mask]).sum().item()
        total = mask.sum().item()

        total_correct += correct
        total_tokens += total
        total_loss += loss

    avg_loss = total_loss / len(data_loader)
    accuracy = total_correct / total_tokens

    train_losses.append(avg_loss)
    train_accuracies.append(accuracy)

    print(f"Epoch {epoch}  Loss: {avg_loss:.4f}  Accuracy: {accuracy:.4f}")

Epoch 1  Loss: 2.6124  Accuracy: 0.0000
Epoch 2  Loss: 2.2403  Accuracy: 0.2000
Epoch 3  Loss: 2.0370  Accuracy: 0.2000
Epoch 4  Loss: 2.1122  Accuracy: 0.2000
Epoch 5  Loss: 1.9199  Accuracy: 0.2000
Epoch 6  Loss: 1.9140  Accuracy: 0.2000
Epoch 7  Loss: 1.7676  Accuracy: 0.4000
Epoch 8  Loss: 1.6318  Accuracy: 0.4000
Epoch 9  Loss: 1.5428  Accuracy: 0.6000
Epoch 10  Loss: 1.4302  Accuracy: 0.6000


## Evaluating the transformer model


##### **Q26: How do you evaluate the transformer model on a validation or test dataset to calculate performance metrics?**


In [65]:
model.eval()
total_loss = 0
total_correct = 0
total_tokens = 0

In [66]:
with torch.no_grad():
    for inputs, targets in data_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        x = embedding_layer(inputs)  # embed tokens
        x = pos_encoder(x)  # add positional encoding
        encoded = model(x)  # forward pass through transformer
        logits = projection_layer(encoded)  # project to vocab size

        loss = criterion(logits.view(-1, logits.size(-1)), targets.view(-1))  # compute loss
        total_loss += loss.item()

        predictions = logits.argmax(dim=-1)  # predict tokens
        mask = targets != 0  # exclude padding
        correct = (predictions[mask] == targets[mask]).sum().item()  # count correct
        total = mask.sum().item()  # count total valid tokens

        total_correct += correct
        total_tokens += total

In [67]:
avg_loss = total_loss / len(data_loader)
accuracy = total_correct / total_tokens

print(f"Transformer Evaluation — Loss: {avg_loss:.4f}  Accuracy: {accuracy:.4f}")

Transformer Evaluation — Loss: 1.3342  Accuracy: 0.8000


##### **Q27: How do you compute metrics such as accuracy or perplexity to evaluate the transformer’s performance?**


In [68]:
def compute_perplexity(loss):
    return math.exp(loss)

In [69]:
perplexity = compute_perplexity(avg_loss)

print(f"Transformer Evaluation — Perplexity: {perplexity:.2f}")

Transformer Evaluation — Perplexity: 3.80


##### **Q28: How do you compare the transformer model's performance to other baseline models, such as LSTMs or RNNs?**

In [70]:
class RNNBaseline(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.GRU(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        x = self.embed(x)
        output, _ = self.rnn(x)
        return self.fc(output)

In [71]:
rnn_model = RNNBaseline(vocab_size=10, embed_dim=64, hidden_dim=64).to(device)
rnn_optimizer = torch.optim.Adam(rnn_model.parameters(), lr=3e-4)

In [72]:
for epoch in range(1, 11):
    rnn_model.train()
    total_loss = 0
    total_correct = 0
    total_tokens = 0

    for inputs, targets in data_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        rnn_optimizer.zero_grad()
        logits = rnn_model(inputs)

        loss = criterion(logits.view(-1, logits.size(-1)), targets.view(-1))
        loss.backward()
        rnn_optimizer.step()

        predictions = logits.argmax(dim=-1)
        mask = targets != 0
        correct = (predictions[mask] == targets[mask]).sum().item()
        total = mask.sum().item()

        total_correct += correct
        total_tokens += total
        total_loss += loss.item()

    avg_loss_rnn = total_loss / len(data_loader)
    accuracy_rnn = total_correct / total_tokens
    perplexity_rnn = compute_perplexity(avg_loss_rnn)

    print(f"RNN Epoch {epoch}  Loss: {avg_loss_rnn:.4f}  Accuracy: {accuracy_rnn:.4f}  Perplexity: {perplexity_rnn:.2f}")

RNN Epoch 1  Loss: 2.3661  Accuracy: 0.0000  Perplexity: 10.66
RNN Epoch 2  Loss: 2.3424  Accuracy: 0.0000  Perplexity: 10.41
RNN Epoch 3  Loss: 2.3188  Accuracy: 0.0000  Perplexity: 10.16
RNN Epoch 4  Loss: 2.2953  Accuracy: 0.0000  Perplexity: 9.93
RNN Epoch 5  Loss: 2.2719  Accuracy: 0.2000  Perplexity: 9.70
RNN Epoch 6  Loss: 2.2486  Accuracy: 0.2000  Perplexity: 9.47
RNN Epoch 7  Loss: 2.2254  Accuracy: 0.2000  Perplexity: 9.26
RNN Epoch 8  Loss: 2.2023  Accuracy: 0.2000  Perplexity: 9.05
RNN Epoch 9  Loss: 2.1793  Accuracy: 0.2000  Perplexity: 8.84
RNN Epoch 10  Loss: 2.1563  Accuracy: 0.2000  Perplexity: 8.64


In [73]:
class LSTMBaseline(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        x = self.embed(x)
        output, _ = self.lstm(x)
        return self.fc(output)

In [74]:
lstm_model = LSTMBaseline(vocab_size=10, embed_dim=64, hidden_dim=64).to(device)
lstm_optimizer = torch.optim.Adam(lstm_model.parameters(), lr=3e-4)
lstm_criterion = nn.CrossEntropyLoss(ignore_index=0)

In [75]:
for epoch in range(1, 11):
    lstm_model.train()
    total_loss = 0
    total_correct = 0
    total_tokens = 0

    for inputs, targets in data_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        lstm_optimizer.zero_grad()
        logits = lstm_model(inputs)

        loss = lstm_criterion(logits.view(-1, logits.size(-1)), targets.view(-1))
        loss.backward()
        lstm_optimizer.step()

        predictions = logits.argmax(dim=-1)
        mask = targets != 0
        correct = (predictions[mask] == targets[mask]).sum().item()
        total = mask.sum().item()

        total_correct += correct
        total_tokens += total
        total_loss += loss.item()

    avg_loss_lstm = total_loss / len(data_loader)
    accuracy_lstm = total_correct / total_tokens
    perplexity_lstm = compute_perplexity(avg_loss_lstm)

    print(f"LSTM Epoch {epoch}  Loss: {avg_loss_lstm:.4f}  Accuracy: {accuracy_lstm:.4f}  Perplexity: {perplexity_lstm:.2f}")

LSTM Epoch 1  Loss: 2.2869  Accuracy: 0.2000  Perplexity: 9.84
LSTM Epoch 2  Loss: 2.2723  Accuracy: 0.2000  Perplexity: 9.70
LSTM Epoch 3  Loss: 2.2578  Accuracy: 0.2000  Perplexity: 9.56
LSTM Epoch 4  Loss: 2.2432  Accuracy: 0.2000  Perplexity: 9.42
LSTM Epoch 5  Loss: 2.2288  Accuracy: 0.2000  Perplexity: 9.29
LSTM Epoch 6  Loss: 2.2143  Accuracy: 0.2000  Perplexity: 9.15
LSTM Epoch 7  Loss: 2.1998  Accuracy: 0.2000  Perplexity: 9.02
LSTM Epoch 8  Loss: 2.1854  Accuracy: 0.2000  Perplexity: 8.89
LSTM Epoch 9  Loss: 2.1709  Accuracy: 0.2000  Perplexity: 8.77
LSTM Epoch 10  Loss: 2.1564  Accuracy: 0.4000  Perplexity: 8.64


## Experimenting with different transformer configurations


##### **Q29: How do you experiment with different numbers of layers and attention heads in the transformer model to observe their effect on performance?**


In [76]:
model_small = TransformerEncoder(num_layers=2, embed_dim=64, num_heads=2, ff_dim=256).to(device)
model_large = TransformerEncoder(num_layers=6, embed_dim=64, num_heads=8, ff_dim=256).to(device)

projection_layer_small = OutputProjection(embed_dim=64, vocab_size=10).to(device)
projection_layer_large = OutputProjection(embed_dim=64, vocab_size=10).to(device)

optimizer_small = torch.optim.Adam(model_small.parameters(), lr=3e-4)
optimizer_large = torch.optim.Adam(model_large.parameters(), lr=3e-4)

In [77]:
criterion = nn.CrossEntropyLoss(ignore_index=0)

In [78]:
inputs, targets = next(iter(data_loader))
inputs, targets = inputs.to(device), targets.to(device)

x_small = pos_encoder(embedding_layer(inputs))
x_large = pos_encoder(embedding_layer(inputs))

logits_small = projection_layer_small(model_small(x_small))
logits_large = projection_layer_large(model_large(x_large))

loss_small = criterion(logits_small.view(-1, logits_small.size(-1)), targets.view(-1))
loss_large = criterion(logits_large.view(-1, logits_large.size(-1)), targets.view(-1))

print(f"Small Transformer Loss: {loss_small.item():.4f}")
print(f"Large Transformer Loss: {loss_large.item():.4f}")

Small Transformer Loss: 2.8785
Large Transformer Loss: 2.7999


##### **Q30: How do you adjust the hidden dimension size of the transformer and analyze its impact on training time and accuracy?**


In [82]:
import time

model_dim32 = TransformerEncoder(num_layers=4, embed_dim=32, num_heads=4, ff_dim=128).to(device)
model_dim128 = TransformerEncoder(num_layers=4, embed_dim=128, num_heads=8, ff_dim=512).to(device)

embed32 = nn.Embedding(10, 32).to(device)
embed128 = nn.Embedding(10, 128).to(device)
proj32 = OutputProjection(32, 10).to(device)
proj128 = OutputProjection(128, 10).to(device)
pos_encoder_32 = PositionalEncoding(d_model=32).to(device)
pos_encoder_128 = PositionalEncoding(d_model=128).to(device)

In [83]:
start = time.time()
x = pos_encoder_32(embed32(inputs))
logits = proj32(model_dim32(x))
loss = criterion(logits.view(-1, logits.size(-1)), targets.view(-1))
end = time.time()
print(f"Dim 32 — Loss: {loss.item():.4f}  Time: {end - start:.4f}s")

Dim 32 — Loss: 2.4019  Time: 0.0061s


In [84]:
start = time.time()
x = pos_encoder_128(embed128(inputs))
logits = proj128(model_dim128(x))
loss = criterion(logits.view(-1, logits.size(-1)), targets.view(-1))
end = time.time()
print(f"Dim 128 — Loss: {loss.item():.4f}  Time: {end - start:.4f}s")

Dim 128 — Loss: 2.7081  Time: 0.0061s


##### **Q31: How do you experiment with different learning rates and dropout rates to optimize the transformer’s generalization and performance?**


In [85]:
model_lowlr_highdrop = TransformerEncoder(num_layers=4, embed_dim=64, num_heads=4, ff_dim=256, dropout=0.3).to(device)
model_highlr_nodrop = TransformerEncoder(num_layers=4, embed_dim=64, num_heads=4, ff_dim=256, dropout=0.0).to(device)

proj_lowlr_highdrop = OutputProjection(64, 10).to(device)
proj_highlr_nodrop = OutputProjection(64, 10).to(device)

optimizer_lowlr_highdrop = torch.optim.Adam(model_lowlr_highdrop.parameters(), lr=1e-4)
optimizer_highlr_nodrop = torch.optim.Adam(model_highlr_nodrop.parameters(), lr=1e-2)

inputs, targets = next(iter(data_loader))
inputs, targets = inputs.to(device), targets.to(device)

x_low = pos_encoder(embedding_layer(inputs))
x_high = pos_encoder(embedding_layer(inputs))

logits_low = proj_lowlr_highdrop(model_lowlr_highdrop(x_low))
logits_high = proj_highlr_nodrop(model_highlr_nodrop(x_high))

loss_low = criterion(logits_low.view(-1, logits_low.size(-1)), targets.view(-1))
loss_high = criterion(logits_high.view(-1, logits_high.size(-1)), targets.view(-1))

print(f"LR=1e-4, Dropout=0.3 — Loss: {loss_low.item():.4f}")
print(f"LR=1e-2, Dropout=0.0 — Loss: {loss_high.item():.4f}")

LR=1e-4, Dropout=0.3 — Loss: 2.3417
LR=1e-2, Dropout=0.0 — Loss: 2.2674


##### **Q32: How do you analyze how the transformer model performs on different tasks by varying the input data and sequence lengths?**

In [86]:
short_seq = torch.tensor([[4, 7, 2]], dtype=torch.long).to(device)  # short sequence (length 3)
long_seq = torch.tensor([[5, 6, 1, 2, 3, 7, 9, 4, 0, 0]], dtype=torch.long).to(device)  # long sequence (length 10)

target_short = torch.tensor([[7, 2, 0]], dtype=torch.long).to(device)
target_long = torch.tensor([[6, 1, 2, 3, 7, 9, 4, 0, 0, 0]], dtype=torch.long).to(device)

embed_short = embedding_layer(short_seq)
embed_long = embedding_layer(long_seq)

pos_short = pos_encoder(embed_short)
pos_long = pos_encoder(embed_long)

encoded_short = model(pos_short)
encoded_long = model(pos_long)

logits_short = projection_layer(encoded_short)
logits_long = projection_layer(encoded_long)

loss_short = criterion(logits_short.view(-1, logits_short.size(-1)), target_short.view(-1))
loss_long = criterion(logits_long.view(-1, logits_long.size(-1)), target_long.view(-1))

print(f"Short Sequence — Loss: {loss_short.item():.4f}")
print(f"Long Sequence  — Loss: {loss_long.item():.4f}")

Short Sequence — Loss: 1.2265
Long Sequence  — Loss: 2.4804
