# CPSC 477/577 Spring 2025

Instructor: Arman Cohan  
Homework 2: Transformers and transfer learning

### Part 1: Implementation of a transformer model for language modeling

### Please write your name and NetID below

NAME: Yuan Chang

NetID:yc2238


**Instructions**: Read the notebook carefully, then add your implementation to places identified with `TODO` and also answer the Reflection questions.


**IMPORTANT SUBMISSION NOTE 1**: Please submit your notebook having run all cells and hide long debugging output before submission.


**IMPORTANT SUBMISSION NOTE 2**: If you included print or debugging statements, please remove those before your final run and submission.

---

## Introduction

In this assingment we will implement the transformer model architecture for language modeling from scratch.  

## Environment Setup

For this assignment, we will use Google Colab.

#### Using GPU in Colab
PyTorch and other deep learning libraries are much faster using GPU acceleration. For training and evaluating the models in this assignment, you should always use a GPU:

1. Go to __Runtime__ option on the top left
2. Click __Change runtime type__
3. Select "GPU" for __Hardware
 accelerator__
4. Click __SAVE__ button

However, Colab limits the amount of time that you can use a free GPU.
So you may wish to implement much of the assignment without the GPU. But note that you will have to run all cells again once you change the runtime type.
You can also connect Colab to your local GPU for faster iteration.

Alternatively the course is already setup on HPC, so you can access the jupyter notebook functionality there and run your code there.

Colab has popular libraries already installed such as Pytorch, TensorFlow, OpenCV and Keras. Let's get started and verify this:

In [2]:
! pip install transformers tokenizers
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch import Tensor
from typing import Tuple, List

import random
import math
import os
import time
import json
import numpy as np

# We'll set the random seeds for deterministic results.
SEED = 1

random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.enabled = False
torch.backends.cudnn.deterministic = True

class Placeholder:
    @property
    def DO(self):
        raise NotImplementedError("You haven't yet implemented this part of the assignment yet")

TO = Placeholder()



In [3]:
import torch

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Pytorch version is: ", torch.__version__)
print("You are using: ", DEVICE)

Pytorch version is:  2.5.1+cu124
You are using:  cuda


## 1. Transformers

First, we will take a look into the famous *Transformer* model, which is foundation of LLMs such as current GPT-4o, Claude and more conventional models such as GPT-3, BERT, etc.
We will use it in the following parts of this assignment.

Transformers were introduced in the paper ["Attention is all you need" (Vaswani et al. 2017)](https://arxiv.org/abs/1706.03762). As the paper title suggests, the key idea that makes transformers work is *attention*. If you want to review attention and transformers, some useful resources include

- the [original paper](https://arxiv.org/abs/1706.03762)
- chapters 9 and 10 of Jurafsky & Martin
- the blog posts [Visualizing A Neural Machine Translation Model](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) and [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/). (The first of these doesn't cover transformers but is useful for understanding attention.)
http://nlp.seas.harvard.edu/annotated-transformer/
- Youtube videos such as [this one](https://youtu.be/OyFJWRnt_AY) or [this one](https://youtu.be/iDulhoQ2pro).

### 1.1 Model Input

Recall that in transfomer models, the input is a sequence of tokens.
Concretely here is the input pipeline for a transformer model (or most of the neural network models in NLP):

1- Given an input text $x$, we first tokenize it into a sequence of tokens. Tokens can be words or sub-word units (or even characters).
We can assume that we have access to a blackbox tokenization method that given an input text $x$, returns a sequence of tokens $x_1, x_2, \ldots, x_n$.

2- The tokens are then converted into a sequence of token IDs. Each token ID is an integer that represents the token in the vocabulary.

3- The token IDs are then converted into a sequence of token embeddings. Each token is represented as a vector, and the sequence of vectors is called an *embedding*.

4- We can also optionally include additional information such as the position of each token in the sequence. This is done by adding a position embedding to the token embedding.
Here each position is an integer that represents the position of the token in the sequence and then a separate position embedding matrix is used to look up the position embedding for each token.

Each token is represented as a vector, and the sequence of vectors is called an *embedding*.

### 1.1.1 Tokenization

We assume we have access to a blackbox tokenization method that given an input text $x$, returns a sequence of tokens $x_1, x_2, \ldots, x_n$.
For this we use the `tokenizers` library from HuggingFace. This library provides a wide range of tokenizers for different languages and models.
We will use the `GPT2Tokenizer` for this assignment.

In [4]:
from transformers import GPT2Tokenizer

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode a text inputs
text = "This adaptation of the enigmatic novel by Liane Moriarty is supremely watchable but flawed."

# we can use tokenizer.encode to convert text to token IDs
tokens_ids = tokenizer.encode(text)

tokens_ids


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

[1212,
 16711,
 286,
 262,
 48584,
 5337,
 416,
 406,
 46470,
 32709,
 25494,
 318,
 17700,
 306,
 2342,
 540,
 475,
 19556,
 13]

In [5]:
# now convert token IDs back to text to inspect the tokens
tokens = tokenizer.convert_ids_to_tokens(tokens_ids)

tokens


['This',
 'Ġadaptation',
 'Ġof',
 'Ġthe',
 'Ġenigmatic',
 'Ġnovel',
 'Ġby',
 'ĠL',
 'iane',
 'ĠMori',
 'arty',
 'Ġis',
 'Ġsupreme',
 'ly',
 'Ġwatch',
 'able',
 'Ġbut',
 'Ġflawed',
 '.']

You will notice some tokens start with a "Ġ".
Popular tokenizers use a special symbol such as "Ġ" (BPE tokenizer such as GPT2) or "▁" (SentencePiece) to represent space.

#### Converting token IDs to text

We can also use the tokenizers library to convert token IDs back to text. This is useful for represting the output of the model in human readable form.

In [6]:
# we can use tokenizer.decode() method to convert token IDs back to text

converted_text = tokenizer.decode(tokens_ids)
print(converted_text)

assert text == converted_text

This adaptation of the enigmatic novel by Liane Moriarty is supremely watchable but flawed.


### 1.2 Embeddings

We first need to implement the embedding layer of the transformer. Recall that this is layer 0. We will use the `nn.Embedding` layer in PyTorch to implement this.
The `nn.Embedding` creates an embdding matrix of size `vocab_size x embedding_dim`. Given a sequence of token IDs, we can use the `nn.Embedding` layer to look up the token embeddings for given token IDs.

In [7]:
import torch
import torch.nn as nn
from torch import Tensor

class Embedding(nn.Module):
    def __init__(self, vocab_size: int, d_model: int):
        """
        Initializes the embedding layer.

        Args:
            vocab_size (int): The number of unique tokens in the vocabulary.
            d_model (int): The embedding dimension (size of each embedding vector).
        """
        super(Embedding, self).__init__()

        # 🔹 TODO: Define the embedding layer using nn.Embedding.
        # 💡 Hint: The nn.Embedding layer should map 'vocab_size' tokens into 'd_model' embeddings.


        self.wte = nn.Embedding(vocab_size,d_model)  # 🔻 REPLACE 'None' with your implementation

    def forward(self, x: Tensor) -> Tensor:
        """
        Performs a forward pass of the embedding layer.

        Args:
            x (Tensor): A tensor of token indices with shape (batch_size, sequence_length).

        Returns:
            Tensor: The corresponding embeddings with shape (batch_size, sequence_length, d_model).
        """

        # 🔹 TODO: Lookup embeddings for the given input indices.
        # 💡 Hint: Use self.wte to retrieve embeddings.

        return self.wte(x)  # 🔻 REPLACE 'None' with your implementation


In [8]:
# tests
vocab_size = 10
d_model = 16

embedding = Embedding(vocab_size, d_model)
x = torch.tensor([1, 2, 3, 4])
output = embedding(x)
assert output.shape == (4, d_model)


# 1.3 Positional Embeddings

Now we implement the positional embeddings. As mentioned above this is a learned position embedding method and we will use the `nn.Embedding` layer in PyTorch to implement this.

In [15]:
class PositionalEmbeddings(nn.Module):
    def __init__(self, d_model: int, max_len: int):
        super(PositionalEmbeddings, self).__init__()

        # 🔹 TODO: Implement
        # create a tensor of shape (1, max_len, d_model) to store the positional embeddings using nn.Embedding
        self.embedding = nn.Embedding(max_len, d_model)

    def forward(self, x: Tensor) -> Tensor:
        # add the positional embeddings to the input tensor
        # 🔹 TODO: Implement
        return self.embedding(x)



In [16]:
# tests

d_model = 16
max_len = 64

positional_embeddings = PositionalEmbeddings(d_model, max_len)
x = torch.tensor([1, 2, 3, 4])
output = positional_embeddings(x)
assert output.shape == (4, d_model)
output

tensor([[-0.5284,  2.0264,  0.2286, -0.3184,  0.0555, -0.8794, -0.2590,  1.2397,
          1.0065,  0.9851, -0.4223,  0.1129, -0.5996, -0.0496,  3.0999, -0.3272],
        [-0.8858, -0.1698, -0.9845, -1.0561,  1.5622,  1.2579, -1.7318,  0.9355,
          1.7652, -0.1249,  1.4508, -0.5174, -0.2349, -0.3428,  0.0640,  1.2952],
        [ 1.9605,  0.3797, -0.4056,  1.9377, -0.7430,  0.6795, -0.1050,  0.1765,
          1.4710, -1.9942, -1.4419, -1.1211,  0.7427, -0.4917,  0.4131, -1.0259],
        [-1.1577,  0.6924,  1.4399, -1.5901,  0.5713, -0.9030, -0.2885, -0.7355,
          0.7207, -0.2414,  1.4590,  0.5052, -0.7366,  1.0573, -0.6979, -1.3404]],
       grad_fn=<EmbeddingBackward0>)

Now we put the token embeddings and positional embeddings together to get the input embeddings for the transformer model.

In [17]:
# Combine the embeddings and positional embeddings

class TokenEmbedder(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, max_len: int):
        super(TokenEmbedder, self).__init__()
        self.token_embedding = Embedding(vocab_size, d_model)
        self.positional_embedding = PositionalEmbeddings(d_model, max_len)

    def forward(self, x: Tensor) -> Tensor:
        # add the token embeddings and positional embeddings together
        pos = torch.arange(0, x.shape[1], dtype=torch.long) # shape: [sequence length]
        return self.token_embedding(x) + self.positional_embedding(pos)

In [18]:
# test the whole input pipeline

sample_texts = ["This adaptation of the enigmatic novel by Liane Moriarty is supremely watchable but flawed.",
                "The story is a bit of a slow burn, but the performances are top-notch and the ending is worth the wait."]

# encode the text
vocab_size = tokenizer.vocab_size
d_model = 64

# return_tensors="pt" returns pytorch tensors directly. truncation and padding are used to ensure the input length is the same
tokenizer.pad_token = tokenizer.eos_token
token_ids = tokenizer(sample_texts, return_tensors="pt", max_length=64, padding="longest", truncation=True)['input_ids']

token_embedder = TokenEmbedder(vocab_size=tokenizer.vocab_size, d_model=768, max_len=64)

# pass the token_ids to the token_embedder
output = token_embedder(token_ids)

output.shape

torch.Size([2, 26, 768])

## 1.4 Reflection Questions

1- What is the purpose of the positional embeddings? Why do we need them?

🔹 TODO: Position embedding is to let the model learn the context information of the word. We need them because transformer doesnt have the ability to remember the position of the token, so we need to mannually add them to the embeddings.


2- What is the purpose of this line?   
`pos = torch.arange(0, x.shape[1], dtype=torch.long) # shape: [sequence length]`

🔹 TODO: This line creates a tensor containing position indices ranging from 0 to the length of the tokens. This is used to give the position information.

3- What is the shape of the output from the TokenEmbedder correspond to?

🔹 TODO: it is batch size, sequence length, and the dimension of the model.


4- Why do we add the positional embeddings to the token embeddings? Can we use other methods like concatenation?

🔹 TODO: Addition allows the model to learn interactions between token meanings and positions directly, concatenation will increase the dimension, adding more cost of computation.

## 2. Transformer Model Architecture

Now that we have implemented the input pipeline we can move on to the transformer model architecture.

We will take a modular approach to implementing the transformer model. We will implement the following components of the transformer model:

1- QKV Projection

2- Multi-head self-attention

3- Position-wise feedforward network

4- Layer normalization

### 2.1 Projections

We first want to create separate projections for the input.
In class we implemented this:

In [21]:
import torch
import torch.nn as nn

class QKVProjection(nn.Module):
    def __init__(self, d_model, num_heads, d_k):
        super(QKVProjection, self).__init__()

        assert num_heads * d_k == d_model, "d_model must be equal to num_heads * d_k"

        self.num_heads = num_heads
        self.d_k = d_k

        # Linear layers for Q, K, V projections
        self.q_linear = nn.Linear(d_model, num_heads * d_k)
        self.k_linear = nn.Linear(d_model, num_heads * d_k)
        self.v_linear = nn.Linear(d_model, num_heads * d_k)

    def forward(self, X):
        batch_size, seq_length, d_model = X.shape

        # Compute Q, K, V
        # then reshape so that the result is of shape [batch_size, seq_len, num_heads, d_k]
        Q = self.q_linear(X).view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        K = self.k_linear(X).view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        V = self.v_linear(X).view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)

        return Q, K, V

# Example usage:
d_model = 512  # Embedding size
num_heads = 8  # Number of attention heads
d_k = 64  # Dimension per head (num_heads * d_k must be d_model)

batch_size = 2
seq_length = 10

# Random input tensor
X = torch.rand(batch_size, seq_length, d_model)

# Instantiate and apply QKVProjection
qkv_projection = QKVProjection(d_model, num_heads, d_k)
Q, K, V = qkv_projection(X)

print(Q.shape, K.shape, V.shape)

torch.Size([2, 8, 10, 64]) torch.Size([2, 8, 10, 64]) torch.Size([2, 8, 10, 64])


## 2.2 Multi-head self-attention

The self-attention mechanism is the key idea that makes transformers work. It allows the model to weigh the importance of different tokens in the input sequence when computing the output for each token.
The main parameters of this layer are the projection matrices $W_Q, W_K, W_V$ and we have $H$ heads (for each head we have separate projections).
Recall that the dimension of the input sequence is $d_{model}$ and the dimension of the output sequence is also $d_{model}$.
And the dimension of the projected queries, keys and values is $d_k = d_v = d_{model} / H$.

Based on this information we can implement the multi-head self-attention layer.

Hint: For tensor multiplications you can use either `torch.matmul` or `torch.einsum`.

In [22]:
import math
import torch.nn.functional as F


class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads, d_k):
        super(MultiHeadAttention, self).__init__()

        assert num_heads * d_k == d_model, "d_model must be equal to num_heads * d_k"

        self.num_heads = num_heads
        self.d_k = d_k

        # Use the improved QKVProjection
        self.qkv_projection = QKVProjection(d_model, num_heads, d_k)

        # Final output projection
        self.out_linear = nn.Linear(d_model, d_model)

    def forward(self, X):
        """
        Perform multi-head self-attention.

        Args:
            X (torch.Tensor): Input tensor of shape (batch_size, seq_length, d_model).

        Returns:
            Tuple[torch.Tensor, torch.Tensor]: A tuple containing:
                - attention_weights (torch.Tensor): Attention weights of shape (batch_size, num_heads, seq_length, seq_length).
                - output (torch.Tensor): Output tensor of shape (batch_size, seq_length, d_model).
        """
        # Generate Q, K, V using the projection module
        Q, K, V = self.qkv_projection(X)

        # 🔹 TODO: Ensure the following implementation is clear and complete:
        batch_size, num_heads, seq_length, d_k = Q.shape

        # 🔹 TODO: Compute scaled dot-product attention for the raw_scores:
        #    a) Compute the raw attention scores using the dot product between Q and K.
        #    b) Scale the scores by dividing by sqrt(d_k).
        raw_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

        # 🔹 TODO: Apply softmax to get attention weights
        attention_weights = F.softmax(raw_scores, dim=-1)

        # 🔹 TODO: Now we should compute the context vector
        context = torch.matmul(attention_weights, V)

        # 🔹 TODO: Compute the output vecotor.
        # Hint: Reshape the context tensor to match the original shape of X.
        output = context.transpose(1, 2).contiguous().view(batch_size, seq_length, -1) # the context tensor is not contiguous after transpose, use.contiguous() to view

        return attention_weights, output

# Example usage:
d_model = 512  # Embedding size
num_heads = 8  # Number of attention heads
d_k = 64  # Dimension per head (num_heads * d_k must be d_model)

batch_size = 2
seq_length = 10

# Random input tensor
X = torch.rand(batch_size, seq_length, d_model)

# Instantiate and apply Multi-Head Attention
multi_head_attn = MultiHeadAttention(d_model, num_heads, d_k)
_, C = multi_head_attn(X)

print(C.shape)  # Expected: (batch_size, seq_length, d_model)




# ----------- Unit tests ---------------
# -----------DO NOT EDIT THIS PART -----


import unittest
import torch

class TestMultiHeadAttention(unittest.TestCase):
    def setUp(self):
        self.d_model = 512
        self.num_heads = 8
        self.d_k = 64
        self.batch_size = 2
        self.seq_length = 10
        torch.manual_seed(42)
        self.multi_head_attn = MultiHeadAttention(self.d_model, self.num_heads, self.d_k)
        self.X = torch.rand(self.batch_size, self.seq_length, self.d_model, requires_grad=True)

    def test_output_shape(self):
        _, output = self.multi_head_attn(self.X)
        expected_shape = (self.batch_size, self.seq_length, self.d_model)
        self.assertEqual(output.shape, expected_shape, f"Expected shape {expected_shape}, but got {output.shape}")

    def test_deterministic_behavior(self):
        torch.manual_seed(42)
        _, output1 = self.multi_head_attn(self.X)
        torch.manual_seed(42)
        _, output2 = self.multi_head_attn(self.X)
        self.assertTrue(torch.allclose(output1, output2, atol=1e-6), "MultiHeadAttention is not deterministic!")

    def test_gradient_computation(self):
        """Ensure gradients are computed properly."""
        _, output = self.multi_head_attn(self.X)
        loss = output.sum()  # Simple loss function
        loss.backward()

        self.assertIsNotNone(self.X.grad, "Gradients were not computed for input!")
        self.assertGreater(self.X.grad.abs().sum().item(), 0, "Gradient sum is zero!")

    def test_attention_softmax(self):
        """Ensure that the attention scores sum up to ~1."""
        attention_weights, _ = self.multi_head_attn(X)

        # Sum of softmax probabilities along last dimension should be close to 1
        attention_sum = attention_weights.sum(dim=-1)
        ones = torch.ones_like(attention_sum)

        self.assertTrue(torch.allclose(attention_sum, ones, atol=1e-6), "Attention weights do not sum to 1!")

    def test_known_computation(self):
        d_model = 4
        num_heads = 1
        d_k = 4
        seq_length = 3
        batch_size = 1
        multi_head_attn = MultiHeadAttention(d_model, num_heads, d_k)
        with torch.no_grad():
            # For Q, K, V projections
            for linear in [multi_head_attn.qkv_projection.q_linear,
                           multi_head_attn.qkv_projection.k_linear,
                           multi_head_attn.qkv_projection.v_linear,
                           multi_head_attn.out_linear]:
                linear.weight.copy_(torch.eye(d_model))
                if linear.bias is not None:
                    linear.bias.zero_()
        X_known = torch.ones(batch_size, seq_length, d_model)
        attention_weights, output = multi_head_attn(X_known)
        expected_output = torch.ones(batch_size, seq_length, d_model)
        self.assertTrue(torch.allclose(output, expected_output, atol=1e-6),
                        f"Expected output {expected_output}, but got {output}")

unittest.main(argv=[''], exit=False)


.....
----------------------------------------------------------------------
Ran 5 tests in 0.158s

OK


torch.Size([2, 10, 512])


<unittest.main.TestProgram at 0x7f5f401d4210>

## 2.3 Feedforward network

The feedforward network is a simple two-layer neural network with a ReLU activation function in between the layers.
The main parameters of this layer are the weight matrices $W_1, W_2$ and the bias vectors $b_1, b_2$.
This layer takes as input a sequence of vectors of dimension $d_{model}$ and returns a sequence of vectors of the same dimension.
This layer is applied to each position in the sequence independently.

We'll next implement this module.

In [23]:
class FeedForward(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float):
        super(FeedForward, self).__init__()

        self.dropout = nn.Dropout(dropout)

        # 🔹 TODO Implement the init function. Recall that we need 2 linear layers. We can use nn.Linear
        self.a = nn.Linear(d_model, d_ff)
        self.b = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()




    def forward(self, x: Tensor) -> Tensor:
        # 🔹 TODO: Implement the forward pass
        # YOUR CODE HERE
        x = self.a(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.b(x)
        return x

## 2.4 Layer normalization

Layer normalization is a simple normalization technique that is applied to the output of each sub-layer in the transformer model.  
Recall that the main hyperparameters of this layer are the scaling and shifting parameters $\gamma, \beta$.  
Next we implement this module.

In [24]:
class LayerNorm(nn.Module):
    def __init__(self, hidden_size, eps: float):
        super(LayerNorm, self).__init__()
        self.gamma = nn.Parameter(torch.ones(hidden_size))
        self.beta = nn.Parameter(torch.zeros(hidden_size))
        self.eps = eps

    def forward(self, x: Tensor) -> Tensor:
        # 🔹 TODO: Implement the layer normalization
        # hint: use the formula from the lecture slides and recall the shape of the tensor x: [batch_size, seq_len, d_model]
        # YOUR CODE HERE
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta


## 2.4.1 Causal Multi Head Attention

Recall that for language modeling we need to prevent multi-head attention to attend to future tokens. We do this with a causal mask.

In this part you need to do the following.
1- Subclass MultiHeadAttention.
2- Extend the Forward Method to accept a `causal_mask: bool` and `attn_pdrop` arguments. `attn_pdrop` is the probability of dropout applied to the attention weights. This helps with regularization during training. You can apply dropout with probability attn_pdrop to the attention weights before computing the final context vector.

Complete the forward function below.


In [25]:
class CausalMultiHeadAttention(MultiHeadAttention):

    def forward(self, X, causal_mask: bool = False, attn_pdrop: float = 0.1):
        # 🔹 TODO: Implement the forward function
        # Hint: You can use torch.tril or torch.triu https://pytorch.org/docs/stable/generated/torch.tril.html#torch-tril
        # YOUR CODE HERE
        Q, K, V = self.qkv_projection(X)
        batch_size, num_heads, seq_length, d_k = Q.shape
        raw_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
        if causal_mask:
            mask = torch.tril(torch.ones(seq_length, seq_length, device=X.device)).view(1, 1, seq_length, seq_length)
            raw_scores = raw_scores.masked_fill(mask == 0, float('-inf'))
        attention_weights = F.softmax(raw_scores, dim=-1)
        attention_weights = F.dropout(attention_weights, p=attn_pdrop, training=self.training)
        context = torch.matmul(attention_weights, V)
        output = context.transpose(1, 2).contiguous().view(batch_size, seq_length, -1)
        output = self.out_linear(output)
        return attention_weights, output


In [28]:
#test
d_model = 512
num_heads = 8
d_k = 64
seq_length = 10
batch_size = 2


X = torch.rand(batch_size, seq_length, d_model)
causal_attn = CausalMultiHeadAttention(d_model, num_heads, d_k)
attn_weights, output = causal_attn(X, causal_mask=True)

print(output.shape)

torch.Size([2, 10, 512])


## 2.5 Transformer Block

Now we can put the multi-head self-attention and the feedforward network together to implement the transformer block.  
Recall that we also need to implement the layer normalization and the residual connections.


You are provided with a code skeleton for the TransformerBlock class. Your task is to implement the forward method to correctly combine the attention mechanism, feed-forward network, residual connections, and layer normalization.
Pay close attention to the following:

Input Parameters:

`d_model`: The dimension of the input embeddings.   
`num_heads`: The number of attention heads.  
`attn_pdrop`: The dropout probability used in the attention module.  
`dropout`: The dropout probability for the feed-forward network.  
`d_ff`: The hidden dimension size in the feed-forward network.  
`eps`: A small constant for numerical stability in layer normalization.  

In [29]:
# This uses your previous implementation of the previous blocks.
# Remember that we need to use the MultiHeadAttention, FeedForward, and LayerNorm modules that you implemented earlier

class TransformerBlock(nn.Module):
    def __init__(self, d_model: int, num_heads: int, attn_pdrop: float, dropout: float, d_ff: int, eps: float):
        super(TransformerBlock, self).__init__()

        # 🔹 TODO: initialize and use any of the modules that are required for a Transformer Block
        # YOUR CODE HERE
        self.attn = CausalMultiHeadAttention(d_model, num_heads, d_model // num_heads)
        self.ff = FeedForward(d_model, d_ff, dropout)
        self.layer_norm1 = LayerNorm(d_model, eps)
        self.layer_norm2 = LayerNorm(d_model, eps)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x: Tensor, causal_mask: bool) -> Tensor:
        """
        Perform the forward pass of the transformer block.

        Args:
            x (Tensor): Input tensor with shape (batch_size, seq_length, d_model).
            causal_mask (bool): If True, apply a causal mask in the self-attention layer
                                (useful for decoder self-attention to prevent attending to future tokens).

        Returns:
            Tensor: The output of the transformer block with shape (batch_size, seq_length, d_model).
        """
        # 🔹 TODO Implement the forward pass
        # hint: the forward pass is similar to the one in the lecture slides. You should use the MultiHeadAttention and FeedForward modules that you implemented earlier
        # hint: remember to use residual connections (x + layer_output) after attention and feed-forward
        # hint: remember to apply layer normalization before attention and feed-forward
        # hint: remember to apply dropout after attention and feed-forward but before residual
        # YOUR CODE HERE
        attn_weights, attn_output = self.attn(self.layer_norm1(x), causal_mask=causal_mask)
        attn_output = self.dropout1(attn_output)
        x = x + attn_output
        ff_output = self.ff(self.layer_norm2(x))
        ff_output = self.dropout2(ff_output)
        ff_output = attn_output + ff_output
        return ff_output


## 2.6 Transformer Stack

We can now put the transformer blocks on top of each other to implement the transformer stack.  
In this part we will simply stack the transformer blocks on top of each other to implement the transformer stack.

**Key Concepts**
- Sequential processing: Each layer builds upon the representations learned by previous layers
- Parameters of layers: All layers share the same architecture but have different learned parameters
- Depth vs width: The number of layers (depth) is crucial for model capacity, while d_model represents width

In [30]:
class TransformerStack(nn.Module):
    def __init__(self, num_layers: int, d_model: int, num_heads: int, attn_pdrop: float, dropout: float, d_ff: int, eps: float):
        """
        A stack of Transformer blocks that processes input sequentially through multiple layers.
        This architecture allows for deep hierarchical processing of input sequences.

        Architecture:
            Input → TransformerBlock₁ → TransformerBlock₂ → ... → TransformerBlockₙ → Output
        """
        super(TransformerStack, self).__init__()

        # 🔹 TODO Implement
        # hint: you should create a list of TransformerBlock modules and store it in self.layers
        self.layers = nn.ModuleList([TransformerBlock(d_model, num_heads, attn_pdrop, dropout, d_ff, eps) for _ in range(num_layers)])


    def forward(self, x: Tensor, causal_mask: bool) -> Tensor:
        """
        Process input through all transformer blocks sequentially.

        Implementation steps:
        1. Iterate through each layer in the stack
        2. Pass the output of each layer as input to the next layer
        3. Maintain the causal mask throughout if specified

        Args:
            x (Tensor): Input tensor of shape (batch_size, seq_length, d_model)
            causal_mask (bool): If True, apply causal masking in all attention layers

        Returns:
            Tensor: Processed output of shape (batch_size, seq_length, d_model)
        """
        # 🔹 TODO: Implement
        # YOUR CODE HERE
        for layer in self.layers:
            x = layer(x, causal_mask)
        return x   # 🔻 REPLACE x with whatever your implementation returns

## 2.7 Building a Complete Transformer Model

Finally we can put the input embeddings and the transformer stack together to implement the transformer model.  
This includes the input pipeline, the transformer stack and the output layer.  

In [33]:
class TransformerModel(nn.Module):
    """
    Complete Transformer model for language modeling tasks.

    Architecture:
        Input Tokens → Token Embeddings → Transformer Stack → Language Model Head → Output Logits
    """
    def __init__(self, vocab_size: int, d_model: int, num_heads: int, attn_pdrop: float, dropout: float, d_ff: int, max_len: int, num_layers: int, eps: float):
        super(TransformerModel, self).__init__()

        # 🔹 TODO: Implement
        # hint: you should create the TokenEmbedder and TransformerStack modules
        # hint: at the end, you should add a linear layer to convert the transformer output to the vocabulary size
        # this is the language model head which allows us to predict the next token in the sequence

        # YOUR CODE HERE
        self.token_embedder = TokenEmbedder(vocab_size, d_model, max_len)
        self.transformer_stack = TransformerStack(num_layers, d_model, num_heads, attn_pdrop, dropout, d_ff, eps)
        self.output_layer = nn.Linear(d_model, vocab_size)


    def forward(self, x: Tensor, causal_mask: bool) -> Tensor:
        # 🔹 TODO: Implement
        # YOUR CODE HERE
        x = self.token_embedder(x)
        x = self.transformer_stack(x, causal_mask)
        x = self.output_layer(x)
        return x # 🔻 REPLACE None with whatever your implementation returns

Let's test your model!

In [36]:
# hyperparameters

vocab_size = tokenizer.vocab_size
d_model = 768
num_heads = 12
attn_pdrop = 0.1
dropout = 0.1
d_ff = 3072
max_len = 20
num_layers = 12
eps = 1e-6

# create the model
model = TransformerModel(vocab_size, d_model, num_heads, attn_pdrop, dropout, d_ff, max_len, num_layers, eps)

token_ids = tokenizer(sample_texts, return_tensors="pt", max_length=max_len, padding="longest", truncation=True)['input_ids']
output = model(token_ids, causal_mask=True)

## 3. Reflection questions

Please answer the following questions in the markdown cells below.

1- What is the purpose of the output projection $W_O$ in the transformer model?

TODO: The purpose of the output projection $W_O$ is to combine all the attention heads attention result to a single result.

2 - What is the total number of trainable parameters in the multi-head self-attention layer that you implemneted above?  
Provide the final answer first, and then show your work.

TODO: 2362368
it is 4$\cdot$768$\cdot$768 + 4$\cdot$768 = 2362368

3- What is the time complexity of the multi-head self-attention layer that you implemented above?  
Provide your answer in terms of the sequence length $n$, the dimension of the input sequence $d_{model}$.
Also show your work on how you arrived at your answer.

TODO: For Computing the QKV, the time complexity is O($nd_{model}^2$)  
For Computing the score, it is still O($nd_{model}^2$/$h$) O($nd_{model}^2$)  
For Computing the Softmax, it is O($n^2$)

So, the Overall time complexity should be O($nd_{model}^2$)


4- [Extra credit] Modify your implementation above (by adding a block of code below) to make your model work as a encoder-decoder transformer instead of decoder-only model.  
Explain only the required changes, don't just dumpt new the code.  
However, when needed, provide concrete implmentation changes in form of short snippets.

In [None]:
# Optional: Encoder-decoder modifications