# 04. The Transformer Block

In the previous notebook, we implemented the Self-Attention mechanism, which allows the model to attend to different parts of the input sequence. However, attention alone is not enough. To build a powerful language model, we need to combine attention with other components: **Layer Normalization**, **Feed-Forward Networks**, and **Residual Connections**.

These components come together to form the **Transformer Block**, the fundamental building block of the Transformer architecture.

## Goals
1.  Understand and implement **Layer Normalization**.
2.  Understand and implement the **Feed-Forward Network (FFN)**.
3.  Understand **Residual Connections**.
4.  Assemble the complete **Transformer Decoder Block**.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
from dataclasses import dataclass

# Configuration class to hold model hyperparameters
@dataclass
class ModelConfig:
    n_embd: int = 768
    n_head: int = 12
    n_layer: int = 12
    n_positions: int = 1024
    vocab_size: int = 50257
    dropout: float = 0.1
    bias: bool = True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster

## 1. Layer Normalization

Layer Normalization (LayerNorm) is a technique used to stabilize the training of deep neural networks. It normalizes the inputs across the features dimension (instead of the batch dimension like Batch Normalization).

For a given input vector $x$, LayerNorm computes:

$$ \text{LayerNorm}(x) = \gamma \frac{x - \mu}{\sigma + \epsilon} + \beta $$

where:
- $\mu$ is the mean of the elements in $x$.
- $\sigma$ is the standard deviation.
- $\epsilon$ is a small constant for numerical stability.
- $\gamma$ (scale) and $\beta$ (shift) are learnable parameters.

In PyTorch, we can use `nn.LayerNorm`.

In [2]:
# Example of LayerNorm
batch_size = 2
seq_len = 3
n_embd = 4

x = torch.randn(batch_size, seq_len, n_embd)
ln = nn.LayerNorm(n_embd)
out = ln(x)

print("Input shape:", x.shape)
print("Output shape:", out.shape)
print("Mean of output (approx 0):", out.mean(dim=-1))
print("Std of output (approx 1):", out.std(dim=-1))

Input shape: torch.Size([2, 3, 4])
Output shape: torch.Size([2, 3, 4])
Mean of output (approx 0): tensor([[-4.4703e-08, -2.9802e-08, -2.6077e-08],
        [-5.9605e-08,  5.9605e-08,  1.4901e-08]], grad_fn=<MeanBackward1>)
Std of output (approx 1): tensor([[1.1547, 1.1547, 1.1547],
        [1.1547, 1.1547, 1.1547]], grad_fn=<StdBackward0>)


## 2. Feed-Forward Network (FFN)

The Feed-Forward Network is a simple Multi-Layer Perceptron (MLP) applied independently to each position in the sequence. It consists of two linear transformations with a non-linear activation function in between.

The standard architecture expands the dimensionality by a factor of 4 and then projects it back.

$$ \text{FFN}(x) = \text{GELU}(x W_1 + b_1) W_2 + b_2 $$

We use the **GELU** (Gaussian Error Linear Unit) activation function, which is commonly used in Transformers (like GPT-2 and BERT).

In [3]:
class FeedForward(nn.Module):
    """Position-wise feed-forward network."""
    
    def __init__(self, config: ModelConfig):
        super().__init__()
        # Inner dimension is usually 4 * n_embd
        self.fc1 = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.fc2 = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x shape: [batch, seq_len, n_embd]
        x = self.fc1(x)
        x = F.gelu(x)
        x = self.fc2(x)
        x = self.dropout(x)
        return x

## 3. Multi-Head Attention (Recap)

For completeness, we include the `MultiHeadAttention` class from the previous notebook here so we can build the full block.

In [4]:
class MultiHeadAttention(nn.Module):
    """
    Multi-head self-attention layer with causal masking.
    """
    
    def __init__(self, config: ModelConfig):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_dim = config.n_embd // config.n_head
        self.dropout = config.dropout
        
        # QKV projection
        self.qkv_proj = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        
        # Output projection
        self.out_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)
        
        # Dropout
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        
        # Causal mask
        self.register_buffer(
            "mask",
            torch.tril(torch.ones(config.n_positions, config.n_positions))
            .view(1, 1, config.n_positions, config.n_positions)
        )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.size()  # batch, sequence length, embedding dimensionality
        
        # Calculate QKV
        qkv = self.qkv_proj(x)
        q, k, v = qkv.split(self.n_embd, dim=2)
        
        # Reshape for multi-head attention
        q = q.view(B, T, self.n_head, self.head_dim).transpose(1, 2)  # [B, n_head, T, head_dim]
        k = k.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.head_dim).transpose(1, 2)
        
        # Attention scores
        attn = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(self.head_dim))
        
        # Apply causal mask
        attn = attn.masked_fill(self.mask[:, :, :T, :T] == 0, float('-inf'))
        
        # Softmax and dropout
        attn = F.softmax(attn, dim=-1)
        attn = self.attn_dropout(attn)
        
        # Apply attention to values
        y = attn @ v  # [B, n_head, T, head_dim]
        
        # Concatenate heads
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        
        # Output projection
        y = self.out_proj(y)
        y = self.resid_dropout(y)
        
        return y

## 4. The Transformer Block

Now we assemble the components. A standard Transformer Decoder Block consists of:

1.  **LayerNorm** (before attention)
2.  **Multi-Head Attention**
3.  **Residual Connection** (add attention output to input)
4.  **LayerNorm** (before FFN)
5.  **Feed-Forward Network**
6.  **Residual Connection** (add FFN output to input)

This "Pre-Norm" architecture (LayerNorm before the sub-layers) is standard in modern Transformers like GPT-2 and GPT-3 as it improves training stability.

In [5]:
class TransformerBlock(nn.Module):
    """Transformer decoder block."""
    
    def __init__(self, config: ModelConfig):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.attn = MultiHeadAttention(config)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.ffn = FeedForward(config)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Pre-norm architecture (GPT-2 style)
        # 1. Attention branch
        x = x + self.attn(self.ln1(x))
        
        # 2. FFN branch
        x = x + self.ffn(self.ln2(x))
        return x

## 5. Verification

Let's verify that our block works as expected by passing some dummy data through it.

In [6]:
config = ModelConfig()
block = TransformerBlock(config)

print("Model Config:", config)

# Create dummy input: [batch_size, seq_len, n_embd]
x = torch.randn(2, 32, config.n_embd)

# Forward pass
output = block(x)

print("Input shape:", x.shape)
print("Output shape:", output.shape)

assert output.shape == x.shape, "Output shape must match input shape!"
print("Verification successful!")

Model Config: ModelConfig(n_embd=768, n_head=12, n_layer=12, n_positions=1024, vocab_size=50257, dropout=0.1, bias=True)
Input shape: torch.Size([2, 32, 768])
Output shape: torch.Size([2, 32, 768])
Verification successful!
