# Bastien GUILLOU et Ryan KHOU

# Chapter 4 - Exercises

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

---

# Exercise 4.1: Parameters in the feed forward versus attention module

**Key Exercise Question: How do the parameter counts differ between the `feed-forward` neural network module and `multi-head attention` mechanism in our transformer architecture?**

*Methodological Approach:*
The investigation focuses on a systematic computational analysis of parameter allocation across two critical transformer neural network components:

1. **Feed-Forward Neural Network Module**
   - Characterization: Nonlinear transformation module
   - Primary computational function: Introducing network complexity and representational capacity
   - Parametric considerations: Linear transformation layers, activation functions

2. **Multi-Head Attention Mechanism**
   - Characterization: Contextual feature interaction module
   - Primary computational function: Capturing inter-token relational dynamics
   - Parametric considerations: Projection matrices, attention computation

*Analytical Objectives:*
- Quantify the exact number of trainable parameters in each architectural component
- Comparative assessment of parametric complexity
- Understand the relative computational resource allocation

*Theoretical Implications:*
- Insights into architectural parameter efficiency
- Empirical understanding of transformer module design
- Potential implications for model optimization and architectural design

*Computational Methodology:*
1. Enumerate parameters in `feed-forward` module
2. Enumerate parameters in `multi-head attention` module
3. Perform comparative statistical analysis
4. Interpret parametric distribution characteristics

*Recommended Investigative Approach:*
- Utilize precise computational tracing
- Consider layer-specific parameter counting
- Account for bias terms and weight matrices

# Solution

In [1]:
import torch
import torch.nn as nn
from gpt import TransformerBlock

GPT_CONFIG_124M = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.1,
    "qkv_bias": False
}

block = TransformerBlock(GPT_CONFIG_124M)
print(block)

# Count parameters
params_ff = sum(p.numel() for p in block.ff.parameters() if p.requires_grad)
params_mha = sum(p.numel() for p in block.att.parameters() if p.requires_grad)

print(f"Feed-Forward Module Parameters: {params_ff}")
print(f"Multi-Head Attention Parameters: {params_mha}")


TransformerBlock(
  (att): MultiHeadAttention(
    (W_query): Linear(in_features=768, out_features=768, bias=False)
    (W_key): Linear(in_features=768, out_features=768, bias=False)
    (W_value): Linear(in_features=768, out_features=768, bias=False)
    (out_proj): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (ff): FeedForward(
    (layers): Sequential(
      (0): Linear(in_features=768, out_features=3072, bias=True)
      (1): GELU()
      (2): Linear(in_features=3072, out_features=768, bias=True)
    )
  )
  (norm1): LayerNorm()
  (norm2): LayerNorm()
  (drop_shortcut): Dropout(p=0.1, inplace=False)
)
Feed-Forward Module Parameters: 4722432
Multi-Head Attention Parameters: 2360064


1. Enumerate parameters in feed-forward module

The feed-forward module of the transformer block consists of two linear layers. The first transforms the entries of 768 units into 3072 units (intermediate activation), and a second reduces the dimensions from 3072 to 768. What we get:

- First layer: 768 3072+3072 (bias) = 2,362,368
- Second layer: 3072 768+768 = 2,360,064

Total 4,722,432 parameters for the feed-forward module.

2. Enumerate parameters in multi-head attention module

The multi-head attention module has four linear matrices (for queries, keys, values and output projection) each with a 768x768 dimension of which three of these matrices have no bias terms and the fourth includes a bias:

- Matrices for queries, keys, values: 3 (768 768) = 1,769,472
- Output projection (with bias): 768 768+768 = 590.592 
    
That is a total of 2,360,064 parameters for multi-head attention.

3. Perform comparative statistical analysis

The feed-forward module contains almost double the parameters of the multi-head attention module, namely:

- Feed-forward: 4,722,432 parameters
- Multi-head attention: 2,360,368 parameters

4. Interpret parametric distribution characteristics

The feed-forward module uses more parameters to perform its linear transformations, which allows it to make the model more complex and able to handle various representations of data. This helps the model learn and adapt more effectively to different characteristics of input data.

Whereas the multi-head attention module, which has fewer parameters, is designed to focus mainly on analyzing the relationships between different tokens in a sequence. This module does not introduce significant complexity to the model, but it is essential for interpreting how the various elements of an input are related.

# Exercise 4.2: Initialize larger GPT models

- **GPT2-small** (the 124M configuration we already implemented):
    - "emb_dim" = 768
    - "n_layers" = 12
    - "n_heads" = 12

- **GPT2-medium:**
    - "emb_dim" = 1024
    - "n_layers" = 24
    - "n_heads" = 16

- **GPT2-large:**
    - "emb_dim" = 1280
    - "n_layers" = 36
    - "n_heads" = 20

- **GPT2-XL:**
    - "emb_dim" = 1600
    - "n_layers" = 48
    - "n_heads" = 25

**Key Exercise Question: Can you systematically scale the GPT-2 model architecture from the small configuration to medium, large, and XL variants by exclusively modifying the configuration parameters?**

*Architectural Scaling Challenge:*
This exercise explores the methodological expansion of the GPT-2 model across different scales, demonstrating how architectural complexity can be incrementally increased through strategic parameter modifications.

*Model Variants to Implement:*
1. **GPT-2 Small (Current Implementation)**
   - Embedding Dimensions ("emb_dim"): 768
   - Transformer Blocks ("n_layers"): 12
   - Multi-Head Attention Heads ("n_heads"): 12

2. **GPT-2 Medium**
   - Embedding Dimensions ("emb_dim"): 1,024
   - Transformer Blocks ("n_layers"): 24
   - Multi-Head Attention Heads ("n_heads"): 16

3. **GPT-2 Large**
   - Embedding Dimensions ("emb_dim"): 1,280
   - Transformer Blocks ("n_layers"): 36
   - Multi-Head Attention Heads ("n_heads"): 20

4. **GPT-2 XL**
   - Embedding Dimensions ("emb_dim"): 1,600
   - Transformer Blocks ("n_layers"): 48
   - Multi-Head Attention Heads ("n_heads"): 25

*Methodological Constraints:*
- Modify only the configuration file
- Utilize the existing `GPTModel` class without code alterations
- Demonstrate parameter scaling capabilities
- Calculate total parameters for each model variant

**Bonus Challenge:**
**Compute the total number of trainable parameters for each model variant, highlighting the exponential growth in model complexity.**



# Solution

In [4]:
import torch
from gpt import GPTModel

configurations = {
    "GPT-2 Small": {
        "emb_dim": 768,
        "n_layers": 12,
        "n_heads": 12,
        "vocab_size": 50257,
        "context_length": 1024,  
        "drop_rate": 0.1,  
        "qkv_bias": False      
    },
    "GPT-2 Medium": {
        "emb_dim": 1024,
        "n_layers": 24,
        "n_heads": 16,
        "vocab_size": 50257,
        "context_length": 1024,
        "drop_rate": 0.1,
        "qkv_bias": False    
    },
    "GPT-2 Large": {
        "emb_dim": 1280,
        "n_layers": 36,
        "n_heads": 20,
        "vocab_size": 50257,
        "context_length": 1024,
        "drop_rate": 0.1,
        "qkv_bias": False      
    },
    "GPT-2 XL": {
        "emb_dim": 1600,
        "n_layers": 48,
        "n_heads": 25,
        "vocab_size": 50257,
        "context_length": 1024,
        "drop_rate": 0.1,
        "qkv_bias": False       
    }
}


models = {}
for name, cfg in configurations.items():
    model = GPTModel(cfg)
    models[name] = model
    print(f"Model: {name}")
    total_params = sum(p.numel() for p in model.parameters())
    print(f"Total number of parameters: {total_params:,}")
    total_params_gpt2 =  total_params - sum(p.numel() for p in model.out_head.parameters())
    print(f"Number of trainable parameters considering weight tying: {total_params_gpt2:,}")
    print("\n")


Model: GPT-2 Small
Total number of parameters: 163,009,536
Number of trainable parameters considering weight tying: 124,412,160


Model: GPT-2 Medium
Total number of parameters: 406,212,608
Number of trainable parameters considering weight tying: 354,749,440


Model: GPT-2 Large
Total number of parameters: 838,220,800
Number of trainable parameters considering weight tying: 773,891,840


Model: GPT-2 XL
Total number of parameters: 1,637,792,000
Number of trainable parameters considering weight tying: 1,557,380,800




# Exercise 4.3: Using separate dropout parameters

**Key Exercise Question: How can we enhance the dropout configuration of the GPT model by implementing layer-specific dropout rates?**

*Architectural Dropout Refinement:*
The current implementation employs a uniform dropout rate across multiple model components, which presents an opportunity for more nuanced regularization strategies. This exercise challenges you to develop a more sophisticated approach to dropout implementation within neural network architectures.

*Dropout Localization:*
Three critical architectural components require distinct dropout configurations:
1. Embedding Layer
2. Shortcut (Residual) Connections
3. Multi-Head Attention Module

*Methodological Approach:*
You must modify the existing `GPT_CONFIG_124M` configuration to:
- Replace the monolithic `drop_rate` parameter
- Introduce a hierarchical dropout configuration
- Maintain the overall structural integrity of the model architecture

*Conceptual Challenge:*
The exercise requires a deep understanding of:
- Regularization techniques in neural network design
- The functional role of dropout in different architectural components
- Systematic configuration of model hyperparameters

# Solution

In [2]:
GPT_CONFIG_124M = {
    "vocab_size": 50257,  
    "context_length": 1024, 
    "emb_dim": 768,          
    "n_heads": 12,           
    "n_layers": 12,         
    "drop_rate_emb": 0.1,  
    "drop_rate_res": 0.05,   
    "drop_rate_attn": 0.15,  
    "qkv_bias": False        
}

In [3]:
from gpt import LayerNorm, MultiHeadAttention, FeedForward

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"], 
            num_heads=cfg["n_heads"],
            dropout=cfg["drop_rate_attn"],  # Attention dropout rate
            qkv_bias=cfg["qkv_bias"] 
        )
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate_res"])

    def forward(self, x):
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)
        x = self.drop_shortcut(x)
        x = x + shortcut

        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        x = self.drop_shortcut(x)
        x = x + shortcut

        return x


class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate_emb"]) # Attention dropout rate

        self.trf_blocks = nn.Sequential(*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])])

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)

        return logits


We can test our change like this :

In [None]:
def test_model_initialization_and_forward_pass():
    cfg = {
        "vocab_size": 50257,
        "context_length": 1024,
        "emb_dim": 768,
        "n_heads": 12,
        "n_layers": 12,
        "drop_rate_emb": 0.1,
        "drop_rate_res": 0.05,
        "drop_rate_attn": 0.15,
        "qkv_bias": False
    }

    model = GPTModel(cfg)
    model.eval() 
    
    dummy_input = torch.randint(0, cfg['vocab_size'], (1, cfg['context_length']))
    
    try:
        output = model(dummy_input)
        print("Forward pass successful. Output shape:", output.shape)
    except Exception as e:
        print("Error during forward pass:", e)

test_model_initialization_and_forward_pass()


Forward pass successful. Output shape: torch.Size([1, 1024, 50257])
