# Chapter 3 - Exercises

> Author : Badr TAJINI - Large Language model (LLMs) - ESIEE 2024-2025

---

# Exercise 3.1

Observe that the `nn.Linear` layer in `SelfAttention_v2` employs a distinct weight initialization strategy compared to the `nn.Parameter(torch.rand(d_in, d_out))` method utilized in `SelfAttention_v1`, resulting in divergent computational outputs. To validate the fundamental structural similarities between the two implementations, we propose a weight transfer methodology that will demonstrate the potential for convergence between `SelfAttention_v1` and `SelfAttention_v2`.

**Key Exercise Question: Can you transfer the weights from `SelfAttention_v2` to `SelfAttention_v1` such that both implementations produce identical output tensors?**

*Specific Challenges:*
- Recognize that `nn.Linear` stores its weight matrix in a transposed configuration
- Carefully map and transfer weights between the two self-attention implementations
- Verify that the transferred weights result in mathematically equivalent computational results

The primary objective is to systematically transfer weight matrices from an instantiated `SelfAttention_v2` object to a `SelfAttention_v1` instance, requiring a nuanced understanding of the underlying weight matrix representation.

Subsequent research focuses on advancing the self-attention mechanism through two critical architectural enhancements:

1. **Causal Masking**: This modification introduces a constraint preventing the attention mechanism from accessing future sequence elements. Such a constraint is particularly pivotal in generative language modeling contexts, where each token's prediction must be conditioned exclusively on preceding contextual information.

2. **Multi-Head Attention**: This approach involves partitioning the attention mechanism into parallel computational "heads." Each head operates as a distinct learnable feature extractor, capable of capturing diverse representational characteristics across different subspaces and positional contexts. By enabling simultaneous multi-perspective representation learning, this technique substantially augments the model's capacity to process complex, high-dimensional representations.

These architectural refinements collectively contribute to more sophisticated and contextually aware neural network architectures, particularly in sequence modeling domains.

In [34]:
import torch.nn as nn
import torch

In [18]:
class SelfAttention_v1(nn.Module):

    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value
        
        attn_scores = queries @ keys.T # omega
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )

        context_vec = attn_weights @ values
        return context_vec

In [20]:
class SelfAttention_v2(nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec

In [44]:
d_in, d_out = 4, 8
x = torch.rand(5, d_in)

v1 = SelfAttention_v1(d_in, d_out)
v2 = SelfAttention_v2(d_in, d_out)

In [46]:
v1.W_query

Parameter containing:
tensor([[0.0685, 0.6606, 0.4592, 0.1848, 0.5046, 0.4392, 0.1914, 0.4091],
        [0.0777, 0.8080, 0.9856, 0.6567, 0.9515, 0.8317, 0.0852, 0.8176],
        [0.8297, 0.6035, 0.5858, 0.1195, 0.2602, 0.5640, 0.5136, 0.5549],
        [0.7499, 0.0186, 0.5960, 0.7015, 0.3423, 0.1344, 0.2942, 0.5245]],
       requires_grad=True)

In [48]:
v2.W_query.weight

Parameter containing:
tensor([[-0.1911,  0.4821,  0.3072,  0.2805],
        [-0.3136,  0.4807,  0.4617,  0.4485],
        [ 0.2543, -0.0261, -0.2113,  0.3516],
        [-0.0614, -0.0181, -0.4040,  0.1788],
        [ 0.3506,  0.0157, -0.4754,  0.3092],
        [-0.1775, -0.0442, -0.3498,  0.2022],
        [-0.4878,  0.2346, -0.0882, -0.0358],
        [ 0.0382,  0.4425,  0.2727, -0.3696]], requires_grad=True)

In [50]:
with torch.no_grad():
    # Transpose and assign weights
    v1.W_query.copy_(v2.W_query.weight.T)
    v1.W_key.copy_(v2.W_key.weight.T)
    v1.W_value.copy_(v2.W_value.weight.T)

output_v1 = v1(x)
output_v2 = v2(x)

print("Output from SelfAttention_v1:\n", output_v1)
print("Output from SelfAttention_v2:\n", output_v2)
print("Are outputs identical? ", torch.allclose(output_v1, output_v2, atol=1e-6))


Output from SelfAttention_v1:
 tensor([[-0.1959,  0.1466, -0.0125, -0.1522,  0.1652,  0.0632, -0.4644, -0.2865],
        [-0.1924,  0.1482, -0.0101, -0.1501,  0.1689,  0.0585, -0.4633, -0.2887],
        [-0.1988,  0.1450, -0.0159, -0.1519,  0.1564,  0.0706, -0.4617, -0.2801],
        [-0.1958,  0.1468, -0.0130, -0.1510,  0.1624,  0.0648, -0.4627, -0.2843],
        [-0.1943,  0.1475, -0.0116, -0.1508,  0.1658,  0.0616, -0.4633, -0.2867]],
       grad_fn=<MmBackward0>)
Output from SelfAttention_v2:
 tensor([[-0.1959,  0.1466, -0.0125, -0.1522,  0.1652,  0.0632, -0.4644, -0.2865],
        [-0.1924,  0.1482, -0.0101, -0.1501,  0.1689,  0.0585, -0.4633, -0.2887],
        [-0.1988,  0.1450, -0.0159, -0.1519,  0.1564,  0.0706, -0.4617, -0.2801],
        [-0.1958,  0.1468, -0.0130, -0.1510,  0.1624,  0.0648, -0.4627, -0.2843],
        [-0.1943,  0.1475, -0.0116, -0.1508,  0.1658,  0.0616, -0.4633, -0.2867]],
       grad_fn=<MmBackward0>)
Are outputs identical?  True


# Exercise 3.2

**Key Exercise Question: How can you modify the input arguments to the `MultiHeadAttentionWrapper(num_heads=2)` to transform the output context vectors from four-dimensional to two-dimensional while maintaining the `num_heads=2` configuration?**

*Specific Challenges:*
- Identify the input parameter that controls the dimensionality of output context vectors
- Understand the relationship between input arguments and tensor shape
- Achieve dimensionality reduction without modifying the core `MultiHeadAttentionWrapper` class implementation

*Architectural Context:*
Up to this point, we have developed a `MultiHeadAttentionWrapper` that integrates multiple single-head attention modules through sequential processing, implemented via the comprehension `[head(x) for head in self.heads]` in the forward method. This current implementation represents a foundational approach to multi-head attention mechanisms.

*Potential Optimization Strategies:*
1. **Sequential Processing Limitation**: The current implementation processes attention heads sequentially, which may introduce computational inefficiencies.

2. **Parallel Processing Approach**: An advanced optimization involves simultaneous computation of attention head outputs through efficient matrix multiplication techniques. This parallel processing strategy can potentially enhance computational performance and reduce computational overhead.

*Theoretical Implications:*
The ability to dynamically adjust output dimensionality while maintaining the multi-head attention structure highlights the flexibility of modern neural network architectural designs. Such manipulations are crucial in adapting attention mechanisms to diverse computational requirements across different machine learning domains.

*Practical Recommendation:*
Carefully examine the input arguments of the `MultiHeadAttentionWrapper` and consider how specific parameters might influence the output tensor's dimensionality. The solution likely involves a subtle adjustment that does not require restructuring the core implementation.

In [63]:
class CausalAttention(nn.Module):

    def __init__(self, d_in, d_out, context_length,
                 dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout) # New
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1)) # New

    def forward(self, x):
        b, num_tokens, d_in = x.shape # New batch dimension b
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.transpose(1, 2) # Changed transpose
        attn_scores.masked_fill_(  # New, _ ops are in-place
            self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)  # `:num_tokens` to account for cases where the number of tokens in the batch is smaller than the supported context_size
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )
        attn_weights = self.dropout(attn_weights) # New

        context_vec = attn_weights @ values
        return context_vec

In [65]:
class MultiHeadAttentionWrapper(nn.Module):

    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias) 
             for _ in range(num_heads)]
        )

    def forward(self, x):
        return torch.cat([head(x) for head in self.heads], dim=-1)

In [90]:
d_in, d_out = 3, 1
context_length = 6
x = torch.rand(1, context_length, d_in)

In [92]:
x

tensor([[[0.9153, 0.7751, 0.6749],
         [0.1166, 0.8858, 0.6568],
         [0.8459, 0.3033, 0.6060],
         [0.9882, 0.8363, 0.9010],
         [0.3950, 0.8809, 0.1084],
         [0.5432, 0.2185, 0.3834]]])

In [103]:
mha = MultiHeadAttentionWrapper(
    d_in, d_out, context_length, 0.0, num_heads=2
)

context_vecs = mha(x)

print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[0.2575, 0.0361],
         [0.1946, 0.0328],
         [0.1508, 0.0100],
         [0.1602, 0.0085],
         [0.2221, 0.0366],
         [0.1916, 0.0276]]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([1, 6, 2])


# Exercise 3.3

**Key Exercise Question: Can you configure a `MultiHeadAttention` module that precisely replicates the architectural specifications of the smallest GPT-2 model?**

*Specific Model Specifications:*
- Number of Attention Heads: 12
- Input/Output Embedding Dimensions: 768
- Context Length: 1,024 tokens

*Architectural Parameters:*
- `num_heads`: 12
- `d_model`: 768
- `context_length`: 1,024

*Theoretical Considerations:*
The proposed configuration mirrors the smallest variant of the GPT-2 model, which represents a fundamental architecture in transformer-based language models. By precisely replicating these specifications, we can explore the intricate design choices that contribute to the model's effectiveness in natural language processing tasks.

*Key Implementation Details:*
- Ensuring 12 parallel attention heads allows for multi-perspective feature representation
- The 768-dimensional embedding space provides a rich, high-dimensional representation of linguistic context
- The 1,024 token context length enables comprehensive sequence processing

*Practical Recommendation:*
Carefully construct the `MultiHeadAttention` initialization to match these exact specifications, paying close attention to the dimensionality and number of heads to accurately reproduce the smallest GPT-2 model's architectural characteristics.

In [106]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"

        self.d_out = d_out
        self.num_heads = num_heads
        self.head_dim = d_out // num_heads # Reduce the projection dim to match desired output dim

        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)  # Linear layer to combine head outputs
        self.dropout = nn.Dropout(dropout)
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(context_length, context_length),
                       diagonal=1)
        )

    def forward(self, x):
        b, num_tokens, d_in = x.shape

        keys = self.W_key(x) # Shape: (b, num_tokens, d_out)
        queries = self.W_query(x)
        values = self.W_value(x)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (b, num_tokens, d_out) -> (b, num_tokens, num_heads, head_dim)
        keys = keys.view(b, num_tokens, self.num_heads, self.head_dim) 
        values = values.view(b, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (b, num_tokens, num_heads, head_dim) -> (b, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention) with a causal mask
        attn_scores = queries @ keys.transpose(2, 3)  # Dot product for each head

        # Original mask truncated to the number of tokens and converted to boolean
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]

        # Use the mask to fill attention scores
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)

        # Shape: (b, num_tokens, num_heads, head_dim)
        context_vec = (attn_weights @ values).transpose(1, 2) 
        
        # Combine heads, where self.d_out = self.num_heads * self.head_dim
        context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
        context_vec = self.out_proj(context_vec) # optional projection

        return context_vec

In [113]:
d_in, d_out = 768, 768
context_length = 1024
x = torch.rand(8, context_length//8, d_in)

In [121]:
mha = MultiHeadAttention(d_in, d_out, context_length, 0.0, num_heads=12)

context_vecs = mha(x)

print("context_vecs.shape:", context_vecs.shape)

context_vecs.shape: torch.Size([8, 128, 768])
