# üßê Attention is All You Need? Understanding the Attention Mechanism

In our previous notebooks, we explored RNNs and Seq2Seq models.

However, the world of NLP was revolutionized in 2017 by the paper **"Attention Is All You Need"** (Vaswani et al.). This paper introduced a new model, the **Transformer**, which *completely* discarded RNNs and relied *entirely* on a more powerful form of attention called **self-attention**.

This notebook is dedicated to understanding this mechanism. We will build, from scratch, the core components of Transformer-style attention:
1.  **The "Query, Key, Value" Concept:** The high-level theory.
2.  **Scaled Dot-Product Attention:** The core mathematical engine.
3.  **Multi-Head Attention (MHA):** The powerful, parallelized version that gives the Transformer its strength.
4.  **Putting it Together:** We'll build a full `EncoderBlock` to see how MHA is used in practice.

Our goal is to *isolate* and *understand* these components before we build a full Transformer model.

## 1. Setup and Imports

We'll be using PyTorch to build our components.

In [12]:
import torch
import torch.nn as nn
import math
import torch.nn.functional as F

print(f"PyTorch version: {torch.__version__}")

# Set a seed for reproducible results
SEED = 1234
torch.manual_seed(SEED)

PyTorch version: 2.8.0+cu126


<torch._C.Generator at 0x782324238c10>

## 2. The "Query, Key, Value" Analogy

How does "self-attention" work? How does a sequence "attend to itself"?

The "Query, Key, Value" (QKV) model is the key. Think of it like a search engine or a database retrieval.

Imagine you are at a library.
* **Query (Q):** This is your **question**. For a word, the Query is its "request for information." For example, the word "it" might have a Query like, "I'm a pronoun... what noun do I refer to?"
* **Key (K):** This is the **label on a book's spine**. Every word in the sequence has a "Key" that describes what it is. The word "car" might have a Key like, "I am a singular, non-human noun."
* **Value (V):** This is the **actual content of the book**. Every word also has a "Value" representing its actual meaning or content.

**The Self-Attention Process:**
1.  **Compare:** For a single word (let's call it "Word 1"), its **Query (Q1)** is compared against *every other word's* **Key (K1, K2, K3...)**. This comparison (a dot product) generates a "similarity score" or "attention score."
2.  **Get Weights:** These scores are passed through a **softmax** function. This turns the scores into a probability distribution (weights that sum to 1). This is the "attention" ‚Äì a set of weights showing how much "Word 1" should care about all the other words.
3.  **Get Output:** The final output for "Word 1" is a **weighted sum** of all the **Values (V1, V2, V3...)** in the sequence, based on the weights calculated in step 2.



In essence, the output for each word is a new vector, "blended" from all other words in the sequence, based on *how relevant* they are (Q-K similarity).

## 3. Part 1: Scaled Dot-Product Attention

This is the core engine that implements the QKV logic. It's defined by the following equation:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$

Let's break this down:

1.  **$QK^T$**: This is the "Compare" step. We perform a matrix multiplication between all Queries and all Keys (transposed). This efficiently calculates the "similarity score" between every query and every key.
    * If `Q` is `[seq_len_q, d_k]` and `K` is `[seq_len_k, d_k]`, then `K^T` is `[d_k, seq_len_k]`.
    * The result `QK^T` is a matrix of shape `[seq_len_q, seq_len_k]` containing all the scores.

2.  **$\frac{...}{\sqrt{d_k}}$**: This is the "Scaled" part.
    * `d_k` is the dimension of the Key vectors.
    * **Why scale?** For large values of `d_k`, the dot products $QK^T$ can become very large. This pushes the softmax function into regions with tiny gradients (it gets "saturated").
    * By dividing by the square root of `d_k`, we "tame" these values, keeping the gradients healthy and making training more stable.

3.  **$\text{softmax}(...)$**: This is the "Get Weights" step. We apply softmax along the "keys" dimension. This converts the raw scores into a probability distribution (weights) for each query.

4.  **$...V$**: This is the "Get Output" step. We multiply our weights (a `[seq_len_q, seq_len_k]` matrix) by the Value matrix (`[seq_len_k, d_v]`). This gives us a final, weighted-sum output of shape `[seq_len_q, d_v]`.

Let's implement this.

In [13]:
def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Computes Scaled Dot-Product Attention.

    Args:
        query (Tensor): Shape [batch_size, n_heads, seq_len_q, d_k]
        key (Tensor): Shape [batch_size, n_heads, seq_len_k, d_k]
        value (Tensor): Shape [batch_size, n_heads, seq_len_k, d_v]
        mask (Tensor, optional): Shape [batch_size, 1, 1, seq_len_k] or [batch_size, 1, seq_len_q, seq_len_k]

    Returns:
        output (Tensor): Shape [batch_size, n_heads, seq_len_q, d_v]
        attn_weights (Tensor): Shape [batch_size, n_heads, seq_len_q, seq_len_k]
    """

    # query, key, and value dimensions are d_k (or d_q) and d_v
    # For self-attention, d_k = d_v
    d_k = query.size(-1)

    # 1. MatMul(Q, K^T)
    # (batch_size, n_heads, seq_len_q, d_k) @ (batch_size, n_heads, d_k, seq_len_k)
    # -> (batch_size, n_heads, seq_len_q, seq_len_k)
    scores = torch.matmul(query, key.transpose(-2, -1))

    # 2. Scale
    scores = scores / math.sqrt(d_k)

    # 3. Mask (Optional)
    # Masks are used to hide certain positions from the attention.
    # e.g., padding tokens, or future tokens in a decoder.
    if mask is not None:
        # We broadcast the mask by adding a very small number (e.g., -1e9)
        # to the scores. Softmax(x + -inf) = 0.
        scores = scores.masked_fill(mask == 0, -1e9)

    # 4. Softmax
    attn_weights = F.softmax(scores, dim=-1)

    # 5. MatMul(weights, V)
    # (batch_size, n_heads, seq_len_q, seq_len_k) @ (batch_size, n_heads, seq_len_k, d_v)
    # -> (batch_size, n_heads, seq_len_q, d_v)
    output = torch.matmul(attn_weights, value)

    return output, attn_weights

# --- Let's test it with some dummy data ---

# We'll use 4 dimensions, 1 head, batch size 1, for simplicity
# (batch_size, n_heads, seq_len, d_k/d_v)
B, H, L, D = 1, 1, 3, 4 # Batch=1, Head=1, SeqLen=3, Dim=4

# Create 3 "words"
q = torch.rand(B, H, L, D)
k = torch.rand(B, H, L, D)
v = torch.rand(B, H, L, D)

print("Query shape:", q.shape)

output, weights = scaled_dot_product_attention(q, k, v)

print("\nOutput shape:", output.shape)
print("Attention Weights shape:", weights.shape)
print("\nAttention Weights (summed over last dim):")
print(weights.sum(dim=-1)) # Should sum to 1

Query shape: torch.Size([1, 1, 3, 4])

Output shape: torch.Size([1, 1, 3, 4])
Attention Weights shape: torch.Size([1, 1, 3, 3])

Attention Weights (summed over last dim):
tensor([[[1., 1., 1.]]])


## 4. Part 2: Multi-Head Attention

Why just do this once? The model might need to ask multiple questions simultaneously.

* When looking at "it," one part of the model (one "head") might ask, "What noun does this refer to?"
* Another head might ask, "Is this part of a possessive phrase?"
* Another might ask, "Is this a subject or an object?"

**Multi-Head Attention (MHA)** runs the Scaled Dot-Product Attention mechanism multiple times (*h* times) in parallel.

**Process:**
1.  **Project:** Take the *single* input `Query`, `Key`, and `Value` and pass them through *h* independent Linear layers (Wq, Wk, Wv) to create *h* different sets of Q, K, and V vectors.
2.  **Split:** We don't actually do *h* separate linear layers. Instead, we use one *large* Linear layer (e.g., from `d_model` to `d_model`) and then "split" the resulting `d_model` vector into *h* "heads."
    * If `d_model = 512` and `h = 8`, we split the 512-dim vector into 8 chunks of 64 dimensions. Each chunk is one "head."
3.  **Attend:** Run Scaled Dot-Product Attention on each head *in parallel*. Each head gets its own output and attention weights.
4.  **Concatenate:** Re-combine the outputs of all *h* heads (e.g., concatenate the 8 chunks of 64-dim vectors back into one 512-dim vector).
5.  **Final Linear Layer:** Pass this concatenated vector through one final linear layer (Wo) to mix the information from all heads.



This allows each head to "attend" to different parts of the input, learning different types of relationships.

In [15]:
class MultiHeadAttention(nn.Module):

    def __init__(self, d_model, h):
        """
        Args:
            d_model (int): The dimensionality of the input/output.
            h (int): The number of attention heads.
        """
        super().__init__()

        assert d_model % h == 0, "d_model must be divisible by h"

        self.d_model = d_model
        self.h = h
        self.d_k = d_model // h # Dimension of each head's Key
        self.d_v = d_model // h # Dimension of each head's Value

        # 1. Define the big Linear layers for Q, K, V
        # These will project d_model -> d_model
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)

        # 5. Define the final Linear layer (Wo)
        self.W_o = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        """
        Splits the last dimension (d_model) into h heads.
        Shape: (batch_size, seq_len, d_model) -> (batch_size, h, seq_len, d_k)
        """
        # (batch_size, seq_len, d_model) -> (batch_size, seq_len, h, d_k)
        x = x.view(batch_size, -1, self.h, self.d_k)
        # (batch_size, seq_len, h, d_k) -> (batch_size, h, seq_len, d_k)
        return x.transpose(1, 2)

    def forward(self, query_in, key_in, value_in, mask=None):
        # In self-attention, query_in, key_in, and value_in are all the same tensor.
        # But for encoder-decoder attention, they can be different.

        batch_size = query_in.size(0)

        # 1. Project through Wq, Wk, Wv
        # (batch_size, seq_len_q, d_model)
        Q = self.W_q(query_in)
        # (batch_size, seq_len_k, d_model)
        K = self.W_k(key_in)
        # (batch_size, seq_len_k, d_model)
        V = self.W_v(value_in)

        # 2. Split heads
        # (batch_size, h, seq_len_q, d_k)
        Q = self.split_heads(Q, batch_size)
        # (batch_size, h, seq_len_k, d_k)
        K = self.split_heads(K, batch_size)
        # (batch_size, h, seq_len_k, d_v)
        V = self.split_heads(V, batch_size)

        # 3. Scaled Dot-Product Attention
        # `context` shape: (batch_size, h, seq_len_q, d_v)
        # `attn_weights` shape: (batch_size, h, seq_len_q, seq_len_k)
        context, attn_weights = scaled_dot_product_attention(Q, K, V, mask)

        # 4. Concatenate heads
        # (batch_size, h, seq_len_q, d_v) -> (batch_size, seq_len_q, h, d_v)
        context = context.transpose(1, 2).contiguous()
        # (batch_size, seq_len_q, h, d_v) -> (batch_size, seq_len_q, d_model)
        context = context.view(batch_size, -1, self.d_model)

        # 5. Pass through final linear layer (Wo)
        # (batch_size, seq_len_q, d_model)
        output = self.W_o(context)

        return output, attn_weights

# --- Let's test it! ---
D_MODEL = 512
HEADS = 8
BATCH_SIZE = 32
SEQ_LEN = 10 # 10 "words" in our sentence

# Create a MHA module
mha = MultiHeadAttention(d_model=D_MODEL, h=HEADS)

# Create a dummy input tensor (batch, seq_len, d_model)
# This is "self-attention" so Q, K, and V all come from the same source `x`
x = torch.rand(BATCH_SIZE, SEQ_LEN, D_MODEL)

# Pass it through the MHA
output, weights = mha(query_in=x, key_in=x, value_in=x)

print(f"Input shape (x): {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")

# Check the weights shape: (batch_size, h, seq_len_q, seq_len_k)
assert weights.shape == (BATCH_SIZE, HEADS, SEQ_LEN, SEQ_LEN)
# Check the output shape: (batch_size, seq_len_q, d_model)
assert output.shape == (BATCH_SIZE, SEQ_LEN, D_MODEL)

Input shape (x): torch.Size([32, 10, 512])
Output shape: torch.Size([32, 10, 512])
Attention weights shape: torch.Size([32, 8, 10, 10])


## 5. Part 3: How is this used? (A Transformer Encoder Block)

So we have this powerful `MultiHeadAttention` module. How is it used?

In a Transformer, it's used inside an **Encoder Block**. This block has two main parts:
1.  **Multi-Head Attention:** The part we just built. This allows the sequence to look at itself.
2.  **Position-wise Feed-Forward Network:** A simple 2-layer MLP that is applied independently to *each token* (position).

It also uses two crucial components for deep learning:
* **Residual Connections:** `x + Sublayer(x)`. We add the *input* to the *output* of the sub-layer. This helps prevent the vanishing gradient problem.
* **Layer Normalization:** This normalizes the features across the `d_model` dimension for each token independently. It provides stability during training.

The structure is:
1.  **Input:** `x`
2.  `x_attn = MultiHeadAttention(x)`
3.  `x_norm1 = LayerNorm(x + x_attn)` (Add & Norm)
4.  `x_ffn = FeedForwardNetwork(x_norm1)`
5.  `output = LayerNorm(x_norm1 + x_ffn)` (Add & Norm)

In [17]:
class PositionWiseFeedForward(nn.Module):
    """ A simple 2-layer MLP for the Encoder/Decoder blocks """
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        # (batch, seq_len, d_model) -> (batch, seq_len, d_ff) -> (batch, seq_len, d_model)
        return self.linear2(self.relu(self.linear1(x)))

class EncoderBlock(nn.Module):
    """ A single Transformer Encoder Block """
    def __init__(self, d_model, h, d_ff, dropout=0.1):
        super().__init__()

        self.mha = MultiHeadAttention(d_model, h)
        self.ffn = PositionWiseFeedForward(d_model, d_ff)

        self.layernorm1 = nn.LayerNorm(d_model)
        self.layernorm2 = nn.LayerNorm(d_model)

        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

    def forward(self, x, mask=None):

        # 1. Multi-Head Attention sub-layer
        attn_output, _ = self.mha(x, x, x, mask)

        # 2. Add & Norm
        # x + dropout(attn_output)
        x = self.layernorm1(x + self.dropout1(attn_output))

        # 3. Feed-Forward sub-layer
        ffn_output = self.ffn(x)

        # 4. Add & Norm
        # x + dropout(ffn_output)
        output = self.layernorm2(x + self.dropout2(ffn_output))

        return output

# --- Let's test the full block ---
D_MODEL = 512
HEADS = 8
D_FF = 2048 # Standard in the paper
BATCH_SIZE = 32
SEQ_LEN = 10

# Create a dummy input tensor
x = torch.rand(BATCH_SIZE, SEQ_LEN, D_MODEL)

# Create an encoder block
encoder_block = EncoderBlock(D_MODEL, HEADS, D_FF)

# Pass the input through
output = encoder_block(x)

print(f"Input shape (x): {x.shape}")
print(f"Final output shape: {output.shape}")

# The output shape should be identical to the input shape!
assert output.shape == x.shape

Input shape (x): torch.Size([32, 10, 512])
Final output shape: torch.Size([32, 10, 512])


## 6. Conclusion and Next Steps

We've done it! We have successfully built the core components of the Transformer model from the ground up.

* We started with the **QKV concept**.
* We implemented the math in **Scaled Dot-Product Attention**.
* We parallelized it and made it powerful with **Multi-Head Attention**.
* We saw how it fits into a full **Encoder Block** with Layer Normalization and Feed-Forward networks.

You now understand the "attention" in "Attention Is All You Need."

**Next Steps:**
* **Positional Encodings:** One thing we've ignored is *word order*. Our MHA is "permutation-invariant" (shuffling the words would give the same result, just shuffled). The next step is to learn about Positional Encodings, which inject word-order information.
* **Building a Full Transformer:** We can now stack several `EncoderBlock`s to build a full Encoder, and then build a Decoder (which uses *masked* MHA) to create a full Encoder-Decoder Transformer model for machine translation.
* **BERT / GPT:** You now have the foundational block for understanding modern pre-trained models. BERT is essentially a stack of these Encoder Blocks.