# Tutorial 10-1: Attention is All You Need â€“ "Building a Transformer Block"

**Course:** CSEN 342: Deep Learning  
**Topic:** Transformers, Self-Attention, Multi-Head Attention, and Positional Encoding

## Objective
The Transformer model (Vaswani et al., 2017) revolutionized Deep Learning by eschewing recurrence entirely in favor of **Attention mechanisms**. 

In this tutorial, we will not use the black-box `nn.Transformer`. Instead, we will build the architecture layer-by-layer to understand the math behind the magic.

We will implement:
1.  **Scaled Dot-Product Attention:** The mathematical core.
2.  **Multi-Head Attention:** Running multiple attention mechanisms in parallel.
3.  **Positional Encoding:** Injecting sequence order since we have no RNN loops.
4.  **Encoder Block:** Assembling the full building block used in BERT and GPT.

---

## Part 1: Scaled Dot-Product Attention

The core attention mechanism is defined as:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where:
* **Q (Query):** What I am looking for.
* **K (Key):** What I match against.
* **V (Value):** What I retrieve.
* **$d_k$:** Dimension of the keys (used for scaling to prevent vanishing gradients in Softmax).

Let's implement this.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import matplotlib.pyplot as plt
import numpy as np

class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, q, k, v, mask=None):
        # q, k, v: (Batch, Heads, Seq_Len, Dim)
        
        d_k = q.size(-1)
        
        # 1. Dot Product: (Batch, Heads, Seq, Dim) x (Batch, Heads, Dim, Seq)
        # Result: (Batch, Heads, Seq, Seq) -> The "Similarity Matrix"
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
        
        # 2. Masking (Optional)
        # Used in Decoder to hide future tokens, or to hide Padding tokens
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        
        # 3. Softmax
        attn_weights = F.softmax(scores, dim=-1)
        
        # 4. Weighted Sum of Values
        output = torch.matmul(attn_weights, v)
        
        return output, attn_weights

# Verification
attention = ScaledDotProductAttention()
# Dummy Data: Batch=1, Heads=1, Seq=5, Dim=64
q = torch.randn(1, 1, 5, 64)
k = torch.randn(1, 1, 5, 64)
v = torch.randn(1, 1, 5, 64)
out, weights = attention(q, k, v)

print(f"Output Shape: {out.shape} (Expected: 1, 1, 5, 64)")
print(f"Weights Shape: {weights.shape} (Expected: 1, 1, 5, 5)")

---

## Part 2: Multi-Head Attention

Instead of one single attention mechanism, we split the embedding dimension into multiple "Heads." This allows the model to attend to different types of information (e.g., Head 1 focuses on grammar, Head 2 focuses on topic, etc.) simultaneously.

**Logic:**
1.  Project Input $X$ into $Q, K, V$ using linear layers.
2.  **Split** into `n_heads`.
3.  Run Scaled Dot-Product Attention on each head.
4.  **Concatenate** the results.
5.  Project back to original dimension.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_model = d_model
        self.d_k = d_model // n_heads
        
        # Linear Projections
        self.w_q = nn.Linear(d_model, d_model)
        self.w_k = nn.Linear(d_model, d_model)
        self.w_v = nn.Linear(d_model, d_model)
        
        self.attention = ScaledDotProductAttention()
        self.fc = nn.Linear(d_model, d_model)

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)
        
        # 1. Linear Projection + Split Heads
        # Transform: (Batch, Seq, Dim) -> (Batch, Seq, Heads, d_k) -> (Batch, Heads, Seq, d_k)
        q = self.w_q(q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        k = self.w_k(k).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        v = self.w_v(v).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        
        # 2. Attention
        # Output: (Batch, Heads, Seq, d_k)
        out, weights = self.attention(q, k, v, mask)
        
        # 3. Concatenate
        # Transform: (Batch, Heads, Seq, d_k) -> (Batch, Seq, Heads, d_k) -> (Batch, Seq, Dim)
        out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        
        # 4. Final Linear Layer
        out = self.fc(out)
        
        return out, weights

---

## Part 3: Positional Encoding

Since the Transformer processes all tokens in parallel (unlike an RNN), it has no inherent notion of order. "Dog bites Man" looks the same as "Man bites Dog" to the self-attention layer.

We inject order by adding a **Positional Encoding (PE)** vector to the input embedding.

$$ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}}) $$
$$ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}}) $$

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        # Create matrix of [SeqLen, HiddenDim] representing the positional encoding for max_len inputs
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        # x: (Batch, Seq_Len, Dim)
        # Add positional encoding to input embeddings
        x = x + self.pe[:, :x.size(1)]
        return x

# Visualization
d_model = 128
pos_enc = PositionalEncoding(d_model, max_len=100)
dummy_input = torch.zeros(1, 100, d_model)
output = pos_enc(dummy_input)

plt.figure(figsize=(10, 6))
plt.imshow(output[0].numpy(), cmap='RdBu', aspect='auto')
plt.title("Positional Encoding (100 positions, 128 dimensions)")
plt.xlabel("Embedding Dimension")
plt.ylabel("Sequence Position")
plt.colorbar()
plt.show()

### Discussion
Look at the plot. Each row represents a position in the sequence. Each column is a dimension.
You can see distinct wave patterns. This provides a unique, deterministic "signature" for every position that the model can learn to recognize.

---

## Part 4: The Encoder Layer

We now assemble the full **Transformer Encoder Layer**. 

**Architecture:**
1.  **Multi-Head Attention**
2.  **Add & Norm:** Residual Connection + Layer Normalization
3.  **Feed-Forward Network:** Two linear layers with ReLU in between.
4.  **Add & Norm:** Residual Connection + Layer Normalization

This is the repeated block in BERT (12 or 24 layers of this).

In [None]:
class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, n_heads)
        self.feed_forward = PositionwiseFeedForward(d_model, d_ff)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # 1. Attention Sublayer
        # Residual Connection: x + Sublayer(x)
        attn_output, _ = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # 2. Feed-Forward Sublayer
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        
        return x

# Verify Assembly
d_model = 512
n_heads = 8
d_ff = 2048
seq_len = 20
batch_size = 4

layer = EncoderLayer(d_model, n_heads, d_ff)
dummy_input = torch.randn(batch_size, seq_len, d_model)
output = layer(dummy_input)

print(f"Input Shape: {dummy_input.shape}")
print(f"Output Shape: {output.shape}")
print("Success! Dimensions are preserved (ready for stacking).")

### Conclusion
You have successfully implemented the core components of a Transformer from scratch.

1.  **Q, K, V:** You learned how attention is just a "soft" database lookup.
2.  **Multi-Head:** You saw how parallel heads allow the model to focus on different things at once.
3.  **Positional Encoding:** You visualized how we inject order into a parallel architecture.
4.  **Encoder Block:** You built the Lego brick that powers modern NLP (BERT, etc.).

In the next tutorial, we will use the `transformers` library to fine-tune a pre-trained version of this architecture.