# Attention Mechanism from Scratch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/deep-learning-abc/blob/main/attention_from_scratch.ipynb)

This notebook implements the attention mechanism using **only low-level tensor operations** — no `nn.Module`, no `nn.Linear`, no high-level wrappers.

We build everything from raw matrix multiplications and element-wise operations so you can see exactly what happens at each step.

> **Companion notebook:** [attention_with_pytorch.ipynb](https://colab.research.google.com/github/adiel2012/deep-learning-abc/blob/main/attention_with_pytorch.ipynb) — same content using `nn.Module`.

In [None]:
# Install dependencies (Colab already has torch, but this ensures compatibility)
!pip install torch matplotlib -q

In [None]:
import torch
import math
import matplotlib.pyplot as plt

torch.manual_seed(42)

# Use GPU if available (Colab: Runtime > Change runtime type > GPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
def plot_heatmap(data, tokens, title):
    """Helper to plot attention weights/scores with values."""
    fig, ax = plt.subplots(figsize=(5, 4))
    im = ax.imshow(data, cmap='Blues')
    
    # Add colorbar
    plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
    
    # Ticks and labels
    ax.set_xticks(range(len(tokens)))
    ax.set_yticks(range(len(tokens)))
    ax.set_xticklabels(tokens)
    ax.set_yticklabels(tokens)
    ax.set_xlabel("Key (attending to)")
    ax.set_ylabel("Query (token)")
    ax.set_title(title)
    
    # Text annotations
    for i in range(len(tokens)):
        for j in range(len(tokens)):
            val = data[i, j]
            color = 'white' if val > data.max()/2 else 'black'
            ax.text(j, i, f"{val:.2f}", ha='center', va='center', color=color)
    
    plt.tight_layout()
    plt.show()


## 1. The Input Pipeline: Text to Tensors

Before we jump into attention, let's see how raw text becomes the matrix `X` we use as input. This process involves three main steps:
1. **Tokenization**: Converting words to discrete IDs.
2. **Embedding**: Mapping IDs to continuous vectors.
3. **Positional Encoding**: Adding information about word order.


In [None]:
sentence = "The cat sat down"
tokens = sentence.split()
vocab = {word: i for i, word in enumerate(tokens)}
token_ids = torch.tensor([vocab[w] for w in tokens])

print(f"Sentence: {sentence}")
print(f"Token IDs: {token_ids}")


### Step 1.1: Learned Word Embeddings

Each token ID points to a row in an **Embedding Matrix**. This matrix is learned during training, allowing the model to represent word meanings as vectors.


In [None]:
d_model = 8  # embedding dimension
vocab_size = len(vocab)

# In a real model, this is an nn.Embedding layer (learned parameters)
embedding_matrix = torch.randn(vocab_size, d_model)

# Lookup embeddings for our IDs
word_embeddings = embedding_matrix[token_ids]  # Shape: (seq_len, d_model)

print(f"Word Embeddings shape: {word_embeddings.shape}")


### Step 1.2: Positional Encoding (The 'Where')

Attention is permutation-invariant — it treats the sentence like a 'bag of words'. To give the model a sense of order, we add **Positional Encodings** ($PE$).

We use the standard Sine/Cosine formula from the original Transformer paper:
$$PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$
$$PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$


In [None]:
def get_positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
    
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

pos_encoding = get_positional_encoding(seq_len, d_model)
print(f"Positional Encoding shape: {pos_encoding.shape}")


### Step 1.3: Final Input Matrix X

We simply add the two matrices. Now, each vector in `X` contains both **what** the word is and **where** it is.


In [None]:
X = word_embeddings + pos_encoding
print(f"Final Input X shape: {X.shape} (seq_len, d_model)")


## 2. Projecting to Query, Key, Value

Attention doesn't operate on raw embeddings directly. We first project `X` into three different spaces using weight matrices:

$$Q = X \cdot W_Q$$
$$K = X \cdot W_K$$
$$V = X \cdot W_V$$

Each weight matrix has shape `(d_model, d_k)`. We use raw `torch.randn` to create them — no `nn.Linear`.

## 0. Mathematical Foundations

Before writing any code, let's build up the math that attention relies on.

---

### 0.1 Vectors and Dot Products

A **vector** $\mathbf{a} \in \mathbb{R}^d$ is an ordered list of $d$ real numbers. The **dot product** of two vectors measures their alignment:

$$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{d} a_i \, b_i = \|\mathbf{a}\| \, \|\mathbf{b}\| \, \cos\theta$$

where $\theta$ is the angle between them. Key intuitions:
- **Positive & large**: vectors point in a similar direction (high similarity)
- **Near zero**: vectors are roughly orthogonal (unrelated)
- **Negative**: vectors point in opposite directions

In attention, the dot product $q_i \cdot k_j$ measures **how relevant** token $j$ is to token $i$.

---

### 0.2 Matrix Multiplication

If $X \in \mathbb{R}^{n \times d}$ and $W \in \mathbb{R}^{d \times m}$, then:

$$(XW)_{ij} = \sum_{k=1}^{d} X_{ik} \, W_{kj}$$

Each row of the result is a **linear combination** of the columns of $W$, weighted by the corresponding row of $X$. This is how we project embeddings into Q, K, V spaces — it's a learned linear transformation.

When we compute $Q K^T$, we're doing all pairwise dot products at once:

$$(Q K^T)_{ij} = \mathbf{q}_i \cdot \mathbf{k}_j$$

So row $i$ of the resulting matrix contains the similarity of query $i$ with every key.

---

### 0.3 The Softmax Function

Softmax converts a vector of arbitrary real numbers into a **probability distribution** (non-negative, sums to 1):

$$\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}$$

Properties:
- **Output range**: each value is in $(0, 1)$, and they sum to exactly $1$
- **Monotonic**: larger inputs get larger probabilities
- **Sharpness**: as differences between inputs grow, softmax approaches a one-hot vector (winner-take-all)

**Numerical stability trick**: since $e^{z_i}$ can overflow for large $z_i$, we use the identity:

$$\text{softmax}(\mathbf{z})_i = \text{softmax}(\mathbf{z} - c)_i \quad \text{for any constant } c$$

Setting $c = \max_j z_j$ keeps the exponents in a safe range.

---

### 0.4 Visualizing `d_k` and Matrix Multiplication

**What is `d_k`?**
It is the **inner dimension** of the query/key vectors. When we compute the attention scores ($Q \cdot K^T$), we perform a matrix multiplication where `d_k` is the dimension we sum over.

**Visual Representation:**

Let's visualize the operation `Score = Q @ K.T`.

```text
       Q (Query Matrix)               K.T (Transposed Key Matrix)
    [ seq_len x d_k ]                    [ d_k x seq_len ]

      +-----------+d_k+                +-------------------+
      | . . . . . | ^                  | . . . . | . . . . |
      | - Row i - | |                  | . . . . | Col j . |
      | . . . . . | | seq_len      d_k | . . . . | . . . . |
      +-----------+ v                  | . . . . | . . . . |
                                       +-------------------+
                                                 ^
                                                 | seq_len
```

When calculating the score for a single pair (Row $i$ of $Q$, Column $j$ of $K^T$), we compute the **dot product**:

$$ \text{Score}_{i,j} = \sum_{z=1}^{d_k} Q_{i,z} \cdot K^T_{z,j} $$

**Why does `d_k` matter?**
Notice the sum symbol $\sum_{z=1}^{d_k}$.
- We are adding up **`d_k` distinct terms**.
- If `d_k` is small (e.g., 64), we sum 64 terms.
- If `d_k` is huge (e.g., 1024), we sum 1024 terms.

**The variance problem:**
If the elements of $Q$ and $K$ are random variables with variance 1, their dot product has a variance equal to $d_k$.
- **Larger `d_k`** $\rightarrow$ **Larger variance** $\rightarrow$ **Larger values** (e.g., +30, -30).

Large values push the **Softmax** function into regions with extremely small gradients (vanishing gradients), causing the model to stop learning.

**The Fix - Scaling:**
We divide by $\sqrt{d_k}$ to scale the variance back to 1, keeping the gradients healthy.

$$\text{Var}\!\left(\frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}}\right) = \frac{d_k}{d_k} = 1$$

---

### 0.5 Attention as a Weighted Average

The final attention output for token $i$ is:

$$\text{output}_i = \sum_{j=1}^{n} \alpha_{ij} \, \mathbf{v}_j$$

where $\alpha_{ij} = \text{softmax}\!\left(\frac{\mathbf{q}_i \cdot \mathbf{k}_j}{\sqrt{d_k}}\right)_j$ are the attention weights.

This is a **convex combination** of the value vectors — the output lives in the convex hull of the $\mathbf{v}_j$'s. Tokens with higher similarity scores contribute more to the output.

---

### 0.6 Linear Projections: Why Q, K, V?

Raw embeddings aren't optimized for measuring relevance. The learnable matrices $W_Q, W_K, W_V$ project each token into three roles:

| Matrix | Role | Analogy |
|--------|------|--------|
| $W_Q$ | **Query** — "what am I looking for?" | A database query |
| $W_K$ | **Key** — "what do I contain?" | A database index |
| $W_V$ | **Value** — "what information do I carry?" | The actual data |

The dot product $\mathbf{q}_i \cdot \mathbf{k}_j$ computes **relevance** (query matches key), and the weighted sum over $\mathbf{v}_j$ retrieves **content** from the relevant tokens.

---

Now let's implement all of this step by step.

In [None]:
d_k = 6  # dimension of queries and keys
d_v = 6  # dimension of values

# Weight matrices — raw parameters, no nn.Module
W_Q = torch.randn(d_model, d_k, device=device) * 0.1 # Shape: (d_model, d_k)
W_K = torch.randn(d_model, d_k, device=device) * 0.1 # Shape: (d_model, d_k)
W_V = torch.randn(d_model, d_v, device=device) * 0.1 # Shape: (d_model, d_v)

# Project: simple matrix multiplication
Q = X @ W_Q   # (seq_len, d_k) # Shape: (seq_len, d_k)
K = X @ W_K   # (seq_len, d_k) # Shape: (seq_len, d_k)
V = X @ W_V   # (seq_len, d_v) # Shape: (seq_len, d_v)

print(f"Q shape: {Q.shape}      (seq_len, d_k)")
print(f"K shape: {K.shape}      (seq_len, d_k)")
print(f"V shape: {V.shape}      (seq_len, d_v)")


## 3. Scaled Dot-Product Attention (Step by Step)

The core formula:

$$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$$

Let's break it into individual steps.

### Step 3a: Raw Attention Scores

$$ \text{scores} = Q \cdot K^T $$

Each element `scores[i][j]` measures how much token `i`'s query aligns with token `j`'s key.

In [None]:
# Dot product between every pair of (query, key)
scores = Q @ K.T   # (seq_len, seq_len)

print(f"Raw attention scores shape: {scores.shape} (seq_len, seq_len)")
print(scores)

In [None]:
tokens = ["The", "cat", "sat", "down"]
plot_heatmap(scores.detach().cpu().numpy(), tokens, "Raw Attention Scores (Q @ K.T)")


### Step 3b: Scale

$$ \text{scaled\_scores} = \frac{\text{scores}}{\sqrt{d_k}} $$

Without scaling, large `d_k` pushes dot products to extreme values, making softmax output nearly one-hot (vanishing gradients).

In [None]:
scale = math.sqrt(d_k)
scaled_scores = scores / scale

print(f"Scale factor: sqrt({d_k}) = {scale:.4f}")
print("Scaled scores:")
print(scaled_scores)

### Step 3c: Softmax (implemented manually)

$$\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$

We implement it by hand with the numerical stability trick: subtract the max before exponentiating.

In [None]:
def softmax(x):
    """Row-wise softmax, implemented from scratch."""
    # Subtract max for numerical stability (prevents overflow in exp)
    x_max = x.max(dim=-1, keepdim=True).values
    exp_x = torch.exp(x - x_max)
    return exp_x / exp_x.sum(dim=-1, keepdim=True)

attn_weights = softmax(scaled_scores)  # (seq_len, seq_len)


plot_heatmap(attn_weights.detach().cpu().numpy(), tokens, "Attention Weights (Softmax)")

**Reading the weight matrix:** Row `i` tells you how much each token contributes to the output of token `i`.

For example, if `attn_weights[1] = [0.1, 0.5, 0.3, 0.1]`, then the output for "cat" (token 1) is 50% influenced by itself, 30% by "sat", and 10% each by "The" and "down".

### Step 3d: Weighted Sum of Values

$$\text{output} = \text{attn\_weights} \cdot V$$

Each output row is a weighted combination of all value vectors.

**Visualizing the Weighted Sum:**



How do the attention weights and the Value matrix ($V$) combine to produce the final output?



```text

      Weights (Softmax)                V (Value Matrix)                Output Matrix

    [ seq_len x seq_len ]             [ seq_len x d_v ]              [ seq_len x d_v ]



      +---------------+                +-----------+d_v+              +---------------+

      | . . . . . . . |                | . . . . . | ^                | . . . . . . . |

      | --- Row i --- |       @        | - Row 0 - | |       =        | --- Row i --- |

      | . . . . . . . |                | - Row 1 - | | seq_len        | . . . . . . . |

      +---------------+                | . . . . . | |                +---------------+

                                       +-----------+ v

```



**Row $i$ Calculation:**



Each element $j$ in **Row $i$ of the Weights matrix** tells us the importance of **Row $j$ of the Value matrix ($V$)** for the current token $i$.



$$\text{Output}_i = \sum_{j=1}^{\text{seq\_len}} \text{Weight}_{i,j} \cdot \mathbf{v}_j$$



Where $\mathbf{v}_j$ is the $j$-th row of $V$. This shows that the output is a **weighted average** of all the value vectors. If a weight is near 1, the output will be very similar to that specific value vector; if weights are distributed, the output is a blend.


In [None]:
attn_output = attn_weights @ V  # (seq_len, d_v)

print(f"Attention output shape: {attn_output.shape} (seq_len, d_v)")
print(attn_output)

### Putting it all together

In [None]:
def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Computes scaled dot-product attention using only basic ops.
    
    Q: (seq_len, d_k)
    K: (seq_len, d_k)
    V: (seq_len, d_v)
    mask: optional (seq_len, seq_len) — True where we want to block attention
    
    Returns: (seq_len, d_v) attention output, (seq_len, seq_len) weights
    """
    d_k = Q.shape[-1]
    
    # 1. Compute scores
    scores = Q @ K.T / math.sqrt(d_k)
    
    # 2. Apply mask (for causal / decoder attention)
    if mask is not None:
        scores = scores.masked_fill(mask, float('-inf'))
    
    # 3. Softmax
    weights = softmax(scores)
    
    # 4. Weighted sum
    output = weights @ V
    
    return output, weights

# Verify it matches our step-by-step result
out, w = scaled_dot_product_attention(Q, K, V)
print("Matches step-by-step?", torch.allclose(out, attn_output, atol=1e-6))

## 4. Causal (Decoder) Mask

In autoregressive models (like GPT), token `i` should only attend to tokens `0..i`, not future tokens. We achieve this with a **causal mask** that sets future positions to `-inf` before softmax.

```
         The  cat  sat  down
The    [  ok  -inf -inf -inf ]
cat    [  ok   ok  -inf -inf ]
sat    [  ok   ok   ok  -inf ]
down   [  ok   ok   ok   ok  ]
```

In [None]:
# 1. Create Mask
causal_mask = torch.triu(torch.ones(seq_len, seq_len, dtype=torch.bool, device=device), diagonal=1)

# 2. Apply Mask to raw scaled scores (re-calculating for demo)
scores_raw = Q @ K.T / math.sqrt(d_k)
scores_masked = scores_raw.masked_fill(causal_mask, float('-inf'))

# 3. Softmax
weights_masked = softmax(scores_masked)

# Visual Comparison
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Pre-mask (just raw weights for comparison sake, though strictly we mask before softmax)
plot_heatmap(softmax(scores_raw).detach().cpu().numpy(), tokens, "Pre-Mask Weights")
# Actually, helper creates its own fig. Let's just call helper twice for simplicity in this notebook context.


In [None]:
print("Causal Mask (Upper Triangle blocked):")
plot_heatmap(causal_mask.float().cpu().numpy(), tokens, "Causal Mask")


In [None]:
print("Weights AFTER masking (Futures set to 0.00):")
plot_heatmap(weights_masked.detach().cpu().numpy(), tokens, "Masked Attention Weights")


## 5. Multi-Head Attention (from raw ops)

Instead of one attention function, we run `n_heads` in parallel, each with its own Q/K/V projections, then concatenate and project the result.

$$\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) \cdot W_O$$

where each head:
$$\text{head}_i = \text{Attention}(X W_Q^i,\; X W_K^i,\; X W_V^i)$$

In [None]:
n_heads = 2
d_k_per_head = d_model // n_heads  # 4
d_v_per_head = d_model // n_heads  # 4

print(f"n_heads={n_heads}, d_k_per_head={d_k_per_head}, d_v_per_head={d_v_per_head}")

# Create separate weight matrices for each head — raw tensors
W_Qs = [torch.randn(d_model, d_k_per_head, device=device) * 0.1 for _ in range(n_heads)] # List of (d_model, d_k_per_head)
W_Ks = [torch.randn(d_model, d_k_per_head, device=device) * 0.1 for _ in range(n_heads)] # List of (d_model, d_k_per_head)
W_Vs = [torch.randn(d_model, d_v_per_head, device=device) * 0.1 for _ in range(n_heads)] # List of (d_model, d_v_per_head)

# Output projection: maps concatenated heads back to d_model
W_O = torch.randn(n_heads * d_v_per_head, d_model, device=device) * 0.1 # Shape: (n_heads * d_v_per_head, d_model)


In [None]:
# Run each head independently
head_outputs = []

for i in range(n_heads):
    Q_i = X @ W_Qs[i]  # (seq_len, d_k_per_head) # Shape: (seq_len, d_k_per_head)
    K_i = X @ W_Ks[i] # Shape: (seq_len, d_k_per_head)
    V_i = X @ W_Vs[i] # Shape: (seq_len, d_v_per_head)
    
    head_out, head_weights = scaled_dot_product_attention(Q_i, K_i, V_i)
    head_outputs.append(head_out)
    
    print(f"Head {i} attention weights:")
    print(head_weights)
    print()

In [None]:
# Concatenate all heads along the last dimension
concat = torch.cat(head_outputs, dim=-1)  # (seq_len, n_heads * d_v_per_head) # Shape: (seq_len, n_heads * d_v_per_head)
print(f"Concatenated shape: {concat.shape} (seq_len, d_model)")

# Final linear projection (raw matmul, no nn.Linear)
multi_head_output = concat @ W_O  # (seq_len, d_model)
print(f"Multi-head output shape: {multi_head_output.shape} (seq_len, d_model)")
print(multi_head_output)

**Why multiple heads?** Each head can learn to attend to different types of relationships. For instance, one head might focus on adjacent tokens (local syntax), while another head attends to the subject of the sentence (long-range dependency).

### 5.1 Understanding the Reshape Logic

In efficient Multi-Head Attention, we don't loop.
Instead, we use `view` and `transpose` to shuffle the data.

**The Goal:** Transform `(seq_len, d_model)` $\rightarrow$ `(n_heads, seq_len, d_k)`.

**Step 1: View (Split heads)**
`x.view(seq_len, n_heads, d_k)` separates the `d_model` dimension into heads.
Logically, this groups features belonging to the same head together, but they are still nested inside each token.

**Step 2: Transpose (Group by head)**
`x.transpose(0, 1)` swaps the first two dimensions.
Now, all tokens for `head_0` are contiguous, allowing us to do matrix multiplication for `head_0` in one go.

```text
Input:  [Token 1] [Token 2] ...
           |         |
        [h1,h2]   [h1,h2]

          || view
          VV
        [[h1],    [[h1],
         [h2]]     [h2]]

          || transpose
          VV
Head 1: [Token 1_h1, Token 2_h1, ...]
Head 2: [Token 1_h2, Token 2_h2, ...]
```


## 6. Multi-Head Attention (Batched / Efficient Version)

The loop above is clear but slow. In practice, we pack all heads into a single large projection, then reshape. Still no `nn.Module` — just reshape tricks.

In [None]:
def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads, mask=None):
    """
    Efficient multi-head attention using reshape instead of loops.
    
    X:    (seq_len, d_model)
    W_Q:  (d_model, d_model)  — all heads packed into one matrix
    W_K:  (d_model, d_model)
    W_V:  (d_model, d_model)
    W_O:  (d_model, d_model)
    """
    seq_len, d_model = X.shape
    d_k = d_model // n_heads
    
    # 1. Project all heads at once
    Q = X @ W_Q  # Shape: (seq_len, d_model)
    K = X @ W_K
    V = X @ W_V
    
    # 2. Reshape to (n_heads, seq_len, d_k) — split d_model into heads
    Q = Q.view(seq_len, n_heads, d_k).transpose(0, 1)  # Shape: (n_heads, seq_len, d_k)
    K = K.view(seq_len, n_heads, d_k).transpose(0, 1)
    V = V.view(seq_len, n_heads, d_k).transpose(0, 1)
    
    # 3. Scaled dot-product attention (batched over heads)
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)  # Shape: (n_heads, seq_len, seq_len)
    
    if mask is not None:
        scores = scores.masked_fill(mask.unsqueeze(0), float('-inf'))
    
    weights = softmax(scores)  # softmax works because we wrote it for dim=-1
    attn_out = weights @ V     # Shape: (n_heads, seq_len, d_k)
    
    # 4. Concatenate heads: transpose back and reshape
    attn_out = attn_out.transpose(0, 1).contiguous().view(seq_len, d_model)
    
    # 5. Output projection
    output = attn_out @ W_O  # (seq_len, d_model)
    
    return output, weights


# Create packed weight matrices (all heads in one tensor)
W_Q_packed = torch.randn(d_model, d_model, device=device) * 0.1
W_K_packed = torch.randn(d_model, d_model, device=device) * 0.1
W_V_packed = torch.randn(d_model, d_model, device=device) * 0.1
W_O_packed = torch.randn(d_model, d_model, device=device) * 0.1

mha_output, mha_weights = multi_head_attention(
    X, W_Q_packed, W_K_packed, W_V_packed, W_O_packed, n_heads
)

print("MHA output shape:", mha_output.shape)
print("Attention weights shape:", mha_weights.shape, " — (n_heads, seq_len, seq_len)")
print("\nHead 0 weights:")
print(mha_weights[0])
print("\nHead 1 weights:")
print(mha_weights[1])

## 7. Visualizing Attention Weights

In [None]:
# Visualizing heads using our helper
for i in range(n_heads):
    w = mha_weights[i].detach().cpu().numpy()
    plot_heatmap(w, tokens, f"Multi-Head Attention - Head {i}")


## 8. Summary

Everything we built uses only these primitive operations:

| Operation | PyTorch op | Purpose |
|-----------|-----------|----------|
| Matrix multiply | `@` / `torch.matmul` | Project Q, K, V; compute scores; weighted sum |
| Division | `/` | Scale scores by √d_k |
| Exp | `torch.exp` | Part of softmax |
| Sum | `.sum()` | Part of softmax |
| Max | `.max()` | Numerical stability in softmax |
| Masked fill | `.masked_fill()` | Causal masking |
| Reshape | `.view()`, `.transpose()` | Splitting/merging heads |

No `nn.Module`, no `nn.Linear`, no `nn.MultiheadAttention` — just tensors and math.