## 🧠 How attenton works 

In our **bigram model**, each token predicted the next without any real understanding of what came before. It had **no context**, just a lookup table. That’s very limited.

Now we want to go **beyond bigrams** and allow tokens to **"look back"** and summarize what’s happened so far — this is where attention begins and the beginning of transformer models

---

### 🔍 The Core Idea

Imagine a sequence of 8 tokens like:

```

\[Token₁, Token₂, Token₃, ..., Token₈]

````

- The **5th token** should ideally make decisions based on tokens `1, 2, 3, 4`.
- It should **not** "see the future" (tokens `6, 7, 8`), since we're generating left-to-right.

## 1. Naive-based Averaging

In [22]:
import torch

B, T, C = 4, 8, 2  # Batch size, Time steps (sequence length), Channels (embedding dim)
x = torch.randn(B, T, C)  # Random input (think of this like token embeddings)

# Allocate output tensor
xbow = torch.zeros(B, T, C)

# For each time step, average over all past and current tokens
for b in range(B):
    for t in range(T):
        xbow[b, t] = torch.mean(x[b, :t+1], dim=0)  # Mean of tokens up to t

# Print the original input for batch 0
print("Input x[0]:")
print(x[0])

# Print the contextualized embeddings (averaged tokens)
print("\nNaive Averaged xbow[0]:")
print(xbow[0])

Input x[0]:
tensor([[-0.8345,  0.5978],
        [-0.0514, -0.0646],
        [-0.4970,  0.4658],
        [-0.2573, -1.0673],
        [ 2.0089, -0.5370],
        [ 0.2228,  0.6971],
        [-1.4267,  0.9059],
        [ 0.1446,  0.2280]])

Naive Averaged xbow[0]:
tensor([[-0.8345,  0.5978],
        [-0.4429,  0.2666],
        [-0.4610,  0.3330],
        [-0.4100, -0.0171],
        [ 0.0738, -0.1210],
        [ 0.0986,  0.0153],
        [-0.1193,  0.1425],
        [-0.0863,  0.1532]])


### 🔎 What’s Actually Happening Naive-based Average?

1. **Original Embedding**  
   - `x[0][4]` is the raw embedding for **token 5** (index 4) in sequence 0.  

2. **Averaged Embedding**  
   - `xbow[0][4]` = **mean** of `x[0][0:5]` → the average of embeddings for tokens 1–5.  
   - This gives a **smoothed, context-aware** vector that “remembers” everything seen so far.

---

### 📊 Step-by-Step Visual

Suppose your sequence embeddings look like this:

| Time step (t) | 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  |
|--------------:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
| Token embedding | T₀ | T₁ | T₂ | T₃ | T₄ | T₅ | T₆ | T₇ |

Then for each position _t_:

- **xbow[0]** = mean([ T₀ ])  
- **xbow[1]** = mean([ T₀, T₁ ])  
- **xbow[2]** = mean([ T₀, T₁, T₂ ])  
- …  
- **xbow[7]** = mean([ T₀, T₁, T₂, T₃, T₄, T₅, T₆, T₇ ])  

In other words, **every xbow[t] blends all tokens from 0…t** into a single vector.

---

### ⚠️ Key Limitations

- **Order lost:**  T₀ + T₁ = T₁ + T₀  
- **Uniform weights:** All past tokens count equally  
- **Inefficient loops:** O(T²) Python loops  
- **Blurred details:** Sharp patterns get diluted  

> 👉 Despite these drawbacks, averaging introduces the fundamental idea that **each token should incorporate its history**—a stepping stone toward full self-attention.  



## 2.Efficient Averaging Using Matrix Multiplication


In [None]:
import torch

# Example dimensions
B, T, C = 4, 8, 2
x = torch.randn(B, T, C)  # (B, T, C) token embeddings

# 1️⃣ Build the causal mask (T×T)
a = torch.tril(torch.ones(T, T))           # shape = (T, T)
# 2️⃣ Normalize each row to sum to 1
a = a / a.sum(dim=1, keepdim=True)         # still (T, T)

# 3️⃣ Vectorized averaging: apply mask to x
#    (T×T) @ (B×T×C) → (B×T×C), broadcasting over batch dim
xbow_vec = a @ x                           # shape = (B, T, C)

# 4️⃣ Inspect results for batch 0
print("Original x[0]:")
print(x[0])

print("\nVectorized Averaged xbow_vec[0]:")
print(xbow_vec[0])


a =
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
b =
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c =
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


### 🔎 What’s Actually Happening with Matrix Averaging?

1. **Causal Mask Matrix**  
   - We build a lower-triangular matrix **a** of shape (T, T), where  
     ```python
     a = torch.tril(torch.ones(T, T))
     ```  
   - This mask has 1’s for positions ≤ t and 0’s for future positions > t.

2. **Row Normalization**  
   - We normalize each row so it sums to 1:  
     ```python
     a = a / a.sum(1, keepdim=True)
     ```  
   - Now **a[t]** is a uniform averaging distribution over tokens 0…t.

3. **Vectorized Averaging**  
   - Multiply the mask by your embedding tensor `x` (shape `(B, T, C)`):  
     ```python
     xbow = a @ x  # result shape (B, T, C)
     ```  
   - For each position _t_, `xbow[:, t, :]` equals the mean of `x[:, 0:t+1, :]` across that row of **a**.

---

### 📊 Step-by-Step Visual (T=3 example)

1. **Build & Normalize**  
   ```python
   a = torch.tril(torch.ones(3, 3))
   # a = [[1,0,0],
   #      [1,1,0],
   #      [1,1,1]]

   a = a / a.sum(1, keepdim=True)
   # a = [[1.0, 0.0, 0.0],
   #      [0.5, 0.5, 0.0],
   #      [0.33,0.33,0.33]]


👉 This vectorized trick is the exact same as the avergaing method seen above and has same limitation but it removes Python loops and makes it more efficient

#### 🧠 Apply to All Batches Efficiently



In [None]:
import torch

# Ensure reproducibility
torch.manual_seed(1337)

B, T, C = 4, 8, 2

# Using same x as in Naive Approach

# Naive for-loop version
xbow = torch.zeros(B, T, C)
for b in range(B):
    for t in range(T):
        xbow[b, t] = torch.mean(x[b, :t+1], dim=0)

# Efficient matrix version
wei = torch.tril(torch.ones(T, T))          # Causal mask
wei = wei / wei.sum(1, keepdim=True)        # Normalize each row
xbow2 = wei @ x                             # Broadcast over batch

# Compare outputs
print("Original x[0]:")
print(x[0])

print("\nNaive Averaged xbow[0]:")
print(xbow[0])

print("\nEfficient Averaged xbow2[0]:")
print(xbow2[0])

# Confirm they match
print("\nDo they match?")
print(torch.allclose(xbow, xbow2, rtol=1e-4, atol=1e-6))



Original x[0]:
tensor([[-0.8345,  0.5978],
        [-0.0514, -0.0646],
        [-0.4970,  0.4658],
        [-0.2573, -1.0673],
        [ 2.0089, -0.5370],
        [ 0.2228,  0.6971],
        [-1.4267,  0.9059],
        [ 0.1446,  0.2280]])

Naive Averaged xbow[0]:
tensor([[-0.8345,  0.5978],
        [-0.4429,  0.2666],
        [-0.4610,  0.3330],
        [-0.4100, -0.0171],
        [ 0.0738, -0.1210],
        [ 0.0986,  0.0153],
        [-0.1193,  0.1425],
        [-0.0863,  0.1532]])

Efficient Averaged xbow2[0]:
tensor([[-0.8345,  0.5978],
        [-0.4429,  0.2666],
        [-0.4610,  0.3330],
        [-0.4100, -0.0171],
        [ 0.0738, -0.1210],
        [ 0.0986,  0.0153],
        [-0.1193,  0.1425],
        [-0.0863,  0.1532]])

Do they match?
True


## 🧠 Final Version: Weighted Averaging with Softmax — the Birth of Attention

We’ve seen how to compute averages using a **triangular matrix**, where each token only sees previous tokens.

But what if we want to assign **different importance** to each past token — instead of giving equal weight?

That’s what self-attention does! It uses **softmax** to assign **learnable, normalized weights** to each token’s past.

---

### 🔶 Step 1: Create a Causal Mask

We want tokens to only attend to the **past**, so we use a lower-triangular matrix again:

This is a **causal mask** — it ensures that token `t` can only "see" tokens `0...t`.

In [30]:
import torch

T = 4  # Sequence length
tril = torch.tril(torch.ones(T, T))

print(tril)

tensor([[1., 0., 0., 0.],
        [1., 1., 0., 0.],
        [1., 1., 1., 0.],
        [1., 1., 1., 1.]])


### 🔶 Step 2: Mask Out the Future with `-inf`

➡️ This fills all **future positions** with `-inf`. The current and past stay `0`.

In [31]:
wei = torch.zeros(T, T)
wei = wei.masked_fill(tril == 0, float('-inf'))

print(wei)

tensor([[0., -inf, -inf, -inf],
        [0., 0., -inf, -inf],
        [0., 0., 0., -inf],
        [0., 0., 0., 0.]])


### 🔶 Step 3: Apply Softmax

💡 **Softmax** turns each row into a **probability distribution**:

* All weights on each row add up to 1
* More recent tokens get higher weight by default (if logits were all equal)

> **Bottom line:** Softmax turns arbitrary scores into a “soft pick” over past tokens—equal if scores are equal, or biased toward the most relevant ones when scores differ.  



In [32]:
import torch.nn.functional as F

wei = F.softmax(wei, dim=-1)

print(wei)

tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500]])


### 🔁 What's Happening?

Let’s say we have 4 tokens: `T₀, T₁, T₂, T₃`

The attention weights become:

| Token | Weighted average of          |
| ----- | ---------------------------- |
| `T₀`  | `T₀` only (1.0)              |
| `T₁`  | 0.5 × `T₀` + 0.5 × `T₁`      |
| `T₂`  | Equal weights over `T₀`–`T₂` |
| `T₃`  | Equal weights over `T₀`–`T₃` |

Each token blends its **history** — just like in attention.


### 📌 Why `-inf`?

Because `softmax(-inf) = 0`.

We use it to force **future tokens to zero**, i.e. prevent information leakage during training.


### 🧠 Why This Masked Softmax Method Matters for Self-Attention

In earlier steps, we used simple averages to summarize past tokens. But what if we don't want to treat all past tokens equally? What if **some past tokens are more relevant** than others?

That’s exactly what self-attention does — it decides *how much each past token should contribute* to the current one.
The masked softmax approach gives us this control.

---

### 🔍 What's Happening Here?

```python
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
```

* `wei` starts as all zeros.
* We apply a **lower-triangular mask** to make sure **each token only attends to the past**.
* `-inf` ensures softmax assigns zero probability to "future" tokens.
* `F.softmax` converts the rest into **attention weights** — a smart way of deciding *how much to pay attention to each previous token*.

---

✅ Gives **smooth attention weights** instead of uniform ones
✅ Still uses **only the past** (causal)
✅ Forms the **core logic of self-attention**
✅ Can be **learned** — when we add key and query vectors


## Implementing Single-Head Self-Attention from Scratch

Now that we can average past tokens in one go, let’s upgrade to **self-attention**, which learns **which** past tokens matter most.

In self-attention, each token plays three roles:

1. 🔍 **Query**: “What am I looking for?”  
2. 📡 **Key**: “Here’s what I contain.”  
3. 💡 **Value**: “Here’s what I pass on.”

**Process**:

- Compute **scores** by dot-product: `score[i,j] = query[i] · key[j]`.  
- Apply a **causal mask** (no future peeking) and **softmax** to turn scores into weights.  
- Multiply weights by the **values** to get a new, context-aware embedding for each token.

---

> **Result:** Instead of fixed uniform averages, each token now **dynamically** focuses on the most relevant parts of its history.  




In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(1337)

B, T, C = 4, 8, 32  # Batch size, Sequence length (time), Embedding dimension (channels)
x = torch.randn(B, T, C)  # Random token embeddings (input)


We simulate 4 batches (`B=4`), each with 8 tokens (`T=8`), and each token is embedded in a 32-dimensional space (`C=32`).


### 🎯 The Goal

Each token should attend to earlier tokens to gather relevant context, but **not all previous tokens are equally useful**. That's where **query-key matching** comes in.


### 🔧 Step 1: Linear Projections to Get `q`, `k`, `v`


In [None]:
head_size = 16  # Attention head dimension (can be smaller than C)

# Create linear projections: no bias for simplicity
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)

# Project the input x into keys, queries, and values
k = key(x)    # (B, T, 16)
q = query(x)  # (B, T, 16)
v = value(x)  # (B, T, 16)

Each token now emits:

* **Key**: What it offers to others.
* **Query**: What it's looking for.
* **Value**: What it gives when attended to.

> 📌 All three are derived from the same input `x`.

### 🔍 Step 2: Compute Attention Scores (`q @ kᵀ`)

In [None]:
# Dot-product attention: how well does each query match each key?
wei = q @ k.transpose(-2, -1)  # Shape: (B, T, T)

> 🔄 Each token’s **query** vector is dotted with every token’s **key** vector in the same sequence, producing a raw score for each pair.

For a single batch entry `b`, the attention-score matrix `wei[b]` looks like:

```
        key t0   key t1   key t2   key t3  …  
query t0 [ •       0         0        0    … ]  
query t1 [q1·k0   q1·k1     0        0    … ]  
query t2 [q2·k0   q2·k1   q2·k2     0    … ]  
query t3 [q3·k0   q3·k1   q3·k2   q3·k3  … ]  
   …  
```

* Row *i* shows how much **query** *i* “matches” each **key** *j* for *j ≤ i* (zeros for *j > i* after masking).
* Those scores then get softmaxed into attention weights.


### 🔒 Step 3: Apply Causal Mask (Prevent "Future Peeking")

In [None]:
tril = torch.tril(torch.ones(T, T))  # Lower triangle = 1, upper = 0
wei = wei.masked_fill(tril == 0, float('-inf'))  # Block future attention

This ensures that:

* Token 5 **cannot** look at token 6, 7, 8.
* Only **past and current** tokens are visible.

### 📊 Step 4: Normalize with Softmax (Attention Weights)

In [None]:
wei = F.softmax(wei, dim=-1)  # Normalize across tokens

> ✅ Now each row of `wei[b]` is a probability distribution over past tokens.

📌 **Interpretation**:
Each token now *softly selects* which previous tokens it wants to attend to.


### 📦 Step 5: Apply Attention to Values



In [None]:
out = wei @ v  # (B, T, 16)

### Why Multiply by the Value Vectors?

1. **Keys & Queries Only Score Relevance**  
   - **Query** (`qᵢ`) and **Key** (`kⱼ`) let us compute a score `scoreᵢⱼ = qᵢ·kⱼ`  
   - After softmax, we get a weight `wᵢⱼ` that says, “How much should token _j_ influence token _i_?”

2. **Values Carry the Actual Content**  
   - Each token also has a **Value** vector `vⱼ` that encodes “what this token represents.”  
   - We don’t want to pass along the raw scores—those only tell us _how much_ to pay attention, not _what_ to pass.

3. **Weighted Sum Produces Contextual Output**  
   - For position _i_, we compute  
     ```
     outᵢ = ∑ⱼ wᵢⱼ · vⱼ
     ```  
   - In matrix form:  
     ```python
     out = wei @ v
     ```  
   - This blends each token’s content (`vⱼ`) according to its importance (`wᵢⱼ`), producing a new, context-aware embedding for token _i_.

---

> **In plain English:**  
> 1. **Score** each past token for relevance (q·k).  
> 2. **Turn** those scores into attention weights (softmax).  
> 3. **Gather** each token’s content (v) and **mix** them according to those weights.  
>  
> The result is a fresh embedding that “knows” which past tokens mattered most.  


### 📈 Final Output



We’ve just transformed our input `x` of shape `(B, T, C)` into a new output `(B, T, head_size)`.

In [None]:
print(out.shape)  # torch.Size([4, 8, 16])

* Each batch row in `wei[b]` tells us *how much weight* to give to each `v[t]`.
* This produces a new vector `out[t]` for each token — a context-aware representation.