In [16]:
!pip install torch tiktoken transformers



# Simple Self-Attention Mechanism

This notebook demonstrates the core concepts of **self-attention** — the key innovation in Transformer models.

**What you'll learn:**
1. How **dot products** measure similarity between token embeddings
2. How **softmax** converts raw scores to attention weights
3. How the **context vector** blends information from all tokens
4. How to compute attention for all tokens at once (attention matrix)

## Install Dependencies

## Toy Example: Input Embeddings

**Important:** These embeddings are **HAND-CRAFTED** for demonstration purposes!

In real models:
- Embeddings start **random**
- Similarity is **learned** via backpropagation over millions of examples

Notice that "journey" `[0.55, 0.87, 0.66]` and "starts" `[0.57, 0.85, 0.64]` have very similar values. This is **intentional** by the author to demonstrate how attention works — it's NOT because a model learned this similarity.

In [17]:
import torch
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

## Select the Query Token

We select "journey" (index 1) as our **query**. The goal: compute how much "journey" should attend to each other word in the sentence.

In [18]:
query = inputs[1]
query

tensor([0.5500, 0.8700, 0.6600])

## Allocate Memory for Attention Scores

`torch.empty()` creates an **uninitialized** tensor — it allocates memory without setting values.

The random-looking numbers (like `2.9910e-32`) are **garbage values** — whatever was in that memory location. This is a **performance optimization**: since we're about to overwrite all values anyway, there's no point initializing to zeros first.

In [19]:
attn_scores_2 = torch.empty(inputs.shape[0])
attn_scores_2

tensor([3.1625e+02, 0.0000e+00, 8.1076e+02, 0.0000e+00, 1.1210e-43, 0.0000e+00])

## Compute Attention Scores with Dot Product

The **dot product** measures similarity between vectors:

```
A · B = (a1 × b1) + (a2 × b2) + (a3 × b3)
```

**Higher dot product = more similar vectors**

- When both vectors have large values in the same positions → big × big → large score
- When one is large where the other is small → big × small → small contribution

Results:
| Token | Score | Why? |
|-------|-------|------|
| journey | 1.4950 | Highest — identical to itself! |
| starts | 1.4754 | Very similar values to journey |
| one | 0.7070 | Lowest — least similar |

**Note:** The dot product also depends on **magnitude**. A token with small values (like "with") can have a lower self-attention score than its attention to larger-magnitude tokens.

In [20]:
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)

tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])


## Normalize Scores to Get Attention Weights

Raw scores need to be converted to **weights that sum to 1** (like probabilities).

### Simple Normalization (divide by sum)
This is a naive approach — just divide each score by the total.

In [21]:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)


### Softmax Normalization (better approach)

**Softmax** is preferred because:
1. It exponentiates values first (`e^x`), which amplifies differences between scores
2. All outputs are guaranteed positive
3. It has nice gradient properties for training

Formula: `softmax(x_i) = e^(x_i) / Σ e^(x_j)`

In [22]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


### PyTorch's Built-in Softmax

`torch.softmax()` does the same thing but is **numerically stable** (handles very large/small numbers better).

In [23]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


## Compute the Context Vector

The **context vector** is the output of attention — a **weighted sum** of all input embeddings.

```
context_vec = Σ (attention_weight_i × input_i)
```

Each input embedding is scaled by its attention weight and then summed:

| Token | Weight | Contribution |
|-------|--------|--------------|
| Your | 0.1385 | `[0.43, 0.15, 0.89] × 0.1385` |
| journey | 0.2379 | `[0.55, 0.87, 0.66] × 0.2379` |
| starts | 0.2333 | `[0.57, 0.85, 0.64] × 0.2333` |
| with | 0.1240 | `[0.22, 0.58, 0.33] × 0.1240` |
| one | 0.1082 | `[0.77, 0.25, 0.10] × 0.1082` |
| step | 0.1581 | `[0.05, 0.80, 0.55] × 0.1581` |

**Result:** A new representation of "journey" that **blends information from all words**, but is **biased toward similar words** (journey and starts contribute ~47% together).

This is the core idea of attention: creating context-aware representations!

In [24]:
query = inputs[1]
context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i
print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])


## Compute All Attention Scores at Once

Instead of computing attention for just one query token, we can compute attention scores for **all tokens simultaneously**.

This creates a **6×6 attention matrix** where:
- Row `i` = attention scores when token `i` is the query
- Column `j` = how much token `i` attends to token `j`

Each entry `[i, j]` is the dot product between token `i` and token `j`.

In [25]:
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
    for j, x_j in enumerate(inputs):
        attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


## Matrix Multiplication Shortcut

The nested loop above is inefficient. We can compute all dot products at once using **matrix multiplication**:

```
attn_scores = inputs @ inputs.T
```

This is equivalent to: for each row `i` in `inputs`, compute dot product with every row `j` in `inputs`.

The `@` operator is matrix multiplication, and `.T` transposes the matrix (swaps rows and columns).

In [26]:
attn_scores = inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


## Apply Softmax to All Rows

Apply softmax across each row (`dim=-1`) to convert scores to weights.

- `dim=-1` means "apply softmax along the last dimension" (i.e., each row)
- Each row now sums to 1.0
- Row 2 (index 1) matches our earlier single-query calculation!

In [27]:
attn_weights = torch.softmax(attn_scores, dim=-1)
print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


## Verify Row Sums

Each row of attention weights should sum to 1.0 (they're probability distributions).

Row 2 (index 1, "journey") matches exactly what we computed earlier!

In [28]:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("Row 2 sum:", row_2_sum)
print("All row sums:", attn_weights.sum(dim=-1))

Row 2 sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


## Compute All Context Vectors at Once

Now compute context vectors for **all tokens** using matrix multiplication:

```
all_context_vecs = attn_weights @ inputs
```

This multiplies the `[6×6]` attention weights matrix by the `[6×3]` inputs matrix, producing a `[6×3]` output — one context vector per token.

Each row `i` in the output is a weighted sum of all input embeddings, using the attention weights from row `i`.

In [29]:
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


## Verify: Matrix Result Matches Loop Result

The 2nd row (index 1) of `all_context_vecs` should match `context_vec_2` that we computed earlier with the loop.

This confirms that the matrix operations produce identical results to the step-by-step calculations — but much faster!

In [30]:
print("Previous 2nd context vector:", context_vec_2)

Previous 2nd context vector: tensor([0.4419, 0.6515, 0.5683])
