In [1]:
!pip install torch tiktoken transformers

Collecting tiktoken
  Downloading tiktoken-0.12.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.12.0-cp312-cp312-manylinux_2_28_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m25.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.12.0


## Toy Example: Input Embeddings

**Important:** These embeddings are **HAND-CRAFTED** for demonstration purposes!

In real models:
- Embeddings start **random**
- Similarity is **learned** via backpropagation over millions of examples

Notice that "journey" `[0.55, 0.87, 0.66]` and "starts" `[0.57, 0.85, 0.64]` have very similar values. This is **intentional** by the author to demonstrate how attention works — it's NOT because a model learned this similarity.

In [None]:
import torch
inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

## Select the Query Token

We select "journey" (index 1) as our **query**. The goal: compute how much "journey" should attend to each other word in the sentence.

In [None]:
query = inputs[1]
query

## Allocate Memory for Attention Scores

`torch.empty()` creates an **uninitialized** tensor — it allocates memory without setting values.

The random-looking numbers (like `2.9910e-32`) are **garbage values** — whatever was in that memory location. This is a **performance optimization**: since we're about to overwrite all values anyway, there's no point initializing to zeros first.

In [None]:
attn_scores_2 = torch.empty(inputs.shape[0])
attn_scores_2

## Compute Attention Scores with Dot Product

The **dot product** measures similarity between vectors:

```
A · B = (a1 × b1) + (a2 × b2) + (a3 × b3)
```

**Higher dot product = more similar vectors**

- When both vectors have large values in the same positions → big × big → large score
- When one is large where the other is small → big × small → small contribution

Results:
| Token | Score | Why? |
|-------|-------|------|
| journey | 1.4950 | Highest — identical to itself! |
| starts | 1.4754 | Very similar values to journey |
| one | 0.7070 | Lowest — least similar |

**Note:** The dot product also depends on **magnitude**. A token with small values (like "with") can have a lower self-attention score than its attention to larger-magnitude tokens.

In [None]:
for i, x_i in enumerate(inputs):
    attn_scores_2[i] = torch.dot(x_i, query)
print(attn_scores_2)

## Normalize Scores to Get Attention Weights

Raw scores need to be converted to **weights that sum to 1** (like probabilities).

### Simple Normalization (divide by sum)
This is a naive approach — just divide each score by the total.

In [14]:
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum()
print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())

Attention weights: tensor([0.1455, 0.2278, 0.2249, 0.1285, 0.1077, 0.1656])
Sum: tensor(1.0000)


### Softmax Normalization (better approach)

**Softmax** is preferred because:
1. It exponentiates values first (`e^x`), which amplifies differences between scores
2. All outputs are guaranteed positive
3. It has nice gradient properties for training

Formula: `softmax(x_i) = e^(x_i) / Σ e^(x_j)`

In [15]:
def softmax_naive(x):
    return torch.exp(x) / torch.exp(x).sum(dim=0)

attn_weights_2_naive = softmax_naive(attn_scores_2)
print("Attention weights:", attn_weights_2_naive)
print("Sum:", attn_weights_2_naive.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


### PyTorch's Built-in Softmax

`torch.softmax()` does the same thing but is **numerically stable** (handles very large/small numbers better).

In [16]:
attn_weights_2 = torch.softmax(attn_scores_2, dim=0)
print("Attention weights:", attn_weights_2)
print("Sum:", attn_weights_2.sum())

Attention weights: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Sum: tensor(1.)


## Compute the Context Vector

The **context vector** is the output of attention — a **weighted sum** of all input embeddings.

```
context_vec = Σ (attention_weight_i × input_i)
```

Each input embedding is scaled by its attention weight and then summed:

| Token | Weight | Contribution |
|-------|--------|--------------|
| Your | 0.1385 | `[0.43, 0.15, 0.89] × 0.1385` |
| journey | 0.2379 | `[0.55, 0.87, 0.66] × 0.2379` |
| starts | 0.2333 | `[0.57, 0.85, 0.64] × 0.2333` |
| with | 0.1240 | `[0.22, 0.58, 0.33] × 0.1240` |
| one | 0.1082 | `[0.77, 0.25, 0.10] × 0.1082` |
| step | 0.1581 | `[0.05, 0.80, 0.55] × 0.1581` |

**Result:** A new representation of "journey" that **blends information from all words**, but is **biased toward similar words** (journey and starts contribute ~47% together).

This is the core idea of attention: creating context-aware representations!

In [18]:
query = inputs[1]
context_vec_2 = torch.zeros(query.shape)
for i,x_i in enumerate(inputs):
    context_vec_2 += attn_weights_2[i]*x_i
print(context_vec_2)

tensor([0.4419, 0.6515, 0.5683])
