References:
- [Let's build GPT: from scratch, in code, spelled out - Andrej Karpathy](https://youtu.be/kCc8FmEb1nY?si=eJvE37IA6l754PY_)
- [Google Colab notebook](https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing#scrollTo=Yw1LKNCgwjj1)
- [Github repo](https://github.com/karpathy/ng-video-lecture)

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm import tqdm
import gc

print("imports done!")

imports done!


In [3]:
# hyperparameters
batch_size = 64       # how many independent sequences will we process in parallel?
block_size = 256      # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2

torch.manual_seed(1337)

print("hyperparameters set!")

hyperparameters set!


In [4]:
with open('../tinyshakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print("file read!")

file read!


In [5]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split, device):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss(model, device):
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split, device)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


print("basic utility functions defined!")

basic utility functions defined!


**Causal Constraint in Self-Attention** is a core component of the **Transformer Decoder** architecture, particularly in auto-regressive models like **GPT** (Generative Pre-trained Transformer). This is often referred to as **Masked Self-Attention**.

---

### **Causal Constraint in Transformer Self-Attention**

The purpose of this mechanism is to ensure that information flow is **uni-directional** (causal) during the generation or prediction of a sequence, which is essential for language models like GPT that predict the next token based on the previous ones.

#### **1. The Necessity of Causal Flow**

- **Prediction Context:** When a model is "about to try to predict the future," it must only use information from the past.
- **Uni-directional Communication:** Any given token in the sequence (e.g., the fifth token) should only be able to **communicate with tokens at previous locations** (the fourth, third, second, and first) and **not** with future tokens (the sixth, seventh, and eighth).
- **Preventing Cheating:** Accessing "future tokens in the sequence" would give away the answer and invalidate the training goal of predicting what comes next.

#### **2. The Simplified Implementation (Cumulative Average)**

Here's a basic, conceptual way to achieve historical context aggregation:
- **Simple Aggregation:** The "easiest way" for a token to incorporate information from its past is to simply calculate the **average of all the preceding elements** and the current element's vectors (channels).
- **Context Vector:** This results in a "feature vector that summarizes me in the context of my history".
- **Weak Interaction:** This simple sum/average is a "weak form of interaction" because it's "extremely lossy," losing information about the relative positioning ("spatial arrangements") of the tokens, a limitation that is addressed later in the full self-attention mechanism.

---

### **Below are 4 ways of incorporating historical context aggregation into the current token**

#### **1. For Loop Implementation**

In [8]:
# consider the following toy example:

torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [11]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [12]:
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))  # bow stands for "bag of words"
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1]    # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)


In [14]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

#### **2. Matrix Multiplication for Weighted Aggregation**

The first challenge is how to calculate a running average (or sum) of all preceding tokens efficiently, without using slow Python `for` loops, as done in the above step (for loop implementation). The solution leverages the power of **matrix multiplication** to achieve **vectorized causal aggregation**.

##### **The Lower Triangular Matrix**

Any cumulative sum or average can be expressed as a matrix multiplication between the data and a special weight matrix.

1.  **Creation of the Matrix (A):** An $N \times N$ matrix (where $N$ is the sequence length, $T$) is created. It is a **Lower Triangular Matrix**, meaning all values on or below the main diagonal are non-zero (or 1), while all values above the diagonal are zero.
2.  **Matrix Multiplication:** The weight matrix $\mathbf{A}$ is multiplied by the input token embeddings $\mathbf{X}$. The output is a new matrix $\mathbf{C}$ containing the aggregated information.
    $$\mathbf{C} = \mathbf{A} \cdot \mathbf{X}$$
3.  **Result:** When calculating the $i$-th row of $\mathbf{C}$, the $i$-th row of $\mathbf{A}$ is dot-producted with all of $\mathbf{X}$. Because the first $i-1$ entries of $\mathbf{A}$'s $i$-th row are non-zero (1s) and the rest are zero, the operation only sums up (or averages) the first $i$ rows of $\mathbf{X}$ (the current token and all previous tokens). This is also known as a **Causal Mask**.

In [None]:
torch.tril(torch.ones(3,3))   # lower triangular matrix

tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

In [17]:
torch.tril(torch.arange(1,10).reshape(3,3))   # lower triangular matrix

tensor([[1, 0, 0],
        [4, 5, 0],
        [7, 8, 9]])

In [None]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)   # normalize rows so that they sum to 1 (used for averaging)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


In [18]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)

xbow2[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [19]:
# check if xbow and xbow2 are the same
torch.allclose(xbow, xbow2)

True

#### **3. Softmax Function**

To turn the aggregation into a proper, normalized attention mechanism, the weight matrix $\mathbf{A}$ is produced using the **Softmax** function:

1.  **Masking:** A lower triangular matrix of ones and zeros (the causal mask) is used to fill the upper triangular part of the *unnormalized attention scores* with **negative infinity**.
2.  **Softmax:** Applying the **Softmax** function to these scores along the row dimension:
    * The negative infinities become **zero** after exponentiation, ensuring future tokens are ignored.
    * The non-zero entries (representing the past and current token) are normalized, ensuring the weights for each row sum to **one**, making it a true weighted average.

In [None]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))

# raise the entries in wei to -inf where we don't want attention weights
# this is because softmax(-inf) = 0
wei = wei.masked_fill(tril == 0, float('-inf'))
wei

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])

In [25]:
wei = F.softmax(wei, dim=-1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [23]:
xbow3 = wei @ x

xbow3[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

In [21]:
# check if xbow and xbow3 are the same
torch.allclose(xbow, xbow3)

True

#### **4. Self-Attention Mechanism: The Data-Dependent Aggregation**

While the matrix trick solved the *efficiency* problem, the simple cumulative average still suffers from being **uniform**â€”it treats all past tokens as equally important. **Self-Attention** solves this by making the aggregation **data-dependent**.

The mechanism relies on every token generating three distinct vectors:

| Vector | Analogy | Purpose |
| :--- | :--- | :--- |
| **Query (Q)** | **"What am I looking for?"** | Generated by the current token, it is used to interrogate all other tokens. |
| **Key (K)** | **"What do I contain?"** | Generated by a past token, it is a descriptor of its content, used to align with the Query. |
| **Value (V)** | **"What information will I share?"** | Generated by a past token, it holds the actual information that will be aggregated by other tokens. |

### **The Core Attention Calculation**

1.  **Affinity Calculation (Attention Scores):** The attention scores (affinities) between tokens are calculated by taking the **dot product** of the **Query (Q)** of the current token with the **Key (K)** of every other token.
    * A high dot product (high score) means the Query found a relevant Key, indicating a strong *affinity* or *interaction strength*.
2.  **Scaling and Masking:** The attention scores are normalized by dividing by the square root of the head size ($\sqrt{d_k}$) to prevent the Softmax function from becoming too "peaky" (sharpening to a single token). The causal mask (lower triangular) is applied to ignore future tokens.
3.  **Weighted Aggregation:** The masked, scaled scores are passed through **Softmax** to create the final weights, which are then multiplied by the **Value (V)** vectors. The results are summed up to form the output vector for the current token.

This final calculation is the **Scaled Dot-Product Attention**:
$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}} + \text{Mask}\right) \mathbf{V}$$

In [None]:
"""
We've seen the below code in version 3 above, and
the problem with this is that it assigns equal "importance"
to all the previous tokens.
This is not desirable because we want to assign more importance
to some tokens and less to others.
"""
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x
out.shape

torch.Size([4, 8, 2])

In [None]:
# equal "normalized" importance to all previous tokens (normalized as the importances sum up to 1)
wei

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])

In [None]:
"""
Self-attention solves the above problem like so:
Every token emits 3 vectors: a key, a query, and a value vector.

query: what i'm interested in
key: what i have
value: what i can give you

The dot product of the query vector of the current token with 
the key vectors of every other token determines the attention weights.


"""

# version 4: self-attention
torch.manual_seed(1337)
B, T, C = 4, 8, 32    # batch, time, channels
x = torch.randn(B, T, C)

# let's see a single head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
v = value(x) # (B, T, 16)
wei = q @ k.transpose(-2, -1)    # (B, T, 16) @ (B, 16, T)  --->  (B, T, T)
# k.transpose(-2,-1) means transpose the last 2 dimensions

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
# out = wei @ x
out = wei @ v

out.shape

torch.Size([4, 8, 16])

In [35]:
# the below weights vector assigns different importances to the previous tokens
# as opposed to the previous 3 versions that assigned same importance to the previous tokens
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

In [36]:
out[0]

tensor([[-0.1571,  0.8801,  0.1615, -0.7824, -0.1429,  0.7468,  0.1007, -0.5239,
         -0.8873,  0.1907,  0.1762, -0.5943, -0.4812, -0.4860,  0.2862,  0.5710],
        [ 0.6764, -0.5477, -0.2478,  0.3143, -0.1280, -0.2952, -0.4296, -0.1089,
         -0.0493,  0.7268,  0.7130, -0.1164,  0.3266,  0.3431, -0.0710,  1.2716],
        [ 0.4823, -0.1069, -0.4055,  0.1770,  0.1581, -0.1697,  0.0162,  0.0215,
         -0.2490, -0.3773,  0.2787,  0.1629, -0.2895, -0.0676, -0.1416,  1.2194],
        [ 0.1971,  0.2856, -0.1303, -0.2655,  0.0668,  0.1954,  0.0281, -0.2451,
         -0.4647,  0.0693,  0.1528, -0.2032, -0.2479, -0.1621,  0.1947,  0.7678],
        [ 0.2510,  0.7346,  0.5939,  0.2516,  0.2606,  0.7582,  0.5595,  0.3539,
         -0.5934, -1.0807, -0.3111, -0.2781, -0.9054,  0.1318, -0.1382,  0.6371],
        [ 0.3428,  0.4960,  0.4725,  0.3028,  0.1844,  0.5814,  0.3824,  0.2952,
         -0.4897, -0.7705, -0.1172, -0.2541, -0.6892,  0.1979, -0.1513,  0.7666],
        [ 0.1866, -0.0

Notes:
- Attention is a **communication mechanism**. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
- There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
- Each example across batch dimension is of course processed completely independently and never "talk" to each other
- In an "encoder" attention block just delete the single line that does masking with `tril`, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
- "self-attention" just means that the keys and values are produced from the same source as queries. In "cross-attention", the queries still get produced from x, but the keys and values come from some other, external source (e.g. an encoder module)
- "Scaled" attention additional divides `wei` by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, wei will be unit variance too and Softmax will stay diffuse and not saturate too much. Illustration below

In [30]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1)    # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
# k.transpose(-2,-1) means transpose the last 2 dimensions

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x

out.shape

torch.Size([4, 8, 32])

In [None]:


v = value(x)
out = wei @ v
#out = wei @ x

out.shape

### **The Full Transformer Decoder Block**

The full GPT-style decoder block is built by combining this self-attention mechanism with other stabilizing and computational layers.

#### **Multi-Head Attention (MHA)**

**Multi-Head Attention** improves the model's ability to focus on different types of information simultaneously.

* Instead of running one large attention calculation (one "Head"), MHA runs **multiple smaller attention heads in parallel**.
* Each head learns a different set of Q, K, and V transformations, allowing it to focus on a unique relationship (e.g., one head finds nouns, another finds preceding punctuation).
* The output from all parallel heads is then **concatenated** back together.

#### **Position-wise Feed Forward Network (FFN)**

After the communication (MHA) is complete, the tokens need time to process the newly gathered information. The FFN is a simple, two-layer **Multi-Layer Perceptron (MLP)** applied **independently and identically** to every token's vector.

#### **Key Optimizations for Deep Networks**

Two crucial techniques make it possible to stack multiple blocks (layers) of MHA and FFN to form a deep Transformer:

1.  **Residual Connections (Skip Connections):** These connections ensure that the data input ($\mathbf{X}$) is added directly back to the output of the sub-layer (MHA or FFN), creating a path for gradients to flow unimpeded directly from the output back to the input layers during training. This prevents the gradients from vanishing in deep networks.
    $$\mathbf{X}_{\text{out}} = \mathbf{X}_{\text{in}} + \text{SubLayer}(\mathbf{X}_{\text{in}})$$
2.  **Layer Normalization:** This technique stabilizes the numerical activations within the network. It calculates the mean and variance of the features **across the channels of a single token** (i.e., along the row dimension) and normalizes them, ensuring inputs to sub-layers are consistently distributed. In the modern "pre-Norm" architecture, it is applied **before** the MHA and FFN sub-layers.

#### **Decoder-Only Architecture (GPT)**

The architecture implemented in the video is a **Decoder-Only Transformer**, as used in GPT. This means it only has the blocks that enforce the **causal (masked)** communication. It lacks the separate **Encoder** block and the **Cross-Attention** layers that would be used in sequence-to-sequence tasks like translation, where the model needs to condition its output on an entirely separate source of information (like a French sentence).