# Building GPT From Scratch

## 1. Data Loading and Preparation

### Reading the Text Data

In [24]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [1]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

### Inspecting the Data Size

In [2]:
print("Length of dataset in characters: ", len(text))

Length of dataset in characters:  1115394


### Previewing the Dataset

In [5]:
print(text[:100])# print the first 100 characters

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


## 2. Tokenization

### Creating the Vocabulary
A neural network can't process raw characters hence it needs numbers. Therefore our first step is to create a "vocabulary" of all the unique characters present in our text. We also calculate the `vocab_size` which is the number of unique characters. This will determine the size of our model's embedding and output layers.

In [7]:
chars = sorted(list(set(text)))# create a sorted list of unique characters
vocab_size = len(chars)
print("All the unique characters in the dataset are as follows: ", ''.join(chars))

All the unique characters in the dataset are as follows:  
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [9]:
print("Vocabulary size:", vocab_size)

Vocabulary size: 65


### Creating Encoder and Decoder

Now we create a mapping (or "tokenizer") that can convert characters to integers and back.
- **`stoi` (string-to-integer):** A dictionary that maps each unique character to a unique integer.
- **`itos` (integer-to-string):** A dictionary that does the reverse, mapping integers back to characters.
- `encode`: A function that takes a string and converts it into a list of integers (tokens) using stoi.
- `decode`: A function that takes a list of integers and converts it back into a string using itos.

In [10]:
# create a mapping from characters to integers and vice versa
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

In [11]:
print(encode("hello world")) # encode the string "hello world" to integers
print(decode(encode("hello world"))) # decode back to string

[46, 43, 50, 50, 53, 1, 61, 53, 56, 50, 42]
hello world


### Encoding the Entire Dataset

In [13]:
data = torch.tensor(encode(text), dtype=torch.long)# encode the entire text dataset and store it in a torch tensor
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


## 3. Data Splitting and Batching

### Train and Validation Split


In [14]:
n = int(0.9*len(data))# first 90% will be train, rest validation
train_data = data[:n]
val_data = data[n:]

### Understanding Context and Targets
A language model works by predicting the next token in a sequence given a preceding context. The maximum length of this context is called the **block size** or **context length**.<br>
Now we perform two tasks:
- The input `x` is a chunk of text of length `block_size`.
- The target `y` is the same chunk of text but shifted one position to the right.

For every position `t` in the sequence the model will learn to predict the target `y[t]` using the context of all tokens up to `x[t]`.

In [15]:
block_size = 8# context length for predictions
train_data[:block_size+1]# a chunk of text of length block_size + 1

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [16]:
x = train_data[:block_size]# input
y = train_data[1:block_size+1]# target
for t in range(block_size):
    context = x[:t+1]# the context is everything up to and including position t
    target = y[t]# the target is the next character we want to predict
    print(f"When input is {context} the target: {target}")

When input is tensor([18]) the target: 47
When input is tensor([18, 47]) the target: 56
When input is tensor([18, 47, 56]) the target: 57
When input is tensor([18, 47, 56, 57]) the target: 58
When input is tensor([18, 47, 56, 57, 58]) the target: 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


### Creating a Batch Generation Function
Training on the entire dataset at once is computationally expensive. Instead we train on small random chunks of data called minibatches.
This function `get_batch` does the following:
1. Selects `batch_size` (e.g., 4) random starting points in the dataset (`train_data` or `val_data`).
2. For each starting point it grabs a sequence of length `block_size` for the input (`x`).
3. It grabs the corresponding sequence of length block_size (shifted by one) for the target (y).
4. It stacks these individual sequences into two tensors `xb` and `yb` of shape (batch_size, block_size).

This provides a batch of independent examples that we can process in parallel which is highly efficient on modern hardware like GPUs.

In [18]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel
block_size = 8 # what is the maximum context length for predictions

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))# pick random starting indices for the batch 
    x = torch.stack([data[i:i+block_size] for i in ix])# gather the input sequences for each starting index
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])# gather the target sequences (shifted by one) for each starting index
    return x, y


xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


In [20]:
for b in range(batch_size):  # batch dimension
    for t in range(block_size):  # time dimension
        context = xb[b, :t+1]# the context is everything up to and including position t
        target = yb[b, t]# the target is the next character we want to predict
        print(f"When input is {context.tolist()} the target: {target}")

When input is [24] the target: 43
When input is [24, 43] the target: 58
When input is [24, 43, 58] the target: 5
When input is [24, 43, 58, 5] the target: 57
When input is [24, 43, 58, 5, 57] the target: 1
When input is [24, 43, 58, 5, 57, 1] the target: 46
When input is [24, 43, 58, 5, 57, 1, 46] the target: 43
When input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
When input is [44] the target: 53
When input is [44, 53] the target: 56
When input is [44, 53, 56] the target: 1
When input is [44, 53, 56, 1] the target: 58
When input is [44, 53, 56, 1, 58] the target: 46
When input is [44, 53, 56, 1, 58, 46] the target: 39
When input is [44, 53, 56, 1, 58, 46, 39] the target: 58
When input is [44, 53, 56, 1, 58, 46, 39, 58] the target: 1
When input is [52] the target: 58
When input is [52, 58] the target: 1
When input is [52, 58, 1] the target: 58
When input is [52, 58, 1, 58] the target: 46
When input is [52, 58, 1, 58, 46] the target: 39
When input is [52, 58, 1, 58, 46, 39] the t

In [23]:
print(xb)# print the input batch

tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])


## 4. A Simple Bigram Model

### Creating A Bigram Model
This model predicts the next token using only the single immediately preceding token. It has no memory of older tokens. This will serve as a baseline to measure our Transformer's improvement against.

The model is implemented as a PyTorch `nn.Module`.
- `nn.Embedding(vocab_size, vocab_size)`: This is the core of the model. It's a simple lookup table where for each of the `vocab_size` possible input tokens it stores a vocab_size-dimensional vector of logits (raw scores) for the next token.
- `forward` **pass:**
    - It takes the input batch `idx` and looks up the logits from the embedding table.
    - It calculates the cross-entropy loss which is the standard loss function for classification tasks. It measures how well the predicted logits correspond to the actual target tokens.
- `generate` **method:**
    - This method produces new text autoregressively.
    - It takes the current context idx gets the logits for the next token from the model applies softmax to convert logits to probabilities and then samples the next token from this probability distribution using torch.multinomial.
    - This new token is then appended to the context and the process repeats.

In [25]:
torch.manual_seed(1337)
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx)  # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape# batch size, time step, vocab size (channels)
            logits = logits.view(B*T, C)# reshape to (B*T, C) for cross-entropy loss
            targets = targets.view(B*T)# reshape to (B*T) for cross-entropy loss
            loss = F.cross_entropy(logits, targets)# compute the cross-entropy loss
        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx


m = BigramLanguageModel(vocab_size)# instantiate the model
logits, loss = m(xb, yb)# forward pass
print(logits.shape)
print(loss)

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


In [27]:
print(decode(m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


pJ:Bpm&yiltNCjeO3:Cx&vvMYW-txjuAd IRFbTpJ$zkZelxZtTlHNzdXXUiQQY:qFINTOBNLI,&oTigq z.c:Cq,SDXzetn3XVj


### Training the Bigram Model
Now we train our simple model using the **Adam Optimizer** a standard and effective optimization algorithm.
The training loop is simple:
1. Get a batch of data.
2. Run the forward pass to get the logits and loss.
3. Reset the gradients from the previous step `(optimizer.zero_grad())`.
4. Perform the backward pass `(loss.backward())` to compute the gradients for all model parameters.
5. Update the parameters using the optimizer `(optimizer.step())`.

We repeat this for a number of steps. The loss should gradually decrease as the model learns the bigram statistics from the data.

In [29]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)# create an AdamW optimizer

In [30]:
batch_size = 32
for steps in range(100): #increase number of steps for good results

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)# reset the gradients
    loss.backward()# backward pass to compute gradients
    optimizer.step()# update the parameters
print(loss.item())

4.554598808288574


In [31]:
print(decode(m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


wBM;o-opr$mOiPJEYD-CfigkzD3p3?zvS;ADz;.y?o,ivCuC'zqHxcVT cHA
rT'Fd,SBMZyOslg!NXeF$sBe,juUzLq?w-wzP-h
ERjjxlgJzPbHxf$ q,q,KCDCU fqBOQT
SV&CW:xSVwZv'DG'NSPypDhKStKzC -$hslxIVzoivnp ,ethA:NCCGoi
tN!ljjP3fwJMwNelgUzzPGJlgihJ!d?q.d
pSPYgCuCJrIFtb
jQXg
pA.P LP,SPJi
DBcuBM:CixjJ$Jzkq,OLI3KLQLMGph$O 3DfiPHnXKuHMlyjxEiyZib3FsCV-oJa!zoc'XSP :CKGUhd?lgCOF$;;DTHZMlvvcmZAm;:iv'MMgO&Ywbc;BLCUd&vZINLIzkuTGZa
D.?EGQhFttk!aUiZa!qB-pcL?OER:PAc'd,ip.SPyI-g:I'nviM;halgd
dFIad,rA'b?qotd,!mJ.vcoibrIdZKtMb?s,SjKuBUzo-


## 5. The Mathematical Trick in Self-Attention

The core idea of a Transformer is **self-attention**. It's a mechanism that allows tokens in a sequence to communicate with and aggregate information from each other.

### Toy Example: Weighted Aggregation

This small example demonstrates the core trick.
- Matrix `a` is a lower-triangular matrix of weights. The weights in each row sum to 1.
- Matrix `b` contains our data vectors.
- The matrix multiplication `c = a @ b` produces a new set of vectors `c` where each vector `c[i]` is a weighted average of the vectors `b[0]...b[i]`.

This shows how matrix multiplication can efficiently perform a weighted sum over a sequence which is the fundamental operation in self-attention.

In [32]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))# lower triangular matrix
a = a / torch.sum(a, 1, keepdim=True)# normalize rows to sum to 1
b = torch.randint(0, 10, (3, 2)).float()# random data matrix
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


### Self-Attention Version 1: for loop

We have a batch of `B` sequences each of length `T` with `C` channels (embedding dimensions). We want each token `t` to be an average of all the tokens preceding it in its sequence.<br>
Here we use `for` loops to iterate through the batch and the time steps. It works but it's very slow and inefficient.

In [33]:
torch.manual_seed(1337)
B, T, C = 4, 8, 2# batch, time, channels
x = torch.randn(B, T, C)
x.shape

torch.Size([4, 8, 2])

In [35]:
xbow = torch.zeros((B, T, C))# initialize the output tensor
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1]# all previous time steps up to and including t
        xbow[b, t] = torch.mean(xprev, 0)# average them
        

In [36]:
x[0]

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])

In [37]:
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

### Self-Attention Version 2: Matrix Multiplication
Here use the matrix multiplication trick from our toy example to replace the inner `for` loop.
- `wei = torch.tril(...)`: We create a `(T, T)` lower-triangular weight matrix.
- `xbow2 = wei @ x`: We matrix multiply the weights with our input data x.

In [38]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))# (T,T) lower triangular matrix
wei = wei / wei.sum(1, keepdim=True)# normalize rows to sum to 1
xbow2 = wei @ x # matrix multiply the weights with the input data x (B,T,T)@(B,T,C) -> (B,T,C)
torch.allclose(xbow, xbow2)

True

### Self-Attention Version 3: Softmax
 Instead of creating the weights by simple division we'll use `softmax`.
1. `tril = torch.tril(...)`: Create the lower-triangular matrix of ones and zeros.
2. `wei = wei.masked_fill(...)`: We use `masked_fill` to replace all the 0s (the upper triangle) with negative infinity.
3. `wei = F.softmax(wei, dim=-1)`: We apply softmax to this matrix. The softmax of negative infinity is 0 so this has the same effect as our previous masking but it's a more general mechanism that will be crucial for attention.

In [39]:
# version3: using softmax
tril = torch.tril(torch.ones(T, T))# (T,T) lower triangular matrix
wei = torch.zeros((T, T))
wei = wei.masked_fill((tril == 0), float('-inf'))# fill the upper triangular part with -inf
wei = F.softmax(wei, dim=-1)# apply softmax to get the weights
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

True