# GPT from scratch

- From [YouTube (Andrej Karpathy): Let's build GPT: from scratch, in code, spelled out.](https://youtu.be/kCc8FmEb1nY?si=rrytCRSJeL4jaCJt)
- Refactored / trimmed down version of [Google colab notebook for video](https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing)
    - Functionally the same
    - Extra comments here as well

## Imports and Paths

In [81]:
from typing import Callable

import torch
import torch.nn.functional as F
import torch.optim as optim
from torch import nn
from tqdm import tqdm

## Dataset

### Download Dataset

In [1]:
# Download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-12-06 09:47:44--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-12-06 09:47:44 (15.3 MB/s) - ‘input.txt’ saved [1115394/1115394]



### Build Dataset (Implicit Tokenizer)

- Implicitly building tokenizer via `stoi` and `itos`
    - Real examples: 
        - [google/sentencepiece](https://github.com/google/sentencepiece)
        - [openai/tiktoken](https://github.com/openai/tiktoken)
    - Both use BPE (byte-pair encoding)
- Workflow is as follows:
    - Data is read in and stored as vocab: `chars` (len=65)
        - Build lookup function `stoi` (char to int)
        - Build lookup function `itos` (int to char)
    - Build encoder function `encode: str -> list[int]`, using `stoi` (char to int)
    - Build decoder function `decode: list[int] -> str`, using `itos` (int to char) then concat
    - Build dataset `data` by simply encoding all characters in our dataset
        - Split into train (90%), val (10%)
    - Build dataloaders from `data` splits via random mini-batches (infinite dataloader)

In [None]:
def build_dataset(
    data_path: str, 
    verbose: bool = False
) -> tuple[torch.Tensor, torch.Tensor, Callable, Callable, int]:
    # Read input.txt
    with open(data_path, 'r', encoding='utf-8') as f:
        text = f.read()

    # Create vocab
    chars = sorted(list(set(text)))
    vocab_size = len(chars)

    # Create a mapping from characters to integers
    # This is our "tokenizer"
    stoi = { ch:i for i,ch in enumerate(chars) }
    itos = { i:ch for i,ch in enumerate(chars) }
    
    # Tokenizer encoder: take a string, output a list of integers
    def encode(s: str) -> list[int]:
        return [stoi[c] for c in s] 
    
    # Tokenizer decoder: take a list of integers, output a string
    def decode(l: list[int]) -> str:
        return ''.join([itos[i] for i in l]) 

    # Encode entire dataset into tensor
    data = torch.tensor(encode(text), dtype=torch.long)

    # Create splits: 90% train, 10% val
    n_train = int(0.9 * len(data))
    train_data = data[:n_train]
    val_data = data[n_train:]

    # Print some useful things
    if verbose:
        print(f'Vocab (size {vocab_size}): {"".join(chars)}')
        print(f'Data tensor shape:\n  {data.shape}, type: {data.dtype}')
        print(f'Data tensor example (100 chars):\n  {data[:100]}')
        print(f'Encode example:\n  encode("hii there") == {encode("hii there")}')
        print(f'Decode example:\n  decode(encode("hii there")) == {decode(encode("hii there"))}')
    
    return train_data, val_data, encode, decode, vocab_size

train_data, val_data, _encode, _decode, _vocab_size = build_dataset(
    data_path='input.txt', verbose=True
)

Vocab (size 65): 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Data tensor shape:
  torch.Size([1115394]), type: torch.int64
Data tensor example (100 chars):
  tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])
Encode example:
  encode("hii there") == [46, 47, 47, 1, 58, 46, 43, 56, 43]
Decode example:
  decode(encode("hii there")) == hii there


### Build Dataloader

In [None]:
class InfiniteDataLoader:
    def __init__(self, data: torch.Tensor, batch_size: int, block_size: int, seed: int):
        self.data: torch.Tensor = data
        self.batch_size: int = batch_size  # e.g. 4
        self.block_size: int = block_size  # context/seq length, e.g. 8
        self.gen: torch.Generator = torch.Generator().manual_seed(seed)
        
    def __iter__(self):
        return self
    
    def __next__(self):
        ix = torch.randint(  # Shape (4,)
            # Max int is len(data) - block_size because we will use i -> i+block_size 
            len(self.data) - self.block_size, (self.batch_size,), 
            generator=self.gen,
        )  
        x = [self.data[i:i+self.block_size] for i in ix]  # List of tensors, 4x shape(8,)
        y = [self.data[i+1:i+self.block_size+1] for i in ix]  # List of tensors, 4x shape(8,)
        x = torch.stack(x)  # 4x (8,) -> (4,8)
        y = torch.stack(y)  # 4x (8,) -> (4,8)
        return x, y
            
def build_dataloaders(
    train_data: torch.Tensor,
    val_data: torch.Tensor,
    batch_size: int, 
    block_size: int, # context/seq length
    seed: int = 1337,    
) -> tuple[InfiniteDataLoader, InfiniteDataLoader]:
    
    train_dataloader = InfiniteDataLoader(
        data=train_data, batch_size=batch_size, block_size=block_size, seed=seed,
    )
    val_dataloader = InfiniteDataLoader(
        data=val_data, batch_size=batch_size, block_size=block_size, seed=seed,
    )
    return train_dataloader, val_dataloader

### Dataloader Example

In [98]:
batch_size = 4
block_size = 8  # context/seq length
train_dataloader, val_dataloader = build_dataloaders(
    train_data=train_data, val_data=val_data, batch_size=batch_size, block_size=block_size
)

xb, yb = next(iter(train_dataloader))
print(f'batch 0 inputs (shape {xb.shape}):')
print(xb)
print(f'batch 0 targets (shape {yb.shape}), aka inputs shifted right one:')
print(yb)

print('----')
print('batch 0 training sequence:')
for b in range(1): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")    

batch 0 inputs (shape torch.Size([4, 8])):
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
batch 0 targets (shape torch.Size([4, 8])), aka inputs shifted right one:
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
batch 0 training sequence:
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39


## Model

### Build Transformer Block

In [None]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd: int, dropout: float):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class SingleHeadAttention(nn.Module):
    """ one head of self-attention 
    
    Previously called 'Head'"""

    def __init__(self, n_embd: int, head_size: int, block_size: int, dropout: float):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class SimpleMultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel 
    
    Previously called 'MultiHeadAttention'
    
    This is the "easy" way, where we just repeat this in parallel, concatenate the result, and
    add an output projection. The "better" way is to combine the W_k, W_q, W_v matrices per head,
    as well as the attn calculation, to speed things up a lot.
    """

    def __init__(self, n_embd: int, num_heads: int, head_size: int, dropout: float):
        super().__init__()
        self.heads = nn.ModuleList(
            [
                SingleHeadAttention(n_embd=n_embd, head_size=head_size, block_size=block_size, dropout=dropout) 
                for _ in range(num_heads)
            ]
        )
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out



class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd: int, n_head: int, dropout: float):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(
            n_embd=n_embd, num_heads=n_head, head_size=head_size, dropout=dropout
        )
        self.ffwd = FeedFoward(n_embd=n_embd, dropout=dropout)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))  # Pre-norm (more common now; orig paper was post-norm)
        x = x + self.ffwd(self.ln2(x))  # Pre-norm (more common now; orig paper was post-norm)
        return x

### Build Bigram Model w/ Transformer Blocks

In [None]:
# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(
        self, 
        vocab_size: int, 
        n_embd: int, 
        block_size: int, 
        n_head: int, 
        n_layer: int, 
        dropout: float,
        loss_fn: Callable = F.cross_entropy,
    ):
        super().__init__()
        self.block_size = block_size
        
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)  # No max seq. length here
        self.position_embedding_table = nn.Embedding(block_size, n_embd)  # Max seq length here
        blocks = [
            Block(n_embd=n_embd, n_head=n_head, dropout=dropout) 
            for _ in range(n_layer)
        ]
        self.blocks = nn.Sequential(*blocks)
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)
        
        # print the number of parameters in the model
        num_params = sum(p.numel() for p in self.parameters())
        print(f'{self.__class__.__name__}: {num_params / 1e6: .2f} M parameters')

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=tok_emb.device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # Big picture: take (B,T), generate new tokens, e.g. generate((B,T)) -> (B,T+1) and so on
        #   - Forward pass: (B,K) -> (B,L,K)
        #   - Last token (B,-1,K) gives logits for next token
        #   - Apply softmax to get probs
        #   - Use torch.multinomial to sample next token index
        #   - Append next token (B,K) -> (B,K+1)
        #   - Crop sequence to last K tokens
        #   - Repeat
        
        # Batch generate
        #   idx.shape: (B,T) where B = batch_size, T = num_tokens (seq len)
        # Here T doesn't have to be block_size == max context len, can just be 1
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens (if T < block_size this is just T)
            idx_cond = idx[:, -self.block_size:]  # (B,T) -> (B,L)
            # get the predictions
            logits, _loss = self(idx_cond, targets=None)  # (B,L) -> (B,L,C)
            # focus only on the last time step
            logits = logits[:, -1, :] # (B,L,C) -> (B,C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B,C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B,C) -> (B,1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B,T) -> (B,T+1)
        return idx

## Train Loop

### Full Loss

In [None]:
@torch.no_grad()
def estimate_loss(
    model: nn.Module, 
    train_dataloader: InfiniteDataLoader, 
    val_dataloader: InfiniteDataLoader, 
    eval_iters: int,
    device: str,
) -> dict[str, torch.Tensor]:
    out = {}
    model.eval()
    for split, dataloader in [('train', train_dataloader), ('val', val_dataloader)]:       
        losses = torch.zeros(eval_iters)
        for k, (X,Y) in enumerate(dataloader):
            # Move data to device (usually part of collate_fn but not implemented in InfiniteDataLoader)
            X = X.to(device)
            Y = Y.to(device)
            
            # Forward pass
            _logits, loss = model(X, Y)
            
            losses[k] = loss.item()
            if k+1 >= eval_iters:  # b/c infinite dataloader
                break
        out[split] = losses.mean()
    model.train()
    return out

### Full Train Loop

In [92]:
# ------------
# moodel hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------
# train hyperparams
max_iters = 5000
print_iters = None  # max_iters // 20; skipping, just using eval_iterval
eval_interval = max_iters // 10
learning_rate = 1e-3
eval_iters = 200
data_path = 'input.txt'
seed = 1337
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# ------------

# Build dataset and tokenizer
train_data, val_data, _encode, decode, vocab_size = build_dataset(data_path=data_path)

# Build dataloaders
train_dataloader, val_dataloader = build_dataloaders(
    train_data=train_data, 
    val_data=val_data, 
    batch_size=batch_size, 
    block_size=block_size, 
    seed=seed,
)

# Build model and move to device
model = BigramLanguageModel(
    vocab_size=vocab_size, 
    n_embd=n_embd, 
    block_size=block_size, 
    n_head=n_head, 
    n_layer=n_layer, 
    dropout=dropout
)
model = model.to(device)

# Build optimizer and scheduler
reduce_lr_steps = max(1, int(max_iters * 0.8))
optimizer = optim.AdamW(params=model.parameters(), lr=learning_rate)  # pyright: ignore[reportPrivateImportUsage]
scheduler = optim.lr_scheduler.StepLR(optimizer=optimizer, step_size=reduce_lr_steps, gamma=0.1)

for idx, (xb, yb) in tqdm(enumerate(train_dataloader), desc="iter", total=max_iters):
    
    # Move data to device (usually part of collate_fn but not implemented in InfiniteDataLoader)
    xb = xb.to(device)
    yb = yb.to(device)
    
    # Forward pass
    logits, loss = model(xb, yb)
    
    # Backward pass
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    
    # Update
    optimizer.step()
    scheduler.step()
    
    # Print loss on current batch (skipped if print_iters is None)
    if print_iters is not None and (idx+1) % print_iters == 0:
        print(
            f"Iter [{idx+1}/{max_iters}]:"
            f" loss={loss.item():.3f}, lr={optimizer.param_groups[0]['lr']}"
        )
    
    # Print loss / mini-evaluate
    if (eval_interval is not None and (idx+1) % eval_interval == 0) or (idx+1) == max_iters:
        losses = estimate_loss(
            model=model, 
            train_dataloader=train_dataloader, 
            val_dataloader=val_dataloader, 
            eval_iters=eval_iters,
            device=device,
        )
        print(
            f"Iter [{idx+1}/{max_iters}]:"
            f" train_loss={losses['train']:.4f},"
            f" val_loss={losses['val']:.4f},"
            f" lr={optimizer.param_groups[0]['lr']}"
        )
    
    if idx+1 >= max_iters:
        break  # b/c infinite dataloader

BigramLanguageModel:  0.21 M parameters


iter:  10%|█         | 505/5000 [00:15<12:20,  6.07it/s]

Iter [500/5000]: train_loss=2.3012, val_loss=2.3193, lr=0.001


iter:  20%|██        | 1009/5000 [00:29<09:03,  7.34it/s]

Iter [1000/5000]: train_loss=2.0963, val_loss=2.1303, lr=0.001


iter:  30%|███       | 1506/5000 [00:44<11:44,  4.96it/s]

Iter [1500/5000]: train_loss=1.9637, val_loss=2.0396, lr=0.001


iter:  40%|████      | 2005/5000 [00:58<07:01,  7.10it/s]

Iter [2000/5000]: train_loss=1.8625, val_loss=1.9771, lr=0.001


iter:  50%|█████     | 2508/5000 [01:13<07:12,  5.77it/s]

Iter [2500/5000]: train_loss=1.7999, val_loss=1.9254, lr=0.001


iter:  60%|██████    | 3009/5000 [01:27<04:43,  7.02it/s]

Iter [3000/5000]: train_loss=1.7508, val_loss=1.9026, lr=0.001


iter:  70%|███████   | 3507/5000 [01:41<03:54,  6.38it/s]

Iter [3500/5000]: train_loss=1.7308, val_loss=1.8715, lr=0.001


iter:  80%|████████  | 4006/5000 [01:56<02:37,  6.31it/s]

Iter [4000/5000]: train_loss=1.6916, val_loss=1.8336, lr=0.0001


iter:  90%|█████████ | 4507/5000 [02:10<01:17,  6.39it/s]

Iter [4500/5000]: train_loss=1.6358, val_loss=1.7833, lr=0.0001


iter: 100%|█████████▉| 4999/5000 [02:25<00:00, 34.42it/s]

Iter [5000/5000]: train_loss=1.6227, val_loss=1.7835, lr=0.0001





### Generate

In [102]:
# Generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)  # Zero is newline char (reasonable start token)
generated = model.generate(context, max_new_tokens=200)  # (1,1) -> (1,200)
generated_txt = decode(generated[0].tolist())  # (1,200) -> (200,) -> list -> decode int to char
print(generated_txt)



Come but I am on joy, Bukly be a their awomer:
Here conferrows weeps.

MERCUTIO:
How nother:
Mare before.

LADY ANNE:
Make to accoudio, their a grant me lord.
Indep you spriect?

FLORIZARET:
To the k
