<a href="https://colab.research.google.com/github/b-schoen/gpt_from_scratch/blob/main/colab/gpt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# clone repo
!rm -rf gpt_from_scratch
!git clone https://github.com/b-schoen/gpt_from_scratch.git

Cloning into 'gpt_from_scratch'...
remote: Enumerating objects: 261, done.[K
remote: Counting objects: 100% (261/261), done.[K
remote: Compressing objects: 100% (201/201), done.[K
remote: Total 261 (delta 132), reused 175 (delta 53), pack-reused 0 (from 0)[K
Receiving objects: 100% (261/261), 4.28 MiB | 14.50 MiB/s, done.
Resolving deltas: 100% (132/132), done.


In [2]:
# change into the repo directory
import os

os.chdir('gpt_from_scratch')

print("Current Working Directory:", os.getcwd())

Current Working Directory: /content/gpt_from_scratch


In [3]:
# now we can operate as if this was a local notebook

In [4]:
%load_ext autoreload
%autoreload 2

## Download dataset locally

In [5]:
# let's load tinystories for comparison
#
# note: `datasets` can list datasets but is deprecated
import huggingface_hub

# from https://huggingface.co/docs/huggingface_hub/en/guides/download#from-latest-version
import dataclasses
from typing import Callable, Generic, TypeVar
import pathlib

T = TypeVar('T')
R = TypeVar('T')

@dataclasses.dataclass(frozen=True)
class TrainAndVal(Generic[T]):
    """Helper for common pattern of transforming both train and val."""

    train: T
    val: T

    def apply(self, func: Callable[[T], R]) -> 'TrainAndVal[R]':
        return dataclasses.replace(self,
            train=func(self.train),
            val=func(self.val),
        )

def download_file_from_tinystories(filename: str) -> pathlib.Path:

    print(f"Downloading {filename}...")
    filepath = huggingface_hub.hf_hub_download(
        repo_id='roneneldan/TinyStories',
        filename=filename,
        repo_type="dataset",
    )

    print(f"Downloaded {filename} to {filepath}")
    return pathlib.Path(filepath)

# original in paper
# train_filename, val_filename = 'TinyStories-train.txt', 'TinyStories-valid.txt'

# GPT-4 only, significantly larger but newer
filenames = TrainAndVal('TinyStoriesV2-GPT4-train.txt', 'TinyStoriesV2-GPT4-valid.txt')

# download
filepaths = filenames.apply(download_file_from_tinystories)

# read train as input text
input_text = filepaths.train.read_text()

Downloading TinyStoriesV2-GPT4-train.txt...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloaded TinyStoriesV2-GPT4-train.txt to /root/.cache/huggingface/hub/datasets--roneneldan--TinyStories/snapshots/f54c09fd23315a6f9c86f9dc80f725de7d8f9c64/TinyStoriesV2-GPT4-train.txt
Downloading TinyStoriesV2-GPT4-valid.txt...
Downloaded TinyStoriesV2-GPT4-valid.txt to /root/.cache/huggingface/hub/datasets--roneneldan--TinyStories/snapshots/f54c09fd23315a6f9c86f9dc80f725de7d8f9c64/TinyStoriesV2-GPT4-valid.txt


In [6]:
# note: if this gets annoying can do an actual pip install requirements
!pip install tiktoken
!pip install jaxtyping



# Starting to optimize

> ![NOTE] Starting from "what hardware do I have, and am I fully utilizing it"

Then looking up NVIDIA spec sheet for A100, we see:

| Specification | A100 80GB PCIe | A100 80GB SXM |
|---------------|----------------|---------------|
| FP64 | 9.7 TFLOPS | 9.7 TFLOPS |
| FP64 Tensor Core | 19.5 TFLOPS | 19.5 TFLOPS |
| FP32 | 19.5 TFLOPS | 19.5 TFLOPS |
| Tensor Float 32 (TF32) | 156 TFLOPS \| 312 TFLOPS\* | 156 TFLOPS \| 312 TFLOPS\* |
| BFLOAT16 Tensor Core | 312 TFLOPS \| 624 TFLOPS\* | 312 TFLOPS \| 624 TFLOPS\* |
| FP16 Tensor Core | 312 TFLOPS \| 624 TFLOPS\* | 312 TFLOPS \| 624 TFLOPS\* |
| INT8 Tensor Core | 624 TOPS \| 1248 TOPS\* | 624 TOPS \| 1248 TOPS\* |
| GPU Memory | 80GB HBM2e | 80GB HBM2e |
| GPU Memory Bandwidth | 1,935GB/s | 2,039GB/s |


We're currently at:

| Specification | A100 80GB PCIe | A100 80GB SXM |
|---------------|----------------|---------------|
| FP32 | 19.5 TFLOPS | 19.5 TFLOPS |

but it turns out we don't really need that much precision for deep learning

| Format | Sign | Range (exponent) | Precision (mantissa) |
|--------|------|------------------|----------------------|
| FP32   | 1    | 8                | 23                   |
| TF32   | 1    | 8                | 10                   |
| FP16   | 1    | 5                | 10                   |
| BF16   | 1    | 8                | 7                    |

In [7]:
import time

import tiktoken

from gpt_from_scratch.gpt2_from_scratch import data_loader
from gpt_from_scratch.gpt2_from_scratch.train_gpt2 import (
    GPT,
    GPTConfig,
    get_best_available_torch_device,
)

import torch
import torch.optim
import torch.nn as nn
import torch.nn.functional as F

## Sampling

In [8]:
# sample some outputs to get an idea of where we are

from typing import TYPE_CHECKING

if TYPE_CHECKING:
  from gpt_from_scratch import tokenizer_utils

def sample_model(
    prompt: str,
    num_samples: int,
    max_tokens: int,
    model: nn.Module,
    tokenizer: 'tokenizer_utils.Tokenizer',
    device: torch.device,
) -> None:

    # tokenize
    tokens = tokenizer.encode(prompt)
    tokens = torch.tensor(tokens, dtype=torch.long)

    tokens = tokens.unsqueeze(0).repeat(num_samples, 1) # (5, 8)

    # tokens in this case is just the prompt, and is small enough to fit on GPU
    x = tokens.to(device)

    while x.size(1) < max_tokens:

        # forward the model to get the logits
        with torch.no_grad():

            logits, loss = model(x) # (B, T, vocab_size)

            # take the logits at the last position
            # throw away all the logits from things other than the last position
            logits = logits[:, -1, :] # (B, vocab_size)

            # get the probabilities
            probs = F.softmax(logits, dim=-1)

            # do top-k sampling of 50 (huggingface pipeline default)
            # topk_probs here becomes (5, 50), topk_indices is (5, 50)
            #
            # "anything lower than the 50th, we clamp to 0 and never sample it"
            #
            topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)

            # select a token from the top-k probabilities
            # note: multinomial does not demand the input to sum to 1
            ix = torch.multinomial(topk_probs, 1) # (B, 1)

            # gather the corresponding indices
            xcol = torch.gather(topk_indices, -1, ix) # (B, 1)

            # append to the sequence
            x = torch.cat((x, xcol), dim=1)

    # print the generated text
    for i in range(num_samples):

        tokens = x[i, :max_tokens].tolist()

        decoded = tokenizer.decode(tokens)

        print(f"\n [{i}] >", decoded)

## Data Loading

In [9]:
import math

def closest_power_of_two(n: int) -> int:
    # Find the power of 2 less than or equal to n
    lower = 2 ** math.floor(math.log2(n))

    # Find the power of 2 greater than n
    upper = lower * 2

    # Return the closest one
    return lower if (n - lower) < (upper - n) else upper

def next_power_of_two(n: int) -> int:

    # Find the power of 2 greater than n
    return 2 ** math.ceil(math.log2(n))

def get_first_n_examples(input_text: str, n: int) -> str:

    delimiter = "<|endoftext|>"

    examples = input_text.split(delimiter)

    # Return all text if n is greater than available examples
    if n > len(examples) - 1:
        return input_text

    result = delimiter.join(examples[:n]) + delimiter
    return result.strip()

In [10]:
print('\n--- First 1000 characters: ---\n')
print(input_text[:1000])

# print('\n--- Last 1000 characters: ---\n')
# print(input_text[:-1000])


--- First 1000 characters: ---


Once upon a time there was a little boy named Ben. Ben loved to explore the world around him. He saw many amazing things, like beautiful vases that were on display in a store. One day, Ben was walking through the store when he came across a very special vase. When Ben saw it he was amazed!  
He said, “Wow, that is a really amazing vase! Can I buy it?” 
The shopkeeper smiled and said, “Of course you can. You can take it home and show all your friends how amazing it is!”
So Ben took the vase home and he was so proud of it! He called his friends over and showed them the amazing vase. All his friends thought the vase was beautiful and couldn't believe how lucky Ben was. 
And that's how Ben found an amazing vase in the store!
<|endoftext|>
Once upon a time, there was a reliable otter named Ollie. He lived in a river with his family. They all loved to play and swim together.
One day, Ollie's mom said, "Ollie, hurry and get some fish for dinner!" Ollie swam f

In [11]:
# 2,717,700 stories
num_samples = len(input_text.split('<|endoftext|>'))

print(f'{num_samples=}')

# arbitrarily choosing 1/10 as scale factor
num_samples = num_samples // 10

print(f'{num_samples=} after scaling')

num_samples = closest_power_of_two(num_samples)

print(f'{num_samples=} after choosing closest power of 2')

num_samples=2717700
num_samples=271770 after scaling
num_samples=262144 after choosing closest power of 2


In [12]:
# clip the input text at number of samples
input_text = get_first_n_examples(input_text, n=num_samples)

In [13]:
# we'll trim down the dataset to something that loads quickly

In [14]:
# create tokenizer
tokenizer = tiktoken.get_encoding('gpt2')

# tokenize input text
# note: the dataset already has `<|endoftext|>` in it, we need to tell the
#       encoder that that's okay and that we genuinely do want to treat it
#       as `<|endoftext|>`
tokens = tokenizer.encode(input_text, allowed_special={'<|endoftext|>'})
tokens = torch.tensor(tokens, dtype=torch.long)

# load text via dataloader
# TODO(bschoen): Why do we pick this?
total_batch_size = 524288 # 2**19, ~0.5M, in number of tokens

B = 16 # micro batch size
T = 1024 # sequence length

assert total_batch_size % (B * T) == 0, "make sure total_batch_size is divisible by B * T"

grad_accum_steps = total_batch_size // (B * T)
print(f"total desired batch size: {total_batch_size}")
print(f"=> calculated gradient accumulation steps: {grad_accum_steps}")

# create a train loader that will continually give us new batches
train_loader = data_loader.DataLoaderLite(B=B, T=T, tokens=tokens)

# note: these are computed based on data loading

# want to make it through all of our tokens

# this seems way too low @ 100, thus the override
max_steps = 1000
# max_steps = len(tokens) // total_batch_size

# chosen fairly arbitrarily
# TODO(bschoen): GPT-2 seems to do this as a faction of tokens (proportional)
warmup_steps = int(max_steps * 0.1)

# learning rate
max_lr = 6e-4
min_lr = max_lr * 0.1

print(f'| {max_steps=} | {warmup_steps=} | {max_lr=:.6f} | {min_lr=:.6f} |')

total desired batch size: 524288
=> calculated gradient accumulation steps: 32
loaded 52796537 tokens
1 epoch = 3222 batches (steps to make one pass through data)
| max_steps=1000 | warmup_steps=100 | max_lr=0.000600 | min_lr=0.000060 |


In [15]:
# Initial layer dominates pretty much everything
#
# Decrease your batch size until things fit
# By default you want to max it out with nice numbers
#
# Tokens / sec is best unit because agnostic to batch size etc, it's the thing we really care about
#
# Karpathy recommends the `Automatic Mixed Precision` pytorch tutorial specifically, others are confusing
#

# Initial w/ Float32 - (B=4, T=32) - mps
#
#   | step 49 | loss: 6.8048 | dt: 136.36ms | tok/sec: 938.68 |
#
# Initial w/ Float32 - (B=4, T=32) - cpu
#
#   | step 14 | loss: 7.6758 | dt: 2578.34ms | tok/sec: 49.64 |
#
# Initial w/ Float32 - (B=4, T=32) - cuda
#
#   | step 48 | loss: 6.3560 | dt: 31.72ms | tok/sec: 4035.35 |
#
# Initial w/ Float32 - (B=16, T=1024) - cuda
#
#   | step 49 | loss: 6.1039 | dt: 1041.67ms | tok/sec: 15728.63 |
#
#   * Pretty stable
#   * Using full 40 GB GPU (~38.5 GB)
#
# ... + torch.set_float32_matmul_precision('high')
#
#   | step 49 | loss: 6.2045 | dt: 382.83ms | tok/sec: 42797.34 |
#
#   {* decrease precision of optimization itself}
#
# ... + bfloat16 (automatic mixed precision)
#
#   | step 49 | loss: 6.0319 | dt: 335.56ms | tok/sec: 48826.04 |
#
#   * decrease amount of storage we're using per float when moving around
#   * pytorch docs *specifically* say to only apply to the model's forward pass and loss calculation
#
# ... + torch.compile
#
#   | step 49 | loss: 6.0414 | dt: 192.20ms | tok/sec: 85246.46 |
#
#   * Karpathy: "Really incredible piece of code from the pytorch team"
#   * Like LLVM for pytorch
#   * No reason to not use it
#
# ... + scaled flash attention
#
#   | step 49 | loss: 6.1316 | dt: 143.52ms | tok/sec: 114161.25 |
#
#   * There are operations that torch.compile will not find
#   * Kernel fusion, but kernel fusion that torch.compile can't find
#   * Flash attention actually more flops! Mindful of memory hierarchy (what's in HBM, shared_memory, min reads/writes)
#   * ~7.6x faster
#   * Flash attention 3?
#   * In particular never materialize the T*T matrix
#   * Uses "online softmax trick"
#   * Allows you to update the softmax value online using intermediate values
#   * "Flops don't matter, the entire memory operation matters"
#   * "I'm not exactly sure why torch.compile doesn't fuse our original implementation into flash attention operation"
#
# ... + nice vocab size
#
#   | step 49 | loss: 6.1674 | dt: 107.45ms | tok/sec: 152477.50 |
#
#   * "The dumbest optimization"
#   * "In some ways still surprises me"
#   * IN GENERAL, SCAN YOUR CODE AND LOOK FOR UGLY NUMBERS, ex: `3`
#   * ex: the `25` as number of heads in GPT2-XL lol
#   * basically can always increase the number until it's a nice power of 2
#   * 50304 is super divisable by a bunch of different powers of 2
#   * this is literally more FLOPS lmao
#   * most kernels have a whole second phase where they handle anything that's not blocked as a special case to be correct
#   * "one of my favorite examples of having to know how stuff works under the hood- knowing what to tinker with"
#
# ... + AdamW params and grad clipping set
#
#   | step   49 | loss: 5.9391 | norm: 0.7900 | dt: 109.41ms | tok/sec: 149755.44 |
#
#   * so a _little_ slower but loss is converging much faster
#   * clipping the global norm
#   * if you get unlucky in a sample, you don't want a huge loss to throw off your whole batch
#   * definitely a hack lmao
#   * useful information to view as you train, like spikes or when getting high
#   * for example early on high gradients when learning easy dumb stuff
#
# ... + cosine decay learning schedule with warmup
#
#   | step   49 | loss: 5.8699 | lr 6.0832e-05 | norm: 0.7640 | dt: 108.81ms | tok/sec: 150577.77 |
#
#   * a little bit better plus a little bit faster
#   * probably matters a lot more later in training? Or is this thinking about it wrong
#   * the warmup is _part_ of the process where we eventually decay
#   * we're replicating this from GPT-3 paper (since don't know for GPT-2)
#
# ... + batch size scheduling
#
#  * Karpathy: "We skip this, because complicates everything and isn't that big of an improvement"
#  * intuition is that early on you actually don't need huge batches because what you're learning is so dumb
#
# ... + model.configure_optimizer - add weight decay, only for 2D params, and add fused AdamW
#
#   | step   49 | loss: 5.8977 | lr 6.0832e-05 | norm: 0.6617 | dt: 103.32ms | tok/sec: 158582.07 |
#
#  * num decayed parameter tensors: 50, with 124,354,560 parameters
#  * num non-decayed parameter tensors: 98, with 121,344 parameters
#  * using fused AdamW: True
#
# ... + gradient accumulation
#
#   | step   35 | loss: 5.8420 | lr 2.2668e-04 | norm: 0.2565 | dt: 3227.67ms | tok/sec: 162435.59 |
#
# ... + use batch size 32 instead of 16 for full gpu utilization
#
#   | step    7 | loss: 8.0427 | lr 4.8000e-04 | norm: 2.0357 | dt: 3084.79ms | tok/sec: 169958.89 |
#
# ... + DistributedDataParallel (multi gpu, torchrun)
#
#   * everything looks pretty much the same
#   * we skip this, as we only have one GPU
#   * does bring it up to like 1.5m/sec, but he has 8 GPUs (that seems roughly 169958 * 8)
#
# ... + switching over to tinystories
#
#   | step   49 | loss: 4.6334 | lr 6.0832e-05 | norm: 0.3634 | dt: 3104.65ms | tok/sec: 168872.06 |
#
#   * interestingly the same tokens per second
#
# ... + (B=16) (since was running out of GPU space)
#
#   | step   49 | loss: 4.2973 | lr 3.0000e-04 | norm: 1.3571 | dt: 3232.42ms | tok/sec: 162196.82 |
#   ...
#   | step  999 | loss: 1.1815 | lr 6.0002e-05 | norm: 0.3806 | dt: 3234.65ms | tok/sec: 162084.83 |

In [16]:
import math

def get_learning_rate(
    step: int,
    warmup_steps: int,
    max_steps: int,
    min_lr: float,
    max_lr: float,
  ) -> float:

    # 1) linear warmup for warmup_iters steps
    if step < warmup_steps:
        # the +1 is because for the 1st iteration no reason to multiply by 0
        return max_lr * (step + 1) / warmup_steps

    # 2) if it > lr_decay_iters, return min learning rate
    if step > max_steps:
        return min_lr

    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps)
    assert 0 <= decay_ratio <= 1

    # coeff starts at 1 and goes to 0
    # TODO(bschoen): Is this cos weight decay?
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))

    return min_lr + coeff * (max_lr - min_lr)

In [17]:
# {use F32 multiplication}
torch.set_float32_matmul_precision('high')

# now we'll try multiple batches
device = get_best_available_torch_device()

print(f'Using device: {device}')

# use nice number for vocab size
model = GPT(GPTConfig(vocab_size=50304))
model.to(device)

print("Compiling model...")
model = torch.compile(model)
print("Done compiling model")

# Karpathy: "AdamW is basically a bugfix of Adam"
#
# note: pretty good default learning rate for early experimentation
optimizer = model.configure_optimizers(
    weight_decay=0.1,
    learning_rate=max_lr,
    device=device.type,
)

for i in range(max_steps):

    t0 = time.time()

    optimizer.zero_grad()

    # gradient accumulation
    loss_accum = 0.0

    for micro_step in range(grad_accum_steps):

        x, y = train_loader.next_batch()

        x, y = x.to(device), y.to(device)

        # automatic mixed precision
        with torch.autocast(device_type=device.type, dtype=torch.bfloat16):

            logits, loss = model(x, y)

        # we have to scale the loss to account for gradient accumulation,
        # because the gradients just add on each successive backward().
        # addition of gradients corresponds to a SUM in the objective, but
        # instead of a SUM we want MEAN. Scale the loss here so it comes out right
        #
        # "accumulation in the gradients is equivalent to the sum in the loss"
        #
        # used small self contained version of just this chunk to debug
        # since the loss objects etc can be used in isolation
        loss = loss / grad_accum_steps
        loss_accum += loss.detach()
        loss.backward()

    norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # determine and set the learning rate for this iteration
    lr = get_learning_rate(
        step=i,
        warmup_steps=warmup_steps,
        max_steps=max_steps,
        min_lr=min_lr,
        max_lr=max_lr,
    )

    # update optimizer
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    optimizer.step()

    torch.cuda.synchronize() # wait for the GPU to finish work

    t1 = time.time()

    dt = t1 - t0 # time difference in seconds

    tokens_processed = train_loader.B * train_loader.T * grad_accum_steps
    tokens_per_sec = tokens_processed / dt

    print(f"| step {i:4d} | loss: {loss_accum:.4f} | lr {lr:.4e} | norm: {norm:.4f} | dt: {dt*1000:.2f}ms | tok/sec: {tokens_per_sec:.2f} |")

Using device: cuda
Compiling model...
Done compiling model
num decayed parameter tensors: 50, with 124,354,560 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True
| step    0 | loss: 10.9909 | lr 6.0000e-06 | norm: 20.5855 | dt: 29888.82ms | tok/sec: 17541.27 |
| step    1 | loss: 10.5371 | lr 1.2000e-05 | norm: 16.8769 | dt: 3211.99ms | tok/sec: 163228.63 |
| step    2 | loss: 10.0238 | lr 1.8000e-05 | norm: 10.4843 | dt: 3215.61ms | tok/sec: 163044.86 |
| step    3 | loss: 9.6534 | lr 2.4000e-05 | norm: 7.3438 | dt: 3216.31ms | tok/sec: 163009.13 |
| step    4 | loss: 9.3873 | lr 3.0000e-05 | norm: 5.1010 | dt: 3214.80ms | tok/sec: 163085.58 |
| step    5 | loss: 9.2134 | lr 3.6000e-05 | norm: 3.9777 | dt: 3216.05ms | tok/sec: 163022.56 |
| step    6 | loss: 9.1017 | lr 4.2000e-05 | norm: 3.5790 | dt: 3222.38ms | tok/sec: 162702.19 |
| step    7 | loss: 9.0170 | lr 4.8000e-05 | norm: 3.3834 | dt: 3221.52ms | tok/sec: 162745.71 |
| step   

In [39]:
sample_model(
    prompt="Goog threw the ball to Lily. Lily threw it back to him.",
    num_samples=10,
    max_tokens=100,
    model=model,
    tokenizer=tokenizer,
    device=device,
)


 [0] > Goog threw the ball to Lily. Lily threw it back to him. It flew away fast. It landed on a pile of leaves. Sara and Ben ran to get the ball. They took it to Mom and Dad.
Mom and Dad looked at each other. They called them. They said, "That was a good kick. But you are glad you played nicely. When you are older, you can have a good time. We love you?"
Sara and Ben nodded. They

 [1] > Goog threw the ball to Lily. Lily threw it back to him. It hit her hard and she started to cry.
Sam stopped crying and ran to Lily. He hugged her and said, "It's okay, Lily. I'm sorry I didn't watch the ball. It's mean. Can we play together again?"
Lily looked at Sam. She looked at him and said, "No, we can't. The ball is too big. It might roll on us."

 [2] > Goog threw the ball to Lily. Lily threw it back to him. She landed on the ball. The ball flew through the air and landed on Ben's head. Ben fell back on the ball. He scraped his knee and cried.
Lily ran to Ben and picked him up. She felt sorry 