<a href="https://colab.research.google.com/github/b-schoen/gpt_from_scratch/blob/main/colab/gpt2_tinystories_custom_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# clone repo
!rm -rf gpt_from_scratch
!git clone https://github.com/b-schoen/gpt_from_scratch.git

Cloning into 'gpt_from_scratch'...
remote: Enumerating objects: 328, done.[K
remote: Counting objects: 100% (328/328), done.[K
remote: Compressing objects: 100% (249/249), done.[K
remote: Total 328 (delta 175), reused 215 (delta 72), pack-reused 0 (from 0)[K
Receiving objects: 100% (328/328), 4.88 MiB | 6.90 MiB/s, done.
Resolving deltas: 100% (175/175), done.


In [2]:
# change into the repo directory
import os

os.chdir('gpt_from_scratch')

print("Current Working Directory:", os.getcwd())

Current Working Directory: /content/gpt_from_scratch


In [3]:
# now we can operate as if this was a local notebook

In [4]:
%load_ext autoreload
%autoreload 2

## Download dataset locally

In [5]:
# let's load tinystories for comparison
from gpt_from_scratch.dataset_loaders import tinystories_loader

tinystories_version = tinystories_loader.TinyStoriesVersion.V2

tinystories_filepaths = tinystories_loader.download_tinystories(tinystories_version)

# read train as input text
input_text = tinystories_filepaths.train.read_text()

Downloading TinyStoriesV2-GPT4-train.txt...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloaded TinyStoriesV2-GPT4-train.txt to /root/.cache/huggingface/hub/datasets--roneneldan--TinyStories/snapshots/f54c09fd23315a6f9c86f9dc80f725de7d8f9c64/TinyStoriesV2-GPT4-train.txt
Downloading TinyStoriesV2-GPT4-valid.txt...
Downloaded TinyStoriesV2-GPT4-valid.txt to /root/.cache/huggingface/hub/datasets--roneneldan--TinyStories/snapshots/f54c09fd23315a6f9c86f9dc80f725de7d8f9c64/TinyStoriesV2-GPT4-valid.txt


In [6]:
# note: if this gets annoying can do an actual pip install requirements
!pip install tiktoken
!pip install jaxtyping
!pip install colored



In [7]:
import time

import tiktoken

from gpt_from_scratch.gpt2_from_scratch import data_loader
from gpt_from_scratch.gpt2_from_scratch.train_gpt2 import (
    GPT,
    GPTConfig,
    get_best_available_torch_device,
)

import torch
import torch.optim
import torch.nn as nn
import torch.nn.functional as F

## Sampling

In [8]:
# sample some outputs to get an idea of where we are

from typing import TYPE_CHECKING

if TYPE_CHECKING:
  from gpt_from_scratch import tokenizer_utils

def sample_model(
    prompt: str,
    num_samples: int,
    max_tokens: int,
    model: nn.Module,
    tokenizer: 'tokenizer_utils.Tokenizer',
    device: torch.device,
    stop_token: str | None = None,
) -> None:

    # tokenize
    tokens = tokenizer.encode(prompt)
    tokens = torch.tensor(tokens, dtype=torch.long)

    tokens = tokens.unsqueeze(0).repeat(num_samples, 1) # (5, 8)

    # tokens in this case is just the prompt, and is small enough to fit on GPU
    x = tokens.to(device)

    while x.size(1) < max_tokens:

        # forward the model to get the logits
        with torch.no_grad():

            logits, loss = model(x) # (B, T, vocab_size)

            # take the logits at the last position
            # throw away all the logits from things other than the last position
            logits = logits[:, -1, :] # (B, vocab_size)

            # get the probabilities
            probs = F.softmax(logits, dim=-1)

            # do top-k sampling of 50 (huggingface pipeline default)
            # topk_probs here becomes (5, 50), topk_indices is (5, 50)
            #
            # "anything lower than the 50th, we clamp to 0 and never sample it"
            #
            topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)

            # select a token from the top-k probabilities
            # note: multinomial does not demand the input to sum to 1
            ix = torch.multinomial(topk_probs, 1) # (B, 1)

            # gather the corresponding indices
            xcol = torch.gather(topk_indices, -1, ix) # (B, 1)

            # append to the sequence
            x = torch.cat((x, xcol), dim=1)

    # print the generated text
    for i in range(num_samples):

        tokens = x[i, :max_tokens].tolist()

        decoded = tokenizer.decode(tokens)

        # cut off at the first stop token
        if stop_token and stop_token in decoded:
          position_of_stop_token = decoded.find(stop_token)
          decoded = decoded[:position_of_stop_token]

        print(f"\n [{i}] >", decoded)

## Data Loading

In [9]:
print('\n--- First 1000 characters: ---\n')
print(input_text[:1000])

# print('\n--- Last 1000 characters: ---\n')
# print(input_text[:-1000])


--- First 1000 characters: ---


Once upon a time there was a little boy named Ben. Ben loved to explore the world around him. He saw many amazing things, like beautiful vases that were on display in a store. One day, Ben was walking through the store when he came across a very special vase. When Ben saw it he was amazed!  
He said, “Wow, that is a really amazing vase! Can I buy it?” 
The shopkeeper smiled and said, “Of course you can. You can take it home and show all your friends how amazing it is!”
So Ben took the vase home and he was so proud of it! He called his friends over and showed them the amazing vase. All his friends thought the vase was beautiful and couldn't believe how lucky Ben was. 
And that's how Ben found an amazing vase in the store!
<|endoftext|>
Once upon a time, there was a reliable otter named Ollie. He lived in a river with his family. They all loved to play and swim together.
One day, Ollie's mom said, "Ollie, hurry and get some fish for dinner!" Ollie swam f

In [10]:
from gpt_from_scratch import python_utils

# 2,717,700 stories
#
# 9/10 -> 2,097,152 stories -> 750 M words
num_samples = len(input_text.split('<|endoftext|>'))

print(f'{num_samples=}')

# arbitrarily choosing 9/10 as scale factor (we use this so it's easier to experiment with smaller chunks without changing much code)
num_samples = int(num_samples * 0.9)

print(f'{num_samples=} after scaling')

num_samples = python_utils.closest_power_of_two(num_samples)

print(f'{num_samples=} after choosing closest power of 2')

num_samples=2717700
num_samples=2445930 after scaling
num_samples=2097152 after choosing closest power of 2


In [11]:
# clip the input text at number of samples
input_text = python_utils.get_first_n_examples(input_text, n=num_samples, delimiter='<|endoftext|>')

In [12]:
# we'll trim down the dataset to something that loads quickly

In [13]:
# create tokenizer
from gpt_from_scratch import (
    byte_pair_encoding_tokenizer,
    file_utils,
    tokenizer_utils,
)

# load pretrained tokenizer
tokenizer_filepath = 'tokenizer_bin/tokenizer__vocab_2048_samples_100000_dataset_tinystories.pkl'
tokenizer = file_utils.deserialize_dataclass_from_pickle_file(
    cls=byte_pair_encoding_tokenizer.BytePairEncodingWordTokenizer,
    file_path=tokenizer_filepath,
)
print(f"Loaded tokenizer from {tokenizer_filepath}")

Loaded tokenizer from tokenizer_bin/tokenizer__vocab_2048_samples_100000_dataset_tinystories.pkl


In [14]:
# make sure we can tokenize some example text
tokenizer_utils.show_token_mapping(tokenizer, "Jack and Jill were doing mechinterp research")

Splitting text into words via regex...


Encoding words as tokens: 100%|██████████| 13/13 [00:00<00:00, 62102.45it/s]


Input:		Jack and Jill were doing mechinterp research
Splitting text into words via regex...


Encoding words as tokens: 100%|██████████| 13/13 [00:00<00:00, 53667.28it/s]

Tokenized:	Jack and Jill were doing mechinterp research
Token ID | Token Bytes | Token String
---------+-------------+--------------
     957 | 4A 61 63 6B | 'Jack'
          Jack and Jill were doing mechinterp research
          U+004A LATIN CAPITAL LETTER J (1 bytes: 4A)
          U+0061 LATIN SMALL LETTER A (1 bytes: 61)
          U+0063 LATIN SMALL LETTER C (1 bytes: 63)
          U+006B LATIN SMALL LETTER K (1 bytes: 6B)
      32 | 20          | ' '
          Jack and Jill were doing mechinterp research
          U+0020 SPACE (1 bytes: 20)
     263 | 61 6E 64    | 'and'
          Jack and Jill were doing mechinterp research
          U+0061 LATIN SMALL LETTER A (1 bytes: 61)
          U+006E LATIN SMALL LETTER N (1 bytes: 6E)
          U+0064 LATIN SMALL LETTER D (1 bytes: 64)
      32 | 20          | ' '
          Jack and Jill were doing mechinterp research
          U+0020 SPACE (1 bytes: 20)
    1890 | 4A 69 6C 6C | 'Jill'
          Jack and Jill were doing mechinterp research




In [15]:
# tokenize input text
# note: tiktoken is using their implementation of lib.rs in rust, so much faster
print(f'Tokenizing input text of length: {len(input_text)}')
tokens = tokenizer.encode(input_text)
tokens = torch.tensor(tokens, dtype=torch.long)
print('Finished tokenizing input text')

Tokenizing input text of length: 1718216655
Splitting text into words via regex...


Encoding words as tokens: 100%|██████████| 746873790/746873790 [49:25<00:00, 251891.81it/s]


Finished tokenizing input text


In [16]:
len(tokenizer.vocab)

2048

In [17]:
len(tokenizer.merges)

2048

In [18]:
# move to GPU, since we can fit it for this dataset
device = get_best_available_torch_device()

tokens = tokens.to(device)

In [19]:
# TODO(bschoen): Shrink other parameters from `config` to match TinyStories paper

# vocab_size = 50304 # note: nice number after ~52,000 initially used by GPT-2

# TODO(bschoen): It's really 2048 + 1 special tokens lmao
vocab_size = 4096

# load text via dataloader
# TODO(bschoen): Why do we pick this?
total_batch_size = 524288 # 2**19, ~0.5M, in number of tokens

# let's pick number of tokens closest power of two (above so we get all tokens)
# total_batch_size = python_utils.next_power_of_two(len(tokens))

B = 128 # micro batch size
# T = 1024 # sequence length (from GPT-2)
T = 512 # sequence length (matches tinystories paper)

assert total_batch_size % (B * T) == 0, "make sure total_batch_size is divisible by B * T"

# compute what our gradient accumulation should be
grad_accum_steps = total_batch_size // (B * T)

print(f'total length of input text (in characters): {len(input_text)}')
print(f'total number of tokens: {len(tokens)}')
print(f"total desired batch size: {total_batch_size}")
print(f"=> calculated gradient accumulation steps: {grad_accum_steps}")

# create a train loader that will continually give us new batches
train_loader = data_loader.DataLoaderLite(B=B, T=T, tokens=tokens)
# train_loader = DataLoaderLiteBasedOnPytorch(B=B, T=T, tokens=tokens)
# pytorch_train_data_loader = train_loader.get_dataloader()

# note: these are computed based on data loading

# want to make it through all of our tokens

# this seems way too low @ 100, thus the override
max_steps = 10000
# max_steps = len(tokens) // total_batch_size

# chosen fairly arbitrarily
# TODO(bschoen): GPT-2 seems to do this as a faction of tokens (proportional)
warmup_steps = int(max_steps * 0.1)

# learning rate
max_lr = 6e-4
min_lr = max_lr * 0.1

print(f'| {max_steps=} | {warmup_steps=} | {max_lr=:.6f} | {min_lr=:.6f} |')

total length of input text (in characters): 1718216655
total number of tokens: 786770743
total desired batch size: 524288
=> calculated gradient accumulation steps: 8
loaded 786770743 tokens
1 epoch = 12005 batches (steps to make one pass through data)
| max_steps=10000 | warmup_steps=1000 | max_lr=0.000600 | min_lr=0.000060 |


In [20]:
# Initial layer dominates pretty much everything
#
# Decrease your batch size until things fit
# By default you want to max it out with nice numbers
#
# ... + switching over to tinystories
#
#   | step   49 | loss: 4.6334 | lr 6.0832e-05 | norm: 0.3634 | dt: 3104.65ms | tok/sec: 168872.06 |
#
#   * interestingly the same tokens per second
#
# ... + (B=16) (since was running out of GPU space)
#
#   | step   49 | loss: 4.2973 | lr 3.0000e-04 | norm: 1.3571 | dt: 3232.42ms | tok/sec: 162196.82 |
#   ...
#   | step  999 | loss: 1.1815 | lr 6.0002e-05 | norm: 0.3806 | dt: 3234.65ms | tok/sec: 162084.83 |
#
# ... + custom tokenizer
#
#   | step   49 | loss: 3.2605 | lr 6.0658e-05 | norm: 0.1714 | dt: 2323.44ms | tok/sec: 225651.47 |
#
# ... + custom data loader
#
#   | step   49 | loss: 3.2779 | lr 6.0658e-05 | norm: 0.1305 | dt: 4052.50ms | tok/sec: 129373.97 |
#
#   * literally slower, reverting in favor of just putting everything on the GPU since can fit it for this dataset
#
# ... + moving everything in input dataset to GPU first
#
#   | step   49 | loss: 3.1756 | lr 3.0000e-04 | norm: 2.3773 | dt: 2296.51ms | tok/sec: 228297.49 |
#
# ... + increasing microbatch size to 64
#
#   | step   49 | loss: 3.1903 | lr 3.0000e-04 | norm: 1.8769 | dt: 2132.62ms | tok/sec: 245841.70 |
#   ...
#   | step  999 | loss: 0.8000 | lr 6.0002e-05 | norm: 0.2571 | dt: 2136.10ms | tok/sec: 245441.52 |
#
# ... + context size 512
#
#   | step   36 | loss: 3.3956 | lr 1.7902e-04 | norm: 0.5462 | dt: 2095.98ms | tok/sec: 250139.40 |
#
# ... + batch size 128
#
#   | step   21 | loss: 3.5883 | lr 4.4836e-04 | norm: 1.9053 | dt: 2023.95ms | tok/sec: 259041.59 |
#
#

In [21]:
import math

def get_learning_rate(
    step: int,
    warmup_steps: int,
    max_steps: int,
    min_lr: float,
    max_lr: float,
  ) -> float:

    # 1) linear warmup for warmup_iters steps
    if step < warmup_steps:
        # the +1 is because for the 1st iteration no reason to multiply by 0
        return max_lr * (step + 1) / warmup_steps

    # 2) if it > lr_decay_iters, return min learning rate
    if step > max_steps:
        return min_lr

    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (step - warmup_steps) / (max_steps - warmup_steps)
    assert 0 <= decay_ratio <= 1

    # coeff starts at 1 and goes to 0
    # TODO(bschoen): Is this cos weight decay?
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))

    return min_lr + coeff * (max_lr - min_lr)

In [22]:
import os

# reset unused CUDA memory
# torch.cuda.empty_cache()

# {use F32 multiplication}
torch.set_float32_matmul_precision('high')

# now we'll try multiple batches
device = get_best_available_torch_device()

print(f'Using device: {device}')

print("Creating model...")
config = GPTConfig(
    vocab_size=vocab_size,
    block_size=T,
)

model = GPT(config)
model.to(device)

Using device: cuda
Creating model...


GPT(
  (transformer): ModuleDict(
    (wte): Embedding(4096, 768)
    (wpe): Embedding(512, 768)
    (h): ModuleList(
      (0-11): 12 x Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): MLP(
          (c_fc): Linear(in_features=768, out_features=3072, bias=True)
          (gelu): GELU(approximate='tanh')
          (c_proj): Linear(in_features=3072, out_features=768, bias=True)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=4096, bias=False)
)

In [23]:
print("Compiling model...")
model = torch.compile(model)
print("Done compiling model")

Compiling model...
Done compiling model


In [24]:
# Karpathy: "AdamW is basically a bugfix of Adam"
#
# note: pretty good default learning rate for early experimentation
optimizer = model.configure_optimizers(
    weight_decay=0.1,
    learning_rate=max_lr,
    device=device.type,
)

num decayed parameter tensors: 50, with 88,473,600 parameters
num non-decayed parameter tensors: 98, with 121,344 parameters
using fused AdamW: True


In [25]:
for i in range(max_steps):

    t0 = time.time()

    optimizer.zero_grad()

    # gradient accumulation
    loss_accum = 0.0

    for micro_step in range(grad_accum_steps):

        # print(f' - {micro_step=}')
        x, y = train_loader.next_batch()

        x, y = x.to(device), y.to(device)

        # automatic mixed precision
        with torch.autocast(device_type=device.type, dtype=torch.bfloat16):

          logits, loss = model(x, y)

        # we have to scale the loss to account for gradient accumulation,
        # because the gradients just add on each successive backward().
        # addition of gradients corresponds to a SUM in the objective, but
        # instead of a SUM we want MEAN. Scale the loss here so it comes out right
        #
        # "accumulation in the gradients is equivalent to the sum in the loss"
        #
        # used small self contained version of just this chunk to debug
        # since the loss objects etc can be used in isolation
        loss = loss / grad_accum_steps
        loss_accum += loss.detach()
        loss.backward()

    norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # determine and set the learning rate for this iteration
    lr = get_learning_rate(
        step=i,
        warmup_steps=warmup_steps,
        max_steps=max_steps,
        min_lr=min_lr,
        max_lr=max_lr,
    )

    # update optimizer
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    optimizer.step()

    torch.cuda.synchronize() # wait for the GPU to finish work

    t1 = time.time()

    dt = t1 - t0 # time difference in seconds

    tokens_processed = train_loader.B * train_loader.T * grad_accum_steps
    tokens_per_sec = tokens_processed / dt

    print(f"| step {i:4d} | loss: {loss_accum:.4f} | lr {lr:.4e} | norm: {norm:.4f} | dt: {dt*1000:.2f}ms | tok/sec: {tokens_per_sec:.2f} |")

| step    0 | loss: 8.5812 | lr 6.0000e-07 | norm: 95.2466 | dt: 28760.22ms | tok/sec: 18229.62 |


OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 

In [None]:
sample_model(
    # example from validation set
    prompt="""Once upon a time, in a warm and sunny place, there was a big pit. A little boy named Tom liked to play near the pit. One day, Tom lost his red ball. He was very sad.""",
    num_samples=10,
    max_tokens=300,
    model=model,
    tokenizer=tokenizer,
    device=device,
    stop_token='<|endoftext|>',
)



In [None]:
# A lot of these with the custom tokenizer are actually pretty good
"""
 [0] > Once upon a time, in a warm and sunny place, there was a big pit. A little boy named Tom liked to play near the pit. One day, Tom lost his red ball. He was very sad.
Tom asked his friend, Sam, to help him search for the ball. They looked all around the pit, but they could not find it. Then, something unexpected happened. A little bird flew down and said, "I found your red ball!" Tom and Sam were happy and surprised. They looked at the bird and had a new friend to play with.
<|endoftext|>

 [1] > Once upon a time, in a warm and sunny place, there was a big pit. A little boy named Tom liked to play near the pit. One day, Tom lost his red ball. He was very sad.
Tom asked his friend, Sam, to help him search for the ball. They looked near the trees and under the big tree. Finally, they found the red ball near a small pond. Tom was so happy! He said, "Thank you, Sam! You are a good friend."
From that day on, Tom and Sam played in the park with the red ball. They were always happy and had lots of fun.
<|endoftext|>

 [2] > Once upon a time, in a warm and sunny place, there was a big pit. A little boy named Tom liked to play near the pit. One day, Tom lost his red ball. He was very sad.
Tom asked his friend, Sam, to help him search for the ball. Sam nodded and said, "Lets search together." They looked in the pit, under the leaves, and in the yard. And now, they found the red ball in the pit. Tom was so happy, and he thanked Sam for coming to play. They sat on the pit and played together.
<|endoftext|>

 [3] > Once upon a time, in a warm and sunny place, there was a big pit. A little boy named Tom liked to play near the pit. One day, Tom lost his red ball. He was very sad.
Tom asked his friend, Sam, to help him search for the ball. They looked in the park, in the grass, and behind the rock. They could not find the ball either. They were very sad.
Then, Tom had an idea. He asked his friend, Tom, if they could search for the red ball. They looked in his house, behind a tree, and in the yard. With Toms help, Tom searched very high and low. Finally, they found the red ball! Tom was so happy and thanked Tom. From that day on, Tom and Tom played with the red ball all by the pit together.
<|endoftext|>

 [4] > Once upon a time, in a warm and sunny place, there was a big pit. A little boy named Tom liked to play near the pit. One day, Tom lost his red ball. He was very sad.
Tom asked his friend, Sam, to help him search for the ball. They looked around the big pit and near the big pond. At first, they found a note under a tree. The note said, "This ball is magic!"
Tom and Sam looked all around the pit. Then, they saw a big red balloon near the pond. Tom picked up the big red balloon. Suddenly, the big red balloon popped! Inside the balloon, there were many small balls with the red balloon inside.
Tom and Sam were very happy to have a new toy balloon. They played with the balls all day, taking turns with the magic balloon from the pit. When they played with them, they were not hungry at the red balloon.
<|endoftext|>

  ...

 [7] > Once upon a time, in a warm and sunny place, there was a big pit. A little boy named Tom liked to play near the pit. One day, Tom lost his red ball. He was very sad.
Tom asked his friend, Sam, to help him search for the ball. Sam said, "Dont worry, Tom. I will help you find it." They looked under the trees and behind the trees. They did not find the red ball.
At the end of the day, they found the red ball. It was behind a big tree. Tom was very happy. He hugged his red ball and said, "Thank you, Sam. You are very kind." Sam smiled and said, "Youre welcome! Come back to my home, Tom."
<|endoftext|>
"""