<a href="https://colab.research.google.com/github/anjelammcgraw/Unsupervised-Pre-Training-of-a-GPT-Style-Model-Shakespeare-Generative-Model/blob/main/2_Shakespeare_Generative_Model_from_Scratch_Unsupervised_Pre_Training_of_GPT_Style_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsupervised Pre-Training of GPT-Style Model

In today's notebook, we will be doing an unsupervised pre-training of a GPT-style model.

The base model we'll use is Andrej Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT).

All of the model code can be found in the [`model.py`](https://github.com/karpathy/nanoGPT/blob/master/model.py) file!

## Data Selection

We'll be using a toy dataset called `tinyshakespeare`.

You could extend this example to use the [OpenWebText](https://skylion007.github.io/OpenWebTextCorpus/) dataset, which was used to pre-train GPT-2.

> NOTE: Training LLMs can take a very long time - in order to get results similar to the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) you will need 8xA100s and train for ~4-5 days using a pararellized strategy (DDP) on the OpenWebText Corpus.

In [None]:
!git clone https://github.com/karpathy/nanoGPT.git

Cloning into 'nanoGPT'...
remote: Enumerating objects: 649, done.[K
remote: Total 649 (delta 0), reused 0 (delta 0), pack-reused 649[K
Receiving objects: 100% (649/649), 936.45 KiB | 2.57 MiB/s, done.
Resolving deltas: 100% (371/371), done.


##Dependencies

In [None]:
!pip install tiktoken requests cohere openai -qU

Download the dataset!

In [None]:
import os
import requests
import tiktoken
import numpy as np

current_path = "/data/shakespeare"
data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'

if not os.path.exists(current_path):
    os.makedirs(current_path)

# download the tiny shakespeare dataset
input_file_path = os.path.join(os.path.dirname(current_path), 'input.txt')
if not os.path.exists(input_file_path):

    with open(input_file_path, 'w') as f:
        f.write(requests.get(data_url).text)

with open(input_file_path, 'r') as f:
    data = f.read()

n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

Tokenizers

In [None]:
!pip install tokenizers -qU

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/3.6 MB[0m [31m2.4 MB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.4/3.6 MB[0m [31m6.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/3.6 MB[0m [31m9.9 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m1.9/3.6 MB[0m [31m14.0 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━[0m [32m2.9/3.6 MB[0m [31m16.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m3.6/3.6 MB[0m [31m19.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m16

In [None]:
input_text = """
After pre-tokenization, a set of unique words has been created and the frequency with which each word occurred in the training data has been determined. Next, BPE creates a base vocabulary consisting of all symbols that occur in the set of unique words and learns merge rules to form a new symbol from two symbols of the base vocabulary. It does so until the vocabulary has attained the desired vocabulary size. Note that the desired vocabulary size is a hyperparameter to define before training the tokenizer.
"""

naive_word_list = input_text.split()

Counting words for frequency.

In [None]:
from collections import defaultdict

vocab_and_frequencies = defaultdict(int)

for word in naive_word_list:
  vocab_and_frequencies[" ".join(list(word))] += 1

sorted(vocab_and_frequencies.items(), key = lambda x: x[1], reverse=True)[:5]

[('t h e', 8), ('a', 4), ('o f', 4), ('v o c a b u l a r y', 4), ('h a s', 3)]

Find base vocabulary

In [None]:
from typing import Dict, Tuple, List, Set

def find_vocabulary_size(current_vocab: Dict[str, int]) -> int:
  vocab = set()

  for word in current_vocab.keys():
    for subword in word.split():
      vocab.add(subword)

  return len(vocab)

In [None]:
find_vocabulary_size(vocab_and_frequencies)

34

Now we can start constructing our pairs.

In [None]:
def find_pairs_and_frequencies(current_vocab: Dict[str, int]) -> Dict[str, int]:
  pairs = {}

  for word, frequency in current_vocab.items():
    symbols = word.split()

    for i in range(len(symbols) - 1):
      pair = (symbols[i], symbols[i + 1])
      current_frequency = pairs.get(pair, 0)
      pairs[pair] = current_frequency + frequency

  return pairs

In [None]:
pairs_and_frequencies = find_pairs_and_frequencies(vocab_and_frequencies)

In [None]:
sorted(pairs_and_frequencies.items(), key = lambda x: x[1], reverse=True)[:5]

[(('t', 'h'), 11),
 (('i', 'n'), 10),
 (('r', 'e'), 8),
 (('h', 'e'), 8),
 (('a', 't'), 7)]

Merge pairs into a single token.

In [None]:
import re

def merge_vocab(most_common_pair: Tuple[str], current_vocab: Dict[str, int]) -> Dict[str, int]:
  vocab_out = {}

  pattern = re.escape(' '.join(most_common_pair))
  replacement = ''.join(most_common_pair)

  for word_in in current_vocab:
      word_out = re.sub(pattern, replacement, word_in)
      vocab_out[word_out] = current_vocab[word_in]

  return vocab_out

In [None]:
 new_vocab_and_frequencies = merge_vocab(
    sorted(pairs_and_frequencies.items(), key = lambda x: x[1], reverse=True)[0][0],
    vocab_and_frequencies
)

In [None]:
sorted(new_vocab_and_frequencies.items(), key = lambda x: x[1], reverse=True)[:5]

[('th e', 8), ('a', 4), ('o f', 4), ('v o c a b u l a r y', 4), ('h a s', 3)]

In [None]:
find_vocabulary_size(new_vocab_and_frequencies)

35

## Training Our Tokenizer


1. Initialize our `Tokenizer` with a `BPE` model. Be sure to include the `unk_token`.

2. We'll include a normalizer, applied at the sequence level, and we'll use `NFD()` to do so.

3. We'll also add our `ByteLevel()` pre-tokenizer, and our `ByteLevelDecoder()` decoder.

In [None]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.normalizers import NFD, Sequence
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.normalizer = Sequence([NFD()])
tokenizer.pre_tokenizer = ByteLevel()
tokenizer.decoder = ByteLevelDecoder()

In [None]:
trainer = BpeTrainer(
    vocab_size=50000,
    show_progress=True,
    special_tokens=[
      "<s>",
      "<pad>",
      "</s>",
      "<unk>",
      "<mask>"
    ]
)

In [None]:
tokenizer.train(files=[input_file_path], trainer=trainer)

Save tokenizer and then load it as a `GPT2Tokenizer` through the Hugging Face Library!

In [None]:
save_path = '/content/tokenizer'
if not os.path.exists(save_path):
    os.makedirs(save_path)
tokenizer.model.save(save_path)

['/content/tokenizer/vocab.json', '/content/tokenizer/merges.txt']

In [None]:
!pip install transformers -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained(save_path, unk_token="[UNK]")

Tokenizing Inputs

In [None]:
input_sentence = "Hark, my name be Romeo! I am but a beautiful summer's day!"

In [None]:
tokenized_sentence = tokenizer.tokenize(input_sentence)
tokenized_sentence

['Hark',
 ',',
 'Ġmy',
 'Ġname',
 'Ġbe',
 'ĠRomeo',
 '!',
 'ĠI',
 'Ġam',
 'Ġbut',
 'Ġa',
 'Ġbeautiful',
 'Ġsummer',
 "'s",
 'Ġday',
 '!']

In [None]:
encoded_tokens = tokenizer.convert_tokens_to_ids(tokenized_sentence)
encoded_tokens

[12077, 9, 124, 637, 121, 826, 5, 87, 295, 219, 72, 9113, 2999, 141, 511, 5]

In [None]:
decoded_tokens = tokenizer.decode(encoded_tokens, clean_up_tokenization_spaces=False)
decoded_tokens

"Hark, my name be Romeo! I am but a beautiful summer's day!"

## Tokenizing Dataset

Create a dataset we can leverage with the `nanoGPT` library.


In [None]:
train_ids = tokenizer.encode(train_data)
val_ids = tokenizer.encode(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

train has 291,284 tokens
val has 34,223 tokens


In [None]:
# export to bin files
data_path = "/data/shakespeare/"

train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(os.path.dirname(data_path), 'train.bin'))
val_ids.tofile(os.path.join(os.path.dirname(data_path), 'val.bin'))

In [None]:
train_ids[:100]

array([  21,  388,  876,   13,   68, 6804,  373,  153, 2501,  622, 2092,
          9,  496,  136,  433,   11,   68,   68,   16,   89,   13,   68,
         34, 7882,    9,  433,   11,   68,   68,   21,  388,  876,   13,
         68,   40,   73,  252,  227, 3778, 1304,  103,  781,  351,  103,
       7504,   15,   68,   68,   16,   89,   13,   68,   33,   97, 5790,
         11, 3778,   11,   68,   68,   21,  388,  876,   13,   68,   21,
        388,    9,  104,  330, 3317, 1177,  145, 3563, 1766,  103,   80,
       1006,   11,   68,   68,   16,   89,   13,   68, 7797,  330,  486,
          9,  153,  330,  486,   11,   68,   68,   21,  388,  876,   13,
         68], dtype=uint16)

###🏗️Activity:

Write Python code that will return the first 100 tokens as text.

> HINT: An example of this code was used above!

In [None]:
decoded_tokens = tokenizer.decode(train_ids[:100])

## Training The Model

In [None]:
%cd nanoGPT

/content/nanoGPT


In [None]:
import os
import time
import math
import pickle
from contextlib import nullcontext

import numpy as np
import torch

# from the local repo
from model import GPTConfig, GPT

### Hyper-Parameters


In [None]:
out_dir = 'out'

#### Initialization

In [None]:
init_from = 'scratch'

In [None]:
eval_interval = 250
eval_iters = 200
log_interval = 10
eval_only = False
always_save_checkpoint = True

#### Dataset


In [None]:
dataset = 'shakespeare'

In [None]:
gradient_accumulation_steps = 1
batch_size = 16
block_size = 512

#### Model Architecture

In [None]:
n_layer = 6
n_head = 6
n_embd = 516
dropout = 0.2
bias = False

#####❓Question:

How many attention heads (total) will our final network have?



In [None]:
total_attention_heads = n_layer * n_head
print(total_attention_heads)

36


**ANSWER:** Our final network will have a total of 36 attention heads.

#### Optimizer Hyper-Parameters

In [None]:
# adamw optimizer
learning_rate = 1e-3
max_iters = 5_000
beta1 = 0.9
beta2 = 0.99

# lr decay settings
decay_lr = True
weight_decay = 1e-1
lr_decay_iters = 5_000
min_lr = 1e-4

# clipping and warmup
grad_clip = 1.0
warmup_iters = 100

In [None]:
backend = 'nccl'
device = 'cuda'
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16'
compile = True
# -----------------------------------------------------------------------------
config_keys = [k for k,v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
config = {k: globals()[k] for k in config_keys}
# -----------------------------------------------------------------------------
master_process = True
seed_offset = 0
ddp_world_size = 1
tokens_per_iter = gradient_accumulation_steps * ddp_world_size * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")
os.makedirs(out_dir, exist_ok=True)

tokens per iteration will be: 8,192


### Torch Settings


In [None]:
torch.manual_seed(1337 + seed_offset)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
device_type = 'cuda' if 'cuda' in device else 'cpu'
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

### Dataloader

1. Set the data path
2. Load the dataset we tokenized earlier from the `.bin` we saved
3. Define a `get_batch` function

In [None]:
data_dir = os.path.join('/data', dataset)
train_data = np.memmap(os.path.join(data_dir, 'train.bin'), dtype=np.uint16, mode='r')
val_data = np.memmap(os.path.join(data_dir, 'val.bin'), dtype=np.uint16, mode='r')

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

In [None]:
ix = torch.randint(len(train_data) - block_size, (batch_size,))
x = torch.stack([torch.from_numpy((train_data[i:i+block_size]).astype(np.int64)) for i in ix])
y = torch.stack([torch.from_numpy((train_data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])

In [None]:
print(f"Our randomly selected indices were: {ix}")

Our randomly selected indices were: tensor([ 99775, 155569, 263696,  32920,  52919, 231541, 153767, 229238, 136782,
        263618,  39008,  14208,  39429, 189430, 194466,  76798])


In [None]:
print(f"The first 10 elements of `x` at the first randomly selected index is:\n{x[0][:10]}")

The first 10 elements of `x` at the first randomly selected index is:
tensor([   68,    16,    81,  2358, 19949,   116,   172,  1280,     9,    68])


In [None]:
print(f"The first 10 elements of `y` at the first randomly selected index is:\n{y[0][:10]}")

The first 10 elements of `y` at the first randomly selected index is:
tensor([   16,    81,  2358, 19949,   116,   172,  1280,     9,    68,    16])


#####❓Question:

Both `x` and `y` are lists of tokens - as is expected - but what relationship to you notice between `x` and `y`?


**ANSWER:** In this case, the relationship between 'x' and 'y' is that 'y' is the sequence that follows 'x', and our auto-regressive language model is trying predict each token of 'y' given the tokens in 'x'.




So the first component selects a random index from our training data (accounting for our block size)

### Simple Initialization of Model

In [None]:
iter_num = 0
best_val_loss = 1e9

In [None]:
meta_path = os.path.join(data_dir, 'meta.pkl')
meta_vocab_size = tokenizer.vocab_size
meta_vocab_size

20099

In [None]:
model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                  bias=bias, vocab_size=None, dropout=dropout)

In [None]:
if init_from == 'scratch':
    print("Initializing a new model from scratch")
    if meta_vocab_size is None:
        print("defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)")
    model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304
    gptconf = GPTConfig(**model_args)
    model = GPT(gptconf)

Initializing a new model from scratch
number of parameters: 29.55M


In [None]:
if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size

In [None]:
model.to(device)

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(20099, 516)
    (wpe): Embedding(512, 516)
    (drop): Dropout(p=0.2, inplace=False)
    (h): ModuleList(
      (0-5): 6 x Block(
        (ln_1): LayerNorm()
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=516, out_features=1548, bias=False)
          (c_proj): Linear(in_features=516, out_features=516, bias=False)
          (attn_dropout): Dropout(p=0.2, inplace=False)
          (resid_dropout): Dropout(p=0.2, inplace=False)
        )
        (ln_2): LayerNorm()
        (mlp): MLP(
          (c_fc): Linear(in_features=516, out_features=2064, bias=False)
          (gelu): GELU(approximate='none')
          (c_proj): Linear(in_features=2064, out_features=516, bias=False)
          (dropout): Dropout(p=0.2, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm()
  )
  (lm_head): Linear(in_features=516, out_features=20099, bias=False)
)

In [None]:
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

In [None]:
optimizer = model.configure_optimizers(
    weight_decay,
    learning_rate,
    (beta1, beta2),
    device_type
)

checkpoint = None

num decayed parameter tensors: 26, with 29,805,708 parameters
num non-decayed parameter tensors: 13, with 6,708 parameters
using fused AdamW: True


In [None]:
if compile:
    print("compiling the model... (takes a ~minute)")
    unoptimized_model = model
    model = torch.compile(model) # requires PyTorch 2.0

compiling the model... (takes a ~minute)


In [None]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

### Creating our LR Scheduler

![img](https://i.imgur.com/KoFEl0b.png)


In [None]:
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr
    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff ranges 0..1
    return min_lr + coeff * (learning_rate - min_lr)

###❓Question:

What advantages does a learning-rate scheduler have over a static learning rate?

Feel free to consult and cite any resources you find!

**ANSWER:** As opposed to a static learning rate, a learning rate-scheduler is advantageous because of its built in risk mitigation system (stabilization closer to global minima/facilitating hyperparameter tuning). Its flexibility allows faster training and leads to better model convergence and is customizable.

In [None]:
!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/usr/lib64-nvidia"
!export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
!ldconfig /usr/lib64-nvidia

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link



## The Training Loop

In [None]:
X, Y = get_batch('train')
t0 = time.time()
local_iter_num = 0
raw_model = model
running_mfu = -1.0 # model flops utilization

while True:
    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0 and master_process:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': raw_model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                print(f"saving checkpoint to {out_dir}")
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))
    if iter_num == 0 and eval_only:
        break

    # forward backward update, with optional gradient accumulation to simulate larger batch size
    # and using the GradScaler if data type is float16
    for micro_step in range(gradient_accumulation_steps):
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps # scale the loss to account for gradient accumulation
        # immediately async prefetch next batch while model is doing the forward pass on the GPU
        X, Y = get_batch('train')
        # backward pass, with gradient scaling if training in fp16
        scaler.scale(loss).backward()
    # clip the gradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
    # step the optimizer and scaler if training in fp16
    scaler.step(optimizer)
    scaler.update()
    # flush the gradients as soon as we can, no need for this memory anymore
    optimizer.zero_grad(set_to_none=True)

    # timing and logging
    t1 = time.time()
    dt = t1 - t0
    t0 = t1
    if iter_num % log_interval == 0 and master_process:
        # get loss as float. note: this is a CPU-GPU sync point
        # scale up to undo the division above, approximating the true total loss (exact would have been a sum)
        lossf = loss.item() * gradient_accumulation_steps
        if local_iter_num >= 5: # let the training loop settle a bit
            mfu = raw_model.estimate_mfu(batch_size * gradient_accumulation_steps, dt)
            running_mfu = mfu if running_mfu == -1.0 else 0.9*running_mfu + 0.1*mfu
        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt*1000:.2f}ms, mfu {running_mfu*100:.2f}%")
    iter_num += 1
    local_iter_num += 1

    # termination conditions
    if iter_num > max_iters:
        break

step 0: train loss 9.9352, val loss 9.9273
iter 0: loss 9.9333, time 97771.85ms, mfu -100.00%
iter 10: loss 8.3523, time 208.69ms, mfu 2.47%
iter 20: loss 7.3770, time 208.44ms, mfu 2.47%
iter 30: loss 6.4399, time 209.33ms, mfu 2.47%
iter 40: loss 5.8015, time 209.69ms, mfu 2.47%
iter 50: loss 5.7303, time 210.07ms, mfu 2.47%
iter 60: loss 5.5226, time 210.12ms, mfu 2.47%
iter 70: loss 5.2349, time 210.74ms, mfu 2.46%
iter 80: loss 5.1079, time 212.94ms, mfu 2.46%
iter 90: loss 4.9914, time 213.43ms, mfu 2.45%
iter 100: loss 4.5892, time 212.50ms, mfu 2.45%
iter 110: loss 4.6100, time 212.17ms, mfu 2.45%
iter 120: loss 4.5501, time 212.92ms, mfu 2.45%
iter 130: loss 4.5344, time 212.16ms, mfu 2.45%
iter 140: loss 4.4540, time 212.93ms, mfu 2.44%
iter 150: loss 4.4568, time 212.78ms, mfu 2.44%
iter 160: loss 4.4422, time 213.23ms, mfu 2.44%
iter 170: loss 4.3045, time 214.52ms, mfu 2.43%
iter 180: loss 4.4319, time 214.84ms, mfu 2.43%
iter 190: loss 4.2649, time 214.49ms, mfu 2.43%
ite

## Generating Outputs with our New Model

### Generation Set Up and Model Loading

In [None]:
import os
import pickle
from contextlib import nullcontext
import torch
import tiktoken
from model import GPTConfig, GPT

# -----------------------------------------------------------------------------
init_from = 'resume' # either 'resume' (from an out_dir) or a gpt2 variant (e.g. 'gpt2-xl')
out_dir = 'out' # ignored if init_from is not 'resume'
start = "\n" # or "<|endoftext|>" or etc. Can also specify a file, use as: "FILE:prompt.txt"
num_samples = 10 # number of samples to draw
max_new_tokens = 500 # number of tokens generated in each sample
temperature = 0.8 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability
seed = 1337
device = 'cuda' # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
dtype = 'bfloat16' if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else 'float16' # 'float32' or 'bfloat16' or 'float16'
compile = False # use PyTorch 2.0 to compile the model to be faster
# -----------------------------------------------------------------------------

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

In [None]:
# model
if init_from == 'resume':
    # init from a model saved in a specific directory
    ckpt_path = os.path.join(out_dir, 'ckpt.pt')
    checkpoint = torch.load(ckpt_path, map_location=device)
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)

number of parameters: 29.55M


In [None]:
model.eval()
model.to(device)
if compile:
    model = torch.compile(model) # requires PyTorch 2.0 (optional)

In [None]:
enc = tokenizer
encode = lambda s: enc.encode(s)
decode = lambda l: enc.decode(l)

### Generation!

In [None]:
# encode the beginning of the prompt
if start.startswith('FILE:'):
    with open(start[5:], 'r', encoding='utf-8') as f:
        start = f.read()
start_ids = encode(start)
x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])

# run generation
with torch.no_grad():
    with ctx:
        for k in range(num_samples):
            y = model.generate(x, max_new_tokens, temperature=temperature, top_k=top_k)
            print(decode(y[0].tolist()))
            print('---------------')


To the swift ambassador,
Where the swift ambassador,
To a clear his bosom find
Where you shall cross this tied in the rigour of severest law.

PRINCE:
We wakes; there's man? what can he say in this?
Where is Romeo's Romeo's Romeo's man?? that Romeo? that kill'd Mercutio?

BALTHASAR:
I brought my master news of Juliet's death;
And then in post he came from Mantua
To this same place, to this same monument.
This letter he early bid me give his father,
And threatened me with death, going in the vault,
I departed not and left him there.

PRINCE:
Give me the letter; I will look on it.
Where is the county's page, that raised the watch?
Sirrah, what made your master in this place?

PAGE:
He came with flowers to strew his lady's grave;
And bid me stand aloof, and so I did:
Anon comes one with light to ope the tomb;
And by and by and by my master drew on him;
And then I ran away to call the watch.

PRINCE:
This letter doth make good the friar's words,
Their course of love, the tidings of her de