This script contains some improvement from train.ipynb by default:
* much few layers, heads, and embedding dimension to reduce the model size
* dataloader v2 which uses a custom tokenizer (again to reduce model size)
* no positional embeddings (to reduce model complexity)
* weight tying (to reduce model size)


We implemented a few things here first and not before:
* validation losses
* increased the model size to be just below 30M parameters
* reduced the amount of data trained with to keep the training (wall) time consistent
* made graph more informative

This script contains a couple improvements from train2.ipynb:
* gradient accumulation is enabled
* the dataloader chunks from the start of an example up to the max_length or the endoftext token

This contains some changes from train3.ipynb:
* an accuracy metric has been implemented
* one cycle learning rate schedule is being used
* weight tying is disabled

This contains some improvements from train4.ipynb: just that the attention module used uses pytorchs implementation for sdpa. This also uses a text generation function to display the capabilities of the trained model.

This contains some changes over train5.ipynb. We use data loading hooks for setting up the train/validation data loaders. We also use the setup hook for setting up the gpt model. We also call compile on the gpt model before training. We also have some code for investigating memory leaks.

The train6 files were used to determine the cause of the memory leak which seems to have been using multiple workers which causes copy-on-reads to occur. setting num_wokers=0 in the dataloader resolves this issue.
See issue: https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662
blogpost: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/

The train7 files started using datasets processed by huggingface libraries. This file continues from there.

This file has many improvements over the train8.ipynb files. By default this script uses:
* packed sequences
* attention masks
* a deep model with many layers
* mixed precision training
* a small vocab of 4096 token ids

~~This file shows a huge improvement in accuracy and loss. Many possible explanations:~~
* ~~Rotational positional encoding was added~~
* ~~the dataset used was cleaned of weird symbols (accents, chinese characters, etc.)~~
* ~~smaller max_lr~~

Huge improvements were false negatives due to bad causal mask construction (data leakage)

Further improvements:
* ~~apply positions from dataset to RoPE~~ (done)
* ~~turn off default positional encodings (or use positions from dataset)~~ (done, turned off absolute position embeddings)

In the train9.ipynb files we have added a few things:
* cleaned the dataset before training a tokenizer and processing the data
* using RoPE for positional embeddings
* switch to a wide, shallow model in line with the original TinyStories paper
* shorter learning rate warm-up

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from tokenizers import Tokenizer
import tokenizers.decoders

tokenizer = Tokenizer.from_file("./TinyStories_tokenizer_small_cleaned_BPE.json")
tokenizer.decoder = tokenizers.decoders.ByteLevel()
vocab_size = tokenizer.get_vocab_size()

In [3]:
GPT_CONFIG_30M = {
    "vocab_size": vocab_size,
    "context_length": 512,
    "emb_dim": 512,
    "n_heads": 4,
    "n_layers": 4,
    "drop_rate": 0.0,
    "qkv_bias": False,
    "weight_tying": True,
    "no_pos_emb": True
}

In [4]:
GPT_CONFIG_120M_DEEP = {
    "vocab_size": vocab_size,
    "context_length": 512, # this must be multiple of 64 for the flash attention implementation
    "emb_dim": 512, # this must be multiple of 16 * n_heads for the flash attention implementation
    "n_heads": 32,
    "n_layers": 36,
    "drop_rate": 0.1,
    "qkv_bias": False,
    "weight_tying": False,
    "no_pos_emb": True # conflicts with sequence packing
}

In [5]:
GPT_CONFIG_120M_SHALLOW = {
    "vocab_size": vocab_size,
    "context_length": 512,
    "emb_dim": 768,
    "n_heads": 24,
    "n_layers": 16,
    "drop_rate": 0.0,
    "qkv_bias": False,
    "weight_tying": False,
    "no_pos_emb": True # conflicts with sequence packing and RoPE
}

In [6]:
import socket
hostname = socket.gethostname().lower()
if "laptop" in hostname:
    GPT_CONFIG = GPT_CONFIG_30M
else:
    GPT_CONFIG = GPT_CONFIG_120M_SHALLOW

In [7]:
import torch.nn as nn
import torch

torch.set_float32_matmul_precision('medium')

In [8]:
trainer_config = {
    "batch_size": 32 if "laptop" in hostname else 32,
    "epochs": 2,
    "num_workers": 23,
    "max_lr": 1e-3 / 2,
    "compile": "laptop" not in hostname
}
trainer_config["grad_batches"] = 256 // trainer_config["batch_size"]

In [9]:
from components.gptmodel import GPTModel_RoPE
from torch.optim.lr_scheduler import OneCycleLR
import lightning as L
from datasets import load_from_disk
from torch.utils.data import DataLoader


class LitGPTModel(L.LightningModule):
    def __init__(self, trainer_config, gpt_config):
        super().__init__()
        self.save_hyperparameters()
        self.gpt_config = gpt_config
        self.trainer_config = trainer_config

        self.train_accuracy = []
        self.val_accuracy = []
        self.train_losses = []
        self.val_losses = []
        self.val_steps = []
        self.learning_rates = []
        self.batch_step = 0

    def _accuracy(self, output, expected):
        total_matching = (torch.argmax(output, dim=-1) == expected).sum().item()
        total_numel = expected.numel()
        return total_matching / total_numel

    def training_step(self, batch, batch_idx):
        self.batch_step += 1

        x, y = batch["packed_inputs"][:, :-1], batch["packed_inputs"][:, 1:]
        attn_mask = batch["attention_mask"][:, :-1, :-1]
        positions = batch["padded_positions"][:, :-1]
        logits = self.model([x, attn_mask, positions])

        accuracy = self._accuracy(logits, y)
        self.log("accuracy", accuracy, prog_bar=True, on_step=True, on_epoch=True)
        self.train_accuracy.append(accuracy)

        loss = self.loss(logits, y)
        self.log("loss", loss, prog_bar=True, on_step=True, on_epoch=True)
        self.train_losses.append(loss.item())

        current_lr = self.optimizers().param_groups[0]["lr"]
        self.learning_rates.append(current_lr)

        return loss

    def validation_step(self, batch, batch_idx):
        self.val_steps.append(self.batch_step)
        x, y = batch["packed_inputs"][:, :-1], batch["packed_inputs"][:, 1:]
        attn_mask = batch["attention_mask"][:, :-1, :-1]
        positions = batch["padded_positions"][:, :-1]
        logits = self.model([x, attn_mask, positions])

        accuracy = self._accuracy(logits, y)
        self.log("val_accuracy", accuracy, prog_bar=True, on_step=True, on_epoch=True)
        self.val_accuracy.append(accuracy)

        loss = self.loss(logits, y)
        self.log("val_loss", loss, prog_bar=True, on_step=True, on_epoch=True)
        self.val_losses.append(loss.item())
        
        return loss

    def loss(self, output, expected):
        loss = nn.functional.cross_entropy(
            output.flatten(0, 1), expected.flatten()
        )
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(
            self.parameters(), lr=self.trainer_config["max_lr"], weight_decay=0.1
        )

        scheduler = OneCycleLR(
            optimizer,
            max_lr=self.trainer_config["max_lr"],
            total_steps=self.trainer.estimated_stepping_batches,
            pct_start = 0.1,
        )
        lr_scheduler_config = {
            "scheduler": scheduler,
            "interval": "step",
            "monitor": "loss"
        }

        return {
            "optimizer": optimizer,
            "lr_scheduler": lr_scheduler_config
        }

    def setup(self, stage):
        self.packed_dataset = load_from_disk("packed_dataset_with_mask_smallVocab_cleaned_BPE")
        self.packed_dataset.set_format('torch')

    def configure_model(self):
        if hasattr(self, "model"):
            return

        self.model = GPTModel_RoPE(self.gpt_config)
        if self.trainer_config["compile"]:
            self.model = torch.compile(self.model, fullgraph=True)


    def train_dataloader(self):
        return DataLoader(
            self.packed_dataset["train"],
            batch_size=self.trainer_config["batch_size"],
            shuffle=True,
            num_workers=self.trainer_config["num_workers"],
            pin_memory=True,
            persistent_workers=True,
            prefetch_factor=2,
            drop_last=True,
        )

    def val_dataloader(self):
        return DataLoader(
            self.packed_dataset["validation"],
            batch_size=self.trainer_config["batch_size"],
            shuffle=False,
            num_workers=self.trainer_config["num_workers"],
            pin_memory=True,
            persistent_workers=True,
            prefetch_factor=2,
            drop_last=True,
        )


  from .autonotebook import tqdm as notebook_tqdm


In [10]:
litmodel = LitGPTModel.load_from_checkpoint(
  checkpoint_path="./checkpoints/120M_SHALLOw_smallVocabCleaned_RoPE2_BPE/epoch=1-step=7044.ckpt",
)
# make sure NOT to compile so that we can load from checkpoint

In [11]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
litmodel.model.to(device);

In [14]:
from components.generatetext import top_p_sampling

start_text = "One day a girl walked into the living room and noticed something very strange. There was a huge cabinet standing in the corner. It looked very old and heavy. She walked over and tried to open it, when suddenly"

generated_text = top_p_sampling(
    litmodel.model,
    tokenizer,
    start_text,
)

print(generated_text)

 One day a girl walked into the living room and noticed something very strange. There was a huge cabinet standing in the corner. It looked very old and heavy. She walked over and tried to open it, when suddenly it shook! The girl was surprised, she couldn't believe her eyes.

 "Hello, are you here?" she asked.

 The cabinet nodded and suddenly opened. Inside was a secret room! The girl stepped inside and was amazed by what she saw. There were amazing lights, toys and pictures everywhere.

 "Oh, this must be the host of this cabinet!" said the girl.

 Just then, a deep voice came from inside the cabinet. "Horse? What do you want?"

 The girl was scared, but she could not answer. The voice asked again. Then the cabinet opened, revealing a surprise. It was a big, colorful elephant!

 The girl was so happy. She thanked the cabinet for revealing the cabinet, and they said goodbye and the girl ran out of the room, never to return.


In [13]:
asdfasdaf


NameError: name 'asdfasdaf' is not defined

In [None]:
import csv
import time

with open('evaluation_prompts.csv', mode='r') as file:
    csv_reader = csv.DictReader(file)
    data = [row for row in csv_reader]  # Each row is a dictionary

data_len = len(data)
current_index = 0
# measure time
start_time = time.time()
# Modify data (e.g., change 'age' column to integers)
for row in data:
    current_index += 1
    starting_text = row['prompt']
    output_text = generate_text_with_attn_positions(litmodel.model, tokenizer, starting_text, 512, device, topk=3, temperature=1, output_only=True)
    row['completion'] = output_text
    if current_index % 10 == 0:
        # print prompt
        print(f"Prompt {current_index}: {starting_text}")
        # print output
        print(f"Row {current_index}: {output_text}")
        current_time = time.time()
        elapsed_time = current_time - start_time
        time_left = (data_len - current_index) * (elapsed_time / current_index)
        print(f"Processed {current_index}/{data_len} rows. Estimated time left: {time_left:.2f} seconds.")


Prompt 10: One day a girl walked into the living room and noticed something very strange. There was a huge cabinet standing in the corner. It looked very old and heavy. She walked over and tried to open it, when suddenly
Row 10:  she heard a loud noise. She jumped back in surprise and ran away. The girl's mum came into the room. She looked very angry. "What are you going to do?" She asked. The girl pointed to the cabinet and said, "I'm going to open the old cabinet!" Mum shook her head. "No, you can't do that," she said. "You're too little and you might break something!" The girl was very sad. She didn't understand why she couldn't open the old cabinet, and why it was so heavy. She wanted to open it, but she knew that it was too dangerous. The girl looked at her mum and sighed. She knew she had made a mistake, but it was too late now.
Processed 10/300 rows. Estimated time left: 239.27 seconds.
Prompt 20: Jack asked his mom if he could ride the bike all the way to his grandmother's hous

In [None]:
with open('evaluation_outputs_smallVocabCleaned2_RoPE_shallow_BPE.csv', mode='w', newline='') as file:
    fieldnames = data[0].keys()  # Get column names
    csv_writer = csv.DictWriter(file, fieldnames=fieldnames)
    csv_writer.writeheader()
    csv_writer.writerows(data)

In [None]:
trainer.validate()

INFO: Restoring states from the checkpoint path at /home/gebrial/projects/llm-from-scratch/src/checkpoints/120M_SHALLOw_smallVocabCleaned_RoPE2_BPE/epoch=1-step=7044.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at /home/gebrial/projects/llm-from-scratch/src/checkpoints/120M_SHALLOw_smallVocabCleaned_RoPE2_BPE/epoch=1-step=7044.ckpt
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: Loaded model weights from the checkpoint at /home/gebrial/projects/llm-from-scratch/src/checkpoints/120M_SHALLOw_smallVocabCleaned_RoPE2_BPE/epoch=1-step=7044.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Loaded model weights from the checkpoint at /home/gebrial/projects/llm-from-scratch/src/checkpoints/120M_SHALLOw_smallVocabCleaned_RoPE2_BPE/epoch=1-step=7044.ckpt


Validation DataLoader 0:  87%|████████▋ | 250/286 [00:25<00:03,  9.65it/s]

Validation DataLoader 0: 100%|██████████| 286/286 [00:27<00:00, 10.34it/s]


[{'val_accuracy_epoch': 0.6962168216705322, 'val_loss_epoch': 1.09814453125}]