# Tiny Stories Hackathon
> From Cluster of stars study group

## TinyStories Hackathon Rules
This hackathon is intended to be a fun competition to give ourselves practice pretraining LLMs on consumer hardware. We will follow the [TinyStories paper](<https://arxiv.org/abs/2305.07759>) and train small language models on small datasets and hardware.

The hackathon will end on April 7th, [AOE](<https://en.wikipedia.org/wiki/AoE>).

### Datasets
1. [**TinyStories:**](<https://huggingface.co/datasets/roneneldan/TinyStories>)
   Note that the TinyStories dataset is split into two versions both in the HF dataset:
     - GPT-3.5 generated TinyStories
    - GPT-4 generated TinyStories
   The tar file appears to have the cleanest versions with the least number of duplicates.
2. **[Simple Wikipedia](<https://huggingface.co/datasets/lsb/simplewiki2023>)** (optional)
   This dataset can be used to give your model more world knowledge than from just the TinyStories dataset. But be careful that 
it doesn't cause your model to use words which a typical 3 to 4-year-olds doesn't understand. It may need to be cleaned.

### Evaluation
Models will be evaluated by LLM-as-a-judge following the methodology outlined in the TinyStories paper. More details including how to submit your model's outputs early next week.

### Model Size Limits
Participants will be slotted into one of the following categories based on their hardware:
- **Small**: Up to 30M parameters. Low-to-mid range laptop GPUs and Apple Silicon.
- **Medium**: Up to 60M parameters. Mid-range GPUs (including high-end laptop GPUs and Apple Silicon)
- **Large**: Up to 120M parameters. High-end GPUs and multi-GPU systems.

### Tokenizers
While you must train your model from scratch, you are welcome to use any pre-trained tokenizer or train your own tokenizer.

### Model Architecture
You are welcome to use any model architecture you want provided you stay within the parameter budget of your hardware by following the parameter counting rules below.

### Parameter Counting
The Parameter budget is the number of unique floating-point weights receiving gradient updates:
- Unique Weights: Count each distinct floating-point weight stored in the model once.
- Reuse Multiplier: For each weight, multiply by the number of distinct times it contributes to forward computation (e.g., due to layer-sharing, layer reuse, or non-standard head-sharing). Weight-tied embedding and decoder weights are the exception and are only counted once. MQA/GQA doesn't count as head-sharing.

### Teams
Teams are limited to a maximum of 2 members and must be formed and declared within the first week.

### Training Frameworks
You might want to take a look at the following libraries and frameworks and adopt one for pretraining:
- [Composer](<https://docs.mosaicml.com/projects/composer/en/stable/index.html>) and optionally [LLM Foundry](<https://github.com/mosaicml/llm-foundry>)
- [PyTorch Lightning](<https://lightning.ai/docs/pytorch/stable/>) and optionally [LitGPT](<https://github.com/Lightning-AI/litgpt>)
- Hugging Face [Trainer](<https://huggingface.co/docs/transformers/en/main_classes/trainer>), [Accelerate](<https://huggingface.co/docs/accelerate/en/index>), and optionally [Axolotl](<https://axolotl-ai-cloud.github.io/axolotl/>) (a wrapper on top of HF)
- [fastai](<https://docs.fast.ai/>) with either [fastxtend](<https://fastxtend.benjaminwarner.dev/text.huggingface.html>)/[blurr](<https://ohmeow.github.io/blurr/>)

## Data

### Dataset (?)

In [1]:
from datasets import load_dataset
import tiktoken
import torch

from minai import *

Grab tiny stories data from hugging face.

In [2]:
ds = load_dataset('roneneldan/TinyStories')
trn = ds['train']
val = ds['validation']
trn

Dataset({
    features: ['text'],
    num_rows: 2119719
})

For now, we can just use gpt2 tokenizer to get started.

In [3]:
tokenizer = tiktoken.get_encoding('gpt2')

txt = trn[0]['text']
txt

'One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.'

In [4]:
tokenizer.encode(txt)[:10]

[3198, 1110, 11, 257, 1310, 2576, 3706, 20037, 1043, 257]

In [5]:
tokenizer.decode(tokenizer.encode(txt)[:10])

'One day, a little girl named Lily found a'

Let's encode them.

In [6]:
def encode(b):
    b['text'] = [tokenizer.encode(o) for o in b['text']]
    return b

In [7]:
ds = ds.with_transform(encode)
trn = ds['train']
val = ds['validation']
trn

Dataset({
    features: ['text'],
    num_rows: 2119719
})

Now we have numbers. We have to decode them to read text.

In [8]:
trn[0]['text'][:5]

[3198, 1110, 11, 257, 1310]

In [9]:
tokenizer.decode(trn[0]['text'])

'One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.'

### Chunk

In [10]:
seq_len = 1024
chunk_sz = seq_len + 1
eot_token = 50256
eot_tensor = torch.tensor([eot_token])

Let's try to use 1% data to get started. Our goal is to add `eot_token` to the end of each text. Then, chop them up into `seq_len` to create each batch.

In [11]:
data_len = trn.num_rows // 100
data_len

21197

In [12]:
seq_tensor = [torch.tensor(o) for o in trn[:data_len]['text']]
seq_tensor[0]

tensor([ 3198,  1110,    11,   257,  1310,  2576,  3706, 20037,  1043,   257,
        17598,   287,   607,  2119,    13,  1375,  2993,   340,   373,  2408,
          284,   711,   351,   340,   780,   340,   373,  7786,    13, 20037,
         2227,   284,  2648,   262, 17598,   351,   607,  1995,    11,   523,
          673,   714, 34249,   257,  4936,   319,   607, 10147,    13,   198,
          198,    43,   813,  1816,   284,   607,  1995,   290,   531,    11,
          366, 29252,    11,   314,  1043,   428, 17598,    13,  1680,   345,
         2648,   340,   351,   502,   290, 34249,   616, 10147,  1701,  2332,
         1995, 13541,   290,   531,    11,   366,  5297,    11, 20037,    11,
          356,   460,  2648,   262, 17598,   290,  4259,   534, 10147,   526,
          198,   198, 41631,    11,   484,  4888,   262, 17598,   290,   384,
        19103,   262,  4936,   319, 20037,   338, 10147,    13,   632,   373,
          407,  2408,   329,   606,   780,   484,   547,  7373, 

In [13]:
cat = torch.cat([torch.cat([s, eot_tensor]) for s in seq_tensor])
cat[:200]

tensor([ 3198,  1110,    11,   257,  1310,  2576,  3706, 20037,  1043,   257,
        17598,   287,   607,  2119,    13,  1375,  2993,   340,   373,  2408,
          284,   711,   351,   340,   780,   340,   373,  7786,    13, 20037,
         2227,   284,  2648,   262, 17598,   351,   607,  1995,    11,   523,
          673,   714, 34249,   257,  4936,   319,   607, 10147,    13,   198,
          198,    43,   813,  1816,   284,   607,  1995,   290,   531,    11,
          366, 29252,    11,   314,  1043,   428, 17598,    13,  1680,   345,
         2648,   340,   351,   502,   290, 34249,   616, 10147,  1701,  2332,
         1995, 13541,   290,   531,    11,   366,  5297,    11, 20037,    11,
          356,   460,  2648,   262, 17598,   290,  4259,   534, 10147,   526,
          198,   198, 41631,    11,   484,  4888,   262, 17598,   290,   384,
        19103,   262,  4936,   319, 20037,   338, 10147,    13,   632,   373,
          407,  2408,   329,   606,   780,   484,   547,  7373, 

In [14]:
cat.shape

torch.Size([4730875])

Let's create batches with `seq_length`.

In [15]:
num_complete_segments = cat.size(0) // chunk_sz
num_complete_segments

4615

In [16]:
complete_segments = cat[:num_complete_segments * chunk_sz].view(-1, chunk_sz)
complete_segments.shape

torch.Size([4615, 1025])

> TODO

Looking at the last bit, it is pretty close to a whole `seq_len`. We can pad it and use it later.

In [17]:
remainder = cat[num_complete_segments * seq_len:]
remainder.shape

torch.Size([5115])

### Dataset (!)

Let's create inputs and targets for a dataset.

In [19]:
inps = complete_segments[:, :-1]
targs = complete_segments[:, 1:]
inps.shape, targs.shape

(torch.Size([4615, 1024]), torch.Size([4615, 1024]))

In [20]:
inps[0][:20]

tensor([ 3198,  1110,    11,   257,  1310,  2576,  3706, 20037,  1043,   257,
        17598,   287,   607,  2119,    13,  1375,  2993,   340,   373,  2408])

In [21]:
targs[0][:20]

tensor([ 1110,    11,   257,  1310,  2576,  3706, 20037,  1043,   257, 17598,
          287,   607,  2119,    13,  1375,  2993,   340,   373,  2408,   284])

We can create a dataset now.

In [22]:
trn_ds = Dataset(inps, targs)
trn_ds[0]

(tensor([ 3198,  1110,    11,  ..., 24829,   284,   262]),
 tensor([1110,   11,  257,  ...,  284,  262, 7586]))

We got the training dataset. Now, we can get the validation dataset with the same approach.

In [24]:
val_data_len = val.num_rows // 100
val_data_len

219

In [25]:
val_seq_tensor = [torch.tensor(o) for o in val[:val_data_len]['text']]
val_seq_tensor[0]

tensor([32565,    13, 15899,  2497,   262, 22441,  1097,   290,   531,    11,
          366, 22017,    11, 21168,    11,   534,  1097,   318,   523,  6016,
          290,  3424,  2474, 21168, 13541,   290,  8712,    11,   366, 10449,
          345,    11, 15899,    13,   314, 25245,   340,   790,  1110,   526,
          198,   198,  3260,  2712,   351,   262,  1097,    11, 21168,   290,
        15899,  2936, 47124,    13,  1119,  1043,   257,  1402, 16723,   351,
         1598,  1660,    13,  1119, 24070,   262,  1660,   290,  2936,   845,
         3772,    13,  1119,  2826,  1978,   477,  1110,   290,  2627,  1266,
         2460,    13])

In [26]:
val_cat = torch.cat([torch.cat([s, eot_tensor]) for s in val_seq_tensor])
val_cat[:200]

tensor([32565,    13, 15899,  2497,   262, 22441,  1097,   290,   531,    11,
          366, 22017,    11, 21168,    11,   534,  1097,   318,   523,  6016,
          290,  3424,  2474, 21168, 13541,   290,  8712,    11,   366, 10449,
          345,    11, 15899,    13,   314, 25245,   340,   790,  1110,   526,
          198,   198,  3260,  2712,   351,   262,  1097,    11, 21168,   290,
        15899,  2936, 47124,    13,  1119,  1043,   257,  1402, 16723,   351,
         1598,  1660,    13,  1119, 24070,   262,  1660,   290,  2936,   845,
         3772,    13,  1119,  2826,  1978,   477,  1110,   290,  2627,  1266,
         2460,    13, 50256,  7454,  2402,   257,   640,    11,   287,   257,
         1263,  8222,    11,   612,  5615,   257,  9529,   259,   420, 27498,
         3706,   371, 23536,    13,   371, 23536,  6151,   284, 12080,    13,
         1375, 19952,  7150,    11, 12586,    11,   290, 18639,    13,  1881,
         1110,    11,   371, 23536,  1043,   281, 30284, 12788, 

In [27]:
val_num_complete_segments = val_cat.size(0) // chunk_sz
val_num_complete_segments

45

In [28]:
val_complete_segments = val_cat[:val_num_complete_segments * chunk_sz].view(-1, chunk_sz)
val_complete_segments.shape

torch.Size([45, 1025])

In [29]:
val_inps = val_complete_segments[:, :-1]
val_targs = val_complete_segments[:, 1:]
val_inps.shape, val_targs.shape

(torch.Size([45, 1024]), torch.Size([45, 1024]))

In [30]:
val_ds = Dataset(val_inps, val_targs)
val_ds[0]

(tensor([32565,    13, 15899,  ...,    13,  4186,   373]),
 tensor([   13, 15899,  2497,  ...,  4186,   373,  3772]))

### DataLoader

We need a dataloader with the batch size.

In [31]:
bs = 4

trn_dl, val_dl = dls = get_dls(trn_ds, val_ds, bs)
xb,yb = next(iter(trn_dl))
xb.shape,yb.shape

(torch.Size([4, 1024]), torch.Size([4, 1024]))

In [32]:
xb[:5], yb[:5]

(tensor([[   13,  1375,  7224,  ...,    13,   679,   857],
         [ 1719,  1576,  5252,  ...,   262,  3084,   290],
         [  470,  1234,   319,  ...,   257,  7604,   329],
         [ 2474,   198,   198,  ...,   262, 47009,   290]]),
 tensor([[ 1375,  7224,   262,  ...,   679,   857,   407],
         [ 1576,  5252,   284,  ...,  3084,   290,  1816],
         [ 1234,   319,   465,  ...,  7604,   329,   606],
         [  198,   198,    50,  ..., 47009,   290,   547]]))

### Model

We make the model using transformer.

In [33]:
import torch.nn as nn

Here's the `MultiHeadAttention`.

In [72]:
x = torch.randn((1, 2, 3)) # (bs, ctx_len, d_in)

In [88]:
class MultiHeadAttention(nn.Module):
    def __init__(self, n_head, d_in, d_out, ctx_len, qkv_bias=False):
        super().__init__()
        self.n_head = n_head
        self.d_in = d_in
        self.d_out = d_out
        self.ctx_len = ctx_len
        self.head_dim = d_out // n_head
        self.w_q = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.w_k = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.w_v = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.mask = nn.Buffer(torch.triu(torch.ones((ctx_len, ctx_len)), diagonal=1).bool())
    
    def forward(self, x): 
        bs, ctx_len, d_in = x.shape
        q = self.w_q(x)  # (bs, ctx_len, d_out)
        k = self.w_k(x)
        v = self.w_v(x)
        
        q = q.view(bs, ctx_len, self.n_head, self.head_dim)  # (bs, ctx_len, n_head, head_dim)
        k = k.view(bs, ctx_len, self.n_head, self.head_dim)
        v = v.view(bs, ctx_len, self.n_head, self.head_dim)
        
        q = q.transpose(1,2) # (bs, n_head, ctx_len, head_dim)
        k = k.transpose(1,2)
        v = v.transpose(1,2)
        
        attn_scr = q@k.transpose(2,3) # (bs, n_head, ctx_len, ctx_len)
        
        # mask
        attn_scr = attn_scr.masked_fill(self.mask.bool(), -torch.inf)
        
        # attn_wt
        attn_wt = torch.softmax(attn_scr / k.shape[-1]**0.5, -1)
        # ctx_vec
        ctx_vec = attn_wt@v  # (bs, n_head, ctx_len, head_dim)
        ctx_vec = ctx_vec.transpose(1,2).reshape(bs, ctx_len, -1) # (bs, ctx_len, d_out)
        
        # concat
        return ctx_vec

In [89]:
mh = MultiHeadAttention(n_head=2, d_in=3, d_out=4, ctx_len=2)
mh(x)

tensor([[[-0.2039,  0.4090,  0.4439,  0.0045],
         [-0.3425, -0.1251, -0.1322, -0.0244]]], grad_fn=<UnsafeViewBackward0>)