<center><h1> Load The Dataset </h1></center>


For this Small Language Model we will use a dataset called **TinyStories**. It is a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by **GPT-3.5** and **GPT-4**. We can get it from [Hugging Face](https://huggingface.co/datasets/roneneldan/TinyStories).

In [1]:
!pip install datasets



In [4]:
from datasets import load_dataset
ds = load_dataset("roneneldan/TinyStories")

<center><h1>Tokenize The Dataset</h1></center>

- **Tokenization** is the process of breaking down sequence of text into smaller units called tokens.
The tokenizer we will use is **GPT-2 sub-word Tokenizer** which uses **Bypair Encodding(BPE)**.

- Dataset -> Tokenizer -> Tokens -> TokenID

**In this step, we will:**

1. Tokenize the dataset into `tokenIDs`.
2. Create two binary files:
   - `Train.bin` with **2,000,000** rows
   - `Validation.bin` with **22,000** rows
3. Store all the token IDs in a single .bin file.
   - This will store the tokenIDs in disk storage, no in RAM.
      - Fast loading during training
      - No need to re-tokenize

These files will store the `tokenIDs` generated from the entire dataset.

In [13]:
!pip install tiktoken #Tiktoken is a library from OpenAI from which we can get different tokenizers.
import tiktoken
import os
import numpy as np
from tqdm.auto import tqdm

enc = tiktoken.get_encoding("gpt2")

def process(example):
    ids = enc.encode_ordinary(example['text']) #encode_ordinary ignores any special tokens
    out = {'ids': ids, 'len': len(ids)}
    return out

if not os.path.exists('train.bin'):
    tokenized = ds.map(
        process,
        remove_columns=['text'],
        desc= 'tokenizing the splits',
        num_proc= 8,
    )

#Concatenate all the ids in each dataset into one large file which will be used for training
for split, dset in tokenized.items():
    arr_len = np.sum(dset['len'], dtype= np.uint64)
    filename = f'{split}.bin'
    dtype = np.uint16 #can do since enc.max_token_value == 50256 is < 2**16
    arr = np.memmap(filename, dtype = dtype, mode = 'w+', shape= (arr_len,))
    total_batches = 1024

    idx = 0
    for batch_idx in tqdm(range(total_batches), desc=f'wrriting {filename}'):
        #Batch samples together for faster write
        batch = dset.shard(num_shards = total_batches, index = batch_idx, contiguous = True).with_format('numpy')
        arr_batch = np.concatenate(batch['ids'])
        #Write into map
        arr[idx : idx + len(arr_batch)] = arr_batch
        idx += len(arr_batch)
    arr.flush()



wrriting train.bin: 100%|██████████| 1024/1024 [06:08<00:00,  2.78it/s]
wrriting validation.bin: 100%|██████████| 1024/1024 [00:05<00:00, 185.68it/s]


<center><h1>Input-Output Batches</h1><center>

- **Input Batch**: A group of tokenized sequences fed into the model simultaneously during training or inference. Each sequence represents a segment of text, and the entire batch enables efficient parallel processing.

- **Output Batch**: The corresponding group of target sequences that the model is trained to predict, typically formed by shifting the input sequences one token to the left. These are used to compute the loss during training.

In [6]:
def get_batch(split):
    if split == 'train':
        data = np.menmap('train.bin', dtype = np.unit16, mode = 'r')
    else:
        data = np.memmap('validation.bin', dtype = np.uint16, mode = 'r')

    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i+block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i+1:i+1+block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y