## Implemented two simple tokenizers from scratch and demonstrated tiktoken library.
Dataset used: 'The verdict' by Edith Warton(1908)

Implementing the first step of data prepration and sampling: Tokenization

Step 1: Creating tokens (word based tokenizers) or tokenizing text

In [1]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

Step 2: Creating Token IDs

Implementing a python class for tokenization.
This class will have two methods, encode and decode.

Step 1: Store the vocabulary as class attribute for access in the encode and decode method.

Step 2: Create an inverse vocabulary that maps token IDs back to the original text tokens.

Step 3: process input text into token IDs

Step 4: Convert token IDs back into text

## BYTE PAIR ENCODING
Implementing BPE from scratch can be relatively complicated, thus we will use an existing Python open-source library called tiktoken which is a fast BPE tokenizer for use wuth OPENAI's models.

In [2]:
import tiktoken
import importlib
print('tiktoken version: ', importlib.metadata.version("tiktoken"))

tiktoken version:  0.8.0


In [3]:
tokenizer = tiktoken.get_encoding("gpt2")

The BPE tokenizer can handle unknown words. How can it achieve this without using <|unk|> token?


The algorithm underlying BPE breaksdown words that aren't in its predefined vocabulary into smaller subword units or even individual characters.
This enables it to handle out of vocabulary(OOV) words.

An example to illustrate how the BPE tokenizer deals with unknown tokens

## Creating input-output layer

We implement a data loader that fetches the input-output pairs using a sliding window approach.

In [4]:
enc_text = tokenizer.encode(raw_text)
enc_sample = enc_text[50:]

## Context size 
determines how many tokens are included in the input. The model istrained to look at a sequence of context_size number of words to predict the next word in the sequence.

Each input-output pair contains context size number of prediction tasks.

In [5]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x:  {x}")
print(f"y:    {y}")

x:  [290, 4920, 2241, 287]
y:    [4920, 2241, 287, 257]


In [6]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [7]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


## Data Loader
iterates over the input dataset and returns inputs and targets as PyTorch tensors.

We implement dataloader using PyTorch datasets and dataloader classes.

We aim at returning two tensors: an input tensor and an output tensor.

Helps us do parallel processing.

In [8]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        self.token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        for i in range(0, len(self.token_ids)-max_length, stride):
            input_chunk = self.token_ids[i : max_length+i]
            target_chunk = self.token_ids[i+1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
        
    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx] #idx=index

drop_last=true = drops the last batchif it is shorter than the specified batch_size to prevent loss spikes during training.

batch_size = how many batches or CPU processes we want to run parallelly

max_length = context length

num_workers = number of CPU threads which we can run simultaneously. 

In [14]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("o200k_base")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) #creating dataset
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
    return dataloader

# this function governs batch processing or the parallel processing we need which is governed by the batch size.
#It help us create the input output data pairs from the dataset which we defined earlier.   

We now convert the dataloader to python iterator to fetch the next entry via python built-in next() function.

In [24]:
import torch
print("PyTorch Version: ", torch.__version__)
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs: ", inputs)
print("targets: ", targets)

PyTorch Version:  2.5.1+cpu
Inputs:  tensor([[    40, 148954,   3324,   4525],
        [ 10874, 165003,  33750,   7542],
        [   261,  12424,  59245,    375],
        [  6460,    261,   1899,  19807],
        [  4951,    375,    786,    480],
        [   673,    860,   2212,  19005],
        [   316,    668,    316,   9598],
        [   484,     11,    306,    290]])
targets:  tensor([[148954,   3324,   4525,  10874],
        [165003,  33750,   7542,    261],
        [ 12424,  59245,    375,   6460],
        [   261,   1899,  19807,   4951],
        [   375,    786,    480,    673],
        [   860,   2212,  19005,    316],
        [   668,    316,   9598,    484],
        [    11,    306,    290,   4679]])


In [19]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[148954,   3324,   4525,  10874]]), tensor([[  3324,   4525,  10874, 165003]])]


Batch size of 1 are used for illustration puposes. Small batch sizes require less memort during training but lead to more noisy model updates.

Batch size is a trade-off and hyperparameter to experiment with when training LLMs.

Model will procces one batch before making the parameter updates.

## TOKEN EMBEDDINGS