## Implemented two simple tokenizers from scratch and demonstrated tiktoken library.
Dataset used: 'The verdict' by Edith Warton(1908)
Embedding: The process of converting data into a vector format.
Implementing the first step of data prepration and sampling: Tokenization.

Step 1: Creating tokens (word based tokenizers) or tokenizing text

In [None]:
import os
import urllib.request

In [2]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

## BYTE PAIR ENCODING
Implementing BPE from scratch can be relatively complicated, thus we will use an existing Python open-source library called tiktoken which is a fast BPE tokenizer for use wuth OPENAI's models.

Encode: Tokenization and conversion into token IDs.

In [3]:
import torch
import tiktoken
import importlib
print('tiktoken version: ', importlib.metadata.version("tiktoken"))

tiktoken version:  0.9.0


In [4]:
tokenizer = tiktoken.get_encoding("gpt2")

In [5]:
text = ( "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace." )

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [6]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


The BPE tokenizer can handle unknown words. How can it achieve this without using <|unk|> token?


The algorithm underlying BPE breaksdown words that aren't in its predefined vocabulary into smaller subword units or even individual characters.
This enables it to handle out of vocabulary(OOV) words.

An example to illustrate how the BPE tokenizer deals with unknown tokens.

In [7]:
integers = tokenizer.encode("Akwirw ier")
print(f"integers: {integers}")
print(f"type: {type(integers)}")

strings = tokenizer.decode(integers)
print(f"strings: {strings}")

integers: [33901, 86, 343, 86, 220, 959]
type: <class 'list'>
strings: Akwirw ier


## Creating input-output layer

We implement a data loader that fetches the input-output pairs using a sliding window approach.

In [8]:
enc_text = tokenizer.encode(raw_text)
print(f"Length of enc_text: {len(enc_text)}")
enc_sample = enc_text[50:]

Length of enc_text: 5145


## Context size 
Determines how many tokens are included in the input. The model istrained to look at a sequence of context_size number of words to predict the next word in the sequence.

Each input-output pair contains context size number of prediction tasks.

## Data Loader
Iterates over the input dataset and returns inputs and targets as PyTorch tensors. We are iterested in returning two tensors: an input tensor containing text that the LLM sees and a target tensor that includes the trget for LLM to predict. We implement dataloader using PyTorch datasets and dataloader classes. We aim at returning two tensors: an input tensor and an output tensor. Helps us do parallel processing.

In [9]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        self.token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        for i in range(0, len(self.token_ids)-max_length, stride):
            input_chunk = self.token_ids[i : max_length+i]
            target_chunk = self.token_ids[i+1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
        
    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx] #idx=index

drop_last=true: drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training.

batch_size = how many batches or CPU processes we want to run parallelly

max_length = context length

num_workers = number of CPU threads which we can run simultaneously. 

In [10]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("o200k_base")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) #creating dataset
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
    return dataloader

#this code uses the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader.
#this function governs batch processing or the parallel processing we need which is governed by the batch size.
#It help us create the input output data pairs from the dataset which we defined earlier.   

We now convert the dataloader to python iterator to fetch the next entry via python built-in next() function.

In [11]:
print("PyTorch Version: ", torch.__version__)
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs: ", inputs)
print("targets: ", targets)

PyTorch Version:  2.6.0
Inputs:  tensor([[    40, 148954,   3324,   4525],
        [ 10874, 165003,  33750,   7542],
        [   261,  12424,  59245,    375],
        [  6460,    261,   1899,  19807],
        [  4951,    375,    786,    480],
        [   673,    860,   2212,  19005],
        [   316,    668,    316,   9598],
        [   484,     11,    306,    290]])
targets:  tensor([[148954,   3324,   4525,  10874],
        [165003,  33750,   7542,    261],
        [ 12424,  59245,    375,   6460],
        [   261,   1899,  19807,   4951],
        [   375,    786,    480,    673],
        [   860,   2212,  19005,    316],
        [   668,    316,   9598,    484],
        [    11,    306,    290,   4679]])


In [12]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[  4679,    328,   1232,  40373],
        [    11,    501,   1458,  22664],
        [  1232,  21352,     11,  17189],
        [   261,  10358, 101819,     11],
        [   326,  12812,  11166,    306],
        [   261,  38350,    402,    290],
        [123397,     13,    350,  52861],
        [   357,   7542,   4525,    480]]), tensor([[   328,   1232,  40373,     11],
        [   501,   1458,  22664,   1232],
        [ 21352,     11,  17189,    261],
        [ 10358, 101819,     11,    326],
        [ 12812,  11166,    306,    261],
        [ 38350,    402,    290, 123397],
        [    13,    350,  52861,    357],
        [  7542,   4525,    480,   1481]])]


Small batch sizes require less memory during training but lead to more noisy model updates. Batch size is a trade-off and hyperparameter to experiment with when training LLMs. Model will procces one batch before making the parameter updates.