## Implemented two simple tokenizers from scratch and demonstrated tiktoken library.
Dataset used: 'The verdict' by Edith Warton(1908)
Embedding: The process of converting data into a vector format.
Implementing the first step of data prepration and sampling: Tokenization

Step 1: Creating tokens (word based tokenizers) or tokenizing text

## BYTE PAIR ENCODING
Implementing BPE from scratch can be relatively complicated, thus we will use an existing Python open-source library called tiktoken which is a fast BPE tokenizer for use wuth OPENAI's models.

In [70]:
import tiktoken
import importlib
print('tiktoken version: ', importlib.metadata.version("tiktoken"))

tiktoken version:  0.9.0


In [71]:
tokenizer = tiktoken.get_encoding("gpt2")

In [72]:
text = ( "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace." )

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [73]:
strings = tokenizer.decode(integers)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


The BPE tokenizer can handle unknown words. How can it achieve this without using <|unk|> token?


The algorithm underlying BPE breaksdown words that aren't in its predefined vocabulary into smaller subword units or even individual characters.
This enables it to handle out of vocabulary(OOV) words.

An example to illustrate how the BPE tokenizer deals with unknown tokens

In [74]:
integers = tokenizer.encode("Akwirw ier")
print("integers: ", integers)

strings = tokenizer.decode(integers)
print("strings: ", strings)

integers:  [33901, 86, 343, 86, 220, 959]
strings:  Akwirw ier


## Creating input-output layer

We implement a data loader that fetches the input-output pairs using a sliding window approach.

In [80]:
enc_text = tokenizer.encode(raw_text)
print(f"Length of enc_text: {len(enc_text)}")
enc_sample = enc_text[50:]

Length of enc_text: 5145


## Context size 
Determines how many tokens are included in the input. The model istrained to look at a sequence of context_size number of words to predict the next word in the sequence.

Each input-output pair contains context size number of prediction tasks.

In [81]:
context_size = 4
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x:  {x}")
print(f"y:    {y}")

x:  [290, 4920, 2241, 287]
y:    [4920, 2241, 287, 257]


In [82]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "---->", desired)

[290] ----> 4920
[290, 4920] ----> 2241
[290, 4920, 2241] ----> 287
[290, 4920, 2241, 287] ----> 257


In [83]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

 and ---->  established
 and established ---->  himself
 and established himself ---->  in
 and established himself in ---->  a


## Data Loader
Iterates over the input dataset and returns inputs and targets as PyTorch tensors. We are iterested in returning two tensors: an input tensor containing text that the LLM sees and a target tensor that includes the trget for LLM to predict. We implement dataloader using PyTorch datasets and dataloader classes. We aim at returning two tensors: an input tensor and an output tensor. Helps us do parallel processing.

In [84]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        self.token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        for i in range(0, len(self.token_ids)-max_length, stride):
            input_chunk = self.token_ids[i : max_length+i]
            target_chunk = self.token_ids[i+1 : i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
        
    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx] #idx=index

drop_last=true = drops the last batchif it is shorter than the specified batch_size to prevent loss spikes during training.

batch_size = how many batches or CPU processes we want to run parallelly

max_length = context length

num_workers = number of CPU threads which we can run simultaneously. 

In [None]:
def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("o200k_base")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) #creating dataset
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
    return dataloader

# this function governs batch processing or the parallel processing we need which is governed by the batch size.
#It help us create the input output data pairs from the dataset which we defined earlier.   

We now convert the dataloader to python iterator to fetch the next entry via python built-in next() function.

In [None]:
import torch
print("PyTorch Version: ", torch.__version__)
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs: ", inputs)
print("targets: ", targets)

In [None]:
second_batch = next(data_iter)
print(second_batch)

Batch size of 1 are used for illustration puposes. Small batch sizes require less memort during training but lead to more noisy model updates.

Batch size is a trade-off and hyperparameter to experiment with when training LLMs.

Model will procces one batch before making the parameter updates.

## TOKEN EMBEDDINGS

For demonstration purpose we create an embeding later for a vocab of cardinality 6 projected to R3.

## Positional encoding/embedding

In [None]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

In [None]:
max_length = 4
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=max_length, stride=max_length, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)

In [None]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Now we convert each token id to a 256 dimensional vector.

In [None]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

Embedding layer for positional embedding

In [None]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [None]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

In [None]:

token_embedding = input_embeddings + positonal_embeddings

In [1]:
import torch

In [3]:
inputs = torch.tensor([[0.43, 0.15, 0.89], #your
                       [0.55, 0.87, 0.66], #journey
                       [0.57, 0.85, 0.64], #starts
                       [0.22, 0.58, 0.33], #with
                       [0.77, 0.25, 0.10], #one
                       [0.05, 0.80, 0.55]]) #step

In [5]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [7]:
words = ["Your", "journey", "srats", "with", "one", "step"]

x_coords = inputs[:, 0].numpy()
y_coords = inputs[:, 1].numpy()
z_coords = inputs[:, 2].numpy()