Creating input-target pairs. We implement a data loader that fetches the input-target pairs using a sliding window approach. We start with tokenizing the whole the Verdict story using the BPE tokenizer.

In [2]:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")

In [3]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print('length enc_text: ', len(enc_text))
print(raw_text[:60])
print(enc_text[:60])

length enc_text:  5145
I HAD always thought Jack Gisburn rather a cheap genius--tho
[40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11, 290, 4920, 2241, 287, 257, 4489, 64, 319, 262, 34686]


In [4]:
enc_sample = enc_text[50:]
print(enc_sample[:4])
#we remove first 50 tokens from the dataset for demonstration purpose as it results in a slightly more intresting text passage


context_size = 4
# context size 4 means that the model is trained to look at a sequence of 4 words to predict the next word in the sequnce.

x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

[290, 4920, 2241, 287]
x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


Processing the inputs along with the targets, which is the inputs shifted by one position, we can create the next-word prediction tasks as follows.

In [5]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(context, "--->", desired)

[290] ---> 4920
[290, 4920] ---> 2241
[290, 4920, 2241] ---> 287
[290, 4920, 2241, 287] ---> 257


Everything on the left of the arrow refers to the input an LLM would receive, and the token ID on the right side of the arrow represents the target token ID that the LLM is supposed to predict.

We now code to convert the token IDS into text:

In [6]:
for i in range(1, context_size+1):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(tokenizer.decode(context), "--->", tokenizer.decode([desired]))

 and --->  established
 and established --->  himself
 and established himself --->  in
 and established himself in --->  a


## IMPLEMENTING A DATA LOADER
using PyTorch's built-in Datasets and DataLoader classes

Step 1: Tokenize the entire text

Step 2: Use a sliding window to chunk the book into overlapping sequence of max_length.

Step 3: return the total number of rows in the dataset

Step 4: Return  single row from the dataset

In [23]:
from torch.utils.data import Dataset, DataLoader
#dataset needs to be in input output pairs
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride): #max length = context size
        self.input_ids = []
        self.target_ids = []
        #tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftxt|>"})
        #using a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids)-max_length, stride):
            input_chunk = token_ids[i:i+max_length]
            target_chunk = token_ids[i+1: i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
            
    def __len__(self):
        return len(self.input_ids)
        
    #method which will be used by the data loader
    def __getitem__(self, idx): #idx=index
        return self.input_ids[idx], self.target_ids[idx] #based on the index provided it will return that particular row of input and output

# data loader needs dataset in map style or iterable style, here we are using map style

The following code will use the GPTDatasetV1 to load the inputs in batches via Pytorch Dataloader:

Step 1: Initialize the tokenizer

Step 2: Create dataset

Step 3: drop_last = True drops the last batch if it is shorter than the specified batch size to prevent loss spikes during training

Step 4: The number of CPU processes to use for preprocessing

In [24]:
#This function will implement the batch processing, parallel processing which will be required, governed by the batch size.
#This function help us create the input output data pairs from the dataset which we defined earlier.
#num_workers=number of cpu threads which we can run simultaneously
#batch size=how many cpu processes you want to run parallel
#max_length=context length

def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    #initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")
    #create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    #create dataloader, this function will check the get item method in above function and it will return the input output pairs based on what
    #is mentioned in the get item.
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
    return dataloader

There is a difference between batch size and number of workers.

Batch size=number of batches the model processed at once before updating its parameters. To make sure that the model updates its parameters quickly data is usually chunked into batches, so that after analysing 4 batches the model will update its parameters, rather than going to the entire dataset.

Num_workers is for parallel processing on different threads of the cpu.

create_dataloader_V1 helps us do all this, alse defining batch size, num workers would be very challenging.

Testing the dataloader with a batch size=1 for an LLM with a context size of 4. Which helps develop the intuition of how create_dataloader_V1 and GPTDatasetV1 works together.

In [25]:
with open("the-verdict.txt", 'r', encoding="utf-8") as f:
    raw_text = f.read()

Now we create a data loader and convert the dataloader into a python iterator to fetch the next entry in the dataset via Python's build-in next() function

In [49]:
import torch
print("PyTorch version:", torch.__version__)
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

PyTorch version: 2.5.1+cpu
[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]


In [50]:
second_batch=next(data_iter)
second_batch

[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]