<a href="https://colab.research.google.com/github/abdussahid26/Dara-preparation-and-sampling-for-LLMs/blob/main/Data_Sampling_with_a_Sliding_Window_Approach.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**In this file we are going to implement a dataloader that fetches the input-target pairs from the training dataset using a sliding window approach.**



In [1]:
!pip install tiktoken



In [2]:
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.5.1+cu121
tiktoken version: 0.8.0


In [3]:
import os
import urllib.request

url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

Tokenizing the whole 'The Verdict' short story using the **byte pair encoding (BPE)** tokenizer.

In [4]:
import tiktoken

tokenizer = tiktoken.get_encoding("gpt2")

In [5]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

enc_text = tokenizer.encode(raw_text)
print("Total number of tokens in the training set after applying the BPE tokenizer: ", len(enc_text))

Total number of tokens in the training set after applying the BPE tokenizer:  5145


Now, remove the first 50 tokens from the dataset for demonstration purposes.

In [6]:
enc_sample = enc_text[50:]

context_size = 4 # The context size determines how many tokens are included in the input.
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


By processing the inputs along with the targets, which are the inputs shifted by one position, we can create the next-word prediction tasks.

In [7]:
print("Following are the input-output pairs that we can use for LLM training:")

for i in range(1, context_size+1):
    context = enc_sample[:i]
    target = enc_sample[i]
    print(context, "--->", target)
    print(tokenizer.decode(context), "--->", tokenizer.decode([target]))


Following are the input-output pairs that we can use for LLM training:
[290] ---> 4920
 and --->  established
[290, 4920] ---> 2241
 and established --->  himself
[290, 4920, 2241] ---> 287
 and established himself --->  in
[290, 4920, 2241, 287] ---> 257
 and established himself in --->  a


# **The efficient dataloader implementation.**

#### Step 1: Tokenize the entire text.
#### Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length.
#### Step 3: Return the total number of rows in the dataset.
#### Step 4: Return a single row from the dataset.

In [8]:
import torch
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride): # max_length means context_size;  the stride determines how much we slide during applying sliding window approach
        self.input_ids = []
        self.target_ids = []

        token_ids = tokenizer.encode(txt) # Tokenizes the entire text

        for i in range(0, len(token_ids) - max_length, stride): # Uses a sliding window approach to chunk the book into overlapping sequences of max_length
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self): # Returns the total number of rows in the dataset
        return len(self.input_ids)

    def __getitem__(self, idx): # Returns a single row from the dataset
        return self.input_ids[idx], self.target_ids[idx]

The following code uses the GPTDatasetV1 to load the inputs in batches via a PyTorch DataLoader.

#### Step 1: Initialize the tokenizer.
#### Step 2: Create dataset.
#### Step 3: drop_last = True; drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training.
#### Step 4: The number of CPU processes to use for preprocessing.

Why do we do this dataloader? Because it helps us to do parallel processing and it also analyzes multiple batches at one time.

In [9]:
# batch_size: The dataset usually chunked into batches. batch_size=4 means after analyzing 4 batches the model updats its parameter.
# num_workers means the number of CPU threads will be used for parallel processing.

def create_dataloader_v1(txt, batch_size=4, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2") # Initializes the BPE tokenizer.
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride) # It creates an instance of the GPTDatasetV1.
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last, # drop_last = True; drops the last batch if it is shorter than the specified batch_size to prevent loss spikes during training.
        num_workers=num_workers # The number of CPU processes to use for preprocessing.
    )

    return dataloader

Let's test the dataloader with a batch_size=1 for an LLM with a context_size of 4 to develop an intuition of how the GPTDatasetV1 class and the create_dataloader_v1 function work together.

In [10]:
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=1, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
print("1st batch: ", first_batch) # [tensor(input ), tensor(output)]

second_batch = next(data_iter)
print("2nd batch: ", second_batch)

1st batch:  [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
2nd batch:  [tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]


Notice that small batch sizes require less memory during training but lead to more noisy model updates. Just like in regular deep learning, the batch size is a tradeoff and
a hyperparameter to experiment with when training LLMs. Let’s look briefly at how we can use the data loader to sample with a batch size
greater than 1:

In [11]:
dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)

# Note that we increase the stride to 4 to utilize the data set fully (we don’t skip a single word). This avoids any overlap between the batches since more overlap could lead to increased overfitting.

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Targets:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])
