In this section, we‚Äôll implement a function that generates **input‚Äìtarget pairs** for training a language model.

---

### üí° Why We Don‚Äôt Need Labeled Data

Language model training is **self-supervised** ‚Äî  
we don‚Äôt need human-labeled datasets like in classification or regression.

Instead, we can use the **text itself** to create labels:

> Each token predicts the **next token** in the sequence.

For example:

Then the model learns pairs like:
| Input | Target |
|--------|--------|
| I | love |
| I love | deep |
| I love deep | learning |

---

### üåÄ The Twist: Sliding Window

We‚Äôll use a **sliding window** approach to efficiently create these input‚Äìtarget pairs.

---

### ‚öôÔ∏è Steps

1. **Define a Context Window**
   - The context window represents the **maximum number of tokens** (words, subwords, or characters) the model can see at once.  
   - Example: `context_window = 5`

2. **Slide Through the Text**
   - Start from the beginning of the text and take **chunks of size = context_window**.
   - For each position in the text:
     - The **input** is a sequence of up to `context_window` tokens.
     - The **target** is the **next token** following that sequence.

3. **Repeat Until End of Text**
   - Keep shifting the window by 1 token and continue generating pairs until you reach the end.

---

In [1]:
# loading the data 
with open('the-verdict.txt', 'r', encoding='utf-8') as f:
    raw_data = f.read()

In [2]:
# create token_ids
import tiktoken

tokenizer = tiktoken.get_encoding('gpt2')
data_tokens = tokenizer.encode(raw_data)
print(f"Total tokens : {len(data_tokens)}")
print(f"Sample tokens : {data_tokens[:20]}")

Total tokens : 5145
Sample tokens : [40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438]


In [3]:
# context size
context_size = 4
enc_sample = data_tokens[:100]
# if input is 4 tokens [1, 2, 3, 4], then output should be [2, 3, 4, 5]
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]

print(f"x : {x}")
print(f"y :     {y}")

x : [40, 367, 2885, 1464]
y :     [367, 2885, 1464, 1807]


In [4]:
for i in range(1, context_size+1):
    inputs = enc_sample[:i]
    output = enc_sample[i]

    print(f"inputs : {inputs} ---> {output}")

inputs : [40] ---> 367
inputs : [40, 367] ---> 2885
inputs : [40, 367, 2885] ---> 1464
inputs : [40, 367, 2885, 1464] ---> 1807


### Dataset and DataLoader

A Dataset is a Python class that tells PyTorch:<br>
-> How to access your data<br>
-> How many samples you have<br>
-> How to fetch any item by index<br>

DataLoader takes a Dataset and handles:

‚úîÔ∏è Batching (e.g., batch size 32)<br>
‚úîÔ∏è Shuffling (important for training)<br>
‚úîÔ∏è Parallel data loading using workers<br>
‚úîÔ∏è Dropping leftover samples<br>
‚úîÔ∏è Putting tensors on GPU automatically (with pin_memory)<br>

In [8]:
# Implementing efficient Dataloaders that iterate over data and return batches of  PyTorch tensors
import torch
from torch.utils.data import Dataset 
from torch.utils.data import DataLoader 

class GPTDatasetV1(Dataset):
    def __init__(self, raw_text, tokenizer, context_size, stride):
        self.input_ids = []
        self.output_ids = []
        token_ids = tokenizer.encode(raw_text)

        for i in range(0, len(token_ids) - context_size - 1, stride):
            x = token_ids[i:i+context_size]
            y = token_ids[i+1:context_size+i+1]
            self.input_ids.append(torch.tensor(x))
            self.output_ids.append(torch.tensor(y))

    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, idx):
        return self.input_ids[idx], self.output_ids[idx]


In [9]:
gpt_dataset = GPTDatasetV1(raw_data, tokenizer, 4, 1)

In [10]:
print(len(gpt_dataset))

5140


In [13]:
# printing an example 
x = gpt_dataset[0][0]
y = gpt_dataset[0][1]

print(f"x : {x}")
print(f"y :  {y}")

x : tensor([  40,  367, 2885, 1464])
y :  tensor([ 367, 2885, 1464, 1807])


In [None]:
## creating a DataLoader 

def create_dataloader_v1(raw_data, batch_size=4, context_size=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    
    tokenizer = tiktoken.get_encoding('gpt2')
    dataset = GPTDatasetV1(raw_data, tokenizer=tokenizer, context_size=context_size, stride=stride)

    # batch_size : number of batches model process before UPDATING parameters
    # num_workers : parallel processing

    # If your dataset size is not divisible by the batch_size, you‚Äôll end up with one last smaller batch.
    # The drop_last flag controls whether to keep or drop that final partial batch.

    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=drop_last, num_workers=num_workers)

    return dataloader

In [16]:
dataloader = create_dataloader_v1(raw_data)
batch = next(iter(dataloader))
batch

[tensor([[ 314,  550, 3750,  ..., 6451,   11,  286],
         [6164,   25,  366,  ...,   11, 4844,  286],
         [ 286, 1762,   30,  ...,  388,  351,  262],
         [2612, 4369,   11,  ...,  655, 4030,  465]]),
 tensor([[  550,  3750,   351,  ...,    11,   286,  2612],
         [   25,   366, 16773,  ...,  4844,   286,   262],
         [ 1762,    30,  2011,  ...,   351,   262,  1459],
         [ 4369,    11,   523,  ...,  4030,   465,  2951]])]