# Creating Input-Target pairs


In [2]:
with open("the-verdict.txt",'r',encoding='utf-8') as f:
    raw_text=f.read()
print('Total number of characters:', len(raw_text))
print(raw_text[:100])

Total number of characters: 20479
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


In [3]:
! pip install tiktoken
 




[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import importlib
import tiktoken


In [5]:
tokenizer=tiktoken.get_encoding("gpt2")

In [6]:
text='Hello, how are you doing today? <|endoftoken|> I hope you are having a great day!'
integers=tokenizer.encode(text)
print(integers)

[15496, 11, 703, 389, 345, 1804, 1909, 30, 1279, 91, 437, 11205, 4233, 91, 29, 314, 2911, 345, 389, 1719, 257, 1049, 1110, 0]


In [7]:
strings=tokenizer.decode(integers)
print(strings)

Hello, how are you doing today? <|endoftoken|> I hope you are having a great day!


Tokenizing whole raw text 

In [8]:
txt_enc=tokenizer.encode(raw_text)
print(f'Total number of tokens: {len(txt_enc)}')

Total number of tokens: 5145


In [9]:
context_size=4
x=txt_enc[:context_size]
y=txt_enc[1:context_size+1]
print(f"x={x}")
print(f"y={y}")

x=[40, 367, 2885, 1464]
y=[367, 2885, 1464, 1807]


In [10]:
for i in range(1,context_size+1):
    context=txt_enc[:i]
    desired=txt_enc[i]
    print(tokenizer.decode(context),'-----> ',tokenizer.decode([desired]))

I ----->   H
I H ----->  AD
I HAD ----->   always
I HAD always ----->   thought


Using Data loader 

<div class="alert alert-block alert-info">
    
Step 1: Tokenize the entire text
    
Step 2: Use a sliding window to chunk the book into overlapping sequences of max_length

Step 3: Return the total number of rows in the dataset

Step 4: Return a single row from the dataset
</div>


In [None]:
import torch
from torch.utils.data import Dataset,DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self,txt_enc,tokenizer,max_length,stride):
        self.input_ids=[]
        self.target_ids=[]
        
        #tokenize the entire text    
        token_ids=tokenizer.encode(txt_enc,allowed_special={"<|endoftoken|>"})
        
        #use sliding window to create input-target pairs
        for i in range(0,len(token_ids)-max_length,stride):
            input_chunk=token_ids[i:i+max_length]
            target_chunk=token_ids[i+1:i+max_length+1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))
    
    def __len__(self):
        return len(self.txt_enc)-self.context_size
    
    def __getitem__(self,idx):
        return self.input_ids[idx],self.target_ids[idx]

<div class="alert alert-block alert-warning">

The GPTDatasetV1 class in listing 2.5 is based on the PyTorch Dataset class.

It defines how individual rows are fetched from the dataset. 

Each row consists of a number of
token IDs (based on a max_length) assigned to an input_chunk tensor. 

The target_chunk
tensor contains the corresponding targets. 

I recommend reading on to see how the data
returned from this dataset looks like when we combine the dataset with a PyTorch
DataLoader -- this will bring additional intuition and clarity.
    
</div>