<a href="https://colab.research.google.com/github/Vivek-afk81/LLM_from_scratch/blob/main/data_preprocessing_pipeline_llm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing Pipeline for LLMs

**4 Main Steps**


*   Tokenization
*   Token Embeddings


*   Positional Embeddings
*   Input Embeddings





In [2]:
from google.colab import drive
drive .mount("/content/drive")

Mounted at /content/drive


In [3]:
import os
os.chdir("/content/drive/My Drive/build_llm_from_scratch")
print(os.getcwd())

/content/drive/My Drive/build_llm_from_scratch


##Step 1: Tokenization

####Creating tokens

In [4]:
with open("The Call of the Wild.txt",'r',encoding="utf-8") as f:
  raw_text=f.read()
print("The total number of character: ",len(raw_text))
print(raw_text[:200])

The total number of character:  175584
The Call of the Wild

by Jack London




Contents

 Chapter I. Into the Primitive
 Chapter II. The Law of Club and Fang
 Chapter III. The Dominant Primordial Beast
 Chapter IV. Who Has Won to Mastersh


the regex pattern plays a very important role insplitting

In [5]:
import re
preprocessed = re.split(r'([,.:;?_!()"\'\s]|--)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(type(preprocessed))
print(preprocessed[:30])

<class 'list'>
['The', 'Call', 'of', 'the', 'Wild', 'by', 'Jack', 'London', 'Contents', 'Chapter', 'I', '.', 'Into', 'the', 'Primitive', 'Chapter', 'II', '.', 'The', 'Law', 'of', 'Club', 'and', 'Fang', 'Chapter', 'III', '.', 'The', 'Dominant', 'Primordial']


In [6]:
print(len(preprocessed))

36384


##Step 2  creating token ids

complete training dataset ---> tokenized text---> Vocabulary

1. Tokenization breaks
down the input text
into individual tokens.

2. Each unique token is
added to the vocabulary
in alphabetical order.

In [7]:
all_words= sorted(set(preprocessed))
vocab_size=len(all_words)
vocab_size

5216

In [8]:
#Assiging an integer to each and every token
vocab={token:integer for integer,token in enumerate(all_words)}

In [9]:
for i,item in enumerate(vocab.items()):
  print(item)
  if i>=50:
    break

('!', 0)
('(', 1)
(')', 2)
(',', 3)
('.', 4)
('1897', 5)
(':', 6)
(';', 7)
('?', 8)
('A', 9)
('About', 10)
('Across', 11)
('After', 12)
('Again', 13)
('Air-holes', 14)
('Alaska', 15)
('Alaskan', 16)
('Alice', 17)
('All', 18)
('Alpine', 19)
('Also', 20)
('Always', 21)
('Among', 22)
('An', 23)
('And', 24)
('Angry', 25)
('Another', 26)
('Apt', 27)
('Arctic', 28)
('As', 29)
('Ask', 30)
('Association', 31)
('At', 32)
('Back', 33)
('Bar', 34)
('Barge', 35)
('Barracks', 36)
('Barrens', 37)
('Bay', 38)
('Be', 39)
('Beast', 40)
('Because', 41)
('Before', 42)
('Being', 43)
('Bellying', 44)
('Bench', 45)
('Benches', 46)
('Bennett', 47)
('Bernard', 48)
('Besides', 49)
('Best', 50)


##Train a Hugging Face BPE Tokenizer

We will use **BPE** because Instead of treating each word as a token, BPE starts from characters and repeatedly merges the most frequent character pairs. This allows the tokenizer to represent common words as single tokens while still being able to compose rare or unseen words from subwords.

example

"lower", "lowest"

→ l o w e r
→ l o w e s t

Frequent merges:
* l+o → lo
* lo+w → low
* e+r → er
* e+s → es


In [10]:
#Rust-backed, production-grade library
!pip install -q tokenizers

###Create a BPE tokenizer

In [12]:
#Initializing the tokenizer

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer

In [13]:
tokenizer=Tokenizer(BPE(unk_token="[UNK]"))  # a fallback for the unseen tokens


####Pre-tokenization

Pre-tokenization defines initial splits


Here: split on whitespace before BPE merges

BPE does NOT work on raw text directly — it works on pre-tokenized units

In [14]:
tokenizer.pre_tokenizer=Whitespace()

In [15]:
##Defining how the vocubalary will be learned
"""
Word-level vocab (5216): Every distinct word is counted.

BPE vocab (5000): We're asking the tokenizer to learn a smaller, more
efficient set of subword tokens."""
trainer=BpeTrainer(
    vocab_size=5000,
    special_tokens=["[UNK]","[PAD]","[BOS]","[EOS]"]
)

In [16]:
##Training the tokenizer

tokenizer.train(
    files=["The Call of the Wild.txt"],
    trainer=trainer
)

In [20]:
tokenizer.get_vocab_size()

5000

In [21]:
tokenizer.get_vocab()

{'bla': 3096,
 'rility': 3386,
 'south': 4666,
 'shreds': 3515,
 'hen': 564,
 'inde': 2531,
 'Jack': 4319,
 'eas': 870,
 'Py': 3062,
 'sooner': 4056,
 'lifeless': 3601,
 'move': 689,
 'creat': 1098,
 'nor': 559,
 'stly': 3343,
 'cur': 696,
 'asingly': 3315,
 'drink': 2454,
 'boast': 3614,
 'proceed': 3636,
 'closely': 3883,
 'aud': 4395,
 'ring': 480,
 'pro': 347,
 'prou': 3208,
 'rope': 684,
 'ofs': 4901,
 'imal': 1016,
 'birds': 2907,
 'sights': 3898,
 'trying': 3234,
 'whole': 855,
 'lained': 4562,
 'came': 255,
 'cause': 542,
 'Pike': 918,
 'do': 118,
 'ggled': 3718,
 'tumbled': 4703,
 'given': 2327,
 'so': 182,
 'speedily': 4071,
 'Japanese': 4158,
 'sworn': 4671,
 'ines': 2528,
 'Jo': 346,
 'all': 169,
 'seeming': 3697,
 'uttered': 1467,
 'futi': 3969,
 'rid': 1613,
 'demon': 2645,
 'surpri': 961,
 'rest': 384,
 'virility': 3772,
 'Fin': 4295,
 'good': 511,
 'prided': 3651,
 '[PAD]': 1,
 'lodge': 1860,
 'kept': 850,
 'fire': 410,
 'seemed': 570,
 'search': 4098,
 'wned': 2652,
 '

####Encoding

In [25]:
encoded=tokenizer.encode("the dog ran quickly.")
for token_id in encoded.ids:
  print(f"{token_id}: {tokenizer.decode([token_id])}")



77: the
163: dog
185: ran
1004: quick
107: ly
9: .


####Saving the tokenizer


In [27]:
# tokenizer.save("bpe_tokenizer.json") i have saved this once,this will provide us with reproducibe results

###INPUT -TARGET CREATION

We will now implementing a GPT-style training data pipeline by converting a continuous token stream into overlapping fixed-length context windows and generating input–target pairs for next-token prediction.

In [28]:
from tokenizers import Tokenizer

tokenizer=Tokenizer.from_file("bpe_tokenizer.json")

encoded=tokenizer.encode(raw_text)
token_ids=encoded.ids

print("total number of tokens: ",len(token_ids))

total number of tokens:  40213


In [32]:
##Chunk the token streams

def create_chunks(token_ids,context_length,stride):
  chunks=[]
  i=0

  while i + context_length+1<=len(token_ids):
    chunk=token_ids[i: i+context_length+1]
    chunks.append(chunk)
    i+=stride

  return chunks

####Create Input Target Pairs

In [33]:
def create_input_target_pairs(chunks):
  inputs=[]
  targets=[]

  for chunk in chunks:
    inputs.append(chunk[:-1])
    targets.append(chunk[1:])

  return inputs,targets

In [34]:
context_length=128 #toy model
stride =64 # 50% overlap

chunks=create_chunks(token_ids,context_length,stride)
inputs,targets=create_input_target_pairs(chunks)

print("Number of samples:", len(inputs))
print("Input shape:", len(inputs[0]))
print("Target shape:", len(targets[0]))

Number of samples: 627
Input shape: 128
Target shape: 128


In [37]:
#testing if this works decoding one sample

sample_input=tokenizer.decode(inputs[0])
sample_target=tokenizer.decode(targets[0])

print("INPUT:")
print(sample_input)

print("\nTARGET:")
print(sample_target)  #works like a charm

INPUT:
The Call of the Wild by Jack Lond on Con tents Chapter I . Into the Primitive Chapter II . The Law of Club and Fang Chapter III . The Dominant Primordial Beast Chapter IV . Who Has Won to Mastership Chapter V . The Toil of Trace and Trail Chapter VI . For the Love of a Man Chapter VII . The Sounding of the Call Chapter I . Into the Primitive “ Old long ings nomad ic leap , Ch af ing at custom ’ s chain ; Again from its bru mal sleep Waken s the fer ine strain .” Buck did not read the newspapers , or he would have known that trouble was bre wing , not alone for himself , but for

TARGET:
Call of the Wild by Jack Lond on Con tents Chapter I . Into the Primitive Chapter II . The Law of Club and Fang Chapter III . The Dominant Primordial Beast Chapter IV . Who Has Won to Mastership Chapter V . The Toil of Trace and Trail Chapter VI . For the Love of a Man Chapter VII . The Sounding of the Call Chapter I . Into the Primitive “ Old long ings nomad ic leap , Ch af ing at custom ’ s cha

### PyTorch Dataset & DataLoader

Wrap our (input, target) pairs into a clean, reusable Dataset, then load them efficiently with a DataLoader.

In [38]:
import torch
from torch.utils.data import Dataset


In [48]:
class llm_text_dataset(Dataset):
  def __init__(self,inputs,targets):
    assert len(inputs)==len(targets),"Inputs and targets must have same length othervise ,its time for a whole lot of checking previous cells"
    self.inputs=inputs
    self.targets=targets

  def __len__(self):
    return len(self.inputs)

  def __getitem__(self,idx):
    x=torch.tensor(self.inputs[idx],dtype=torch.long)
    y=torch.tensor(self.targets[idx],dtype=torch.long)
    return x,y

In [49]:
dataset=llm_text_dataset(inputs,targets)
print("Total training samples:", len(dataset))

Total training samples: 627


In [45]:
from torch.utils.data import DataLoader


In [50]:
batch_size=32

dataloader=DataLoader(
    dataset,
    batch_size=batch_size,
    shuffle=True,   # prevents memorization
    drop_last=True
)

In [51]:
##Lets check one batch and see how it performs

batch_1,batch_2=next(iter(dataloader))

print("batch input shape: ",batch_1.shape)
print("batch target shape: ",batch_2.shape)

batch input shape:  torch.Size([32, 128])
batch target shape:  torch.Size([32, 128])


#####Decode one sample

In [52]:
sample_input_text = tokenizer.decode(batch_1[0].tolist())
sample_target_text = tokenizer.decode(batch_2[0].tolist())

print("INPUT:")
print(sample_input_text)

print("\nTARGET:")
print(sample_target_text)


INPUT:
m .” Con cer ning that night ’ s ride , the man spoke most elo quently for himself , in a little shed back of a saloon on the San Fran cis co water front . “ All I get is fifty for it ,” he grumb led ; “ an ’ I wouldn ’ t do it over for a thousand , cold cash .” His hand was wra pped in a bloody hand ker chief , and the right trous er leg was ripped from knee to ank le . “ How much did the other mug get ?” the saloon - keeper demanded . “ A hundred ,” was the reply . “ Would n ’ t take a sou less , so help

TARGET:
.” Con cer ning that night ’ s ride , the man spoke most elo quently for himself , in a little shed back of a saloon on the San Fran cis co water front . “ All I get is fifty for it ,” he grumb led ; “ an ’ I wouldn ’ t do it over for a thousand , cold cash .” His hand was wra pped in a bloody hand ker chief , and the right trous er leg was ripped from knee to ank le . “ How much did the other mug get ?” the saloon - keeper demanded . “ A hundred ,” was the reply . “ 