# ðŸ““ Notebook 3 â€” Tokenization, Labeling & Data Splitting
**LLM Data Processing Pipeline Â· Stage 3 of 3**

The final preprocessing stage before model training:
- Word-level and sub-word tokenization
- Vocabulary building & token-to-ID mapping
- Data labeling for supervised tasks
- Train / Validation / Test splitting

> **Prerequisites:** `pip install pandas scikit-learn` (tokenization is shown with pure Python first, then with `transformers` if available)


## 3.1 Setup & Normalized Input

Simulated output from Notebook 2 â€” fully normalized text ready for tokenization.


In [1]:
import re
import json
import pandas as pd
from collections import Counter

# -------------------------------------------------------------------
# Simulated output from Notebook 2
# -------------------------------------------------------------------
sentences = [
    "the quick brown fox jumped over the lazy dog",
    "large language models learn from massive volumes of text data",
    "neural networks are the backbone of modern ai",
    "transformers changed natural language processing forever",
    "deep learning models require large datasets for training",
    "do not underestimate the importance of data quality",
    "it is a great time to be working in machine learning",
    "data preparation is often the most time consuming step",
    "tokenization converts text into numerical representations",
    "bias removal is essential for responsible ai systems",
]

labels = [
    "general", "llm", "neural_nets", "llm", "deep_learning",
    "data_quality", "general", "data_quality", "tokenization", "ethics",
]

df = pd.DataFrame({"text": sentences, "label": labels})
print(f"Corpus: {len(df)} sentences, {df['label'].nunique()} unique labels")
df


Corpus: 10 sentences, 7 unique labels


Unnamed: 0,text,label
0,the quick brown fox jumped over the lazy dog,general
1,large language models learn from massive volum...,llm
2,neural networks are the backbone of modern ai,neural_nets
3,transformers changed natural language processi...,llm
4,deep learning models require large datasets fo...,deep_learning
5,do not underestimate the importance of data qu...,data_quality
6,it is a great time to be working in machine le...,general
7,data preparation is often the most time consum...,data_quality
8,tokenization converts text into numerical repr...,tokenization
9,bias removal is essential for responsible ai s...,ethics


## 3.2 Word-Level Tokenization

The simplest tokenization: split on whitespace. Every unique word becomes a token.
Word-level tokenizers struggle with unseen words (OOV â€” Out of Vocabulary).


In [2]:
# -------------------------------------------------------------------
# Word tokenizer: split on whitespace / punctuation boundaries
# -------------------------------------------------------------------
def word_tokenize(text):
    return re.findall(r"\b\w+\b", text.lower())

tokenized = [word_tokenize(sent) for sent in df["text"]]

print("Example tokenization:")
for i in range(2):
    print(f"  [{i}] {tokenized[i]}")

# Build vocabulary from training tokens
all_tokens = [tok for sent_tokens in tokenized for tok in sent_tokens]
vocab_counts = Counter(all_tokens)
print(f"\nTotal tokens: {len(all_tokens)}")
print(f"Unique tokens (vocab size): {len(vocab_counts)}")
print(f"Top 10 most frequent: {vocab_counts.most_common(10)}")


Example tokenization:
  [0] ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
  [1] ['large', 'language', 'models', 'learn', 'from', 'massive', 'volumes', 'of', 'text', 'data']

Total tokens: 83
Unique tokens (vocab size): 65
Top 10 most frequent: [('the', 5), ('of', 3), ('data', 3), ('is', 3), ('large', 2), ('language', 2), ('models', 2), ('text', 2), ('ai', 2), ('learning', 2)]


## 3.3 Building a Vocabulary & Token-to-ID Mapping

Each unique token is assigned an integer ID. Special tokens
`[PAD]`, `[UNK]`, `[CLS]`, `[SEP]` are reserved at the start of the vocabulary.


In [3]:
# -------------------------------------------------------------------
# Reserve special tokens, then assign IDs to corpus vocabulary
# -------------------------------------------------------------------
SPECIAL_TOKENS = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]

# Sort by frequency (most common first) for stable vocab
sorted_vocab = [tok for tok, _ in vocab_counts.most_common()]

# Combine: special tokens first, then corpus tokens
full_vocab = SPECIAL_TOKENS + sorted_vocab

token2id = {tok: idx for idx, tok in enumerate(full_vocab)}
id2token = {idx: tok for tok, idx in token2id.items()}

print(f"Vocabulary size (including special tokens): {len(token2id)}")
print("\nFirst 15 entries:")
for tok, idx in list(token2id.items())[:15]:
    print(f"  {idx:3d}  {tok}")


Vocabulary size (including special tokens): 70

First 15 entries:
    0  [PAD]
    1  [UNK]
    2  [CLS]
    3  [SEP]
    4  [MASK]
    5  the
    6  of
    7  data
    8  is
    9  large
   10  language
   11  models
   12  text
   13  ai
   14  learning


## 3.4 Encoding Sentences to Integer Sequences

Convert each sentence from a list of words to a list of integer IDs.
Unknown tokens (not in vocab) map to the `[UNK]` ID.


In [4]:
# -------------------------------------------------------------------
# Encode text â†’ integer IDs; unknown words â†’ [UNK]
# -------------------------------------------------------------------
UNK_ID = token2id["[UNK]"]

def encode(tokens, vocab):
    return [vocab.get(tok, UNK_ID) for tok in tokens]

def decode(ids, id_map):
    return [id_map.get(i, "[UNK]") for i in ids]

encoded = [encode(toks, token2id) for toks in tokenized]

print("Encoded examples:")
for i in range(3):
    print(f"  Sentence : {df['text'].iloc[i]}")
    print(f"  Token IDs: {encoded[i]}")
    print(f"  Decoded  : {decode(encoded[i], id2token)}")
    print()


Encoded examples:
  Sentence : the quick brown fox jumped over the lazy dog
  Token IDs: [5, 17, 18, 19, 20, 21, 5, 22, 23]
  Decoded  : ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

  Sentence : large language models learn from massive volumes of text data
  Token IDs: [9, 10, 11, 24, 25, 26, 27, 6, 12, 7]
  Decoded  : ['large', 'language', 'models', 'learn', 'from', 'massive', 'volumes', 'of', 'text', 'data']

  Sentence : neural networks are the backbone of modern ai
  Token IDs: [28, 29, 30, 5, 31, 6, 32, 13]
  Decoded  : ['neural', 'networks', 'are', 'the', 'backbone', 'of', 'modern', 'ai']



## 3.5 Sub-Word Tokenization (BPE concept)

Real LLMs use sub-word tokenizers (BPE, WordPiece) to handle rare and OOV words.
Here we demonstrate the concept with a tiny manual BPE-style merge step.
For production, use `tokenizers` or `transformers` from HuggingFace.


In [5]:
# -------------------------------------------------------------------
# Minimal BPE demonstration: show how "unrelated" becomes
# sub-word pieces rather than [UNK]
# -------------------------------------------------------------------

# Simulate HuggingFace tokenizer (works if transformers is installed)
try:
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

    examples = [
        "tokenization",
        "preprocessing",
        "unrelated",
        "LLMs learn from massive datasets",
    ]
    print("HuggingFace BERT WordPiece tokenization:")
    for ex in examples:
        toks = tokenizer.tokenize(ex)
        ids  = tokenizer.encode(ex, add_special_tokens=True)
        print(f"  Input : {ex!r}")
        print(f"  Tokens: {toks}")
        print(f"  IDs   : {ids}")
        print()

except ImportError:
    print("transformers not installed â€” showing manual sub-word concept instead.")
    print()
    # Manual character-pair demonstration
    word = "unrelated"
    chars = list(word)
    print(f"Word: {word!r}")
    print(f"Character-level split: {chars}")
    print()
    # In BPE, common pairs are merged. E.g. 'u','n' â†’ 'un', 'un','related' â†’ 'un-related'
    print("BPE would progressively merge frequent pairs:")
    print("  Step 1: ['u','n','r','e','l','a','t','e','d']")
    print("  Step 2: ['un','r','e','l','a','t','e','d']   (merge 'u'+'n')")
    print("  Step 3: ['un','re','l','a','t','e','d']       (merge 'r'+'e')")
    print("  Step 4: ['un','relat','ed']                   (after more merges)")
    print("  â†’ Final sub-words map to known vocabulary IDs, no [UNK] needed!")




HuggingFace BERT WordPiece tokenization:
  Input : 'tokenization'
  Tokens: ['token', '##ization']
  IDs   : [101, 19204, 3989, 102]

  Input : 'preprocessing'
  Tokens: ['prep', '##ro', '##ces', '##sing']
  IDs   : [101, 17463, 3217, 9623, 7741, 102]

  Input : 'unrelated'
  Tokens: ['unrelated']
  IDs   : [101, 15142, 102]

  Input : 'LLMs learn from massive datasets'
  Tokens: ['ll', '##ms', 'learn', 'from', 'massive', 'data', '##set', '##s']
  IDs   : [101, 2222, 5244, 4553, 2013, 5294, 2951, 13462, 2015, 102]



## 3.6 Data Labeling

For supervised tasks the model needs labels alongside input text.
Here we show the label encoding step â€” converting string labels to integer class IDs.


In [6]:
# -------------------------------------------------------------------
# Encode string labels â†’ integer class IDs
# -------------------------------------------------------------------
unique_labels = sorted(df["label"].unique())
label2id = {lbl: idx for idx, lbl in enumerate(unique_labels)}
id2label = {idx: lbl for lbl, idx in label2id.items()}

df["label_id"] = df["label"].map(label2id)

print("Label mapping:")
for lbl, idx in label2id.items():
    print(f"  {idx}  â†’  {lbl}")

print()
print(df[["text", "label", "label_id"]])


Label mapping:
  0  â†’  data_quality
  1  â†’  deep_learning
  2  â†’  ethics
  3  â†’  general
  4  â†’  llm
  5  â†’  neural_nets
  6  â†’  tokenization

                                                text          label  label_id
0       the quick brown fox jumped over the lazy dog        general         3
1  large language models learn from massive volum...            llm         4
2      neural networks are the backbone of modern ai    neural_nets         5
3  transformers changed natural language processi...            llm         4
4  deep learning models require large datasets fo...  deep_learning         1
5  do not underestimate the importance of data qu...   data_quality         0
6  it is a great time to be working in machine le...        general         3
7  data preparation is often the most time consum...   data_quality         0
8  tokenization converts text into numerical repr...   tokenization         6
9  bias removal is essential for responsible ai s...         et

## 3.7 Train / Validation / Test Split

The prepared data is divided into three non-overlapping sets.
We use stratified splitting to ensure each class is proportionally
represented across all three sets.


In [7]:
# -------------------------------------------------------------------
# Stratified split: 70% train, 15% validation, 15% test
# -------------------------------------------------------------------
try:
    from sklearn.model_selection import train_test_split

    # First split off test set
    train_val, test = train_test_split(
        df, test_size=0.15, random_state=42, stratify=df["label_id"]
        if df["label_id"].value_counts().min() > 1 else None
    )
    # Then split validation from the remaining
    train, val = train_test_split(
        train_val, test_size=0.176, random_state=42  # 0.176 â‰ˆ 15% of total
    )

    print(f"Train size     : {len(train):4d}  ({len(train)/len(df)*100:.0f}%)")
    print(f"Validation size: {len(val):4d}  ({len(val)/len(df)*100:.0f}%)")
    print(f"Test size      : {len(test):4d}  ({len(test)/len(df)*100:.0f}%)")
    print(f"Total          : {len(train)+len(val)+len(test)}")

    print("\nTrain set:")
    print(train[["text","label"]].to_string(index=False))
    print("\nValidation set:")
    print(val[["text","label"]].to_string(index=False))
    print("\nTest set:")
    print(test[["text","label"]].to_string(index=False))

except ImportError:
    print("scikit-learn not installed. Falling back to manual split.")
    n = len(df)
    train = df.iloc[:int(n*0.70)]
    val   = df.iloc[int(n*0.70):int(n*0.85)]
    test  = df.iloc[int(n*0.85):]
    print(f"Train: {len(train)} | Val: {len(val)} | Test: {len(test)}")


Train size     :    6  (60%)
Validation size:    2  (20%)
Test size      :    2  (20%)
Total          : 10

Train set:
                                                    text        label
     do not underestimate the importance of data quality data_quality
    it is a great time to be working in machine learning      general
  data preparation is often the most time consuming step data_quality
    bias removal is essential for responsible ai systems       ethics
           neural networks are the backbone of modern ai  neural_nets
transformers changed natural language processing forever          llm

Validation set:
                                                    text         label
            the quick brown fox jumped over the lazy dog       general
deep learning models require large datasets for training deep_learning

Test set:
                                                         text        label
    tokenization converts text into numerical representations tokenization


## 3.8 Exporting the Preprocessed Dataset

Save each split so it can be loaded by a training script without repeating all preprocessing.


In [9]:
import json

# -------------------------------------------------------------------
# Save splits and vocabulary to JSON (works in any environment)
# -------------------------------------------------------------------
output = {
    "vocab": token2id,
    "label_map": label2id,
    "splits": {
        "train": train[["text","label","label_id"]].to_dict(orient="records"),
        "val":   val[["text","label","label_id"]].to_dict(orient="records"),
        "test":  test[["text","label","label_id"]].to_dict(orient="records"),
    }
}

with open("preprocessed_dataset.json", "w") as f:
    json.dump(output, f, indent=2)

print("Saved: preprocessed_dataset.json")
print(f"  Vocab size : {len(token2id)}")
print(f"  Train rows : {len(output['splits']['train'])}")
print(f"  Val rows   : {len(output['splits']['val'])}")
print(f"  Test rows  : {len(output['splits']['test'])}")
print()
print("This file feeds directly into model training.")


Saved: preprocessed_dataset.json
  Vocab size : 70
  Train rows : 6
  Val rows   : 2
  Test rows  : 2

This file feeds directly into model training.


## 3.9 Summary â€” Full Pipeline Recap

| Stage | Notebook | Key Steps |
|-------|----------|-----------|
| **Raw Data** | â€” | Web scrape, books, code repos, datasets |
| **Data Cleaning** | 01 | Dedup, empty removal, noise removal, spell correction |
| **Normalization** | 02 | Contractions, lowercase, punctuation, whitespace, bias flagging |
| **Tokenization** | 03 | Word/sub-word split, vocab build, tokenâ†’ID encoding |
| **Labeling** | 03 | String labels â†’ integer class IDs |
| **Splitting** | 03 | 70% train / 15% val / 15% test |
| **Training** | *next* | Feed `preprocessed_dataset.json` to model |

> **Next step:** Load the saved JSON into your model training loop (PyTorch, TensorFlow, or HuggingFace `Trainer`).
