# 🧩 Week 5-6, Notebook 4: Advanced Tokenization and Custom Vocabularies

**Module:** LLMs, Prompt Engineering & RAG  
**Project:** Build the Knowledge Core for the Manufacturing Copilot

---

Tokenization is the critical, often overlooked, bridge between the raw, messy text from the factory floor and the structured numerical inputs a Large Language Model requires. A well-designed tokenizer understands your domain's unique language—from part numbers and error codes to specialized verbs. A poorly suited one will shred important terms into meaningless sub-pieces, crippling your model's ability to understand context and nuance.

In this notebook, we move beyond using pre-trained tokenizers and take a crucial step toward building a true domain-specific model: we will **train our own tokenizer** from scratch on a corpus of manufacturing data. This custom tokenizer will form a key component of our Manufacturing Copilot, ensuring it speaks the language of our factory and can interpret technical information correctly.

## 🎯 Learning Objectives

By the end of this notebook, you will be able to:

1.  **Diagnose Tokenizer Mismatches:** Articulate and demonstrate how standard, pre-trained tokenizers can fail on specialized manufacturing jargon.
2.  **Train a Custom Tokenizer:** Build, train, and save a new Byte-Pair Encoding (BPE) tokenizer from scratch using a corpus of sample maintenance logs.
3.  **Measure Vocabulary Coverage:** Quantify the effectiveness of a tokenizer by calculating its Out-of-Vocabulary (OOV) rate on a given dataset.
4.  **Package and Distribute a Tokenizer:** Properly save a trained tokenizer in a format that can be easily loaded and shared using the standard Hugging Face `AutoTokenizer` class.
5.  **Control Padding and Truncation:** Configure a tokenizer to handle batches of text with varying lengths, a crucial step for both training and inference.

## 🏭 Part 1: The Problem with Off-the-Shelf Tokenizers

Pre-trained models from the Hugging Face Hub come with their own tokenizers, which were trained on massive, general-purpose datasets like Wikipedia and Common Crawl. While powerful, these tokenizers have often never seen the specific jargon, part numbers, and error codes common in a manufacturing environment.

Let's see what happens when we feed some typical manufacturing text to these standard tokenizers. We will examine how they "see" the text by looking at the tokens they produce.

In [None]:
# Hands-On: Diagnosing Tokenizer Failures
from transformers import AutoTokenizer
import pandas as pd

# A list of terms and phrases commonly found in a manufacturing setting
manufacturing_terms = [
    'Hydroforming pressure calibration check for part #HFP-2024A.',
    'OEE dropped to 71% after unplanned downtime on CNC-12.',
    'Favor revisar torque 450 Nm en lote 18. (Spanish)',
    'Robot axis-3 grease refill overdue per SOP-442-V3.'
]

# Let's test a few popular tokenizers with different underlying algorithms.
# Note: You may need to request access or log in for meta-llama/Meta-Llama-3-8B
tokenizers_to_test = {
    'GPT-2 (BPE)': 'gpt2',
    'BERT (WordPiece)': 'bert-base-uncased',
    'Llama-3 (BPE)': 'meta-llama/Meta-Llama-3-8B-Instruct'
}

results = []
for name, model_path in tokenizers_to_test.items():
    try:
        # Load the tokenizer from the Hub
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        for text in manufacturing_terms:
            # Get the list of token strings
            tokens = tokenizer.tokenize(text)
            results.append({
                'Tokenizer': name,
                'Original Text': text,
                'Token Count': len(tokens),
                'Tokens': ' | '.join(tokens)  # Use a separator for clarity
            })
    except Exception as e:
        print(f"Could not load tokenizer '{model_path}'. It may require special access. Error: {e}")

# Display the results in a DataFrame for easy comparison
df = pd.DataFrame(results)
df

### **Analysis and Diagnosis**

The results clearly show the problem:

1.  **Fragmented Technical Terms:** All tokenizers shred our domain-specific identifiers. `HFP-2024A` is broken into many meaningless pieces like `H`, `FP`, `-`, `2024`, and `A`. The model has no way of knowing this is a single, unique part number.
2.  **Loss of Meaning:** GPT-2 splits the unit `Nm` (Newton-meters) into `N` and `m`. The model might lose the semantic meaning of this being a unit of torque.
3.  **Inconsistent Tokenization:** `CNC-12` is handled differently by each tokenizer. This inconsistency makes it extremely difficult for a model to learn a consistent representation for what a "CNC machine" is.

**The Core Problem:** The vocabularies of these general-purpose tokenizers do not contain our specialized terms.

**The Solution:** We need to train a new tokenizer that learns *our* vocabulary from *our* data.

## 🛠️ Part 2: Training a Domain-Specific Tokenizer

To solve this problem, we will use the `tokenizers` library from Hugging Face to train our own **Byte-Pair Encoding (BPE)** tokenizer. BPE is a subword tokenization algorithm that starts with a base vocabulary of individual characters and iteratively merges the most frequently co-occurring pairs of tokens.

This bottom-up approach allows the tokenizer to "discover" the common words and subwords in our corpus. By training it on our maintenance logs, it will learn to recognize terms like `CNC-12` and `HFP-2024A` as single, meaningful units.

In a real-world project, you would use thousands or even millions of documents for training. For this demonstration, we will use a small, representative sample of maintenance logs.

# Hands-On: Training a Custom BPE Tokenizer
from pathlib import Path
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# --- Step 1: Prepare the Training Corpus ---
# This should be an iterator that yields strings. For a large dataset,
# you could have a generator that reads lines from a file.
maintenance_logs_corpus = [
    'Press-24 hydraulic accumulator leak detected at 03:14.',
    'Torque wrench calibration overdue for cell B; schedule before shift 2.',
    'Robot cell 3 axis-2 grease refill triggered due to high temperature.',
    'Lathe #4 vibration at 12.5 mm/s despite new SKF-6205-2Z bearing.',
    'Favor revisar torque 450 Nm en lote 18 y reportar a calidad.',
    'OEE for CNC-12 dropped to 71%. Root cause: spindle overheating.',
    'Part #HFP-2024A failed quality inspection due to surface defects.'
]

# --- Step 2: Configure the Tokenizer ---
# We start with a blank BPE model. The `unk_token` is used for any tokens not in the vocabulary.
custom_tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
# The pre-tokenizer splits the text into words. Whitespace splitting is a good start.
custom_tokenizer.pre_tokenizer = Whitespace()

# --- Step 3: Configure and Run the Trainer ---
# The trainer will learn the merge rules from our corpus.
trainer = BpeTrainer(
    vocab_size=1000,  # The desired size of the final vocabulary.
    min_frequency=1,  # Include tokens that appear at least once.
    special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] # Define special tokens.
)

# Train the tokenizer on our data
custom_tokenizer.train_from_iterator(maintenance_logs_corpus, trainer=trainer)

# --- Step 4: Save the Trained Tokenizer ---
# We save the tokenizer's configuration and vocabulary to a single JSON file.
# This file contains everything needed to reuse the tokenizer later.
artifacts_dir = Path("artifacts/custom_tokenizer")
artifacts_dir.mkdir(parents=True, exist_ok=True)
tokenizer_path = str(artifacts_dir / "maintenance-tokenizer.json")
custom_tokenizer.save(tokenizer_path)

print(f"Custom tokenizer trained and saved to: {tokenizer_path}")
print(f"Final Vocabulary Size: {custom_tokenizer.get_vocab_size()}")

### Testing Our Newly Trained Tokenizer

Now for the moment of truth. Let's compare how our custom tokenizer handles a test sentence compared to the general-purpose BERT tokenizer.

# A test sentence containing our specialized terms
test_sentence = "New SKF-6205-2Z bearing for CNC-12 shows high vibration."

# Encode the sentence with our custom tokenizer
encoding = custom_tokenizer.encode(test_sentence)

print(f"Test Sentence: '{test_sentence}'")
print("-" * 50)
print(f"Custom Tokenizer Output:\n{encoding.tokens}")
print("-" * 50)

# For comparison, let's see how the standard BERT tokenizer handles it
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
print(f"BERT Tokenizer Output:\n{bert_tokenizer.tokenize(test_sentence)}")

**Analysis:** Success! Our custom tokenizer correctly identifies `SKF-6205-2Z` and `CNC-12` as single, indivisible tokens. The BERT tokenizer, in contrast, shatters them into multiple, less meaningful subwords.

This is a massive improvement. By learning the specific vocabulary of our domain, our tokenizer provides a much more accurate and semantically meaningful representation of the text, which will directly lead to better performance when we use it with a language model.

## 📊 Part 3: Measuring Vocabulary Coverage with the OOV Rate

A key metric for evaluating a tokenizer's quality is its **Out-of-Vocabulary (OOV) rate**. This is the percentage of tokens in a given text that the tokenizer does not recognize and therefore maps to its special `[UNK]` (unknown) token.

A high OOV rate is a major problem. If the tokenizer frequently encounters unknown words, the model loses valuable information and cannot make accurate predictions. Our goal is to minimize the OOV rate on our target domain data. A lower OOV rate indicates a better fit between the tokenizer's vocabulary and the text it will be processing.

# Hands-On: Calculating the OOV Rate
def calculate_oov_rate(tokenizer, text_iterator):
    """Calculates the OOV rate for a given tokenizer and text corpus."""
    total_tokens = 0
    unk_tokens = 0

    # Get the ID for the unknown token
    try:
        # For `transformers` tokenizers
        unk_token_id = tokenizer.unk_token_id
    except AttributeError:
        # For `tokenizers` library tokenizers
        unk_token_id = tokenizer.token_to_id("[UNK]")

    for text in text_iterator:
        # Get the token IDs for the text
        if hasattr(tokenizer, 'encode_plus'): # Heuristic for transformers tokenizer
            encoding = tokenizer.encode(text)
        else: # Heuristic for tokenizers library tokenizer
            encoding = tokenizer.encode(text).ids

        total_tokens += len(encoding)
        unk_tokens += encoding.count(unk_token_id)

    # Avoid division by zero
    return (unk_tokens / total_tokens) if total_tokens > 0 else 0

# --- Compare OOV Rates ---

# 1. Load our custom tokenizer from the saved file
from tokenizers import Tokenizer
custom_tok_from_file = Tokenizer.from_file(tokenizer_path)

# 2. Load the baseline BERT tokenizer
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# 3. Calculate OOV rates on our maintenance log corpus
custom_oov_rate = calculate_oov_rate(custom_tok_from_file, maintenance_logs_corpus)
bert_oov_rate = calculate_oov_rate(bert_tokenizer, maintenance_logs_corpus)

print(f"OOV Rate (Custom Tokenizer): {custom_oov_rate:.2%}")
print(f"OOV Rate (BERT Tokenizer):   {bert_oov_rate:.2%}")

As expected, our custom tokenizer has a 0% OOV rate on the data it was trained on. The BERT tokenizer, while not having a terrible OOV rate, still fails to recognize some of the sub-word components of our specialized terms.

**The Real Test:** In a real project, you would perform this calculation on a **held-out test set**—a collection of documents that the tokenizer did *not* see during training. This gives you a much more honest measure of how well your tokenizer will perform on new, unseen data.

## 📦 Part 4: Packaging for Easy Use with `AutoTokenizer`

The `tokenizers` library is excellent for training, but for deployment and sharing, we want our tokenizer to behave just like a standard Hugging Face tokenizer. This means being able to load it with the one-liner: `AutoTokenizer.from_pretrained(...)`.

To achieve this, we need to convert our single `tokenizer.json` file into the standard Hugging Face format, which includes the `tokenizer.json` file along with a `tokenizer_config.json` and `special_tokens_map.json`. The `PreTrainedTokenizerFast` class provides a convenient way to do this.

# Hands-On: Saving in Hugging Face Format
from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer
import json

# --- Step 1: Load the tokenizer we trained ---
# This is the "slow" tokenizer object from the `tokenizers` library.
slow_tokenizer = Tokenizer.from_file(tokenizer_path)

# --- Step 2: Wrap it in a `PreTrainedTokenizerFast` object ---
# This is the Hugging Face "fast" tokenizer implementation that wraps the underlying Rust-based tokenizer.
hf_fast_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=slow_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

# --- Step 3: Save it using the standard Hugging Face method ---
# This will create a directory with all the necessary files (tokenizer.json, config, etc.).
hf_tokenizer_dir = Path("artifacts/hf_custom_tokenizer")
hf_fast_tokenizer.save_pretrained(str(hf_tokenizer_dir))

print(f"Hugging Face compatible tokenizer saved to: '{hf_tokenizer_dir}'")
print("Directory contents:", [p.name for p in hf_tokenizer_dir.iterdir()])

### Loading and Using the Packaged Tokenizer

Now, anyone on your team can load and use your custom tokenizer with a single, standard line of code, just as they would with any pre-trained tokenizer from the Hugging Face Hub. This makes it incredibly easy to share and integrate into other parts of your project.

# Load the tokenizer from the directory we just saved it to
reloaded_tokenizer = AutoTokenizer.from_pretrained(str(hf_tokenizer_dir))

# Use it just like any other Hugging Face tokenizer
test_sentence = "Part #HFP-2024A from CNC-12 needs a new SKF-6205-2Z bearing."
tokens = reloaded_tokenizer.tokenize(test_sentence)

print(f"Test Sentence: '{test_sentence}'")
print(f"Tokens from reloaded tokenizer: {tokens}")

# The output is identical, proving our packaging was successful.
# This packaged tokenizer is now ready to be used in our RAG pipeline (Notebook 08),
# ensuring that the text chunking and embedding steps use the exact same vocabulary.

## 🪄 Part 5: Padding and Truncation for Batch Processing

When you process multiple texts at once (a "batch"), you need to ensure that all the input sequences have the same length. This is because the underlying models and frameworks like PyTorch and TensorFlow require inputs to be in rectangular tensors. We use **padding** and **truncation** to achieve this.

*   **Padding:** Adds special `[PAD]` tokens to the end of shorter sequences to make them match the length of the longest sequence in the batch.
*   **Truncation:** Cuts off tokens from the end of longer sequences to ensure they do not exceed a specified maximum length.

The tokenizer can handle both of these operations for you automatically.

# Hands-On: Padding and Truncation
# A sample batch of texts with different lengths
sample_batch = [
    "Shift 1: verify coolant pressure before restart.",
    "Alert: axis-3 vibration exceeded 9 mm/s threshold on CNC-12.",
    "Favor revisar torque 450 Nm en lote 18."
]

# --- Strategy 1: Dynamic Padding ---
# Pad each sequence to the length of the *longest sequence in the current batch*.
# This is efficient for inference, as it minimizes the number of padding tokens.
encoded_dynamic = reloaded_tokenizer(sample_batch, padding=True, return_tensors="pt")

print("--- Dynamic Padding ---")
print("Shape of Input IDs:", encoded_dynamic['input_ids'].shape)
print("Input IDs:\n", encoded_dynamic['input_ids'])
print("Attention Mask:\n", encoded_dynamic['attention_mask'])
print("\nNote: The attention mask is 1 for real tokens and 0 for padding tokens.")

# --- Strategy 2: Static Padding to Max Length ---
# Pad all sequences to a fixed `max_length`. If a sequence is longer, it will be truncated.
# This is often used for training, as it creates uniformly shaped tensors which can be more efficient on GPUs.
encoded_static = reloaded_tokenizer(
    sample_batch,
    padding="max_length",
    max_length=24,  # A fixed length for all sequences
    truncation=True,
    return_tensors="pt"
)

print("\n--- Static Padding to Max Length (24) ---")
print("Shape of Input IDs:", encoded_static['input_ids'].shape)
print("Input IDs:\n", encoded_static['input_ids'])

In [None]:
## ✅ Summary and Next Steps

In this notebook, you dove deep into the world of tokenization and took a major step toward building a true domain-specific NLP application. You have learned:

-   **Why standard tokenizers fail** on specialized text and how to diagnose these failures by inspecting the token output.
-   **How to train a custom BPE tokenizer** from scratch on a domain-specific corpus, enabling it to learn and correctly represent your unique vocabulary.
-   **How to measure tokenizer quality** using the Out-of-Vocabulary (OOV) rate, giving you a quantitative way to assess its performance.
-   **How to package and save a custom tokenizer** in the standard Hugging Face format, making it easy to share, version, and load with `AutoTokenizer`.
-   **How to control padding and truncation**, essential techniques for handling batches of text for both training and inference.

You are now equipped with one of the most powerful techniques for adapting language models to new domains.

In the next notebook, we will shift our focus from the data to the model itself and explore the art and science of **Prompt Engineering**. You will learn how to craft effective prompts to control the behavior and output of large language models for a variety of tasks.