# 🧩 Week 5-6 · Notebook 04 · Advanced Tokenizers

**Module:** LLMs, Prompt Engineering & RAG  
**Project:** Build the Knowledge Core for the Manufacturing Copilot

---

Tokenization is the critical, often overlooked, bridge between raw, messy text from the factory floor and the structured inputs a Large Language Model requires. A good tokenizer understands your domain's language—from part numbers to error codes. A bad one will shred important terms into meaningless pieces, crippling your model's performance.

In this notebook, we will train our own **domain-specific tokenizer** on manufacturing data. This will be a key component of our Manufacturing Copilot, ensuring it understands the unique vocabulary of our factory.

## 🎯 Learning Outcomes

By the end of this notebook, you will be able to:
1. ✅ **Diagnose Tokenizer Issues:** See how different standard tokenizers fail on manufacturing jargon.
2. ✅ **Train a Custom Tokenizer:** Build a BPE tokenizer from scratch using maintenance logs.
3. ✅ **Measure Vocabulary Coverage:** Quantify the improvement of a custom tokenizer by measuring the Out-of-Vocabulary (OOV) rate.
4. ✅ **Package & Save a Tokenizer:** Properly save a trained tokenizer so it can be loaded with `AutoTokenizer`.
5. ✅ **Control Padding & Truncation:** Configure tokenizers for different inference scenarios (e.g., real-time vs. batch).

## 🏭 The Problem: Standard Tokenizers vs. Manufacturing Jargon

Let's see how popular, off-the-shelf tokenizers handle text they've likely never seen before.

In [None]:
from transformers import AutoTokenizer
import pandas as pd

terms = [
    'Hydroforming pressure calibration check for part #HFP-2024A.',
    'OEE dropped to 71% after unplanned downtime on CNC-12.',
    'Favor revisar torque 450 Nm en lote 18. (Spanish)',
    'Robot axis-3 grease refill overdue per SOP-442-V3.'
]

# Note: You may need to request access or log in for meta-llama/Meta-Llama-3-8B
tokenizers_to_test = {
    'GPT-2 (BPE)': 'gpt2',
    'BERT (WordPiece)': 'bert-base-uncased',
    # 'Llama-3 (BPE)': 'meta-llama/Meta-Llama-3-8B' # Uncomment if you have access
}

rows = []
for label, model_name in tokenizers_to_test.items():
    try:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        for text in terms:
            tokens = tokenizer.tokenize(text)
            rows.append({'Tokenizer': label, 'Text': text, 'Token Count': len(tokens), 'Tokens': ' '.join(tokens)})
    except Exception as e:
        print(f"Could not load tokenizer {model_name}. Error: {e}")


df = pd.DataFrame(rows)
df

### **Observations & Diagnosis**

1.  **Fragmented Terms:** Notice how `HFP-2024A` is split into many pieces by all tokenizers (e.g., `H`, `FP`, `-`, `2024`, `A`). The model loses the concept of this being a single part number.
2.  **Units:** GPT-2 splits `450` and `Nm` into `['450', 'N', 'm']`. The model might not understand `Nm` as a unit of torque.
3.  **Inconsistency:** `CNC-12` is handled differently by each tokenizer. This inconsistency makes it hard for a model to learn what a 
 is.

**Conclusion:** We need a tokenizer that learns *our* vocabulary. Let's build one.

## 🛠️ Training a Domain-Specific Tokenizer

We will use the `tokenizers` library to train a Byte-Pair Encoding (BPE) tokenizer on a small sample of maintenance logs. In a real project, you would use thousands of documents.

from pathlib import Path
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Step 1: Prepare our training data (a list of strings)
maintenance_logs = [
    'Press-24 hydraulic accumulator leak detected at 03:14.',
    'Torque wrench calibration overdue for cell B; schedule before shift 2.',
    'Robot cell 3 axis-2 grease refill triggered due to high temperature.',
    'Lathe #4 vibration at 12.5 mm/s despite new SKF-6205-2Z bearing.',
    'Favor revisar torque 450 Nm en lote 18 y reportar a calidad.',
    'OEE for CNC-12 dropped to 71%. Root cause: spindle overheating.',
    'Part #HFP-2024A failed quality inspection due to surface defects.'
]

# Step 2: Configure the Tokenizer
custom_tokenizer = Tokenizer(BPE(unk_token='[UNK]'))
custom_tokenizer.pre_tokenizer = Whitespace()

# Step 3: Configure the Trainer
trainer = BpeTrainer(
    vocab_size=1000,  # A larger vocab size can capture more specific terms
    min_frequency=1,  # Include words that appear at least once
    special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']
)

# Step 4: Train the tokenizer
custom_tokenizer.train_from_iterator(maintenance_logs, trainer=trainer)

# Step 5: Save the tokenizer file
artifacts_dir = Path('artifacts/custom_tokenizer')
artifacts_dir.mkdir(parents=True, exist_ok=True)
tokenizer_path = str(artifacts_dir / 'maintenance-tokenizer.json')
custom_tokenizer.save(tokenizer_path)

print(f"Custom tokenizer saved to: {tokenizer_path}")
print(f"Vocabulary size: {custom_tokenizer.get_vocab_size()}")

### Testing our new tokenizer

test_sentence = 'New SKF-6205-2Z bearing for CNC-12 shows high vibration.'
encoding = custom_tokenizer.encode(test_sentence)

print(f"Test Sentence: '{test_sentence}'")
print(f"Custom Tokenizer Output: {encoding.tokens}")

# Compare with BERT
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
print(f"BERT Tokenizer Output:   {bert_tokenizer.tokenize(test_sentence)}")

**Analysis:** Our custom tokenizer correctly keeps `SKF-6205-2Z` and `CNC-12` as single tokens! This is a huge improvement. It learned these terms from our small dataset.

## 📊 Coverage Audit: Measuring Out-of-Vocabulary (OOV) Risk

A key metric for a tokenizer's quality is its OOV rate—the percentage of tokens that it doesn't recognize and maps to `[UNK]`. A lower OOV rate is better.

def calculate_oov_rate(tokenizer, text_iterator):
    total_tokens = 0
    unk_tokens = 0
    
    # Handle both 'tokenizers' library and 'transformers' library objects
    is_hf_tokenizer = hasattr(tokenizer, 'vocab')
    unk_token_id = tokenizer.unk_token_id if is_hf_tokenizer else tokenizer.token_to_id('[UNK]')

    for text in text_iterator:
        if is_hf_tokenizer:
            encoding = tokenizer(text)['input_ids']
        else:
            encoding = tokenizer.encode(text).ids
        
        total_tokens += len(encoding)
        unk_tokens += encoding.count(unk_token_id)
        
    return (unk_tokens / total_tokens) if total_tokens > 0 else 0

# Load our custom tokenizer from the file
from tokenizers import Tokenizer
custom_tok = Tokenizer.from_file(tokenizer_path)

# The baseline tokenizer (BERT)
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Calculate OOV rates
custom_oov_rate = calculate_oov_rate(custom_tok, maintenance_logs)
bert_oov_rate = calculate_oov_rate(bert_tokenizer, maintenance_logs)

print(f"OOV Rate (Custom Tokenizer): {custom_oov_rate:.2%}")
print(f"OOV Rate (BERT Tokenizer):   {bert_oov_rate:.2%}")

Our custom tokenizer has a 0% OOV rate on its training data, which is expected. The real test is on a *held-out* test set of logs it has never seen before.

## 📦 Packaging for HuggingFace `AutoTokenizer`

To make our tokenizer easily reusable, we need to save it in a format that `AutoTokenizer.from_pretrained()` understands. This requires the `tokenizer.json` file and a `tokenizer_config.json`.

from transformers import PreTrainedTokenizerFast
import json

# The `tokenizers` library produces a single JSON file. 
# We can wrap this in a `PreTrainedTokenizerFast` object, which is the standard HuggingFace format.

# 1. Load the trained tokenizer
slow_tokenizer = Tokenizer.from_file(tokenizer_path)

# 2. Wrap it in the HuggingFace Fast Tokenizer implementation
hf_fast_tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=slow_tokenizer,
    unk_token=
,
    pad_token=
,
    cls_token=
,
    sep_token=
,
    mask_token=
,
)

# 3. Save it using the HuggingFace `save_pretrained` method
hf_tokenizer_dir = Path('artifacts/hf_custom_tokenizer')
hf_fast_tokenizer.save_pretrained(str(hf_tokenizer_dir))

print(f
print(

### Loading and Using the Packaged Tokenizer

Now, anyone on the team can load our custom tokenizer with a single, standard line of code.

# Load the tokenizer from the directory we just saved it to
reloaded_tokenizer = AutoTokenizer.from_pretrained(str(hf_tokenizer_dir))

test_sentence = 'Part #HFP-2024A from CNC-12 needs a new SKF-6205-2Z bearing.'
tokens = reloaded_tokenizer.tokenize(test_sentence)

print(f

This packaged tokenizer is now ready to be used in our RAG pipeline (Notebook 08), ensuring that the text chunking and embedding steps use the exact same vocabulary.

## 🪄 Padding and Truncation

When processing batches of text, we need all input sequences to be the same length. We use padding and truncation to achieve this.

sample_batch = [
    'Shift 1: verify coolant pressure before restart.',
    'Alert: axis-3 vibration exceeded 9 mm/s threshold on CNC-12.',
    'Favor revisar torque 450 Nm en lote 18.'
]

# Dynamic Padding: Pads each batch to the length of the longest sequence in that batch.
# Ideal for inference APIs where batch composition changes.
encoded_dynamic = reloaded_tokenizer(sample_batch, padding=True, return_tensors='pt')

# Static Padding: Pads all sequences to a fixed `max_length`.
# Useful for training on GPUs where uniform shapes are more efficient.
encoded_static = reloaded_tokenizer(sample_batch, padding='max_length', max_length=24, truncation=True, return_tensors='pt')

print(f