# 🧩 Week 5-6 · Notebook 04 · Advanced Tokenizers

Tokenization is the bridge between messy shop-floor text and model-friendly inputs. We will benchmark off-the-shelf tokenizers, build a domain-specific vocabulary, and wire the results into downstream RAG & prompt workflows.

## 🎯 Learning Outcomes
- Diagnose how different tokenizers handle manufacturing jargon, units, and multilingual snippets.
- Train a SentencePiece/BPE tokenizer on plant maintenance logs and export artifacts.
- Measure vocabulary coverage and OOV (out-of-vocabulary) risk across data sources.
- Configure padding, truncation, and special tokens for real-time vs. batch inference.
- Package the tokenizer for use with HuggingFace `AutoTokenizer` and vector stores.

## 🏭 Manufacturing Text Fingerprints
| Corpus | Example | Avg Tokens (WordPiece) | Notes |
| --- | --- | --- | --- |
| Shift handover | `Press 12 coolant alarm cleared, watch valve drift.` | 32 | Dense with component IDs |
| Maintenance ticket | `Lathe #4 vibration 12 mm/s despite new bearing.` | 44 | Units + shorthand |
| SOP snippet | `Lockout-tagout for turret press v3.2` | 65 | Mix of legal + procedural text |
| Supplier email (es/en) | `Favor revisar torque 450 Nm en lote 18.` | 54 | Multilingual & informal |

## 🧠 Tokenizer Families at a Glance
| Family | Examples | Strengths | Watch-outs |
| --- | --- | --- | --- |
| Byte-Pair Encoding (BPE) | GPT-2, LLaMA | Robust on rare words, compact vocab | Sensitive to casing; may split units awkwardly |
| WordPiece | BERT, RoBERTa | Balanced vocabulary, good for classification | Requires pre-tokenization; bigger vocab |
| SentencePiece (BPE/Unigram) | T5, mT5 | Language-agnostic, trains on raw text | Needs proper normalization config |
| Char / Byte | CANINE, ByT5 | Zero OOV | Longer sequences; higher compute |

In [None]:
from transformers import AutoTokenizer
import pandas as pd

terms = [
    'Hydroforming pressure calibration check',
    'OEE dropped to 71% after unplanned downtime',
    'Favor revisar torque 450 Nm en lote 18',
    'Robot axis-3 grease refill overdue per SOP-442',
]

tokenizers = {
    'GPT-2 BPE': AutoTokenizer.from_pretrained('gpt2'),
    'BERT WordPiece': AutoTokenizer.from_pretrained('bert-base-uncased'),
    'Longformer': AutoTokenizer.from_pretrained('allenai/longformer-base-4096'),
}

rows = []
for label, tok in tokenizers.items():
    for text in terms:
        pieces = tok.tokenize(text)
        rows.append({'tokenizer': label, 'text': text, 'token_count': len(pieces), 'tokens': pieces})

pd.DataFrame(rows)

**Observations**
- GPT-2 BPE breaks units like `Nm` into `['N', 'm']`, inflating token count.
- WordPiece preserves many units but lowercases machine IDs.
- Longformer shares BERT's vocabulary yet supports 4k token windows for SOP ingestion.

## 🛠️ Training a Domain Tokenizer (SentencePiece BPE)
We'll train on a curated sample of maintenance logs. In production, feed thousands of tickets to stabilize merge statistics.

In [None]:
from pathlib import Path
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

maintenance_logs = [
    'Press 24 hydraulic accumulator leak detected 03:14.',
    'Torque wrench calibration overdue for cell B; schedule before shift 2.',
    'Robot cell 3 axis-2 grease refill triggered due to temperature rise.',
    'Lathe #4 vibration 12 mm/s despite new bearing install.',
    'Favor revisar torque 450 Nm en lote 18 y reportar a calidad.',
]

tokenizer = Tokenizer(BPE(unk_token='[UNK]'))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(
    vocab_size=600,
    min_frequency=1,
    special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]', '<SHIFT>', '<ALERT>']
)

tokenizer.train_from_iterator(maintenance_logs, trainer=trainer)
artifacts_dir = Path('artifacts/tokenizer')
artifacts_dir.mkdir(parents=True, exist_ok=True)
tokenizer.save(str(artifacts_dir / 'maintenance-bpe.json'))
tokenizer.get_vocab_size()

### Coverage Audit
Measure the share of tokens that become `[UNK]` when using a base vs. custom tokenizer.

In [None]:
import json

baseline_tok = AutoTokenizer.from_pretrained('bert-base-uncased')
custom_tok = Tokenizer.from_file(str(artifacts_dir / 'maintenance-bpe.json'))

def oov_ratio(tokenizer, logs):
    if hasattr(tokenizer, 'tokenize'):
        total = 0
        oov = 0
        for text in logs:
            tokens = tokenizer.tokenize(text)
            total += len(tokens)
            oov += sum(1 for tok in tokens if tok == '[UNK]')
        return oov / total if total else 0.0
    encoding = tokenizer.encode(logs[0])
    unk_token = tokenizer.token_to_id('[UNK]')
    total = 0
    oov = 0
    for text in logs:
        encoding = tokenizer.encode(text)
        total += len(encoding.tokens)
        oov += sum(1 for tok in encoding.tokens if tokenizer.token_to_id(tok) == unk_token)
    return oov / total if total else 0.0

baseline_oov = oov_ratio(baseline_tok, maintenance_logs)
custom_oov = oov_ratio(custom_tok, maintenance_logs)
{'bert_wordpiece_oov': round(baseline_oov, 4), 'custom_bpe_oov': round(custom_oov, 4)}

Expect the custom tokenizer to reduce `[UNK]` occurrences, especially for bilingual logs and unit notation. Record these metrics in your experiment tracker.

## 📦 Packaging for HuggingFace & Vector Stores
The HuggingFace ecosystem expects separate `vocab.json` and `merges.txt` for BPE-based tokenizers.

In [None]:
from tokenizers import pre_tokenizers

# Configure whitespace split for downstream compatibility
tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
# Save vocab and merges files expected by HF AutoTokenizer
files = tokenizer.model.save(str(artifacts_dir), 'maintenance-bpe')

hf_config = {
    'model_type': 'bpe',
    'unk_token': '[UNK]',
    'pad_token': '[PAD]',
    'cls_token': '[CLS]',
    'sep_token': '[SEP]',
    'mask_token': '[MASK]'
}

with open(artifacts_dir / 'tokenizer.json', 'w', encoding='utf-8') as f:
    f.write(tokenizer.to_str())

with open(artifacts_dir / 'tokenizer_config.json', 'w', encoding='utf-8') as f:
    json.dump(hf_config, f, indent=2)

sorted(str(path) for path in artifacts_dir.iterdir())

Load with `AutoTokenizer.from_pretrained('artifacts/tokenizer')` and reuse in RAG pipelines (Notebook 08) so chunking stays aligned with vector embeddings.

## 🪄 Padding, Truncation & Special Tokens
### Dynamic vs. Static Padding
- **Dynamic (`padding=True`)**: Ideal for online APIs; minimizes wasted compute.
- **Static (`padding='max_length'`)**: Useful for batching jobs on GPU/TPU where uniform shapes matter.

### Truncation Strategies
1. **Front truncation**: Keep latest sentences for incident response.
2. **Sliding window with overlap**: Maintain continuity for SOPs (feeds vector store chunks).
3. **Hierarchy-aware trimming**: Preserve safety warnings before product specs.

In [None]:
sample_batch = [
    'Shift 1: verify coolant pressure before restart.',
    'Alert: axis-3 vibration exceeded 9 mm/s threshold.',
    'Favor revisar torque 450 Nm en lote 18.'
]

encoded_dynamic = baseline_tok(sample_batch, padding=True, return_tensors='pt')
encoded_static = baseline_tok(sample_batch, padding='max_length', max_length=24, return_tensors='pt')

encoded_dynamic['input_ids'].shape, encoded_static['input_ids'].shape

## 🔄 Integrating with Embedding Pipelines
- Reuse the same tokenizer as your embedding model when possible to prevent semantic drift.
- Map chunk IDs to vector store metadata so original tokens can be reconstructed for audit.
- Align chunk overlap with vector dimensionality (Notebook 09 dives deeper into embeddings).

## 🧪 Lab · Tokenizer Bake-Off
1. Export the custom tokenizer artifacts and push to an internal Git repository.
2. Evaluate on 500 historical tickets: measure token count reduction vs. baseline.
3. Compute API latency savings given shorter sequences (reuse Notebook 02 pipelines).
4. Present a decision memo recommending tokenizer + padding strategy for production rollout.

## ✅ Implementation Checklist
- [ ] Tokenizer comparison table logged with metrics
- [ ] Custom tokenizer artifacts versioned in source control
- [ ] OOV ratio benchmarked across top corpora
- [ ] Padding & truncation strategy documented with latency trade-offs
- [ ] Integration test covering `AutoTokenizer.from_pretrained('artifacts/tokenizer')`

## 📚 References
- HuggingFace Tokenizers documentation
- *Efficient Tokenization for Industrial NLP*, ABB Research (2024)
- SentencePiece project: https://github.com/google/sentencepiece
- Notebook 09 · Vector Embeddings (for downstream chunking)