This project provides a unified experimental framework to compare three tokenizer strategies Byte-level, HuggingFace BPE, and a Custom BPE implementation across multiple languages (English and Chinese). It trains a small GPT-style model on Wikipedia streaming data and evaluates tokenization compression ratio, perplexity, BPC, and inference speed.
- Byte-level tokenizer (256 fixed vocab)
- HuggingFace BPE tokenizer (trainable, with special tokens)
- Custom GPT-2–style BPE tokenizer (full merge-learning implementation)
- Unified GPT training loop for fair comparison
- Compression ratio metrics (tokens per character)
- Bits-per-character (BPC) evaluation
- Inference speed measurement across tokenizers
Run all experiments:
python experiment.py --mode allRun a specific tokenizer mode:
python experiment.py --mode hf_bpe
Modes: --all --byte --hf_bpe --custom_bpeThe repository includes:
- main_results.ipynb — Visualizations, training/validation curves, BPC comparisons, tokenizer efficiency plots and inference speed analysis generated from experiment outputs.
See requirements.txt file.