Skip to content

Add tokenizer training pipeline and extract src/tokenizer.py#6

Closed
ncylich wants to merge 2 commits into
mainfrom
tokenizer-pipeline
Closed

Add tokenizer training pipeline and extract src/tokenizer.py#6
ncylich wants to merge 2 commits into
mainfrom
tokenizer-pipeline

Conversation

@ncylich
Copy link
Copy Markdown
Contributor

@ncylich ncylich commented Mar 9, 2026

Summary

  • Extract src/tokenizer.py from src/data.py: all tokenizer code (constants, NeedleTokenizer, pre_tokenize(), train_tokenizer(), get_tokenizer(), SP config) now lives in its own module. data.py re-exports everything so existing callers are unaffected.
  • Add src/tokenizer_train.py: standalone GCS corpus pipeline with prepare, train, and validate subcommands for running on a GCP VM.
  • pre_tokenize() isolates ( ) { } [ ] " , with spaces so BPE never merges them into multi-char tokens.
  • user_defined_symbols hardcodes <tool_call>=4, <transcribe>=5, and all 8 structural chars at fixed IDs 6–13.
  • Trained tokenizer results: 8192 vocab, 4.63 chars/token compression on multilingual corpus, 0 isolated-char violations. Model/vocab files are gitignored (hosted on GCS).
  • Gitignore docs/tokenization_plan.md (internal design doc).

Test plan

  • All imports resolve: src.tokenizer, src.data re-exports, src.tokenizer_train
  • pre_tokenize() correctly isolates all 8 chars
  • Full pipeline ran on e2-standard-4 VM: prepare (3.3M examples, 5.3GB corpus) → train (8192 BPE) → validate
  • Isolated char constraint: 0 violations across all 8192 tokens
  • Compression ratio: 4.63 chars/token (corpus), 4.76 (English)
  • Special token IDs match constants (PAD=0, EOS=1, BOS=2, UNK=3, tool_call=4, transcribe=5)
  • Verify needle train --toy still works end-to-end with new tokenizer

ncylich added 2 commits March 9, 2026 02:02
- Extract all tokenizer code from data.py into src/tokenizer.py:
  token ID constants, NeedleTokenizer, pre_tokenize(), train_tokenizer(),
  get_tokenizer(), and SP training config (_SP_TRAIN_KWARGS)
- Re-export everything from data.py so existing callers are unaffected
- Add src/tokenizer_train.py: standalone GCS corpus pipeline with
  prepare/train/validate subcommands
- pre_tokenize() isolates ( ) { } [ ] " , so BPE never merges them
- user_defined_symbols hardcodes tool_call=4, transcribe=5, and all
  8 structural chars at fixed IDs 6-13
- Trained tokenizer: 8192 vocab, 4.63 chars/token compression,
  0 isolated-char violations (model/vocab gitignored, on GCS)
- Add docs/tokenization_plan.md with full design rationale
- Extract all tokenizer code from data.py into src/tokenizer.py:
  token ID constants, NeedleTokenizer, pre_tokenize(), train_tokenizer(),
  get_tokenizer(), and SP training config (_SP_TRAIN_KWARGS)
- Re-export everything from data.py so existing callers are unaffected
- Add src/tokenizer_train.py: standalone GCS corpus pipeline with
  prepare/train/validate subcommands
- pre_tokenize() isolates ( ) { } [ ] " , so BPE never merges them
- user_defined_symbols hardcodes tool_call=4, transcribe=5, and all
  8 structural chars at fixed IDs 6-13
- Trained tokenizer: 8192 vocab, 4.63 chars/token compression,
  0 isolated-char violations (model/vocab gitignored, on GCS)
- Gitignore docs/tokenization_plan.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant