Add tokenizer training pipeline and extract src/tokenizer.py by ncylich · Pull Request #6 · cactus-compute/needle

ncylich · 2026-03-09T09:02:54Z

Summary

Extract src/tokenizer.py from src/data.py: all tokenizer code (constants, NeedleTokenizer, pre_tokenize(), train_tokenizer(), get_tokenizer(), SP config) now lives in its own module. data.py re-exports everything so existing callers are unaffected.
Add src/tokenizer_train.py: standalone GCS corpus pipeline with prepare, train, and validate subcommands for running on a GCP VM.
pre_tokenize() isolates ( ) { } [ ] " , with spaces so BPE never merges them into multi-char tokens.
user_defined_symbols hardcodes <tool_call>=4, <transcribe>=5, and all 8 structural chars at fixed IDs 6–13.
Trained tokenizer results: 8192 vocab, 4.63 chars/token compression on multilingual corpus, 0 isolated-char violations. Model/vocab files are gitignored (hosted on GCS).
Gitignore docs/tokenization_plan.md (internal design doc).

Test plan

All imports resolve: src.tokenizer, src.data re-exports, src.tokenizer_train
pre_tokenize() correctly isolates all 8 chars
Full pipeline ran on e2-standard-4 VM: prepare (3.3M examples, 5.3GB corpus) → train (8192 BPE) → validate
Isolated char constraint: 0 violations across all 8192 tokens
Compression ratio: 4.63 chars/token (corpus), 4.76 (English)
Special token IDs match constants (PAD=0, EOS=1, BOS=2, UNK=3, tool_call=4, transcribe=5)
Verify needle train --toy still works end-to-end with new tokenizer

- Extract all tokenizer code from data.py into src/tokenizer.py: token ID constants, NeedleTokenizer, pre_tokenize(), train_tokenizer(), get_tokenizer(), and SP training config (_SP_TRAIN_KWARGS) - Re-export everything from data.py so existing callers are unaffected - Add src/tokenizer_train.py: standalone GCS corpus pipeline with prepare/train/validate subcommands - pre_tokenize() isolates ( ) { } [ ] " , so BPE never merges them - user_defined_symbols hardcodes tool_call=4, transcribe=5, and all 8 structural chars at fixed IDs 6-13 - Trained tokenizer: 8192 vocab, 4.63 chars/token compression, 0 isolated-char violations (model/vocab gitignored, on GCS) - Add docs/tokenization_plan.md with full design rationale

- Extract all tokenizer code from data.py into src/tokenizer.py: token ID constants, NeedleTokenizer, pre_tokenize(), train_tokenizer(), get_tokenizer(), and SP training config (_SP_TRAIN_KWARGS) - Re-export everything from data.py so existing callers are unaffected - Add src/tokenizer_train.py: standalone GCS corpus pipeline with prepare/train/validate subcommands - pre_tokenize() isolates ( ) { } [ ] " , so BPE never merges them - user_defined_symbols hardcodes tool_call=4, transcribe=5, and all 8 structural chars at fixed IDs 6-13 - Trained tokenizer: 8192 vocab, 4.63 chars/token compression, 0 isolated-char violations (model/vocab gitignored, on GCS) - Gitignore docs/tokenization_plan.md

ncylich added 2 commits March 9, 2026 02:02

ncylich closed this Mar 10, 2026

bs258q mentioned this pull request May 13, 2026

Feat/tool descriptions and confidence, MCP server #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenizer training pipeline and extract src/tokenizer.py#6

Add tokenizer training pipeline and extract src/tokenizer.py#6
ncylich wants to merge 2 commits into
mainfrom
tokenizer-pipeline

ncylich commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ncylich commented Mar 9, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant