tokenization: use BPE instead of chars\4 approximation#309
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the skill-validator’s skill profiling to use a real BPE tokenizer (cl100k_base via Microsoft.ML.Tokenizers) instead of the previous chars/4 heuristic when classifying skill complexity and generating token-size warnings.
Changes:
- Compute BPE token counts in
SkillProfilerand use them for complexity tiering and token-size warnings (while still retaining the chars/4 estimate for display). - Extend
SkillProfileto carry both chars/4 and BPE counts, and update formatted output/warnings accordingly. - Update the comprehensive-skill unit test to use varied text (to avoid repeated-char tokenization artifacts) and add tokenizer package references.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| eng/skill-validator/tests/SkillProfileTests.cs | Updates test content generation to reliably exceed the “comprehensive” threshold under BPE tokenization. |
| eng/skill-validator/src/SkillValidator.csproj | Adds Microsoft.ML.Tokenizers + cl100k data package; minor RunArguments quoting cleanup. |
| eng/skill-validator/src/Services/SkillProfiler.cs | Implements BPE token counting and switches tiering/warnings/output to use BPE counts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Skill Validation Results
[1] Quality improved but weighted score is -8.6% due to: tokens (22963 → 142576), tool calls (2 → 11), time (34.0s → 48.9s)
Model: claude-opus-4.6 | Judge: claude-opus-4.6 |
Curious, why? |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Skill Validation Results
[1] Quality unchanged but weighted score is -8.5% due to: tokens (11164 → 30813), tool calls (0 → 1), time (12.4s → 17.5s)
Model: claude-opus-4.6 | Judge: claude-opus-4.6 |
Today evaluator estimates the token size of the prompt by
chars/4. I am proposing to use BPE instead.chars/4is a rough average that assumes every 4 characters ≈ 1 token. This is only accurate for plain English prose. It systematically misjudges:csharp) waste characters on syntax that BPE compresses efficiently.cl100k_base is the right reference: it's the BPE vocabulary used by GPT-4 and closely matches the tokenization of Claude models (and others). Since skills are injected into these models' context windows, counting with the actual tokenizer gives the true cost.
PR still gives the output in
chars\4, but uses BPE for evaluation.BPE vs chars/4 token estimation — all 25 skills