tokenization: use BPE instead of chars\4 approximation#280
tokenization: use BPE instead of chars\4 approximation#280DeagleGross wants to merge 1 commit intodotnet:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the skill-validator’s prompt size evaluation to use a real BPE tokenizer (cl100k_base family) instead of the chars/4 heuristic, improving accuracy for code/markdown-heavy skills while still reporting the chars/4 estimate for reference.
Changes:
- Add ML Tokenizers dependencies and compute BPE token counts during skill profiling.
- Switch complexity tier classification and token-size warnings to be based on BPE token count (while retaining chars/4 as an estimate).
- Update the “comprehensive skill” unit test data generation to avoid repeated-character compression artifacts.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| eng/skill-validator/src/Services/SkillProfiler.cs | Adds BPE token counting, uses it for tiering/warnings, and updates profile formatting/output. |
| eng/skill-validator/src/SkillValidator.csproj | Adds tokenizer package references and adjusts RunArguments quoting. |
| eng/skill-validator/tests/SkillProfileTests.cs | Updates the comprehensive-skill test to use varied text suitable for BPE counting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public sealed record SkillProfile( | ||
| string Name, | ||
| int TokenCount, | ||
| int BpeTokenCount, | ||
| string ComplexityTier, // "compact" | "detailed" | "standard" | "comprehensive" |
There was a problem hiding this comment.
SkillProfile.TokenCount now represents the chars/4 estimate, while BpeTokenCount is the real BPE count used for tiering/warnings. The TokenCount name is now misleading and can easily cause downstream misuse (especially since SkillProfile is public). Consider renaming TokenCount to something explicit like Chars4TokenEstimate (or similar), and updating formatting/output accordingly.
| { | ||
| warnings.Add( | ||
| $"Skill is only {tokenCount} tokens — may be too sparse to provide actionable guidance."); | ||
| $"Skill is only {bpeTokenCount} BPE tokens (chars/4 estimate: {chars4TokenCount}) — may be too sparse to provide actionable guidance."); |
There was a problem hiding this comment.
The "too sparse" warning formats bpeTokenCount and chars4TokenCount without :N0, unlike the other token warnings in this method. For consistency/readability (and to avoid very large numbers being printed raw), apply the same numeric formatting here.
| $"Skill is only {bpeTokenCount} BPE tokens (chars/4 estimate: {chars4TokenCount}) — may be too sparse to provide actionable guidance."); | |
| $"Skill is only {bpeTokenCount:N0} BPE tokens (chars/4 estimate: {chars4TokenCount:N0}) — may be too sparse to provide actionable guidance."); |
|
Please submit this change from a non-forked branch. See #282 |
|
reopened at #309 |
Today evaluator estimates the token size of the prompt by
chars/4. I am proposing to use BPE instead.chars/4is a rough average that assumes every 4 characters ≈ 1 token. This is only accurate for plain English prose. It systematically misjudges:csharp) waste characters on syntax that BPE compresses efficiently.cl100k_base is the right reference: it's the BPE vocabulary used by GPT-4 and closely matches the tokenization of Claude models (and others). Since skills are injected into these models' context windows, counting with the actual tokenizer gives the true cost.
PR still gives the output in
chars\4, but uses BPE for evaluation.BPE vs chars/4 token estimation — all 25 skills