Skip to content

tokenization: use BPE instead of chars\4 approximation#280

Closed
DeagleGross wants to merge 1 commit intodotnet:mainfrom
DeagleGross:dmkorolev/bpe
Closed

tokenization: use BPE instead of chars\4 approximation#280
DeagleGross wants to merge 1 commit intodotnet:mainfrom
DeagleGross:dmkorolev/bpe

Conversation

@DeagleGross
Copy link
Copy Markdown
Member

Today evaluator estimates the token size of the prompt by chars/4. I am proposing to use BPE instead.

chars/4 is a rough average that assumes every 4 characters ≈ 1 token. This is only accurate for plain English prose. It systematically misjudges:

  • Code-heavy skills: identifiers like MSBuildProjectFullPath are 1-2 tokens but 24 characters. chars/4 says 6 tokens — a 3-4x overcount.
  • Markdown structure: headers, bullets, fenced code blocks (`csharp) waste characters on syntax that BPE compresses efficiently.
  • Repeated patterns: BPE exploits common subword patterns (e.g. dotnet, build, --configuration) that the chars/4 heuristic cannot.
  • Whitespace-heavy formatting: indentation in code blocks inflates char count but BPE merges whitespace runs into single tokens.

cl100k_base is the right reference: it's the BPE vocabulary used by GPT-4 and closely matches the tokenization of Claude models (and others). Since skills are injected into these models' context windows, counting with the actual tokenizer gives the true cost.

PR still gives the output in chars\4, but uses BPE for evaluation.

BPE vs chars/4 token estimation — all 25 skills

Skill Chars Chars/4 BPE Diff%
optimizing-ef-core-queries 5,744 1,436 1,335 +7.6%
analyzing-dotnet-performance 11,310 2,828 2,561 +10.4%
android-tombstone-symbolication 8,189 2,048 2,109 -2.9%
clr-activation-debugging 20,140 5,035 4,976 +1.2%
dotnet-trace-collect 22,621 5,656 5,119 +10.5%
dump-collect 4,267 1,067 1,069 -0.2%
microbenchmarking 13,155 3,289 2,674 +23.0%
binlog-failure-analysis 3,761 941 972 -3.2%
binlog-generation 2,853 714 716 -0.3%
build-parallelism 3,638 910 820 +11.0%
build-perf-baseline 12,357 3,090 2,842 +8.7%
build-perf-diagnostics 6,595 1,649 1,560 +5.7%
check-bin-obj-clash 16,259 4,065 3,600 +12.9%
directory-build-organization 9,418 2,355 2,102 +12.0%
eval-performance 4,336 1,084 949 +14.2%
including-generated-files 6,843 1,711 1,447 +18.2%
incremental-build 13,442 3,361 2,958 +13.6%
msbuild-antipatterns 14,241 3,561 3,456 +3.0%
msbuild-modernization 16,712 4,178 4,148 +0.7%
dotnet-aot-compat 16,752 4,188 3,901 +7.4%
migrate-nullable-references 35,735 8,934 7,835 +14.0%
thread-abort-migration 12,792 3,198 2,772 +15.4%
csharp-scripts 5,183 1,296 1,239 +4.6%
dotnet-pinvoke 18,866 4,717 4,521 +4.3%
nuget-trusted-publishing 9,254 2,314 2,215 +4.5%

Diff% = how much chars/4 overestimates (+) or underestimates (−) relative to BPE.
chars/4 overestimates for 22/25 skills (up to +23%). Mean overestimate ≈ +8.5%.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the skill-validator’s prompt size evaluation to use a real BPE tokenizer (cl100k_base family) instead of the chars/4 heuristic, improving accuracy for code/markdown-heavy skills while still reporting the chars/4 estimate for reference.

Changes:

  • Add ML Tokenizers dependencies and compute BPE token counts during skill profiling.
  • Switch complexity tier classification and token-size warnings to be based on BPE token count (while retaining chars/4 as an estimate).
  • Update the “comprehensive skill” unit test data generation to avoid repeated-character compression artifacts.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
eng/skill-validator/src/Services/SkillProfiler.cs Adds BPE token counting, uses it for tiering/warnings, and updates profile formatting/output.
eng/skill-validator/src/SkillValidator.csproj Adds tokenizer package references and adjusts RunArguments quoting.
eng/skill-validator/tests/SkillProfileTests.cs Updates the comprehensive-skill test to use varied text suitable for BPE counting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 7 to 11
public sealed record SkillProfile(
string Name,
int TokenCount,
int BpeTokenCount,
string ComplexityTier, // "compact" | "detailed" | "standard" | "comprehensive"
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SkillProfile.TokenCount now represents the chars/4 estimate, while BpeTokenCount is the real BPE count used for tiering/warnings. The TokenCount name is now misleading and can easily cause downstream misuse (especially since SkillProfile is public). Consider renaming TokenCount to something explicit like Chars4TokenEstimate (or similar), and updating formatting/output accordingly.

Copilot uses AI. Check for mistakes.
{
warnings.Add(
$"Skill is only {tokenCount} tokens — may be too sparse to provide actionable guidance.");
$"Skill is only {bpeTokenCount} BPE tokens (chars/4 estimate: {chars4TokenCount}) — may be too sparse to provide actionable guidance.");
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "too sparse" warning formats bpeTokenCount and chars4TokenCount without :N0, unlike the other token warnings in this method. For consistency/readability (and to avoid very large numbers being printed raw), apply the same numeric formatting here.

Suggested change
$"Skill is only {bpeTokenCount} BPE tokens (chars/4 estimate: {chars4TokenCount}) — may be too sparse to provide actionable guidance.");
$"Skill is only {bpeTokenCount:N0} BPE tokens (chars/4 estimate: {chars4TokenCount:N0}) — may be too sparse to provide actionable guidance.");

Copilot uses AI. Check for mistakes.
@ViktorHofer
Copy link
Copy Markdown
Member

Please submit this change from a non-forked branch. See #282

@DeagleGross
Copy link
Copy Markdown
Member Author

reopened at #309

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants