tokenization: use BPE instead of chars\4 approximation by DeagleGross · Pull Request #280 · dotnet/skills

DeagleGross · 2026-03-08T14:10:42Z

Today evaluator estimates the token size of the prompt by chars/4. I am proposing to use BPE instead.

chars/4 is a rough average that assumes every 4 characters ≈ 1 token. This is only accurate for plain English prose. It systematically misjudges:

Code-heavy skills: identifiers like MSBuildProjectFullPath are 1-2 tokens but 24 characters. chars/4 says 6 tokens — a 3-4x overcount.
Markdown structure: headers, bullets, fenced code blocks (`csharp) waste characters on syntax that BPE compresses efficiently.
Repeated patterns: BPE exploits common subword patterns (e.g. dotnet, build, --configuration) that the chars/4 heuristic cannot.
Whitespace-heavy formatting: indentation in code blocks inflates char count but BPE merges whitespace runs into single tokens.

cl100k_base is the right reference: it's the BPE vocabulary used by GPT-4 and closely matches the tokenization of Claude models (and others). Since skills are injected into these models' context windows, counting with the actual tokenizer gives the true cost.

PR still gives the output in chars\4, but uses BPE for evaluation.

BPE vs chars/4 token estimation — all 25 skills

Skill	Chars	Chars/4	BPE	Diff%
optimizing-ef-core-queries	5,744	1,436	1,335	+7.6%
analyzing-dotnet-performance	11,310	2,828	2,561	+10.4%
android-tombstone-symbolication	8,189	2,048	2,109	-2.9%
clr-activation-debugging	20,140	5,035	4,976	+1.2%
dotnet-trace-collect	22,621	5,656	5,119	+10.5%
dump-collect	4,267	1,067	1,069	-0.2%
microbenchmarking	13,155	3,289	2,674	+23.0%
binlog-failure-analysis	3,761	941	972	-3.2%
binlog-generation	2,853	714	716	-0.3%
build-parallelism	3,638	910	820	+11.0%
build-perf-baseline	12,357	3,090	2,842	+8.7%
build-perf-diagnostics	6,595	1,649	1,560	+5.7%
check-bin-obj-clash	16,259	4,065	3,600	+12.9%
directory-build-organization	9,418	2,355	2,102	+12.0%
eval-performance	4,336	1,084	949	+14.2%
including-generated-files	6,843	1,711	1,447	+18.2%
incremental-build	13,442	3,361	2,958	+13.6%
msbuild-antipatterns	14,241	3,561	3,456	+3.0%
msbuild-modernization	16,712	4,178	4,148	+0.7%
dotnet-aot-compat	16,752	4,188	3,901	+7.4%
migrate-nullable-references	35,735	8,934	7,835	+14.0%
thread-abort-migration	12,792	3,198	2,772	+15.4%
csharp-scripts	5,183	1,296	1,239	+4.6%
dotnet-pinvoke	18,866	4,717	4,521	+4.3%
nuget-trusted-publishing	9,254	2,314	2,215	+4.5%

Diff% = how much chars/4 overestimates (+) or underestimates (−) relative to BPE.
chars/4 overestimates for 22/25 skills (up to +23%). Mean overestimate ≈ +8.5%.

Copilot

Pull request overview

Updates the skill-validator’s prompt size evaluation to use a real BPE tokenizer (cl100k_base family) instead of the chars/4 heuristic, improving accuracy for code/markdown-heavy skills while still reporting the chars/4 estimate for reference.

Changes:

Add ML Tokenizers dependencies and compute BPE token counts during skill profiling.
Switch complexity tier classification and token-size warnings to be based on BPE token count (while retaining chars/4 as an estimate).
Update the “comprehensive skill” unit test data generation to avoid repeated-character compression artifacts.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
eng/skill-validator/src/Services/SkillProfiler.cs	Adds BPE token counting, uses it for tiering/warnings, and updates profile formatting/output.
eng/skill-validator/src/SkillValidator.csproj	Adds tokenizer package references and adjusts `RunArguments` quoting.
eng/skill-validator/tests/SkillProfileTests.cs	Updates the comprehensive-skill test to use varied text suitable for BPE counting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-08T14:13:31Z

eng/skill-validator/src/Services/SkillProfiler.cs

 public sealed record SkillProfile(
    string Name,
    int TokenCount,
+    int BpeTokenCount,
    string ComplexityTier, // "compact" | "detailed" | "standard" | "comprehensive"


SkillProfile.TokenCount now represents the chars/4 estimate, while BpeTokenCount is the real BPE count used for tiering/warnings. The TokenCount name is now misleading and can easily cause downstream misuse (especially since SkillProfile is public). Consider renaming TokenCount to something explicit like Chars4TokenEstimate (or similar), and updating formatting/output accordingly.

Copilot · 2026-03-08T14:13:31Z

eng/skill-validator/src/Services/SkillProfiler.cs

        {
            warnings.Add(
-                $"Skill is only {tokenCount} tokens — may be too sparse to provide actionable guidance.");
+                $"Skill is only {bpeTokenCount} BPE tokens (chars/4 estimate: {chars4TokenCount}) — may be too sparse to provide actionable guidance.");


The "too sparse" warning formats bpeTokenCount and chars4TokenCount without :N0, unlike the other token warnings in this method. For consistency/readability (and to avoid very large numbers being printed raw), apply the same numeric formatting here.

Suggested change

$"Skill is only {bpeTokenCount} BPE tokens (chars/4 estimate: {chars4TokenCount}) — may be too sparse to provide actionable guidance.");

$"Skill is only {bpeTokenCount:N0} BPE tokens (chars/4 estimate: {chars4TokenCount:N0}) — may be too sparse to provide actionable guidance.");

ViktorHofer · 2026-03-08T16:47:26Z

Please submit this change from a non-forked branch. See #282

DeagleGross · 2026-03-10T12:30:15Z

reopened at #309

use BPE tokenizer

d870730

DeagleGross requested review from a team and Copilot March 8, 2026 14:10

DeagleGross requested review from JanKrivanek and ViktorHofer as code owners March 8, 2026 14:10

DeagleGross self-assigned this Mar 8, 2026

Copilot started reviewing on behalf of DeagleGross March 8, 2026 14:11 View session

Copilot AI reviewed Mar 8, 2026

View reviewed changes

DeagleGross mentioned this pull request Mar 8, 2026

feat: noise-test evaluation option #281

Closed

danmoseley added the infrastructure label Mar 9, 2026

DeagleGross closed this Mar 10, 2026

DeagleGross mentioned this pull request Mar 10, 2026

feat: noise-test evaluation option #310

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization: use BPE instead of chars\4 approximation#280

tokenization: use BPE instead of chars\4 approximation#280
DeagleGross wants to merge 1 commit intodotnet:mainfrom
DeagleGross:dmkorolev/bpe

DeagleGross commented Mar 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

Copilot AI Mar 8, 2026

Uh oh!

ViktorHofer commented Mar 8, 2026

Uh oh!

DeagleGross commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	$"Skill is only {bpeTokenCount} BPE tokens (chars/4 estimate: {chars4TokenCount}) — may be too sparse to provide actionable guidance.");
	$"Skill is only {bpeTokenCount:N0} BPE tokens (chars/4 estimate: {chars4TokenCount:N0}) — may be too sparse to provide actionable guidance.");

Conversation

DeagleGross commented Mar 8, 2026

BPE vs chars/4 token estimation — all 25 skills

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

ViktorHofer commented Mar 8, 2026

Uh oh!

DeagleGross commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants