tokenization: use BPE instead of chars\4 approximation by DeagleGross · Pull Request #309 · dotnet/skills

DeagleGross · 2026-03-10T12:30:04Z

Today evaluator estimates the token size of the prompt by chars/4. I am proposing to use BPE instead.

chars/4 is a rough average that assumes every 4 characters ≈ 1 token. This is only accurate for plain English prose. It systematically misjudges:

Code-heavy skills: identifiers like MSBuildProjectFullPath are 1-2 tokens but 24 characters. chars/4 says 6 tokens — a 3-4x overcount.
Markdown structure: headers, bullets, fenced code blocks (`csharp) waste characters on syntax that BPE compresses efficiently.
Repeated patterns: BPE exploits common subword patterns (e.g. dotnet, build, --configuration) that the chars/4 heuristic cannot.
Whitespace-heavy formatting: indentation in code blocks inflates char count but BPE merges whitespace runs into single tokens.

cl100k_base is the right reference: it's the BPE vocabulary used by GPT-4 and closely matches the tokenization of Claude models (and others). Since skills are injected into these models' context windows, counting with the actual tokenizer gives the true cost.

PR still gives the output in chars\4, but uses BPE for evaluation.

BPE vs chars/4 token estimation — all 25 skills

Skill	Chars	Chars/4	BPE	Diff%
optimizing-ef-core-queries	5,744	1,436	1,335	+7.6%
analyzing-dotnet-performance	11,310	2,828	2,561	+10.4%
android-tombstone-symbolication	8,189	2,048	2,109	-2.9%
clr-activation-debugging	20,140	5,035	4,976	+1.2%
dotnet-trace-collect	22,621	5,656	5,119	+10.5%
dump-collect	4,267	1,067	1,069	-0.2%
microbenchmarking	13,155	3,289	2,674	+23.0%
binlog-failure-analysis	3,761	941	972	-3.2%
binlog-generation	2,853	714	716	-0.3%
build-parallelism	3,638	910	820	+11.0%
build-perf-baseline	12,357	3,090	2,842	+8.7%
build-perf-diagnostics	6,595	1,649	1,560	+5.7%
check-bin-obj-clash	16,259	4,065	3,600	+12.9%
directory-build-organization	9,418	2,355	2,102	+12.0%
eval-performance	4,336	1,084	949	+14.2%
including-generated-files	6,843	1,711	1,447	+18.2%
incremental-build	13,442	3,361	2,958	+13.6%
msbuild-antipatterns	14,241	3,561	3,456	+3.0%
msbuild-modernization	16,712	4,178	4,148	+0.7%
dotnet-aot-compat	16,752	4,188	3,901	+7.4%
migrate-nullable-references	35,735	8,934	7,835	+14.0%
thread-abort-migration	12,792	3,198	2,772	+15.4%
csharp-scripts	5,183	1,296	1,239	+4.6%
dotnet-pinvoke	18,866	4,717	4,521	+4.3%
nuget-trusted-publishing	9,254	2,314	2,215	+4.5%

Diff% = how much chars/4 overestimates (+) or underestimates (−) relative to BPE.
chars/4 overestimates for 22/25 skills (up to +23%). Mean overestimate ≈ +8.5%.

Copilot

Pull request overview

Updates the skill-validator’s skill profiling to use a real BPE tokenizer (cl100k_base via Microsoft.ML.Tokenizers) instead of the previous chars/4 heuristic when classifying skill complexity and generating token-size warnings.

Changes:

Compute BPE token counts in SkillProfiler and use them for complexity tiering and token-size warnings (while still retaining the chars/4 estimate for display).
Extend SkillProfile to carry both chars/4 and BPE counts, and update formatted output/warnings accordingly.
Update the comprehensive-skill unit test to use varied text (to avoid repeated-char tokenization artifacts) and add tokenizer package references.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
eng/skill-validator/tests/SkillProfileTests.cs	Updates test content generation to reliably exceed the “comprehensive” threshold under BPE tokenization.
eng/skill-validator/src/SkillValidator.csproj	Adds `Microsoft.ML.Tokenizers` + cl100k data package; minor `RunArguments` quoting cleanup.
eng/skill-validator/src/Services/SkillProfiler.cs	Implements BPE token counting and switches tiering/warnings/output to use BPE counts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

eng/skill-validator/src/Services/SkillProfiler.cs

github-actions · 2026-03-10T13:06:42Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
csharp-scripts	Test a C# language feature with a script	4.0/5 → 5.0/5 🟢	✅ csharp-scripts; tools: skill, create, edit	🟡 0.32	❌ [1]
nuget-trusted-publishing	Set up trusted publishing for a new NuGet library	3.0/5 → 4.0/5 🟢	✅ nuget-trusted-publishing; tools: skill	✅ 0.11	✅
nuget-trusted-publishing	Set up NuGet publishing without mentioning trusted publishing	2.0/5 → 5.0/5 🟢	✅ nuget-trusted-publishing; tools: skill, report_intent, glob, view	✅ 0.11	✅
nuget-trusted-publishing	Migrate existing workflow from API key to trusted publishing	3.0/5 → 5.0/5 🟢	✅ nuget-trusted-publishing; tools: skill, view	✅ 0.11	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET 8+)	4.0/5 → 5.0/5 🟢	✅ dotnet-pinvoke; tools: skill	✅ 0.12	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET Framework)	3.0/5 → 5.0/5 🟢	✅ dotnet-pinvoke; tools: skill	✅ 0.12	✅
dotnet-trace-collect	High CPU in Kubernetes on Linux (.NET 8)	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.12	✅
dotnet-trace-collect	.NET Framework on Windows without admin privileges	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.12	✅
dotnet-trace-collect	.NET 10 on Linux with root access and native call stacks	1.0/5 → 4.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.12	✅
dotnet-trace-collect	Memory leak on Linux (.NET 8)	4.0/5 → 3.0/5 🔴	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.12	❌
dotnet-trace-collect	Slow requests on Windows with PerfView	5.0/5 → 5.0/5	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.12	✅
dotnet-trace-collect	Excessive GC on Linux (.NET 8)	5.0/5 → 5.0/5	✅ dotnet-trace-collect; tools: skill	✅ 0.12	❌ [2]
dotnet-trace-collect	Hang or deadlock diagnosis on Linux	3.0/5 → 4.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.12	✅
dotnet-trace-collect	Windows container high CPU with PerfView	1.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: report_intent, skill, view, glob	✅ 0.12	✅
dotnet-trace-collect	Long-running intermittent issue with PerfView triggers	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.12	✅
dotnet-trace-collect	Linux pre-.NET 10 needing native call stacks	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.12	✅
dotnet-trace-collect	Windows modern .NET with admin high CPU	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.12	✅
dotnet-trace-collect	Memory leak on .NET Framework Windows	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.12	✅
dotnet-trace-collect	Kubernetes with console access prefers console tools	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.12	✅
dotnet-trace-collect	Container installation without .NET SDK	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.12	✅
dotnet-trace-collect	HTTP 500s from downstream service on Linux (.NET 8)	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.12	✅
dotnet-trace-collect	Networking timeouts on Windows with admin (.NET 8)	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.12	✅
microbenchmarking	Investigate runtime upgrade performance impact	4.0/5 → 5.0/5 🟢	✅ microbenchmarking; tools: skill, glob, stop_bash	✅ 0.10	✅
clr-activation-debugging	Diagnose unexpected FOD dialog from native build tool	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Diagnose FOD suppressed but activation still failing	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Explain why same binary behaves differently under different launch methods	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Analyze healthy managed EXE activation	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Identify multiple activation sequences in a single log	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Explain useLegacyV2RuntimeActivationPolicy in activation log	2.0/5 → 4.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Decline non-CLR-activation issue	1.0/5 → 5.0/5 🟢	ℹ️ not activated (expected)	✅ 0.10	✅
analyzing-dotnet-performance	Detects compiled regex startup budget and regex chain allocations	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Detects CurrentCulture comparer and compiled regex budget in inflection rules	1.0/5 → 5.0/5 ⏰ 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Finds per-call Dictionary allocation not hoisted to static	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Catches compound allocations in recursive number converter with ToLower	1.0/5 → 4.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Finds StringComparison.Ordinal missing and FrozenDictionary opportunities	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Detects Aggregate+Replace chain and struct missing IEquatable	1.0/5 → 2.0/5 ⏰ 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Finds branched Replace chain in format string manipulation	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Catches LINQ on hot-path string processing and All(char.IsUpper)	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Detects LINQ pipeline in TimeSpan formatting and collection processing	1.0/5 → 3.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Flags Span inconsistencies and compound method chains in truncation library	1.0/5 → 4.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
analyzing-dotnet-performance	Identifies unsealed leaf classes and locale hierarchy patterns	1.0/5 → 4.0/5 ⏰ 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.13	✅
android-tombstone-symbolication	Symbolicate .NET frames in an Android tombstone	3.0/5 → 3.0/5	✅ android-tombstone-symbolication; tools: skill, stop_bash	✅ 0.17	✅
android-tombstone-symbolication	Recognize tombstone with no .NET frames	5.0/5 → 5.0/5	✅ android-tombstone-symbolication; tools: skill	✅ 0.17	❌ [3]
android-tombstone-symbolication	Symbolicate CoreCLR frames in an Android tombstone	3.0/5 → 4.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, glob	✅ 0.17	✅
android-tombstone-symbolication	Recognize NativeAOT tombstone with app binary and libSystem.Native.so	3.0/5 → 5.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, bash, glob	✅ 0.17	✅
android-tombstone-symbolication	Symbolicate multi-thread tombstone	4.0/5 → 4.0/5	✅ android-tombstone-symbolication; tools: skill, glob	✅ 0.17	✅
android-tombstone-symbolication	Handle .NET frames with no BuildId metadata	5.0/5 → 4.0/5 🔴	✅ android-tombstone-symbolication; tools: skill, bash, glob	✅ 0.17	✅ [4]
android-tombstone-symbolication	Symbolicate tombstone with multiple .NET libraries and different BuildIds	3.0/5 → 4.0/5 🟢	✅ android-tombstone-symbolication; tools: skill	✅ 0.17	✅
android-tombstone-symbolication	Reject iOS crash log as wrong format	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.17	✅
dump-collect	Configure automatic crash dumps for CoreCLR app on Linux	3.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill, report_intent, view, glob	🟡 0.28	✅
dump-collect	Set up NativeAOT crash dumps with createdump in Kubernetes	2.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.28	✅
dump-collect	Recover crash dump from macOS NativeAOT without createdump	4.0/5 → 4.0/5	✅ dump-collect; tools: skill, report_intent, view, glob	🟡 0.28	❌ [5]
dump-collect	Configure CoreCLR dump collection in Alpine Docker as non-root	4.0/5 → 4.0/5	✅ dump-collect; tools: skill, report_intent, view, glob	🟡 0.28	✅
dump-collect	Advisory: macOS NativeAOT crash dump recovery steps	4.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill, bash	🟡 0.28	✅
dump-collect	Advisory: CoreCLR Alpine Docker non-root configuration	4.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill, report_intent, view	🟡 0.28	✅
dump-collect	Advisory: NativeAOT Kubernetes dump collection setup	3.0/5 → 3.0/5	✅ dump-collect; tools: skill	🟡 0.28	❌ [6]
dump-collect	Detect runtime and configure crash dumps for unknown .NET app on Linux	4.0/5 → 4.0/5	✅ dump-collect; tools: skill, bash	🟡 0.28	❌ [7]
dump-collect	Decline dump analysis request	2.0/5 → 4.0/5 🟢	ℹ️ not activated (expected)	🟡 0.28	✅
optimizing-ef-core-queries	Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete	4.0/5 → 5.0/5 🟢	✅ optimizing-ef-core-queries; tools: skill	🟡 0.30	✅
build-parallelism	Analyze build parallelism bottlenecks	4.0/5 → 4.0/5	✅ build-parallelism; tools: skill	✅ 0.13	✅
including-generated-files	Diagnose generated file inclusion failure	3.0/5 → 5.0/5 🟢	✅ including-generated-files; tools: skill	✅ 0.18	✅
msbuild-antipatterns	Review MSBuild files for anti-patterns and style issues	4.0/5 → 5.0/5 🟢	✅ msbuild-antipatterns; tools: skill	✅ 0.06	❌ [8]
build-perf-baseline	Establish build performance baseline and recommend optimizations	3.0/5 → 4.0/5 🟢	✅ build-perf-baseline; build-perf-diagnostics; tools: skill	🟡 0.24	✅
msbuild-modernization	Modernize legacy project to SDK-style	5.0/5 → 5.0/5	✅ msbuild-modernization; tools: skill	✅ 0.04	❌ [9]
directory-build-organization	Organize build infrastructure for a multi-project repo	3.0/5 → 5.0/5 🟢	✅ msbuild-antipatterns; directory-build-organization; tools: skill	✅ 0.13	✅
check-bin-obj-clash	Diagnose bin/obj output path clashes	4.0/5 → 5.0/5 🟢	✅ check-bin-obj-clash; binlog-generation; tools: skill, glob	✅ 0.13	✅
incremental-build	Analyze incremental build issues	3.0/5 → 4.0/5 🟢	✅ incremental-build; tools: skill	✅ 0.14	✅
eval-performance	Analyze MSBuild evaluation performance issues	4.0/5 → 5.0/5 🟢	✅ eval-performance; tools: skill	✅ 0.11	✅
build-perf-diagnostics	Analyze analyzer performance impact on builds	5.0/5 → 5.0/5	✅ binlog-generation; build-perf-diagnostics; tools: skill	🟡 0.30	❌ [10]
binlog-generation	Build project with /bl flag	1.0/5 → 5.0/5 🟢	✅ binlog-generation; tools: skill	🟡 0.48	❌ [11]
binlog-generation	Build with /bl in PowerShell	3.0/5 → 5.0/5 🟢	✅ binlog-generation; tools: skill	🟡 0.48	✅
binlog-generation	Build multiple configurations with unique binlogs	5.0/5 → 5.0/5	✅ binlog-generation; tools: skill	🟡 0.48	✅
binlog-failure-analysis	Diagnose build failures from binlog only (no source files)	4.0/5 → 5.0/5 🟢	✅ binlog-failure-analysis; tools: skill	✅ 0.07	❌ [12]
dotnet-maui-doctor	Plan macOS MAUI setup with Xcode	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	✅ 0.20	✅
dotnet-maui-doctor	Plan Linux MAUI environment for Android	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: report_intent, skill, view, bash	✅ 0.20	✅
dotnet-maui-doctor	Guardrail against workload update and repair	1.0/5 → 4.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	✅ 0.20	✅
dotnet-maui-doctor	Diagnose non-Microsoft JDK causing build failure	1.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	✅ 0.20	✅
dotnet-maui-doctor	Plan complete MAUI setup on Windows	4.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	✅ 0.20	❌ [13]
dotnet-maui-doctor	Prevent incorrect JAVA_HOME configuration	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: report_intent, skill	✅ 0.20	✅
dotnet-maui-doctor	Determine required Android SDK packages for specific .NET version	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: report_intent, skill, view, glob, bash, stop_bash	✅ 0.20	✅
dotnet-maui-doctor	Fix stale MAUI workloads after SDK update	2.0/5 → 4.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	✅ 0.20	✅
thread-abort-migration	Worker thread with abort-based cancellation	5.0/5 → 5.0/5	✅ thread-abort-migration; tools: skill	✅ 0.12	✅
thread-abort-migration	Timeout enforcement via Thread.Abort	4.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.12	✅
thread-abort-migration	Blocking WaitHandle with Thread.Interrupt	4.0/5 → 4.0/5	✅ thread-abort-migration; tools: skill	✅ 0.12	✅
thread-abort-migration	ASP.NET Response.End and Response.Redirect with Thread.Abort	4.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.12	✅
thread-abort-migration	Thread.Join and Thread.Sleep only — should not migrate	5.0/5 → 5.0/5	✅ thread-abort-migration; tools: skill	✅ 0.12	✅
migrate-nullable-references	Enable NRT in a small library with mixed nullability	5.0/5 → 5.0/5	✅ migrate-nullable-references; tools: skill, glob	✅ 0.12	✅
migrate-nullable-references	File-by-file migration: only modify the targeted file	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.12	❌ [14]
migrate-nullable-references	Enable NRT in ASP.NET Core Web API with EF Core	3.0/5 → 3.0/5	✅ migrate-nullable-references; tools: skill	✅ 0.12	❌ [15]
dotnet-aot-compat	Make Azure.ResourceManager AOT-compatible	1.0/5 → 2.0/5 ⏰ 🟢	✅ dotnet-aot-compat; tools: skill, create, read_agent	✅ 0.16	❌ [16]

[1] Quality improved but weighted score is -8.6% due to: tokens (22963 → 142576), tool calls (2 → 11), time (34.0s → 48.9s)
[2] Quality unchanged but weighted score is -17.7% due to: judgment, tokens (33531 → 106250), tool calls (3 → 6), time (23.0s → 36.1s)
[3] Quality unchanged but weighted score is -5.8% due to: tokens (23229 → 43240), tool calls (2 → 3)
[4] Quality dropped but weighted score is +5.6% due to: efficiency metrics
[5] Quality unchanged but weighted score is -6.0% due to: tokens (11185 → 82741), tool calls (0 → 6), time (13.7s → 36.0s)
[6] Quality unchanged but weighted score is -4.5% due to: tokens (44786 → 84328), time (39.0s → 47.6s)
[7] Quality unchanged but weighted score is -2.6% due to: tokens (38611 → 100265), time (35.7s → 46.4s)
[8] Quality improved but weighted score is -15.0% due to: judgment, tokens (52762 → 90288)
[9] Quality unchanged but weighted score is -8.4% due to: judgment
[10] Quality unchanged but weighted score is -8.5% due to: tokens (123360 → 368759), tool calls (12 → 24), time (62.5s → 88.1s)
[11] Quality improved but weighted score is -1.3% due to: tokens (44784 → 54417)
[12] Quality improved but weighted score is -3.4% due to: tokens (434842 → 1292335), tool calls (22 → 41)
[13] Quality improved but weighted score is -2.9% due to: completion (✓ → ✗), tokens (34781 → 52532), tool calls (3 → 8), time (41.6s → 55.7s)
[14] Quality unchanged but weighted score is -0.6% due to: efficiency metrics
[15] Quality unchanged but weighted score is -20.9% due to: judgment, tokens (87356 → 188143), quality, time (78.3s → 102.4s), tool calls (20 → 24)
[16] Quality improved but weighted score is -10.8% due to: judgment, errors (0 → 1), tool calls (140 → 209), time (795.5s → 1050.0s)

⏰ timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

eng/skill-validator/src/Services/SkillProfiler.cs

eng/skill-validator/src/SkillValidator.csproj

ViktorHofer · 2026-03-10T13:26:04Z

PR still gives the output in chars\4, but uses BPE for evaluation.

Curious, why?

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

eng/skill-validator/src/Services/SkillProfiler.cs

… dmkorolev/bpe

github-actions · 2026-03-10T19:12:13Z

Skill Validation Results

Skill	Scenario	Quality	Skills Loaded	Overfit	Verdict
csharp-scripts	Test a C# language feature with a script	3.0/5 → 4.0/5 🟢	✅ csharp-scripts; tools: skill, create	🟡 0.32	✅
nuget-trusted-publishing	Set up trusted publishing for a new NuGet library	3.0/5 → 4.0/5 🟢	✅ nuget-trusted-publishing; tools: skill	✅ 0.09	✅
nuget-trusted-publishing	Set up NuGet publishing without mentioning trusted publishing	2.0/5 → 5.0/5 🟢	✅ nuget-trusted-publishing; tools: skill, report_intent, glob, view, bash, create	✅ 0.09	✅
nuget-trusted-publishing	Migrate existing workflow from API key to trusted publishing	3.0/5 → 4.0/5 🟢	✅ nuget-trusted-publishing; tools: skill, view	✅ 0.09	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET 8+)	5.0/5 → 5.0/5	✅ dotnet-pinvoke; tools: skill	✅ 0.09	✅
dotnet-pinvoke	Generate LibraryImport declaration from C header (.NET Framework)	5.0/5 → 5.0/5	✅ dotnet-pinvoke; tools: skill	✅ 0.09	✅
dotnet-trace-collect	High CPU in Kubernetes on Linux (.NET 8)	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view, glob	✅ 0.10	✅
dotnet-trace-collect	.NET Framework on Windows without admin privileges	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.10	✅
dotnet-trace-collect	.NET 10 on Linux with root access and native call stacks	1.0/5 → 4.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.10	✅
dotnet-trace-collect	Memory leak on Linux (.NET 8)	3.0/5 → 3.0/5	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Slow requests on Windows with PerfView	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Excessive GC on Linux (.NET 8)	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.10	✅
dotnet-trace-collect	Hang or deadlock diagnosis on Linux	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill	✅ 0.10	✅
dotnet-trace-collect	Windows container high CPU with PerfView	1.0/5 → 4.0/5 🟢	✅ dotnet-trace-collect; tools: report_intent, skill, view, glob	✅ 0.10	✅
dotnet-trace-collect	Long-running intermittent issue with PerfView triggers	3.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Linux pre-.NET 10 needing native call stacks	2.0/5 → 4.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Windows modern .NET with admin high CPU	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Memory leak on .NET Framework Windows	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Kubernetes with console access prefers console tools	5.0/5 → 5.0/5	✅ dotnet-trace-collect; tools: skill	✅ 0.10	❌ [1]
dotnet-trace-collect	Container installation without .NET SDK	4.0/5 → 2.0/5 🔴	✅ dotnet-trace-collect; tools: skill	✅ 0.10	❌
dotnet-trace-collect	HTTP 500s from downstream service on Linux (.NET 8)	4.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
dotnet-trace-collect	Networking timeouts on Windows with admin (.NET 8)	2.0/5 → 5.0/5 🟢	✅ dotnet-trace-collect; tools: skill, report_intent, view	✅ 0.10	✅
microbenchmarking	Investigate runtime upgrade performance impact	5.0/5 → 5.0/5	✅ microbenchmarking; tools: skill, glob, create	✅ 0.10	✅
clr-activation-debugging	Diagnose unexpected FOD dialog from native build tool	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Diagnose FOD suppressed but activation still failing	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Explain why same binary behaves differently under different launch methods	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Analyze healthy managed EXE activation	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Identify multiple activation sequences in a single log	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Explain useLegacyV2RuntimeActivationPolicy in activation log	2.0/5 → 3.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
clr-activation-debugging	Decline non-CLR-activation issue	1.0/5 → 5.0/5 🟢	✅ clr-activation-debugging; tools: skill	✅ 0.10	✅
analyzing-dotnet-performance	Detects compiled regex startup budget and regex chain allocations	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Detects CurrentCulture comparer and compiled regex budget in inflection rules	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Finds per-call Dictionary allocation not hoisted to static	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Catches compound allocations in recursive number converter with ToLower	1.0/5 → 4.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Finds StringComparison.Ordinal missing and FrozenDictionary opportunities	1.0/5 → 3.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Detects Aggregate+Replace chain and struct missing IEquatable	1.0/5 → 4.0/5 ⏰ 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Finds branched Replace chain in format string manipulation	1.0/5 → 1.0/5 ⏰	✅ analyzing-dotnet-performance; tools: skill, write_bash, stop_bash	✅ 0.14	❌ [2]
analyzing-dotnet-performance	Catches LINQ on hot-path string processing and All(char.IsUpper)	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Detects LINQ pipeline in TimeSpan formatting and collection processing	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill	✅ 0.14	✅
analyzing-dotnet-performance	Flags Span inconsistencies and compound method chains in truncation library	1.0/5 → 5.0/5 🟢	✅ analyzing-dotnet-performance; tools: skill, grep	✅ 0.14	✅
analyzing-dotnet-performance	Identifies unsealed leaf classes and locale hierarchy patterns	1.0/5 → 1.0/5 ⏰	✅ analyzing-dotnet-performance; tools: skill, read_bash, stop_bash	✅ 0.14	✅
android-tombstone-symbolication	Symbolicate .NET frames in an Android tombstone	5.0/5 → 5.0/5	✅ android-tombstone-symbolication; tools: skill	✅ 0.11	❌ [3]
android-tombstone-symbolication	Recognize tombstone with no .NET frames	5.0/5 → 5.0/5	✅ android-tombstone-symbolication; tools: skill	✅ 0.11	❌ [4]
android-tombstone-symbolication	Symbolicate CoreCLR frames in an Android tombstone	2.0/5 ⏰ → 4.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, glob	✅ 0.11	✅
android-tombstone-symbolication	Recognize NativeAOT tombstone with app binary and libSystem.Native.so	3.0/5 → 1.0/5 ⏰ 🔴	✅ android-tombstone-symbolication; tools: skill, bash, read_bash	✅ 0.11	❌
android-tombstone-symbolication	Symbolicate multi-thread tombstone	1.0/5 ⏰ → 5.0/5 🟢	✅ android-tombstone-symbolication; tools: skill, glob	✅ 0.11	✅
android-tombstone-symbolication	Handle .NET frames with no BuildId metadata	5.0/5 → 5.0/5	✅ android-tombstone-symbolication; tools: skill, bash	✅ 0.11	✅
android-tombstone-symbolication	Symbolicate tombstone with multiple .NET libraries and different BuildIds	4.0/5 → 4.0/5	✅ android-tombstone-symbolication; tools: skill, glob	✅ 0.11	✅
android-tombstone-symbolication	Reject iOS crash log as wrong format	5.0/5 → 5.0/5	ℹ️ not activated (expected)	✅ 0.11	❌ [5]
dump-collect	Configure automatic crash dumps for CoreCLR app on Linux	5.0/5 → 5.0/5	✅ dump-collect; tools: skill, report_intent, view, glob	🟡 0.28	❌ [6]
dump-collect	Set up NativeAOT crash dumps with createdump in Kubernetes	2.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.28	✅
dump-collect	Recover crash dump from macOS NativeAOT without createdump	2.0/5 → 4.0/5 🟢	✅ dump-collect; tools: skill, report_intent, view, glob, bash	🟡 0.28	✅
dump-collect	Configure CoreCLR dump collection in Alpine Docker as non-root	5.0/5 → 5.0/5	✅ dump-collect; tools: skill, report_intent, view, glob, bash	🟡 0.28	❌ [7]
dump-collect	Advisory: macOS NativeAOT crash dump recovery steps	4.0/5 → 4.0/5	✅ dump-collect; tools: skill, glob, bash	🟡 0.28	✅
dump-collect	Advisory: CoreCLR Alpine Docker non-root configuration	4.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.28	✅
dump-collect	Advisory: NativeAOT Kubernetes dump collection setup	3.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.28	✅
dump-collect	Detect runtime and configure crash dumps for unknown .NET app on Linux	4.0/5 → 5.0/5 🟢	✅ dump-collect; tools: skill	🟡 0.28	❌ [8]
dump-collect	Decline dump analysis request	2.0/5 → 4.0/5 🟢	ℹ️ not activated (expected)	🟡 0.28	✅
optimizing-ef-core-queries	Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete	4.0/5 → 4.0/5	✅ optimizing-ef-core-queries; tools: skill	🟡 0.35	✅
build-parallelism	Analyze build parallelism bottlenecks	4.0/5 → 4.0/5	✅ binlog-generation; build-parallelism; tools: skill	✅ 0.20	✅
including-generated-files	Diagnose generated file inclusion failure	3.0/5 → 5.0/5 🟢	✅ including-generated-files; tools: skill	🟡 0.22	✅
msbuild-antipatterns	Review MSBuild files for anti-patterns and style issues	5.0/5 → 5.0/5	✅ msbuild-antipatterns; tools: skill	✅ 0.06	✅
build-perf-baseline	Establish build performance baseline and recommend optimizations	3.0/5 → 4.0/5 🟢	✅ build-perf-baseline; tools: skill	🟡 0.24	✅
msbuild-modernization	Modernize legacy project to SDK-style	5.0/5 → 5.0/5	✅ msbuild-modernization; tools: skill	✅ 0.04	✅
directory-build-organization	Organize build infrastructure for a multi-project repo	3.0/5 → 4.0/5 🟢	✅ directory-build-organization; msbuild-antipatterns; tools: skill	✅ 0.20	✅
check-bin-obj-clash	Diagnose bin/obj output path clashes	4.0/5 → 5.0/5 🟢	✅ check-bin-obj-clash; binlog-generation; tools: skill, glob, edit	✅ 0.14	✅
incremental-build	Analyze incremental build issues	3.0/5 → 4.0/5 🟢	✅ incremental-build; tools: skill	✅ 0.13	✅
eval-performance	Analyze MSBuild evaluation performance issues	4.0/5 → 5.0/5 🟢	✅ eval-performance; tools: skill	✅ 0.08	✅
build-perf-diagnostics	Analyze analyzer performance impact on builds	4.0/5 → 5.0/5 🟢	✅ binlog-generation; build-perf-diagnostics; tools: skill, edit	🟡 0.24	✅
binlog-generation	Build project with /bl flag	1.0/5 → 5.0/5 🟢	✅ binlog-generation; tools: skill	✅ 0.00	✅
binlog-generation	Build with /bl in PowerShell	3.0/5 → 5.0/5 🟢	✅ binlog-generation; tools: skill	✅ 0.00	✅
binlog-generation	Build multiple configurations with unique binlogs	4.0/5 → 5.0/5 🟢	✅ binlog-generation; tools: skill	✅ 0.00	✅
binlog-failure-analysis	Diagnose build failures from binlog only (no source files)	4.0/5 → 4.0/5	✅ binlog-failure-analysis; tools: skill	✅ 0.04	✅
dotnet-maui-doctor	Plan macOS MAUI setup with Xcode	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.25	✅
dotnet-maui-doctor	Plan Linux MAUI environment for Android	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: report_intent, skill, view, glob, bash, stop_bash	🟡 0.25	✅
dotnet-maui-doctor	Guardrail against workload update and repair	1.0/5 → 3.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.25	✅
dotnet-maui-doctor	Diagnose non-Microsoft JDK causing build failure	4.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.25	✅
dotnet-maui-doctor	Plan complete MAUI setup on Windows	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.25	❌ [9]
dotnet-maui-doctor	Prevent incorrect JAVA_HOME configuration	3.0/5 → 4.0/5 🟢	✅ dotnet-maui-doctor; tools: report_intent, skill	🟡 0.25	✅
dotnet-maui-doctor	Determine required Android SDK packages for specific .NET version	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: report_intent, skill, view, bash, glob	🟡 0.25	✅
dotnet-maui-doctor	Fix stale MAUI workloads after SDK update	3.0/5 → 5.0/5 🟢	✅ dotnet-maui-doctor; tools: skill	🟡 0.25	✅
thread-abort-migration	Worker thread with abort-based cancellation	5.0/5 → 5.0/5	✅ thread-abort-migration; tools: skill	✅ 0.11	✅
thread-abort-migration	Timeout enforcement via Thread.Abort	5.0/5 → 5.0/5	✅ thread-abort-migration; tools: skill	✅ 0.11	❌ [10]
thread-abort-migration	Blocking WaitHandle with Thread.Interrupt	4.0/5 → 3.0/5 🔴	✅ thread-abort-migration; tools: skill	✅ 0.11	✅ [11]
thread-abort-migration	ASP.NET Response.End and Response.Redirect with Thread.Abort	4.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.11	✅
thread-abort-migration	Thread.Join and Thread.Sleep only — should not migrate	3.0/5 → 5.0/5 🟢	✅ thread-abort-migration; tools: skill	✅ 0.11	✅
migrate-nullable-references	Enable NRT in a small library with mixed nullability	5.0/5 → 5.0/5	✅ migrate-nullable-references; tools: skill	✅ 0.15	❌ [12]
migrate-nullable-references	File-by-file migration: only modify the targeted file	5.0/5 → 5.0/5	⚠️ NOT ACTIVATED	✅ 0.15	❌ [13]
migrate-nullable-references	Enable NRT in ASP.NET Core Web API with EF Core	3.0/5 → 3.0/5	✅ migrate-nullable-references; tools: skill	✅ 0.15	❌ [14]
dotnet-aot-compat	Make Azure.ResourceManager AOT-compatible	2.0/5 ⏰ → 2.0/5 ⏰	✅ dotnet-aot-compat; tools: skill, create, read_agent	✅ 0.16	✅

[1] Quality unchanged but weighted score is -8.5% due to: tokens (11164 → 30813), tool calls (0 → 1), time (12.4s → 17.5s)
[2] Quality unchanged but weighted score is -3.0% due to: tokens (44326 → 144631), errors (0 → 1), tool calls (4 → 12), time (18.4s → 180.0s)
[3] Quality unchanged but weighted score is -12.5% due to: judgment, errors (0 → 1)
[4] Quality unchanged but weighted score is -6.4% due to: tokens (23036 → 43262), tool calls (2 → 3), time (13.0s → 16.7s)
[5] Quality unchanged but weighted score is -1.0% due to: time (29.5s → 35.6s), tokens (23867 → 26318)
[6] Quality unchanged but weighted score is -6.8% due to: tokens (11217 → 82397), tool calls (0 → 6), time (11.5s → 31.9s)
[7] Quality unchanged but weighted score is -10.0% due to: tokens (11368 → 104022), tool calls (0 → 9), time (13.9s → 36.3s)
[8] Quality improved but weighted score is -4.4% due to: quality, tokens (76340 → 100270)
[9] Quality improved but weighted score is -3.1% due to: completion (✓ → ✗), tokens (34748 → 53019), tool calls (3 → 9), time (40.4s → 56.3s)
[10] Quality unchanged but weighted score is -11.1% due to: tokens (12123 → 28596), quality, tool calls (0 → 1)
[11] Quality dropped but weighted score is +10.0% due to: completion (✗ → ✓)
[12] Quality unchanged but weighted score is -1.1% due to: tokens (118647 → 214824)
[13] Quality unchanged but weighted score is -3.0% due to: tokens (58630 → 89836)
[14] Quality unchanged but weighted score is -1.1% due to: tokens (122789 → 188410), time (78.0s → 100.6s)

⏰ timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

use BPE tokenizer

d870730

DeagleGross requested review from JanKrivanek and ViktorHofer as code owners March 10, 2026 12:30

Copilot AI review requested due to automatic review settings March 10, 2026 12:30

DeagleGross mentioned this pull request Mar 10, 2026

tokenization: use BPE instead of chars\4 approximation #280

Closed

DeagleGross self-assigned this Mar 10, 2026

DeagleGross added the infrastructure label Mar 10, 2026

Copilot started reviewing on behalf of DeagleGross March 10, 2026 12:30 View session

DeagleGross mentioned this pull request Mar 10, 2026

feat: skill loading tests #311

Open

Copilot AI reviewed Mar 10, 2026

View reviewed changes

eng/skill-validator/src/Services/SkillProfiler.cs Show resolved Hide resolved

eng/skill-validator/src/Services/SkillProfiler.cs Outdated Show resolved Hide resolved

JanKrivanek approved these changes Mar 10, 2026

View reviewed changes

ViktorHofer reviewed Mar 10, 2026

View reviewed changes

eng/skill-validator/src/Services/SkillProfiler.cs Outdated Show resolved Hide resolved

ViktorHofer reviewed Mar 10, 2026

View reviewed changes

eng/skill-validator/src/SkillValidator.csproj Outdated Show resolved Hide resolved

DeagleGross added 2 commits March 10, 2026 18:37

address PR comments 1

4b9980a

elaborate on model

c25025d

DeagleGross enabled auto-merge (squash) March 10, 2026 18:03

Merge branch 'main' into dmkorolev/bpe

a43448e

Copilot AI review requested due to automatic review settings March 10, 2026 18:03

Copilot started reviewing on behalf of DeagleGross March 10, 2026 18:03 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

eng/skill-validator/src/Services/SkillProfiler.cs Show resolved Hide resolved

eng/skill-validator/src/Services/SkillProfiler.cs Outdated Show resolved Hide resolved

eng/skill-validator/src/Services/SkillProfiler.cs Show resolved Hide resolved

DeagleGross added 2 commits March 10, 2026 19:32

nit

fc95d4e

Merge branch 'dmkorolev/bpe' of https://github.com/dotnet/skills into…

1624bcf

… dmkorolev/bpe

JanKrivanek disabled auto-merge March 10, 2026 19:16

JanKrivanek merged commit 484d5d2 into main Mar 10, 2026
17 checks passed

JanKrivanek deleted the dmkorolev/bpe branch March 10, 2026 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization: use BPE instead of chars\4 approximation#309

tokenization: use BPE instead of chars\4 approximation#309
JanKrivanek merged 6 commits intomainfrom
dmkorolev/bpe

DeagleGross commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

Uh oh!

Uh oh!

ViktorHofer commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DeagleGross commented Mar 10, 2026

BPE vs chars/4 token estimation — all 25 skills

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2026

Skill Validation Results

Uh oh!

Uh oh!

Uh oh!

ViktorHofer commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 10, 2026

Skill Validation Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants