Skip to content

tokenization: use BPE instead of chars\4 approximation#309

Merged
JanKrivanek merged 6 commits intomainfrom
dmkorolev/bpe
Mar 10, 2026
Merged

tokenization: use BPE instead of chars\4 approximation#309
JanKrivanek merged 6 commits intomainfrom
dmkorolev/bpe

Conversation

@DeagleGross
Copy link
Copy Markdown
Member

Today evaluator estimates the token size of the prompt by chars/4. I am proposing to use BPE instead.

chars/4 is a rough average that assumes every 4 characters ≈ 1 token. This is only accurate for plain English prose. It systematically misjudges:

  • Code-heavy skills: identifiers like MSBuildProjectFullPath are 1-2 tokens but 24 characters. chars/4 says 6 tokens — a 3-4x overcount.
  • Markdown structure: headers, bullets, fenced code blocks (`csharp) waste characters on syntax that BPE compresses efficiently.
  • Repeated patterns: BPE exploits common subword patterns (e.g. dotnet, build, --configuration) that the chars/4 heuristic cannot.
  • Whitespace-heavy formatting: indentation in code blocks inflates char count but BPE merges whitespace runs into single tokens.

cl100k_base is the right reference: it's the BPE vocabulary used by GPT-4 and closely matches the tokenization of Claude models (and others). Since skills are injected into these models' context windows, counting with the actual tokenizer gives the true cost.

PR still gives the output in chars\4, but uses BPE for evaluation.

BPE vs chars/4 token estimation — all 25 skills

Skill Chars Chars/4 BPE Diff%
optimizing-ef-core-queries 5,744 1,436 1,335 +7.6%
analyzing-dotnet-performance 11,310 2,828 2,561 +10.4%
android-tombstone-symbolication 8,189 2,048 2,109 -2.9%
clr-activation-debugging 20,140 5,035 4,976 +1.2%
dotnet-trace-collect 22,621 5,656 5,119 +10.5%
dump-collect 4,267 1,067 1,069 -0.2%
microbenchmarking 13,155 3,289 2,674 +23.0%
binlog-failure-analysis 3,761 941 972 -3.2%
binlog-generation 2,853 714 716 -0.3%
build-parallelism 3,638 910 820 +11.0%
build-perf-baseline 12,357 3,090 2,842 +8.7%
build-perf-diagnostics 6,595 1,649 1,560 +5.7%
check-bin-obj-clash 16,259 4,065 3,600 +12.9%
directory-build-organization 9,418 2,355 2,102 +12.0%
eval-performance 4,336 1,084 949 +14.2%
including-generated-files 6,843 1,711 1,447 +18.2%
incremental-build 13,442 3,361 2,958 +13.6%
msbuild-antipatterns 14,241 3,561 3,456 +3.0%
msbuild-modernization 16,712 4,178 4,148 +0.7%
dotnet-aot-compat 16,752 4,188 3,901 +7.4%
migrate-nullable-references 35,735 8,934 7,835 +14.0%
thread-abort-migration 12,792 3,198 2,772 +15.4%
csharp-scripts 5,183 1,296 1,239 +4.6%
dotnet-pinvoke 18,866 4,717 4,521 +4.3%
nuget-trusted-publishing 9,254 2,314 2,215 +4.5%

Diff% = how much chars/4 overestimates (+) or underestimates (−) relative to BPE.
chars/4 overestimates for 22/25 skills (up to +23%). Mean overestimate ≈ +8.5%.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the skill-validator’s skill profiling to use a real BPE tokenizer (cl100k_base via Microsoft.ML.Tokenizers) instead of the previous chars/4 heuristic when classifying skill complexity and generating token-size warnings.

Changes:

  • Compute BPE token counts in SkillProfiler and use them for complexity tiering and token-size warnings (while still retaining the chars/4 estimate for display).
  • Extend SkillProfile to carry both chars/4 and BPE counts, and update formatted output/warnings accordingly.
  • Update the comprehensive-skill unit test to use varied text (to avoid repeated-char tokenization artifacts) and add tokenizer package references.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
eng/skill-validator/tests/SkillProfileTests.cs Updates test content generation to reliably exceed the “comprehensive” threshold under BPE tokenization.
eng/skill-validator/src/SkillValidator.csproj Adds Microsoft.ML.Tokenizers + cl100k data package; minor RunArguments quoting cleanup.
eng/skill-validator/src/Services/SkillProfiler.cs Implements BPE token counting and switches tiering/warnings/output to use BPE counts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
csharp-scripts Test a C# language feature with a script 4.0/5 → 5.0/5 🟢 ✅ csharp-scripts; tools: skill, create, edit 🟡 0.32 [1]
nuget-trusted-publishing Set up trusted publishing for a new NuGet library 3.0/5 → 4.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill ✅ 0.11
nuget-trusted-publishing Set up NuGet publishing without mentioning trusted publishing 2.0/5 → 5.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill, report_intent, glob, view ✅ 0.11
nuget-trusted-publishing Migrate existing workflow from API key to trusted publishing 3.0/5 → 5.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill, view ✅ 0.11
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET 8+) 4.0/5 → 5.0/5 🟢 ✅ dotnet-pinvoke; tools: skill ✅ 0.12
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET Framework) 3.0/5 → 5.0/5 🟢 ✅ dotnet-pinvoke; tools: skill ✅ 0.12
dotnet-trace-collect High CPU in Kubernetes on Linux (.NET 8) 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.12
dotnet-trace-collect .NET Framework on Windows without admin privileges 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.12
dotnet-trace-collect .NET 10 on Linux with root access and native call stacks 1.0/5 → 4.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.12
dotnet-trace-collect Memory leak on Linux (.NET 8) 4.0/5 → 3.0/5 🔴 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.12
dotnet-trace-collect Slow requests on Windows with PerfView 5.0/5 → 5.0/5 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.12
dotnet-trace-collect Excessive GC on Linux (.NET 8) 5.0/5 → 5.0/5 ✅ dotnet-trace-collect; tools: skill ✅ 0.12 [2]
dotnet-trace-collect Hang or deadlock diagnosis on Linux 3.0/5 → 4.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.12
dotnet-trace-collect Windows container high CPU with PerfView 1.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: report_intent, skill, view, glob ✅ 0.12
dotnet-trace-collect Long-running intermittent issue with PerfView triggers 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.12
dotnet-trace-collect Linux pre-.NET 10 needing native call stacks 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.12
dotnet-trace-collect Windows modern .NET with admin high CPU 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.12
dotnet-trace-collect Memory leak on .NET Framework Windows 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.12
dotnet-trace-collect Kubernetes with console access prefers console tools 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.12
dotnet-trace-collect Container installation without .NET SDK 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.12
dotnet-trace-collect HTTP 500s from downstream service on Linux (.NET 8) 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.12
dotnet-trace-collect Networking timeouts on Windows with admin (.NET 8) 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.12
microbenchmarking Investigate runtime upgrade performance impact 4.0/5 → 5.0/5 🟢 ✅ microbenchmarking; tools: skill, glob, stop_bash ✅ 0.10
clr-activation-debugging Diagnose unexpected FOD dialog from native build tool 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Diagnose FOD suppressed but activation still failing 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Explain why same binary behaves differently under different launch methods 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Analyze healthy managed EXE activation 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Identify multiple activation sequences in a single log 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Explain useLegacyV2RuntimeActivationPolicy in activation log 2.0/5 → 4.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Decline non-CLR-activation issue 1.0/5 → 5.0/5 🟢 ℹ️ not activated (expected) ✅ 0.10
analyzing-dotnet-performance Detects compiled regex startup budget and regex chain allocations 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Detects CurrentCulture comparer and compiled regex budget in inflection rules 1.0/5 → 5.0/5 ⏰ 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Finds per-call Dictionary allocation not hoisted to static 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Catches compound allocations in recursive number converter with ToLower 1.0/5 → 4.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Finds StringComparison.Ordinal missing and FrozenDictionary opportunities 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Detects Aggregate+Replace chain and struct missing IEquatable 1.0/5 → 2.0/5 ⏰ 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Finds branched Replace chain in format string manipulation 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Catches LINQ on hot-path string processing and All(char.IsUpper) 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Detects LINQ pipeline in TimeSpan formatting and collection processing 1.0/5 → 3.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Flags Span inconsistencies and compound method chains in truncation library 1.0/5 → 4.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
analyzing-dotnet-performance Identifies unsealed leaf classes and locale hierarchy patterns 1.0/5 → 4.0/5 ⏰ 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.13
android-tombstone-symbolication Symbolicate .NET frames in an Android tombstone 3.0/5 → 3.0/5 ✅ android-tombstone-symbolication; tools: skill, stop_bash ✅ 0.17
android-tombstone-symbolication Recognize tombstone with no .NET frames 5.0/5 → 5.0/5 ✅ android-tombstone-symbolication; tools: skill ✅ 0.17 [3]
android-tombstone-symbolication Symbolicate CoreCLR frames in an Android tombstone 3.0/5 → 4.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, glob ✅ 0.17
android-tombstone-symbolication Recognize NativeAOT tombstone with app binary and libSystem.Native.so 3.0/5 → 5.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, bash, glob ✅ 0.17
android-tombstone-symbolication Symbolicate multi-thread tombstone 4.0/5 → 4.0/5 ✅ android-tombstone-symbolication; tools: skill, glob ✅ 0.17
android-tombstone-symbolication Handle .NET frames with no BuildId metadata 5.0/5 → 4.0/5 🔴 ✅ android-tombstone-symbolication; tools: skill, bash, glob ✅ 0.17 [4]
android-tombstone-symbolication Symbolicate tombstone with multiple .NET libraries and different BuildIds 3.0/5 → 4.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill ✅ 0.17
android-tombstone-symbolication Reject iOS crash log as wrong format 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.17
dump-collect Configure automatic crash dumps for CoreCLR app on Linux 3.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill, report_intent, view, glob 🟡 0.28
dump-collect Set up NativeAOT crash dumps with createdump in Kubernetes 2.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.28
dump-collect Recover crash dump from macOS NativeAOT without createdump 4.0/5 → 4.0/5 ✅ dump-collect; tools: skill, report_intent, view, glob 🟡 0.28 [5]
dump-collect Configure CoreCLR dump collection in Alpine Docker as non-root 4.0/5 → 4.0/5 ✅ dump-collect; tools: skill, report_intent, view, glob 🟡 0.28
dump-collect Advisory: macOS NativeAOT crash dump recovery steps 4.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill, bash 🟡 0.28
dump-collect Advisory: CoreCLR Alpine Docker non-root configuration 4.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill, report_intent, view 🟡 0.28
dump-collect Advisory: NativeAOT Kubernetes dump collection setup 3.0/5 → 3.0/5 ✅ dump-collect; tools: skill 🟡 0.28 [6]
dump-collect Detect runtime and configure crash dumps for unknown .NET app on Linux 4.0/5 → 4.0/5 ✅ dump-collect; tools: skill, bash 🟡 0.28 [7]
dump-collect Decline dump analysis request 2.0/5 → 4.0/5 🟢 ℹ️ not activated (expected) 🟡 0.28
optimizing-ef-core-queries Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete 4.0/5 → 5.0/5 🟢 ✅ optimizing-ef-core-queries; tools: skill 🟡 0.30
build-parallelism Analyze build parallelism bottlenecks 4.0/5 → 4.0/5 ✅ build-parallelism; tools: skill ✅ 0.13
including-generated-files Diagnose generated file inclusion failure 3.0/5 → 5.0/5 🟢 ✅ including-generated-files; tools: skill ✅ 0.18
msbuild-antipatterns Review MSBuild files for anti-patterns and style issues 4.0/5 → 5.0/5 🟢 ✅ msbuild-antipatterns; tools: skill ✅ 0.06 [8]
build-perf-baseline Establish build performance baseline and recommend optimizations 3.0/5 → 4.0/5 🟢 ✅ build-perf-baseline; build-perf-diagnostics; tools: skill 🟡 0.24
msbuild-modernization Modernize legacy project to SDK-style 5.0/5 → 5.0/5 ✅ msbuild-modernization; tools: skill ✅ 0.04 [9]
directory-build-organization Organize build infrastructure for a multi-project repo 3.0/5 → 5.0/5 🟢 ✅ msbuild-antipatterns; directory-build-organization; tools: skill ✅ 0.13
check-bin-obj-clash Diagnose bin/obj output path clashes 4.0/5 → 5.0/5 🟢 ✅ check-bin-obj-clash; binlog-generation; tools: skill, glob ✅ 0.13
incremental-build Analyze incremental build issues 3.0/5 → 4.0/5 🟢 ✅ incremental-build; tools: skill ✅ 0.14
eval-performance Analyze MSBuild evaluation performance issues 4.0/5 → 5.0/5 🟢 ✅ eval-performance; tools: skill ✅ 0.11
build-perf-diagnostics Analyze analyzer performance impact on builds 5.0/5 → 5.0/5 ✅ binlog-generation; build-perf-diagnostics; tools: skill 🟡 0.30 [10]
binlog-generation Build project with /bl flag 1.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill 🟡 0.48 [11]
binlog-generation Build with /bl in PowerShell 3.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill 🟡 0.48
binlog-generation Build multiple configurations with unique binlogs 5.0/5 → 5.0/5 ✅ binlog-generation; tools: skill 🟡 0.48
binlog-failure-analysis Diagnose build failures from binlog only (no source files) 4.0/5 → 5.0/5 🟢 ✅ binlog-failure-analysis; tools: skill ✅ 0.07 [12]
dotnet-maui-doctor Plan macOS MAUI setup with Xcode 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill ✅ 0.20
dotnet-maui-doctor Plan Linux MAUI environment for Android 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill, view, bash ✅ 0.20
dotnet-maui-doctor Guardrail against workload update and repair 1.0/5 → 4.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill ✅ 0.20
dotnet-maui-doctor Diagnose non-Microsoft JDK causing build failure 1.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill ✅ 0.20
dotnet-maui-doctor Plan complete MAUI setup on Windows 4.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill ✅ 0.20 [13]
dotnet-maui-doctor Prevent incorrect JAVA_HOME configuration 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill ✅ 0.20
dotnet-maui-doctor Determine required Android SDK packages for specific .NET version 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill, view, glob, bash, stop_bash ✅ 0.20
dotnet-maui-doctor Fix stale MAUI workloads after SDK update 2.0/5 → 4.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill ✅ 0.20
thread-abort-migration Worker thread with abort-based cancellation 5.0/5 → 5.0/5 ✅ thread-abort-migration; tools: skill ✅ 0.12
thread-abort-migration Timeout enforcement via Thread.Abort 4.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.12
thread-abort-migration Blocking WaitHandle with Thread.Interrupt 4.0/5 → 4.0/5 ✅ thread-abort-migration; tools: skill ✅ 0.12
thread-abort-migration ASP.NET Response.End and Response.Redirect with Thread.Abort 4.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.12
thread-abort-migration Thread.Join and Thread.Sleep only — should not migrate 5.0/5 → 5.0/5 ✅ thread-abort-migration; tools: skill ✅ 0.12
migrate-nullable-references Enable NRT in a small library with mixed nullability 5.0/5 → 5.0/5 ✅ migrate-nullable-references; tools: skill, glob ✅ 0.12
migrate-nullable-references File-by-file migration: only modify the targeted file 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.12 [14]
migrate-nullable-references Enable NRT in ASP.NET Core Web API with EF Core 3.0/5 → 3.0/5 ✅ migrate-nullable-references; tools: skill ✅ 0.12 [15]
dotnet-aot-compat Make Azure.ResourceManager AOT-compatible 1.0/5 → 2.0/5 ⏰ 🟢 ✅ dotnet-aot-compat; tools: skill, create, read_agent ✅ 0.16 [16]

[1] Quality improved but weighted score is -8.6% due to: tokens (22963 → 142576), tool calls (2 → 11), time (34.0s → 48.9s)
[2] Quality unchanged but weighted score is -17.7% due to: judgment, tokens (33531 → 106250), tool calls (3 → 6), time (23.0s → 36.1s)
[3] Quality unchanged but weighted score is -5.8% due to: tokens (23229 → 43240), tool calls (2 → 3)
[4] Quality dropped but weighted score is +5.6% due to: efficiency metrics
[5] Quality unchanged but weighted score is -6.0% due to: tokens (11185 → 82741), tool calls (0 → 6), time (13.7s → 36.0s)
[6] Quality unchanged but weighted score is -4.5% due to: tokens (44786 → 84328), time (39.0s → 47.6s)
[7] Quality unchanged but weighted score is -2.6% due to: tokens (38611 → 100265), time (35.7s → 46.4s)
[8] Quality improved but weighted score is -15.0% due to: judgment, tokens (52762 → 90288)
[9] Quality unchanged but weighted score is -8.4% due to: judgment
[10] Quality unchanged but weighted score is -8.5% due to: tokens (123360 → 368759), tool calls (12 → 24), time (62.5s → 88.1s)
[11] Quality improved but weighted score is -1.3% due to: tokens (44784 → 54417)
[12] Quality improved but weighted score is -3.4% due to: tokens (434842 → 1292335), tool calls (22 → 41)
[13] Quality improved but weighted score is -2.9% due to: completion (✓ → ✗), tokens (34781 → 52532), tool calls (3 → 8), time (41.6s → 55.7s)
[14] Quality unchanged but weighted score is -0.6% due to: efficiency metrics
[15] Quality unchanged but weighted score is -20.9% due to: judgment, tokens (87356 → 188143), quality, time (78.3s → 102.4s), tool calls (20 → 24)
[16] Quality improved but weighted score is -10.8% due to: judgment, errors (0 → 1), tool calls (140 → 209), time (795.5s → 1050.0s)

timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

@ViktorHofer
Copy link
Copy Markdown
Member

PR still gives the output in chars\4, but uses BPE for evaluation.

Curious, why?

@DeagleGross DeagleGross enabled auto-merge (squash) March 10, 2026 18:03
Copilot AI review requested due to automatic review settings March 10, 2026 18:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
csharp-scripts Test a C# language feature with a script 3.0/5 → 4.0/5 🟢 ✅ csharp-scripts; tools: skill, create 🟡 0.32
nuget-trusted-publishing Set up trusted publishing for a new NuGet library 3.0/5 → 4.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill ✅ 0.09
nuget-trusted-publishing Set up NuGet publishing without mentioning trusted publishing 2.0/5 → 5.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill, report_intent, glob, view, bash, create ✅ 0.09
nuget-trusted-publishing Migrate existing workflow from API key to trusted publishing 3.0/5 → 4.0/5 🟢 ✅ nuget-trusted-publishing; tools: skill, view ✅ 0.09
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET 8+) 5.0/5 → 5.0/5 ✅ dotnet-pinvoke; tools: skill ✅ 0.09
dotnet-pinvoke Generate LibraryImport declaration from C header (.NET Framework) 5.0/5 → 5.0/5 ✅ dotnet-pinvoke; tools: skill ✅ 0.09
dotnet-trace-collect High CPU in Kubernetes on Linux (.NET 8) 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view, glob ✅ 0.10
dotnet-trace-collect .NET Framework on Windows without admin privileges 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.10
dotnet-trace-collect .NET 10 on Linux with root access and native call stacks 1.0/5 → 4.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.10
dotnet-trace-collect Memory leak on Linux (.NET 8) 3.0/5 → 3.0/5 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Slow requests on Windows with PerfView 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Excessive GC on Linux (.NET 8) 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.10
dotnet-trace-collect Hang or deadlock diagnosis on Linux 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill ✅ 0.10
dotnet-trace-collect Windows container high CPU with PerfView 1.0/5 → 4.0/5 🟢 ✅ dotnet-trace-collect; tools: report_intent, skill, view, glob ✅ 0.10
dotnet-trace-collect Long-running intermittent issue with PerfView triggers 3.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Linux pre-.NET 10 needing native call stacks 2.0/5 → 4.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Windows modern .NET with admin high CPU 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Memory leak on .NET Framework Windows 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Kubernetes with console access prefers console tools 5.0/5 → 5.0/5 ✅ dotnet-trace-collect; tools: skill ✅ 0.10 [1]
dotnet-trace-collect Container installation without .NET SDK 4.0/5 → 2.0/5 🔴 ✅ dotnet-trace-collect; tools: skill ✅ 0.10
dotnet-trace-collect HTTP 500s from downstream service on Linux (.NET 8) 4.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
dotnet-trace-collect Networking timeouts on Windows with admin (.NET 8) 2.0/5 → 5.0/5 🟢 ✅ dotnet-trace-collect; tools: skill, report_intent, view ✅ 0.10
microbenchmarking Investigate runtime upgrade performance impact 5.0/5 → 5.0/5 ✅ microbenchmarking; tools: skill, glob, create ✅ 0.10
clr-activation-debugging Diagnose unexpected FOD dialog from native build tool 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Diagnose FOD suppressed but activation still failing 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Explain why same binary behaves differently under different launch methods 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Analyze healthy managed EXE activation 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Identify multiple activation sequences in a single log 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Explain useLegacyV2RuntimeActivationPolicy in activation log 2.0/5 → 3.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
clr-activation-debugging Decline non-CLR-activation issue 1.0/5 → 5.0/5 🟢 ✅ clr-activation-debugging; tools: skill ✅ 0.10
analyzing-dotnet-performance Detects compiled regex startup budget and regex chain allocations 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Detects CurrentCulture comparer and compiled regex budget in inflection rules 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Finds per-call Dictionary allocation not hoisted to static 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Catches compound allocations in recursive number converter with ToLower 1.0/5 → 4.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Finds StringComparison.Ordinal missing and FrozenDictionary opportunities 1.0/5 → 3.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Detects Aggregate+Replace chain and struct missing IEquatable 1.0/5 → 4.0/5 ⏰ 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Finds branched Replace chain in format string manipulation 1.0/5 → 1.0/5 ⏰ ✅ analyzing-dotnet-performance; tools: skill, write_bash, stop_bash ✅ 0.14 [2]
analyzing-dotnet-performance Catches LINQ on hot-path string processing and All(char.IsUpper) 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Detects LINQ pipeline in TimeSpan formatting and collection processing 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill ✅ 0.14
analyzing-dotnet-performance Flags Span inconsistencies and compound method chains in truncation library 1.0/5 → 5.0/5 🟢 ✅ analyzing-dotnet-performance; tools: skill, grep ✅ 0.14
analyzing-dotnet-performance Identifies unsealed leaf classes and locale hierarchy patterns 1.0/5 → 1.0/5 ⏰ ✅ analyzing-dotnet-performance; tools: skill, read_bash, stop_bash ✅ 0.14
android-tombstone-symbolication Symbolicate .NET frames in an Android tombstone 5.0/5 → 5.0/5 ✅ android-tombstone-symbolication; tools: skill ✅ 0.11 [3]
android-tombstone-symbolication Recognize tombstone with no .NET frames 5.0/5 → 5.0/5 ✅ android-tombstone-symbolication; tools: skill ✅ 0.11 [4]
android-tombstone-symbolication Symbolicate CoreCLR frames in an Android tombstone 2.0/5 ⏰ → 4.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, glob ✅ 0.11
android-tombstone-symbolication Recognize NativeAOT tombstone with app binary and libSystem.Native.so 3.0/5 → 1.0/5 ⏰ 🔴 ✅ android-tombstone-symbolication; tools: skill, bash, read_bash ✅ 0.11
android-tombstone-symbolication Symbolicate multi-thread tombstone 1.0/5 ⏰ → 5.0/5 🟢 ✅ android-tombstone-symbolication; tools: skill, glob ✅ 0.11
android-tombstone-symbolication Handle .NET frames with no BuildId metadata 5.0/5 → 5.0/5 ✅ android-tombstone-symbolication; tools: skill, bash ✅ 0.11
android-tombstone-symbolication Symbolicate tombstone with multiple .NET libraries and different BuildIds 4.0/5 → 4.0/5 ✅ android-tombstone-symbolication; tools: skill, glob ✅ 0.11
android-tombstone-symbolication Reject iOS crash log as wrong format 5.0/5 → 5.0/5 ℹ️ not activated (expected) ✅ 0.11 [5]
dump-collect Configure automatic crash dumps for CoreCLR app on Linux 5.0/5 → 5.0/5 ✅ dump-collect; tools: skill, report_intent, view, glob 🟡 0.28 [6]
dump-collect Set up NativeAOT crash dumps with createdump in Kubernetes 2.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.28
dump-collect Recover crash dump from macOS NativeAOT without createdump 2.0/5 → 4.0/5 🟢 ✅ dump-collect; tools: skill, report_intent, view, glob, bash 🟡 0.28
dump-collect Configure CoreCLR dump collection in Alpine Docker as non-root 5.0/5 → 5.0/5 ✅ dump-collect; tools: skill, report_intent, view, glob, bash 🟡 0.28 [7]
dump-collect Advisory: macOS NativeAOT crash dump recovery steps 4.0/5 → 4.0/5 ✅ dump-collect; tools: skill, glob, bash 🟡 0.28
dump-collect Advisory: CoreCLR Alpine Docker non-root configuration 4.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.28
dump-collect Advisory: NativeAOT Kubernetes dump collection setup 3.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.28
dump-collect Detect runtime and configure crash dumps for unknown .NET app on Linux 4.0/5 → 5.0/5 🟢 ✅ dump-collect; tools: skill 🟡 0.28 [8]
dump-collect Decline dump analysis request 2.0/5 → 4.0/5 🟢 ℹ️ not activated (expected) 🟡 0.28
optimizing-ef-core-queries Optimize bulk operations with EF Core 7+ ExecuteUpdate and ExecuteDelete 4.0/5 → 4.0/5 ✅ optimizing-ef-core-queries; tools: skill 🟡 0.35
build-parallelism Analyze build parallelism bottlenecks 4.0/5 → 4.0/5 ✅ binlog-generation; build-parallelism; tools: skill ✅ 0.20
including-generated-files Diagnose generated file inclusion failure 3.0/5 → 5.0/5 🟢 ✅ including-generated-files; tools: skill 🟡 0.22
msbuild-antipatterns Review MSBuild files for anti-patterns and style issues 5.0/5 → 5.0/5 ✅ msbuild-antipatterns; tools: skill ✅ 0.06
build-perf-baseline Establish build performance baseline and recommend optimizations 3.0/5 → 4.0/5 🟢 ✅ build-perf-baseline; tools: skill 🟡 0.24
msbuild-modernization Modernize legacy project to SDK-style 5.0/5 → 5.0/5 ✅ msbuild-modernization; tools: skill ✅ 0.04
directory-build-organization Organize build infrastructure for a multi-project repo 3.0/5 → 4.0/5 🟢 ✅ directory-build-organization; msbuild-antipatterns; tools: skill ✅ 0.20
check-bin-obj-clash Diagnose bin/obj output path clashes 4.0/5 → 5.0/5 🟢 ✅ check-bin-obj-clash; binlog-generation; tools: skill, glob, edit ✅ 0.14
incremental-build Analyze incremental build issues 3.0/5 → 4.0/5 🟢 ✅ incremental-build; tools: skill ✅ 0.13
eval-performance Analyze MSBuild evaluation performance issues 4.0/5 → 5.0/5 🟢 ✅ eval-performance; tools: skill ✅ 0.08
build-perf-diagnostics Analyze analyzer performance impact on builds 4.0/5 → 5.0/5 🟢 ✅ binlog-generation; build-perf-diagnostics; tools: skill, edit 🟡 0.24
binlog-generation Build project with /bl flag 1.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill ✅ 0.00
binlog-generation Build with /bl in PowerShell 3.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill ✅ 0.00
binlog-generation Build multiple configurations with unique binlogs 4.0/5 → 5.0/5 🟢 ✅ binlog-generation; tools: skill ✅ 0.00
binlog-failure-analysis Diagnose build failures from binlog only (no source files) 4.0/5 → 4.0/5 ✅ binlog-failure-analysis; tools: skill ✅ 0.04
dotnet-maui-doctor Plan macOS MAUI setup with Xcode 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.25
dotnet-maui-doctor Plan Linux MAUI environment for Android 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill, view, glob, bash, stop_bash 🟡 0.25
dotnet-maui-doctor Guardrail against workload update and repair 1.0/5 → 3.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.25
dotnet-maui-doctor Diagnose non-Microsoft JDK causing build failure 4.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.25
dotnet-maui-doctor Plan complete MAUI setup on Windows 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.25 [9]
dotnet-maui-doctor Prevent incorrect JAVA_HOME configuration 3.0/5 → 4.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill 🟡 0.25
dotnet-maui-doctor Determine required Android SDK packages for specific .NET version 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: report_intent, skill, view, bash, glob 🟡 0.25
dotnet-maui-doctor Fix stale MAUI workloads after SDK update 3.0/5 → 5.0/5 🟢 ✅ dotnet-maui-doctor; tools: skill 🟡 0.25
thread-abort-migration Worker thread with abort-based cancellation 5.0/5 → 5.0/5 ✅ thread-abort-migration; tools: skill ✅ 0.11
thread-abort-migration Timeout enforcement via Thread.Abort 5.0/5 → 5.0/5 ✅ thread-abort-migration; tools: skill ✅ 0.11 [10]
thread-abort-migration Blocking WaitHandle with Thread.Interrupt 4.0/5 → 3.0/5 🔴 ✅ thread-abort-migration; tools: skill ✅ 0.11 [11]
thread-abort-migration ASP.NET Response.End and Response.Redirect with Thread.Abort 4.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.11
thread-abort-migration Thread.Join and Thread.Sleep only — should not migrate 3.0/5 → 5.0/5 🟢 ✅ thread-abort-migration; tools: skill ✅ 0.11
migrate-nullable-references Enable NRT in a small library with mixed nullability 5.0/5 → 5.0/5 ✅ migrate-nullable-references; tools: skill ✅ 0.15 [12]
migrate-nullable-references File-by-file migration: only modify the targeted file 5.0/5 → 5.0/5 ⚠️ NOT ACTIVATED ✅ 0.15 [13]
migrate-nullable-references Enable NRT in ASP.NET Core Web API with EF Core 3.0/5 → 3.0/5 ✅ migrate-nullable-references; tools: skill ✅ 0.15 [14]
dotnet-aot-compat Make Azure.ResourceManager AOT-compatible 2.0/5 ⏰ → 2.0/5 ⏰ ✅ dotnet-aot-compat; tools: skill, create, read_agent ✅ 0.16

[1] Quality unchanged but weighted score is -8.5% due to: tokens (11164 → 30813), tool calls (0 → 1), time (12.4s → 17.5s)
[2] Quality unchanged but weighted score is -3.0% due to: tokens (44326 → 144631), errors (0 → 1), tool calls (4 → 12), time (18.4s → 180.0s)
[3] Quality unchanged but weighted score is -12.5% due to: judgment, errors (0 → 1)
[4] Quality unchanged but weighted score is -6.4% due to: tokens (23036 → 43262), tool calls (2 → 3), time (13.0s → 16.7s)
[5] Quality unchanged but weighted score is -1.0% due to: time (29.5s → 35.6s), tokens (23867 → 26318)
[6] Quality unchanged but weighted score is -6.8% due to: tokens (11217 → 82397), tool calls (0 → 6), time (11.5s → 31.9s)
[7] Quality unchanged but weighted score is -10.0% due to: tokens (11368 → 104022), tool calls (0 → 9), time (13.9s → 36.3s)
[8] Quality improved but weighted score is -4.4% due to: quality, tokens (76340 → 100270)
[9] Quality improved but weighted score is -3.1% due to: completion (✓ → ✗), tokens (34748 → 53019), tool calls (3 → 9), time (40.4s → 56.3s)
[10] Quality unchanged but weighted score is -11.1% due to: tokens (12123 → 28596), quality, tool calls (0 → 1)
[11] Quality dropped but weighted score is +10.0% due to: completion (✗ → ✓)
[12] Quality unchanged but weighted score is -1.1% due to: tokens (118647 → 214824)
[13] Quality unchanged but weighted score is -3.0% due to: tokens (58630 → 89836)
[14] Quality unchanged but weighted score is -1.1% due to: tokens (122789 → 188410), time (78.0s → 100.6s)

timeout — run hit the scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output

Model: claude-opus-4.6 | Judge: claude-opus-4.6

Full results

@JanKrivanek JanKrivanek disabled auto-merge March 10, 2026 19:16
@JanKrivanek JanKrivanek merged commit 484d5d2 into main Mar 10, 2026
17 checks passed
@JanKrivanek JanKrivanek deleted the dmkorolev/bpe branch March 10, 2026 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants