Skip to content

Fix eval timeouts for migrate-mstest-v3-to-v4 scenarios#548

Merged
Evangelink merged 3 commits into
dotnet:mainfrom
Evangelink:fix/mstest-sdk-eval-timeout
Apr 23, 2026
Merged

Fix eval timeouts for migrate-mstest-v3-to-v4 scenarios#548
Evangelink merged 3 commits into
dotnet:mainfrom
Evangelink:fix/mstest-sdk-eval-timeout

Conversation

@Evangelink
Copy link
Copy Markdown
Member

@Evangelink Evangelink commented Apr 20, 2026

Problem

Several migrate-mstest-v3-to-v4 evaluation scenarios were timing out on CI:

  • Fix TestMethodAttribute CallerInfo constructor breaking change — baseline timed out at 180s
  • Fix multiple v4 breaking changes: Assert, ClassCleanup, TestContext, Timeout — baseline timed out at 240s
  • Migrate MSTest.Sdk v3 project using ManagedType and TestTimeout — baseline timed out at 240s
  • Fix sealed custom TestMethodAttribute with Timeout changes — scenario didn't exist yet

Root Cause

Two contributing factors:

  1. Cold NuGet cache: The agent calls dotnet build which triggers NuGet package downloads. MSTest packages + MSTest.Sdk resolution adds significant latency on CI runners with cold caches.
  2. Insufficient timeout headroom: The baseline agent (without skill) needs many iterations for complex multi-fix scenarios (up to 30 tool calls, 500K tokens), and on slower CI runners these iterations exceed the timeout budget.

Fix

Pre-warm NuGet cache

Added dotnet restore setup commands to scenarios 3 (multiple breaking changes), 5 (CallerInfo), and the new scenario 10 (sealed TestMethod). This runs before the agent starts, so NuGet packages are already cached when the agent runs dotnet build.

Increase timeouts for complex scenarios

  • CallerInfo scenario: 180s → 240s
  • Multiple breaking changes scenario: 240s → 300s (6 breaking changes require many baseline iterations)

New scenario

Added "Fix sealed custom TestMethodAttribute with Timeout changes" (Goal 10) with:

  • Sealed TimedTestMethodAttribute subclass that overrides Execute
  • Multiple [Timeout(TestTimeout.Infinite)] usages
  • Fixture files at fixtures/v3-sealed-testmethod/
  • 240s timeout with dotnet restore pre-warming

Verification

All 4 scenarios pass locally with no timeouts:

Scenario Timeout Baseline Skilled
CallerInfo 240s 6 turns, no timeout 8 turns
Multiple breaking changes 300s 17 turns, no timeout 13 turns
MSTest.Sdk 240s 7 turns, no timeout 5 turns
Sealed TestMethod (new) 240s 17 turns, no timeout 5 turns

Copilot AI review requested due to automatic review settings April 20, 2026 09:40
@Evangelink Evangelink enabled auto-merge (squash) April 20, 2026 09:41
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses CI timeouts in the migrate-mstest-v3-to-v4 eval scenario by pre-warming NuGet/package restore work before the timed agent run begins, reducing repeated cold-cache restore/build overhead across parallel sessions.

Changes:

  • Adds a scenario setup.commands step to run dotnet restore for the MSTest.Sdk v4 migration fixture.
Show a summary per file
File Description
tests/dotnet-test/migrate-mstest-v3-to-v4/eval.yaml Adds a setup restore command to warm NuGet cache/obj prior to the timed evaluation.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 0

@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions Bot added a commit that referenced this pull request Apr 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
migrate-mstest-v3-to-v4 Migrate custom TestMethodAttribute from Execute to ExecuteAsync 2.0/5 → 4.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.07
migrate-mstest-v3-to-v4 Replace ExpectedExceptionAttribute with Assert.ThrowsExactly 3.3/5 → 4.7/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.07 [1]
migrate-mstest-v3-to-v4 Fix multiple v4 breaking changes: Assert, ClassCleanup, TestContext, Timeout 3.7/5 ⏰ → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.07
migrate-mstest-v3-to-v4 Handle net6.0 target framework dropped in MSTest v4 3.0/5 → 4.7/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.07
migrate-mstest-v3-to-v4 Fix TestMethodAttribute CallerInfo constructor breaking change 4.0/5 → 4.7/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.07 [2]
migrate-mstest-v3-to-v4 Understand behavioral changes after MSTest v4 upgrade 3.3/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.07
migrate-mstest-v3-to-v4 Handle MSTest.Sdk and MTP changes in v4 2.0/5 → 3.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.07 [3]
migrate-mstest-v3-to-v4 Full MSTest v3 to v4 migration with multiple breaking changes 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill / ✅ migrate-mstest-v3-to-v4; tools: skill, bash, create ✅ 0.07
migrate-mstest-v3-to-v4 Migrate MSTest.Sdk v3 project using ManagedType and TestTimeout 4.0/5 → 4.0/5 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.07 [4]
migrate-mstest-v3-to-v4 Correctly identify MSTest v3 project and recommend v4 migration 4.7/5 → 4.7/5 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.07 [5]

[1] ⚠️ High run-to-run variance (CV=1.67) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=1.92) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=25.81) — consider re-running with --runs 5. (Plugin) Quality improved but weighted score is -14.7% due to: judgment, quality
[4] ⚠️ High run-to-run variance (CV=1.19) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -23.9% due to: completion (✓ → ✗), judgment, quality
[5] ⚠️ High run-to-run variance (CV=62.91) — consider re-running with --runs 5

timeout — run(s) hit the (240s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

github-actions Bot added a commit that referenced this pull request Apr 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
migrate-mstest-v3-to-v4 Migrate custom TestMethodAttribute from Execute to ExecuteAsync 1.7/5 → 3.3/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.08
migrate-mstest-v3-to-v4 Replace ExpectedExceptionAttribute with Assert.ThrowsExactly 3.3/5 → 4.7/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill, read_bash / ⚠️ NOT ACTIVATED ✅ 0.08 [1]
migrate-mstest-v3-to-v4 Fix multiple v4 breaking changes: Assert, ClassCleanup, TestContext, Timeout 3.0/5 ⏰ → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill, edit ✅ 0.08 [2]
migrate-mstest-v3-to-v4 Handle net6.0 target framework dropped in MSTest v4 3.0/5 → 4.7/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.08
migrate-mstest-v3-to-v4 Fix TestMethodAttribute CallerInfo constructor breaking change 4.0/5 → 4.3/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.08 [3]
migrate-mstest-v3-to-v4 Understand behavioral changes after MSTest v4 upgrade 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.08
migrate-mstest-v3-to-v4 Handle MSTest.Sdk and MTP changes in v4 2.0/5 → 3.3/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill / ✅ migrate-mstest-v3-to-v4; tools: report_intent, skill ✅ 0.08 [4]
migrate-mstest-v3-to-v4 Full MSTest v3 to v4 migration with multiple breaking changes 4.3/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.08 [5]
migrate-mstest-v3-to-v4 Migrate MSTest.Sdk v3 project using ManagedType and TestTimeout 4.0/5 → 4.0/5 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.08 [6]
migrate-mstest-v3-to-v4 Correctly identify MSTest v3 project and recommend v4 migration 4.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.08

[1] ⚠️ High run-to-run variance (CV=2.50) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=0.82) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=3.18) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -19.8% due to: quality, judgment
[4] (Plugin) Quality improved but weighted score is -2.2% due to: completion (✓ → ✗), tokens (22793 → 50555), tool calls (1 → 3)
[5] ⚠️ High run-to-run variance (CV=1.58) — consider re-running with --runs 5
[6] (Isolated) Quality unchanged but weighted score is -18.1% due to: completion (✓ → ✗), judgment

timeout — run(s) hit the (240s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Copilot AI review requested due to automatic review settings April 21, 2026 07:15
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses CI timeouts in the migrate-mstest-v3-to-v4 evaluation by pre-warming the NuGet cache for the MSTest.Sdk v4 fixture before the agent session begins, reducing subsequent dotnet build/dotnet restore latency during the timed portion of the scenario.

Changes:

  • Add a setup.commands step to run dotnet restore for the affected MSTest.Sdk v4 migration scenario.
  • Ensure dependency resolution happens before agent execution to avoid consuming scenario timeout budget.
Show a summary per file
File Description
tests/dotnet-test/migrate-mstest-v3-to-v4/eval.yaml Adds a setup-time dotnet restore to warm NuGet/obj for the MSTest.Sdk v4 scenario, mitigating CI timeouts.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 0

github-actions Bot added a commit that referenced this pull request Apr 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
migrate-mstest-v3-to-v4 Migrate custom TestMethodAttribute from Execute to ExecuteAsync 1.7/5 → 3.3/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.05
migrate-mstest-v3-to-v4 Replace ExpectedExceptionAttribute with Assert.ThrowsExactly 3.7/5 → 4.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill, bash ✅ 0.05 [1]
migrate-mstest-v3-to-v4 Fix multiple v4 breaking changes: Assert, ClassCleanup, TestContext, Timeout 3.7/5 ⏰ → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.05
migrate-mstest-v3-to-v4 Handle net6.0 target framework dropped in MSTest v4 3.0/5 → 3.0/5 ⚠️ NOT ACTIVATED ✅ 0.05
migrate-mstest-v3-to-v4 Fix TestMethodAttribute CallerInfo constructor breaking change 3.3/5 ⏰ → 4.3/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill / ✅ migrate-mstest-v3-to-v4; tools: skill, read_bash ✅ 0.05 [2]
migrate-mstest-v3-to-v4 Understand behavioral changes after MSTest v4 upgrade 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.05
migrate-mstest-v3-to-v4 Handle MSTest.Sdk and MTP changes in v4 2.0/5 → 3.3/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.05 [3]
migrate-mstest-v3-to-v4 Full MSTest v3 to v4 migration with multiple breaking changes 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.05 [4]
migrate-mstest-v3-to-v4 Migrate MSTest.Sdk v3 project using ManagedType and TestTimeout 2.7/5 ⏰ → 4.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill, edit ✅ 0.05 [5]
migrate-mstest-v3-to-v4 Verified MSTest v3 to v4 package update with build and test 4.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.05 [6]
migrate-mstest-v3-to-v4 Fix sealed custom TestMethodAttribute with Timeout changes 3.7/5 ⏰ → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.05 [7]
migrate-mstest-v3-to-v4 Fix TestMethodAttribute and TestMethod display name constructor 5.0/5 → 5.0/5 ✅ migrate-mstest-v3-to-v4; tools: skill, bash ✅ 0.05 [8]
migrate-mstest-v3-to-v4 Fix Assert.IsInstanceOfType out parameter removal 5.0/5 → 5.0/5 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.05 [9]
migrate-mstest-v3-to-v4 Address TreatDiscoveryWarningsAsErrors and behavioral changes 2.0/5 → 3.7/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.05
migrate-mstest-v3-to-v4 Correctly identify MSTest v3 project and recommend v4 migration 4.3/5 → 4.3/5 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.05 [10]

[1] ⚠️ High run-to-run variance (CV=19.33) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=1.43) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=0.96) — consider re-running with --runs 5
[4] ⚠️ High run-to-run variance (CV=0.60) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=1.05) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=0.91) — consider re-running with --runs 5
[7] ⚠️ High run-to-run variance (CV=0.62) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=0.58) — consider re-running with --runs 5
[9] ⚠️ High run-to-run variance (CV=0.87) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -17.1% due to: judgment, quality, tokens (112007 → 148762)
[10] ⚠️ High run-to-run variance (CV=1.04) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -15.1% due to: judgment, quality

timeout — run(s) hit the (180s, 240s) scenario timeout limit; scoring may be impacted by aborting model execution before it could produce its full output (increase via timeout in eval.yaml)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

- Add dotnet restore setup commands to scenarios 3 (multiple breaking
  changes) and 5 (CallerInfo) to pre-warm NuGet cache before agent runs
- Increase timeout for CallerInfo scenario (180s -> 240s) and multiple
  breaking changes scenario (240s -> 300s) to accommodate baseline agent
  iteration cycles
- Add new scenario 10: Fix sealed custom TestMethodAttribute with
  Timeout changes, with fixture files and dotnet restore pre-warming
Copilot AI review requested due to automatic review settings April 21, 2026 08:53
@Evangelink Evangelink force-pushed the fix/mstest-sdk-eval-timeout branch from f7fc9de to 79e9e62 Compare April 21, 2026 08:53
@Evangelink Evangelink changed the title Fix eval timeout for MSTest.Sdk scenario by pre-warming NuGet cache Fix eval timeouts for migrate-mstest-v3-to-v4 scenarios Apr 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the migrate-mstest-v3-to-v4 eval to avoid CI timeouts caused by slow first-time NuGet restores (notably for MSTest.Sdk-based projects) by pre-warming dependencies during scenario setup.

Changes:

  • Add setup.commands: ["dotnet restore"] to multiple scenarios so package restore happens before the agent session timeout window.
  • Increase timeouts for two existing scenarios to better accommodate slower CI environments.
  • Add a new Goal 10 eval scenario (plus fixture files) covering a sealed TestMethodAttribute override and TestTimeout/Timeout breaking changes.
Show a summary per file
File Description
tests/dotnet-test/migrate-mstest-v3-to-v4/fixtures/v3-sealed-testmethod/TestProject.csproj New fixture project referencing MSTest 4.1.0 to support the added sealed-attribute scenario.
tests/dotnet-test/migrate-mstest-v3-to-v4/fixtures/v3-sealed-testmethod/PerformanceTests.cs New fixture demonstrating Execute override + [Timeout(TestTimeout.Infinite)] usage for migration guidance.
tests/dotnet-test/migrate-mstest-v3-to-v4/eval.yaml Adds pre-restore setup commands, adjusts a couple timeouts, and introduces a new Goal 10 scenario.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 3/3 changed files
  • Comments generated: 0

Resolve merge conflicts in eval.yaml:
- Keep detailed rubric and timeout:300 for Goal 3 (multiple breaking changes)
- Include main's new Goal 10 (verified package update with build and test)
- Use v3-sealed-testmethod fixture with dotnet restore for sealed scenario
- Include main's new Goals 12-15 (display name, IsInstanceOfType, behavioral changes, v3 recognition)
The plugin run timed out at 240s during local evaluation. Increase to
300s to match other complex multi-fix scenarios.
Copilot AI review requested due to automatic review settings April 21, 2026 17:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the migrate-mstest-v3-to-v4 evaluation suite to reduce CI timeouts by pre-warming NuGet restores, increasing scenario timeouts for longer baseline runs, and adding a new sealed TestMethodAttribute migration scenario.

Changes:

  • Add setup.commands: dotnet restore to several scenarios to reduce cold-cache NuGet latency.
  • Increase timeouts for the most complex scenarios to provide more headroom on slower CI runners.
  • Add a new “sealed custom TestMethodAttribute + Timeout(TestTimeout.Infinite)” fixture and evaluation scenario.
Show a summary per file
File Description
tests/dotnet-test/migrate-mstest-v3-to-v4/eval.yaml Adds restore pre-warming, adjusts timeouts, and introduces a new sealed-attribute migration scenario.
tests/dotnet-test/migrate-mstest-v3-to-v4/fixtures/v3-sealed-testmethod/TestProject.csproj New fixture project used by the sealed TestMethodAttribute scenario.
tests/dotnet-test/migrate-mstest-v3-to-v4/fixtures/v3-sealed-testmethod/PerformanceTests.cs New fixture code intentionally using v3-era APIs (Execute, TestTimeout.Infinite) to drive migration guidance.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 3/3 changed files
  • Comments generated: 4

Comment thread tests/dotnet-test/migrate-mstest-v3-to-v4/eval.yaml
Comment thread tests/dotnet-test/migrate-mstest-v3-to-v4/eval.yaml
Comment thread tests/dotnet-test/migrate-mstest-v3-to-v4/eval.yaml
Comment thread tests/dotnet-test/migrate-mstest-v3-to-v4/eval.yaml
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

@Evangelink Evangelink merged commit dee829b into dotnet:main Apr 23, 2026
38 checks passed
@Evangelink Evangelink deleted the fix/mstest-sdk-eval-timeout branch April 23, 2026 11:33
github-actions Bot added a commit that referenced this pull request Apr 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
migrate-mstest-v3-to-v4 Migrate custom TestMethodAttribute from Execute to ExecuteAsync 1.0/5 → 1.0/5 ⚠️ NOT ACTIVATED ✅ 0.06 [1]
migrate-mstest-v3-to-v4 Replace ExpectedExceptionAttribute with Assert.ThrowsExactly 1.3/5 → 1.3/5 ✅ migrate-mstest-v3-to-v4; tools: skill, report_intent, view, edit, bash / ✅ migrate-mstest-v3-to-v4; tools: report_intent, skill, view, edit, bash ✅ 0.06 [2]
migrate-mstest-v3-to-v4 Fix multiple v4 breaking changes: Assert, ClassCleanup, TestContext, Timeout 1.0/5 → 2.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill, edit, bash ✅ 0.06
migrate-mstest-v3-to-v4 Handle net6.0 target framework dropped in MSTest v4 1.0/5 → 1.0/5 ⚠️ NOT ACTIVATED ✅ 0.06 [3]
migrate-mstest-v3-to-v4 Fix TestMethodAttribute CallerInfo constructor breaking change 1.0/5 → 3.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
migrate-mstest-v3-to-v4 Understand behavioral changes after MSTest v4 upgrade 3.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
migrate-mstest-v3-to-v4 Handle MSTest.Sdk and MTP changes in v4 2.0/5 → 3.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06 [4]
migrate-mstest-v3-to-v4 Full MSTest v3 to v4 migration with multiple breaking changes 3.3/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06 [5]
migrate-mstest-v3-to-v4 Migrate MSTest.Sdk v3 project using ManagedType and TestTimeout 3.0/5 → 4.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill, edit, bash ✅ 0.06 [6]
migrate-mstest-v3-to-v4 Verified MSTest v3 to v4 package update with build and test 2.0/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
migrate-mstest-v3-to-v4 Fix sealed custom TestMethodAttribute with Timeout changes 1.0/5 → 3.7/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill, edit ✅ 0.06
migrate-mstest-v3-to-v4 Fix TestMethodAttribute and TestMethod display name constructor 3.7/5 → 3.7/5 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06 [7]
migrate-mstest-v3-to-v4 Fix Assert.IsInstanceOfType out parameter removal 5.0/5 → 5.0/5 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06 [8]
migrate-mstest-v3-to-v4 Address TreatDiscoveryWarningsAsErrors and behavioral changes 2.0/5 → 3.7/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06
migrate-mstest-v3-to-v4 Correctly identify MSTest v3 project and recommend v4 migration 4.7/5 → 5.0/5 🟢 ✅ migrate-mstest-v3-to-v4; tools: skill ✅ 0.06 [9]

[1] ⚠️ High run-to-run variance (CV=1.52) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=1.82) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -9.1% due to: tokens (38 → 32037), tool calls (0 → 3), time (28.8s → 47.7s)
[3] ⚠️ High run-to-run variance (CV=1.72) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -0.0% due to: efficiency metrics
[4] ⚠️ High run-to-run variance (CV=1.60) — consider re-running with --runs 5
[5] ⚠️ High run-to-run variance (CV=0.74) — consider re-running with --runs 5
[6] ⚠️ High run-to-run variance (CV=3.64) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -24.0% due to: completion (✓ → ✗), judgment, quality
[7] ⚠️ High run-to-run variance (CV=2.04) — consider re-running with --runs 5
[8] ⚠️ High run-to-run variance (CV=0.61) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -19.6% due to: judgment, quality, tokens (89780 → 155291)
[9] ⚠️ High run-to-run variance (CV=0.51) — consider re-running with --runs 5

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants