Skip to content

Fix coverage-analysis activation for plateau diagnosis prompts#647

Merged
Evangelink merged 12 commits into
mainfrom
dev/amauryleve/coverage-analysis-activation-fix
May 14, 2026
Merged

Fix coverage-analysis activation for plateau diagnosis prompts#647
Evangelink merged 12 commits into
mainfrom
dev/amauryleve/coverage-analysis-activation-fix

Conversation

@Evangelink
Copy link
Copy Markdown
Member

Problem

The "Coverage plateau diagnosis" eval scenario (coverage-analysis/eval.vally.yaml) shows skill activation issues. The prompt "My coverage is stuck at 75% and I can't get it higher. What's blocking me?" gets intercepted by code-testing-agent instead of coverage-analysis, because code-testing-agent's description includes "improve test coverage, add test coverage" — a close semantic match for the user's desire to raise coverage.

Changes

coverage-analysis SKILL.md:

  • Trim verbose implementation details (provider detection, ReportGenerator) that consumed description budget without aiding activation
  • Add explicit USE FOR keywords: coverage stuck, coverage plateau, can't increase coverage, what's blocking coverage

code-testing-agent SKILL.md:

  • Add diagnosing coverage plateaus or CRAP score computation (use coverage-analysis) to the DO NOT USE FOR boundary to prevent the test-generation skill from intercepting diagnostic prompts

Both descriptions fit within the 1,024 char per-skill and 15,000 char aggregate limits (validated via skill-validator).

coverage-analysis SKILL.md:
- Trim verbose implementation details (provider detection,
  ReportGenerator) that consumed description budget without
  aiding skill activation
- Add explicit USE FOR keywords: coverage stuck, coverage plateau,
  can't increase coverage, what's blocking coverage

code-testing-agent SKILL.md:
- Add 'diagnosing coverage plateaus or CRAP score computation
  (use coverage-analysis)' to DO NOT USE FOR boundary to prevent
  test-generation skill from intercepting diagnostic prompts
Copilot AI review requested due to automatic review settings May 13, 2026 16:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tunes .NET test skill activation so coverage plateau diagnosis prompts are routed to coverage-analysis instead of test generation.

Changes:

  • Shortens and refocuses coverage-analysis frontmatter description around plateau/risk diagnosis.
  • Adds explicit coverage plateau exclusion guidance to code-testing-agent.
Show a summary per file
File Description
plugins/dotnet-test/skills/coverage-analysis/SKILL.md Refines activation keywords and boundaries for coverage/CRAP analysis.
plugins/dotnet-test/skills/code-testing-agent/SKILL.md Adds a boundary redirect for coverage plateau and CRAP-related requests.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 2/2 changed files
  • Comments generated: 1

Comment thread plugins/dotnet-test/skills/code-testing-agent/SKILL.md Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Skill Coverage Report

Plugin Skill Covered Coverage
dotnet-test code-testing-agent 4/5 80%
Uncovered: dotnet-test/code-testing-agent
  • [WorkflowStep] Step 2: Invoke the Test Generator (line 81)

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 13, 2026 16:55
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 2/2 changed files
  • Comments generated: 1

Comment thread plugins/dotnet-test/skills/code-testing-agent/SKILL.md Outdated
github-actions Bot added a commit that referenced this pull request May 13, 2026
github-actions Bot added a commit that referenced this pull request May 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.7/5 → 1.0/5 🔴 ✅ coverage-analysis; tools: skill / ✅ coverage-analysis; tools: skill, create ✅ 0.10
coverage-analysis Run coverage from scratch without existing data 1.0/5 → 1.0/5 ✅ coverage-analysis; tools: skill, glob / ✅ coverage-analysis; tools: skill, glob, create ✅ 0.10 [1]
coverage-analysis Coverage plateau diagnosis 3.3/5 → 2.3/5 🔴 ✅ coverage-analysis; tools: skill, create ✅ 0.10 [2]
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 3.0/5 → 3.3/5 🟢 ✅ code-testing-agent; tools: skill / ⚠️ NOT ACTIVATED ✅ 0.02 [3]

[1] ⚠️ High run-to-run variance (CV=0.57) — consider re-running with --runs 5. (Isolated) Quality unchanged but weighted score is -18.4% due to: judgment, tokens (45430 → 79699), quality
[2] ⚠️ High run-to-run variance (CV=1.07) — consider re-running with --runs 5
[3] ⚠️ High run-to-run variance (CV=0.90) — consider re-running with --runs 5. (Isolated) Quality improved but weighted score is -19.4% due to: judgment, quality, tokens (1348036 → 1503735)

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

▶ Sessions Visualisation -- interactive replay of all evaluation sessions

@Evangelink Evangelink enabled auto-merge (squash) May 13, 2026 17:37
…olated mode

Address eval regressions reported on PR #647 (run 25813728646):

1. code-testing-agent: `Generate tests for ContosoUniversity ASP.NET Core MVC app`
   was NOT ACTIVATED in plugin mode (detectedSkills=[], skillEventCount=0,
   invokedAgents=[]). The model bypassed the skill system entirely.

   - SKILL.md description: restructure to use the proven `Use when user says ...`
     pattern with quoted trigger phrases (matching the run-tests skill that
     consistently activates), make the link to the code-testing-generator
     sub-agent explicit, and tighten DO NOT USE FOR clauses.
   - eval prompt (eval.yaml + eval.vally.yaml): make the request
     pipeline-shaped (`project-wide, multi-file test generation task`,
     `scaffold a new test project`) so the model recognizes it as multi-step
     work that benefits from the orchestrated pipeline. Explicitly request
     coverlet.collector + a Cobertura XML run so rubric criterion 1
     (`high line coverage as reported by the Cobertura XML in TestResults/`)
     becomes achievable without overfitting.

2. code-testing-tester agent + code-testing-extensions/dotnet.md: open a
   scoped exception to the `skip coverage tools` rule. Default behavior
   stays the same, but when the user/harness explicitly asks for a
   Cobertura/XML coverage artifact, the agent may add coverlet.collector
   to the generated test csproj so the harness's coverage command produces
   output. The agent still does not run the coverage command itself.

3. coverage-analysis SKILL.md: add a `User-visible output is mandatory`
   guard at the top of the Workflow section. The latest eval showed isolated
   mode producing literally `(no output)` in 2 of 3 scenarios — the agent
   ran Compute-CrapScores.ps1 / Extract-MethodCoverage.ps1 / ReportGenerator
   in parallel, then the session ended without ever surfacing findings.
   The guard tells the agent to always return a partial summary instead of
   ending silent, and to deprioritize ReportGenerator HTML when budget is
   tight. (Plugin-mode quality is already strong: 4.3 / 4.3 / 5.0 — no
   regression risk there.)

Aggregate dotnet-test plugin description size: 14,925 chars (limit 15,000).
skill-validator check passes (22 skills, 11 agents, 1 plugin); markdownlint
passes for all 4 modified files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Evangelink
Copy link
Copy Markdown
Member Author

Pushed 174f9e56b to address the eval regressions reported in the previous comment.

What changed

code-testing-agent activation (was NOT ACTIVATED in plugin mode for ContosoUniversity)

  • plugins/dotnet-test/skills/code-testing-agent/SKILL.md — restructure description to use the proven Use when user says "..." pattern with quoted trigger phrases (matching run-tests), make the link to the code-testing-generator sub-agent explicit, and tighten DO NOT USE FOR.
  • tests/dotnet-test/code-testing-agent/eval.yaml + eval.vally.yaml — make the prompt pipeline-shaped (project-wide, multi-file test generation task, scaffold a new test project) and explicitly request coverlet.collector + a Cobertura XML run so rubric criterion 1 (high line coverage as reported by the Cobertura XML in TestResults/) becomes achievable.

coverage-analysis isolated-mode (no output) regression

  • plugins/dotnet-test/skills/coverage-analysis/SKILL.md — add a User-visible output is mandatory guard at the top of the ## Workflow section. The previous run showed isolated mode producing literally (no output) in 2 of 3 scenarios (agent ran scripts then ended silent). Plugin-mode quality is already strong (4.3 / 4.3 / 5.0) — this only targets the silent-end failure mode.

Allow coverlet.collector when explicitly required (rubric criterion 1)

  • plugins/dotnet-test/agents/code-testing-tester.agent.md and plugins/dotnet-test/skills/code-testing-extensions/extensions/dotnet.md — open a scoped exception to the existing skip coverage tools rule. Default behavior unchanged; the exception only kicks in when the user/harness asks for coverlet.collector or --collect:"XPlat Code Coverage".

Verification

  • skill-validator check --plugin ./plugins/dotnet-test ✅ (22 skills, 11 agents, 1 plugin)
  • Aggregate dotnet-test plugin description size: 14,925 / 15,000 chars
  • code-testing-agent description: 904 / 1,024 chars
  • markdownlint-cli2 ✅ on all 4 modified files

@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

Copilot AI review requested due to automatic review settings May 14, 2026 09:30
auto-merge was automatically disabled May 14, 2026 09:30

Head branch was pushed to by a user without write access

@Evangelink Evangelink review requested due to automatic review settings May 14, 2026 09:30
github-actions Bot added a commit that referenced this pull request May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 5.0/5 → 5.0/5 ✅ code-testing-agent; tools: skill / ✅ code-testing-agent; code-testing-extensions; tools: skill, task, read_agent, glob, grep ✅ 0.02
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.0/5 → 1.0/5 🔴 ✅ coverage-analysis; tools: skill, view / ✅ coverage-analysis; tools: skill, view, create ✅ 0.10
coverage-analysis Run coverage from scratch without existing data 1.0/5 → 2.3/5 🟢 ✅ coverage-analysis; tools: skill, create / ✅ coverage-analysis; tools: skill, create, glob ✅ 0.10 [1]
coverage-analysis Coverage plateau diagnosis 3.3/5 → 1.0/5 🔴 ✅ coverage-analysis; tools: skill / ✅ coverage-analysis; tools: skill, create ✅ 0.10

[1] ⚠️ High run-to-run variance (CV=0.67) — consider re-running with --runs 5

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

…ortGenerator

The previous workflow encouraged the agent to run `dotnet tool install` for
ReportGenerator in parallel with the CRAP scoring scripts (Phase 2 "Steps
3 and 4 in parallel" + Phase 3 "Steps 5 and 6 in parallel"). In isolated
mode that pattern reliably crashed the session with "Failed to persist
session events: timeout while waiting for mutex to become available"
right after the scripts returned valid data, so the agent never produced
the user-facing summary.

Restructure the workflow into 5 phases:

- Phase 1 (Setup) - unchanged
- Phase 2 (Test execution) - skip when Cobertura XML already exists
- Phase 3 (Analysis) - run only the two PowerShell scripts, no RG
- Phase 4 (User-facing summary) - MANDATORY, must be the next assistant
  response after Phase 3, before any RG work; also save
  coverage-analysis.md as a secondary follow-up
- Phase 5 (ReportGenerator HTML/CSV) - strictly optional, post-summary,
  skipped by default for existing-Cobertura and plateau-diagnosis paths

Also update references/output-format.md so the Reports section marks RG
artifacts as "Not generated (optional - request HTML reports to enable)"
when Phase 5 has not run, and update references/guidelines.md so the
"show and open the markdown report" rule explicitly defers to the
user-facing assistant response.

Targets the isolated-mode regressions in PR #647 eval:
- Project-wide coverage with existing Cobertura: 1.0/5 -> expected 3+
- Coverage plateau diagnosis: 1.0/5 -> expected 3+
- Run coverage from scratch: 2.3/5 -> expected steady or up

Verified: skill-validator check --plugin ./plugins/dotnet-test passes;
markdownlint-cli2 clean on all 3 modified files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 14, 2026 14:41
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

  • Files reviewed: 8/8 changed files
  • Comments generated: 4

Comment thread plugins/dotnet-test/skills/coverage-analysis/SKILL.md Outdated
Comment thread plugins/dotnet-test/skills/coverage-analysis/SKILL.md
Comment thread plugins/dotnet-test/skills/coverage-analysis/SKILL.md Outdated
Copilot AI review requested due to automatic review settings May 14, 2026 16:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot's findings

Comments suppressed due to low confidence (1)

plugins/dotnet-test/skills/coverage-analysis/SKILL.md:75

  • This says Phases 1–4 are required, but the preceding instruction explicitly skips Phase 2 when existing Cobertura XML is available. That contradiction can cause agents to run unnecessary dotnet test despite the existing-data path; clarify that Phase 2 is conditional.
The workflow runs in five phases. Phases 1–4 are required; Phase 5 (ReportGenerator HTML/CSV reports) is strictly optional and runs **after** the user-facing summary has been delivered. Do not parallelize Phase 5 with earlier phases — the heavy `dotnet tool install` for ReportGenerator can crash the session before Phase 4 completes.
  • Files reviewed: 9/9 changed files
  • Comments generated: 8

Comment thread plugins/dotnet-test/skills/coverage-analysis/scripts/Compute-CrapScores.ps1 Outdated
Comment thread plugins/dotnet-test/skills/coverage-analysis/references/output-format.md Outdated
Comment thread plugins/dotnet-test/skills/coverage-analysis/SKILL.md Outdated
Comment thread plugins/dotnet-test/skills/coverage-analysis/SKILL.md
Comment thread plugins/dotnet-test/skills/code-testing-agent/SKILL.md
Comment thread tests/dotnet-test/code-testing-agent/eval.yaml
Comment thread tests/dotnet-test/code-testing-agent/eval.vally.yaml
Comment thread plugins/dotnet-test/skills/coverage-analysis/SKILL.md Outdated
@Evangelink
Copy link
Copy Markdown
Member Author

@copilot address all review comments

auto-merge was automatically disabled May 14, 2026 16:25

Head branch was pushed to by a user without write access

Copilot AI review requested due to automatic review settings May 14, 2026 16:26
@Evangelink Evangelink review requested due to automatic review settings May 14, 2026 16:26
github-actions Bot added a commit that referenced this pull request May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.3/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, create, view ✅ 0.16
coverage-analysis Run coverage from scratch without existing data 4.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, create, glob / ✅ coverage-analysis; tools: skill, create ✅ 0.16
coverage-analysis Coverage plateau diagnosis 3.0/5 → 4.3/5 🟢 ✅ coverage-analysis; tools: skill, create, read_bash, stop_bash / ✅ coverage-analysis; tools: skill, create ✅ 0.16 [1]
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 5.0/5 → 4.3/5 🔴 ✅ code-testing-agent; tools: skill / ✅ code-testing-agent; code-testing-extensions; tools: skill, task, grep, glob, read_agent ✅ 0.02 [2]

[1] ⚠️ High run-to-run variance (CV=2.01) — consider re-running with --runs 5
[2] ⚠️ High run-to-run variance (CV=0.67) — consider re-running with --runs 5

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

Copilot AI review requested due to automatic review settings May 14, 2026 16:28
@Evangelink Evangelink review requested due to automatic review settings May 14, 2026 16:28
Copilot AI review requested due to automatic review settings May 14, 2026 16:28
@Evangelink Evangelink review requested due to automatic review settings May 14, 2026 16:28
Copy link
Copy Markdown
Contributor

Copilot AI commented May 14, 2026

@copilot address all review comments

Done — I addressed all unresolved review comments in commits 18c1092, bbe0f00, dcc86f6, and 3178fc0.

@Evangelink Evangelink enabled auto-merge (squash) May 14, 2026 17:06
@Evangelink
Copy link
Copy Markdown
Member Author

/evaluate

@Evangelink Evangelink merged commit 3d59e44 into main May 14, 2026
35 of 37 checks passed
@Evangelink Evangelink deleted the dev/amauryleve/coverage-analysis-activation-fix branch May 14, 2026 17:41
github-actions Bot added a commit that referenced this pull request May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Skill Validation Results

Skill Scenario Quality Skills Loaded Overfit Verdict
code-testing-agent Generate tests for ContosoUniversity ASP.NET Core MVC app 5.0/5 → 4.7/5 🔴 ✅ code-testing-agent; tools: skill / ✅ code-testing-agent; code-testing-extensions; tools: skill, task, read_agent, read_bash ✅ 0.02 [1]
coverage-analysis Project-wide coverage analysis with existing Cobertura data 2.0/5 → 4.7/5 🟢 ✅ coverage-analysis; tools: skill, view, read_bash, stop_bash, create / ✅ coverage-analysis; tools: skill, view, create ✅ 0.16
coverage-analysis Run coverage from scratch without existing data 4.0/5 → 5.0/5 🟢 ✅ coverage-analysis; tools: skill, read_bash, stop_bash, create, glob / ✅ coverage-analysis; tools: skill, create, glob ✅ 0.16
coverage-analysis Coverage plateau diagnosis 3.3/5 → 4.7/5 🟢 ✅ coverage-analysis; tools: skill, create, view / ✅ coverage-analysis; tools: skill, read_bash, stop_bash, create, view ✅ 0.16

[1] ⚠️ High run-to-run variance (CV=1.32) — consider re-running with --runs 5

Model: claude-opus-4.6 | Judge: claude-opus-4.6

🔍 Full Results - additional metrics and failure investigation steps

Evangelink added a commit that referenced this pull request May 15, 2026
…on (#652)

Dashboard data showed coverage-analysis failing to activate in 8/10
recent scheduled plugin-mode runs for the 'Coverage plateau diagnosis'
scenario, while activating reliably in isolated mode. PR #647 added
positive triggers to coverage-analysis but did not address sibling
attention competition.

The crap-score description matched the plateau prompt almost as well as
coverage-analysis (it advertised 'evaluate whether complex methods have
sufficient test coverage' + 'Requires code coverage data (Cobertura
XML)') without redirecting project-wide / stuck-coverage diagnosis to
coverage-analysis. With 22 sibling skills competing for attention this
overlap is enough to suppress activation altogether.

Tighten the crap-score frontmatter to:
- Scope positive triggers to a named method, class, or single source
  file (the actual eval surface — see tests/dotnet-test/crap-score/
  eval.yaml, all 3 scenarios target OrderService.cs).
- Add explicit DO NOT USE FOR redirects covering project-wide coverage
  analysis, coverage plateau / stuck coverage, what's blocking
  coverage, and where to add tests across a project — all of which
  point at coverage-analysis.

skill-validator check passes (22 skills, 11 agents, 1 plugin).
Aggregate dotnet-test description size: 14,932 chars (limit 15,000).
markdownlint passes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants