Skip to content

dotnet-test: add unit-under-test + behaviors quality cue to code-testing-generator#646

Merged
Evangelink merged 4 commits into
dotnet:mainfrom
YuliiaKovalova:dev/ykovalova/cta-quality-cue
May 13, 2026
Merged

dotnet-test: add unit-under-test + behaviors quality cue to code-testing-generator#646
Evangelink merged 4 commits into
dotnet:mainfrom
YuliiaKovalova:dev/ykovalova/cta-quality-cue

Conversation

@YuliiaKovalova
Copy link
Copy Markdown
Member

@YuliiaKovalova YuliiaKovalova commented May 13, 2026

What

Adds a quality cue to the orchestrator agent (code-testing-generator) and small companion changes in code-testing-implementer and code-testing-fixer that ask each subagent to extract the unit-under-test contract and the behaviors to test before writing or fixing tests.

The cue is orchestrator-level (a verification gate after dispatch) with bounded re-dispatch, not new content rules baked into planner/researcher/implementer prompts. Same structural pattern as the existing research-cue.

Why

CTA was losing rubric quality (nice%, agg%) to the vanilla CLI baseline despite matching coverage. The proximate cause (verified via per-instance patch inspection) was that the planner under-specified what the test was supposed to demonstrate, and the implementer wrote weak assertions (e.g. assert.NotNil(err) instead of assert.Contains(err.Error(), "must not be empty")).

This cue makes the contract explicit before code is written, without prescribing assertion style — so it's framework-agnostic and not benchmark-specific.

YuliiaKovalova and others added 2 commits May 11, 2026 11:21
Experiment branch to isolate the impact of "invoke prompted subagents
more often" from the impact of "make those subagents do richer work."
Same baseline as dev/ykovalova/cta-prompt-tuning (main = 66628b6), but
strips out every content/quality rule and keeps only the dispatch
plumbing.

Comparison branch: dev/ykovalova/cta-prompt-tuning (HEAD: bd530be)
which contains both the dispatch mechanics AND content/quality rules.

Files modified (3 vs 5 in cta-prompt-tuning):
- code-testing-generator.agent.md  +110 lines
- code-testing-implementer.agent.md +11 lines
- code-testing-fixer.agent.md      +1  line
- code-testing-researcher.agent.md  UNTOUCHED (baseline)
- code-testing-planner.agent.md     UNTOUCHED (baseline)

KEPT (invocation / dispatch mechanics):

Generator:
- Rule 1: every task() call MUST use agent_type "dotnet-test:code-testing-..."
  (without this, calls dispatch generic built-ins and never reach the
  named CTA agents)
- Rule 2: routing table -- which named agent for which job
- Rule 3: prefer one named-agent dispatch over many tool calls
- Rule 4: orchestrator MUST NOT edit/create test files itself
  (forces implementer dispatch)
- Rule 5: orchestrator MUST NOT run builds/tests via terminal
  (forces builder/tester dispatch)
- Rule 6: every run MUST dispatch the planner (no exceptions for "small"
  scope; Direct still goes through planner)
- Rule 7: every build/test failure MUST dispatch the fixer
- Step 1b: mandatory initial researcher dispatch (every strategy)
- Direct strategy rewritten: dispatches planner -> implementer -> builder
  -> tester -> fixer -> linter (was "Skip Steps 3-5, write tests inline")
- All Step 3/4/5/6/7/8/9 dispatches converted from runSubagent({agent:...})
  to task({ agent_type: "dotnet-test:code-testing-...", name:..., prompt:...})
- Step 9 validator dispatch (forces builder dispatch for cleanup)
- Steps 6/7 mandatory builder/tester dispatch wrapper

Implementer:
- Section 5: "you MUST dispatch fixer for build errors" + no-inline-edit
  block (forces fixer dispatch on build failures)
- Section 6: "you MUST dispatch fixer for test failures" + no-inline-edit
  block (forces fixer dispatch on test failures)
- Section 7: "Format Code (mandatory if a lint command exists)"
  (was "Optional"; mandatory firing of linter)
- Rule 6: never declare SUCCESS while build/tests fail (gates SUCCESS on
  fixer dispatch)
- Rule 7: no inline test-file edits between failed dispatch and fixer

Fixer:
- Frontmatter description widened to advertise handling of failing tests
  (without this, the orchestrator's routing logic does not select the
  fixer for test failures, so even Rule 7's mandate produces no firing
  -- this is the change that took fixer firing from 0.00/inst to 0.39/inst
  in earlier iterations)

DROPPED (content / quality rules -- in cta-prompt-tuning, NOT here):

Generator:
- Test-strength rules embedded in implementer dispatch prompt
- Test-design rules embedded in implementer dispatch prompt (OFAT,
  mutation self-check, never mock subject under test)
- File-location rules embedded in implementer dispatch prompt
- TARGET ENTITIES / PHASE CHECKLIST / TEST TRACEABILITY blocks in
  implementer dispatch prompt
- CHECKLIST format spec in planner dispatch prompt
- Step 9 validator's detailed cleanup classification

Implementer:
- Section 4b "Verify CHECKLIST coverage" pre-completion check
- Section 8 "CHECKLIST COVERAGE" report block
- "Honor the CHECKLIST" rule

Fixer:
- "Process -- Failing Tests" section (5-step diagnosis flow)
- All anti-weakening / anti-skipping rules
- "Re-derive expected from production source" guidance

Planner:
- CHECKLIST format ("one item per TARGET BEHAVIOR, Source/Variants/
  Expected mandatory")
- "Test name from research.md conventions" rule
- "At least 2 phases" rule

Researcher:
- Section 8 "Extract Local Test Naming & Style Conventions"
- TARGET ENTITIES / TARGET BEHAVIORS / TEST INFRASTRUCTURE structure
  in research.md
- Test naming pattern extraction

WHAT THE SUBAGENTS WILL ACTUALLY DO:

The researcher / planner / implementer / fixer all operate at baseline
behavior -- they receive the same prompts they receive in the upstream
"vanilla" runs. The only difference vs vanilla is that the orchestrator
ACTUALLY DISPATCHES THEM (where vanilla often inlines the work or skips
sub-agent dispatch entirely).

EXPECTED COMPARISON:

If quality on this branch is similar to or higher than dev/ykovalova/
cta-prompt-tuning (bd530be), then "more dispatches" is the dominant
quality lever and the content/quality rules in cta-prompt-tuning are
adding marginal or noise-level value.

If quality on this branch is materially lower than cta-prompt-tuning,
then the content/quality rules are doing the heavy lifting and the
dispatch mechanics alone are insufficient.

If quality on this branch matches or exceeds vanilla but trails
cta-prompt-tuning, then the dispatch mechanics provide a baseline lift
and the content rules add an incremental quality layer on top.

Rubber-duck check passed (validated dispatch-vs-content classification;
fixer frontmatter is routing metadata not a runtime gate; surviving
dispatch prompts contain no dangling references to removed CHECKLIST /
TARGET ENTITIES / TEST STRENGTH / naming-convention concepts).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@YuliiaKovalova YuliiaKovalova marked this pull request as ready for review May 13, 2026 10:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the dotnet-test CTA agent prompt set to improve test-quality outcomes by making the unit-under-test contract and behaviors explicit earlier in the pipeline, and by tightening dispatch/verification discipline across the orchestrator and sub-agents.

Changes:

  • Add orchestrator-level dispatch discipline rules and a new “unit-under-test + behaviors” verification gate in code-testing-generator.
  • Strengthen “no inline band-aid fixes” guidance in code-testing-implementer (mandatory fixer dispatch on failures; lint/format expectation).
  • Expand code-testing-fixer metadata to include failing-test assertion correction (in addition to compilation errors).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
plugins/dotnet-test/agents/code-testing-generator.agent.md Adds dispatch discipline rules and a researcher verification gate intended to improve test intent/behavior specificity.
plugins/dotnet-test/agents/code-testing-implementer.agent.md Tightens implementer behavior on build/test failures (must dispatch fixer; no inline edits after failures) and makes linting conditional-mandatory.
plugins/dotnet-test/agents/code-testing-fixer.agent.md Updates fixer front-matter to explicitly include failing-test fixes.
Comments suppressed due to low confidence (1)

plugins/dotnet-test/agents/code-testing-generator.agent.md:165

  • Step 1b already mandates an initial researcher dispatch that writes .testagent/research.md, but Step 3 then starts another researcher phase that also writes to .testagent/research.md. This duplicates work and can overwrite the contract/behaviors you just verified; clarify the flow by removing one of these steps or making Step 3 conditional / reference Step 1b’s output instead of re-dispatching unconditionally.
### Step 3: Research Phase

```text
task({
  agent_type: "dotnet-test:code-testing-researcher",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread plugins/dotnet-test/agents/code-testing-generator.agent.md Outdated
Comment thread plugins/dotnet-test/agents/code-testing-generator.agent.md Outdated
Comment thread plugins/dotnet-test/agents/code-testing-generator.agent.md Outdated
Comment thread plugins/dotnet-test/agents/code-testing-generator.agent.md Outdated
Comment thread plugins/dotnet-test/agents/code-testing-fixer.agent.md
- Step 1b prompt: explicitly request unit-under-test (file:line) and behaviors
  so the verification gate rarely needs re-dispatch.
- Step 3: rename to 'Deep Research Phase', mark skipped for Direct strategy,
  and switch from overwriting research.md to extending it (no double-research).
- Step references: update '6-9' -> '6-10' and 'Step 9' -> 'Steps 9-10' in the
  strategy table and the All-strategies-MUST line, since reporting is Step 10.
- Step 6 builder prompt: drop '*.sln' glob (could expand to multiple args);
  use 'dotnet build --no-incremental' (auto-discovers .sln) per dotnet.md.
- Step 9: stop overloading the builder agent; perform diff/cleanup directly
  in the orchestrator (Rule 5 forbids inline build/test, not git/fs hygiene).
- Fixer agent: update mission text to cover failing tests and assertion
  correction (front-matter description already mentioned this; body now
  matches), with explicit no-Ignore/no-Skip/no-production-rewrite guardrails.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@YuliiaKovalova YuliiaKovalova removed the request for review from JanKrivanek May 13, 2026 11:34
@Evangelink
Copy link
Copy Markdown
Member

/evaluate

@github-actions
Copy link
Copy Markdown
Contributor

⏭️ No skills to evaluate — no changed skills with tests were found in this PR. View workflow run

@Evangelink Evangelink merged commit f1b09eb into dotnet:main May 13, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants