dotnet-test: add unit-under-test + behaviors quality cue to code-testing-generator#646
Merged
Evangelink merged 4 commits intoMay 13, 2026
Merged
Conversation
Experiment branch to isolate the impact of "invoke prompted subagents more often" from the impact of "make those subagents do richer work." Same baseline as dev/ykovalova/cta-prompt-tuning (main = 66628b6), but strips out every content/quality rule and keeps only the dispatch plumbing. Comparison branch: dev/ykovalova/cta-prompt-tuning (HEAD: bd530be) which contains both the dispatch mechanics AND content/quality rules. Files modified (3 vs 5 in cta-prompt-tuning): - code-testing-generator.agent.md +110 lines - code-testing-implementer.agent.md +11 lines - code-testing-fixer.agent.md +1 line - code-testing-researcher.agent.md UNTOUCHED (baseline) - code-testing-planner.agent.md UNTOUCHED (baseline) KEPT (invocation / dispatch mechanics): Generator: - Rule 1: every task() call MUST use agent_type "dotnet-test:code-testing-..." (without this, calls dispatch generic built-ins and never reach the named CTA agents) - Rule 2: routing table -- which named agent for which job - Rule 3: prefer one named-agent dispatch over many tool calls - Rule 4: orchestrator MUST NOT edit/create test files itself (forces implementer dispatch) - Rule 5: orchestrator MUST NOT run builds/tests via terminal (forces builder/tester dispatch) - Rule 6: every run MUST dispatch the planner (no exceptions for "small" scope; Direct still goes through planner) - Rule 7: every build/test failure MUST dispatch the fixer - Step 1b: mandatory initial researcher dispatch (every strategy) - Direct strategy rewritten: dispatches planner -> implementer -> builder -> tester -> fixer -> linter (was "Skip Steps 3-5, write tests inline") - All Step 3/4/5/6/7/8/9 dispatches converted from runSubagent({agent:...}) to task({ agent_type: "dotnet-test:code-testing-...", name:..., prompt:...}) - Step 9 validator dispatch (forces builder dispatch for cleanup) - Steps 6/7 mandatory builder/tester dispatch wrapper Implementer: - Section 5: "you MUST dispatch fixer for build errors" + no-inline-edit block (forces fixer dispatch on build failures) - Section 6: "you MUST dispatch fixer for test failures" + no-inline-edit block (forces fixer dispatch on test failures) - Section 7: "Format Code (mandatory if a lint command exists)" (was "Optional"; mandatory firing of linter) - Rule 6: never declare SUCCESS while build/tests fail (gates SUCCESS on fixer dispatch) - Rule 7: no inline test-file edits between failed dispatch and fixer Fixer: - Frontmatter description widened to advertise handling of failing tests (without this, the orchestrator's routing logic does not select the fixer for test failures, so even Rule 7's mandate produces no firing -- this is the change that took fixer firing from 0.00/inst to 0.39/inst in earlier iterations) DROPPED (content / quality rules -- in cta-prompt-tuning, NOT here): Generator: - Test-strength rules embedded in implementer dispatch prompt - Test-design rules embedded in implementer dispatch prompt (OFAT, mutation self-check, never mock subject under test) - File-location rules embedded in implementer dispatch prompt - TARGET ENTITIES / PHASE CHECKLIST / TEST TRACEABILITY blocks in implementer dispatch prompt - CHECKLIST format spec in planner dispatch prompt - Step 9 validator's detailed cleanup classification Implementer: - Section 4b "Verify CHECKLIST coverage" pre-completion check - Section 8 "CHECKLIST COVERAGE" report block - "Honor the CHECKLIST" rule Fixer: - "Process -- Failing Tests" section (5-step diagnosis flow) - All anti-weakening / anti-skipping rules - "Re-derive expected from production source" guidance Planner: - CHECKLIST format ("one item per TARGET BEHAVIOR, Source/Variants/ Expected mandatory") - "Test name from research.md conventions" rule - "At least 2 phases" rule Researcher: - Section 8 "Extract Local Test Naming & Style Conventions" - TARGET ENTITIES / TARGET BEHAVIORS / TEST INFRASTRUCTURE structure in research.md - Test naming pattern extraction WHAT THE SUBAGENTS WILL ACTUALLY DO: The researcher / planner / implementer / fixer all operate at baseline behavior -- they receive the same prompts they receive in the upstream "vanilla" runs. The only difference vs vanilla is that the orchestrator ACTUALLY DISPATCHES THEM (where vanilla often inlines the work or skips sub-agent dispatch entirely). EXPECTED COMPARISON: If quality on this branch is similar to or higher than dev/ykovalova/ cta-prompt-tuning (bd530be), then "more dispatches" is the dominant quality lever and the content/quality rules in cta-prompt-tuning are adding marginal or noise-level value. If quality on this branch is materially lower than cta-prompt-tuning, then the content/quality rules are doing the heavy lifting and the dispatch mechanics alone are insufficient. If quality on this branch matches or exceeds vanilla but trails cta-prompt-tuning, then the dispatch mechanics provide a baseline lift and the content rules add an incremental quality layer on top. Rubber-duck check passed (validated dispatch-vs-content classification; fixer frontmatter is routing metadata not a runtime gate; surviving dispatch prompts contain no dangling references to removed CHECKLIST / TARGET ENTITIES / TEST STRENGTH / naming-convention concepts). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Updates the dotnet-test CTA agent prompt set to improve test-quality outcomes by making the unit-under-test contract and behaviors explicit earlier in the pipeline, and by tightening dispatch/verification discipline across the orchestrator and sub-agents.
Changes:
- Add orchestrator-level dispatch discipline rules and a new “unit-under-test + behaviors” verification gate in
code-testing-generator. - Strengthen “no inline band-aid fixes” guidance in
code-testing-implementer(mandatory fixer dispatch on failures; lint/format expectation). - Expand
code-testing-fixermetadata to include failing-test assertion correction (in addition to compilation errors).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| plugins/dotnet-test/agents/code-testing-generator.agent.md | Adds dispatch discipline rules and a researcher verification gate intended to improve test intent/behavior specificity. |
| plugins/dotnet-test/agents/code-testing-implementer.agent.md | Tightens implementer behavior on build/test failures (must dispatch fixer; no inline edits after failures) and makes linting conditional-mandatory. |
| plugins/dotnet-test/agents/code-testing-fixer.agent.md | Updates fixer front-matter to explicitly include failing-test fixes. |
Comments suppressed due to low confidence (1)
plugins/dotnet-test/agents/code-testing-generator.agent.md:165
- Step 1b already mandates an initial researcher dispatch that writes
.testagent/research.md, but Step 3 then starts another researcher phase that also writes to.testagent/research.md. This duplicates work and can overwrite the contract/behaviors you just verified; clarify the flow by removing one of these steps or making Step 3 conditional / reference Step 1b’s output instead of re-dispatching unconditionally.
### Step 3: Research Phase
```text
task({
agent_type: "dotnet-test:code-testing-researcher",
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Step 1b prompt: explicitly request unit-under-test (file:line) and behaviors so the verification gate rarely needs re-dispatch. - Step 3: rename to 'Deep Research Phase', mark skipped for Direct strategy, and switch from overwriting research.md to extending it (no double-research). - Step references: update '6-9' -> '6-10' and 'Step 9' -> 'Steps 9-10' in the strategy table and the All-strategies-MUST line, since reporting is Step 10. - Step 6 builder prompt: drop '*.sln' glob (could expand to multiple args); use 'dotnet build --no-incremental' (auto-discovers .sln) per dotnet.md. - Step 9: stop overloading the builder agent; perform diff/cleanup directly in the orchestrator (Rule 5 forbids inline build/test, not git/fs hygiene). - Fixer agent: update mission text to cover failing tests and assertion correction (front-matter description already mentioned this; body now matches), with explicit no-Ignore/no-Skip/no-production-rewrite guardrails. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Evangelink
approved these changes
May 13, 2026
Member
|
/evaluate |
Contributor
|
⏭️ No skills to evaluate — no changed skills with tests were found in this PR. View workflow run |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a quality cue to the orchestrator agent (
code-testing-generator) and small companion changes incode-testing-implementerandcode-testing-fixerthat ask each subagent to extract the unit-under-test contract and the behaviors to test before writing or fixing tests.The cue is orchestrator-level (a verification gate after dispatch) with bounded re-dispatch, not new content rules baked into planner/researcher/implementer prompts. Same structural pattern as the existing research-cue.
Why
CTA was losing rubric quality (
nice%,agg%) to the vanilla CLI baseline despite matching coverage. The proximate cause (verified via per-instance patch inspection) was that the planner under-specified what the test was supposed to demonstrate, and the implementer wrote weak assertions (e.g.assert.NotNil(err)instead ofassert.Contains(err.Error(), "must not be empty")).This cue makes the contract explicit before code is written, without prescribing assertion style — so it's framework-agnostic and not benchmark-specific.