Problem
A significant fraction of Copilot-generated PRs are closed without merging, with the dominant category being fix: PRs. This indicates a systemic gap in pre-flight validation: agent sessions are completing, producing a PR, but the implementation is not passing CI or reviewer standards. Each wasted PR represents a full agent session (typically 20–60 minutes) consumed with no delivered value.
Evidence
- Analysis window: 2026-04-21 to 2026-05-04
- Sessions analyzed: 50 sessions (metadata only; no event logs available)
- Key metrics and examples:
- 1000 Copilot PRs analyzed in total; 215 closed without merging (21.5% closure rate)
- Of the 215 closed-unmerged PRs, 165 carry the
fix: prefix (77% of all closures)
- The overall merge success rate is 78.5% (779 merged / 994 non-open PRs)
- Second-largest closed category:
feat: (104 closures visible in title scan)
fix: PRs are the highest-volume PR type and the highest-volume closure type, suggesting that quick fix tasks are particularly prone to insufficient validation before a PR is opened
- Zero session logs were available to diagnose which validation steps (lint, tests, build, fmt) were failing most often
Proposed Change
- Enforce
make agent-finish as a mandatory pre-PR gate in the Copilot coding agent workflow instruction (AGENTS.md already documents this, but sessions may be skipping it when time-pressured). Add a runtime enforcement step in the agent harness that runs make fmt && make test-unit and blocks create_pull_request if either fails.
- Add a lightweight pre-flight check step in the workflow that runs
make build && make fmt immediately after the first code edit (not just at the end), so compile errors surface early rather than after long exploration phases.
- Instrument closure reasons by adding a tag/label system to closed PRs (e.g.,
closed:ci-failure, closed:reviewer-rejected) so future optimization runs can triage the 215 closures by actual root cause rather than title prefix heuristics.
Expected Impact
- Reduce the
fix: PR closure rate by catching the most common failure modes (formatting, build, failing tests) before a PR is opened
- Save 1–3 agent sessions per week currently wasted on PRs that fail CI immediately after opening
- Enable data-driven triage of future closure events with labeled root causes
Notes
- Distinct root cause category: late/missing validation strategy
- Data quality caveats: session logs directory (
/tmp/gh-aw/session-data/logs/) was empty; close reasons for individual PRs were not fetched (gh CLI unavailable in this session). The 215-closure count and category breakdown are derived from title-prefix analysis of the full PR dataset. Actual failure mode distribution requires per-PR inspection.
Generated by Copilot Opt · ● 2.5M · ◷
Problem
A significant fraction of Copilot-generated PRs are closed without merging, with the dominant category being
fix:PRs. This indicates a systemic gap in pre-flight validation: agent sessions are completing, producing a PR, but the implementation is not passing CI or reviewer standards. Each wasted PR represents a full agent session (typically 20–60 minutes) consumed with no delivered value.Evidence
fix:prefix (77% of all closures)feat:(104 closures visible in title scan)fix:PRs are the highest-volume PR type and the highest-volume closure type, suggesting that quick fix tasks are particularly prone to insufficient validation before a PR is openedProposed Change
make agent-finishas a mandatory pre-PR gate in the Copilot coding agent workflow instruction (AGENTS.md already documents this, but sessions may be skipping it when time-pressured). Add a runtime enforcement step in the agent harness that runsmake fmt && make test-unitand blockscreate_pull_requestif either fails.make build && make fmtimmediately after the first code edit (not just at the end), so compile errors surface early rather than after long exploration phases.closed:ci-failure,closed:reviewer-rejected) so future optimization runs can triage the 215 closures by actual root cause rather than title prefix heuristics.Expected Impact
fix:PR closure rate by catching the most common failure modes (formatting, build, failing tests) before a PR is openedNotes
/tmp/gh-aw/session-data/logs/) was empty; close reasons for individual PRs were not fetched (gh CLI unavailable in this session). The 215-closure count and category breakdown are derived from title-prefix analysis of the full PR dataset. Actual failure mode distribution requires per-PR inspection.