You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Analysis Period: 2026-05-06 → 2026-05-24 (last ~18 days, 1,000 most recent Copilot SWE PRs)
Total Copilot SWE PRs Analyzed: 997 (3 dropped for empty/short bodies)
Clusters Identified: 8 (k chosen by silhouette score over k=4..8)
Overall Merge Rate: 81.1%
Silhouette Score (k=8): 0.048
Methodology note: Workflow turn-count enrichment was skipped. The workflow logs available in this run (/tmp/gh-aw/agent/workflow-logs/) are for unrelated gh-aw agentic workflows (PR Sous Chef, Pull Request Reviewer, etc.), not the upstream Copilot SWE agent that authored these PRs — so turn counts and durations could not be cleanly joined. Analysis is therefore based on PR title + body + interaction metadata (comments, reviews, commits, files changed, additions/deletions). The "PR body" is the agent's own change description, which is a reasonable proxy for the work done (and correlates with the original prompt) but is not the literal prompt text.
Note on C6: top terms (bin, gh, usr, usr bin) suggest this cluster picked up residual firewall-block log fragments that survived cleaning. The cluster is small, so it doesn't distort the overall picture, but treat its theme as "miscellaneous / data-leakage" rather than a real task category.
Key Findings
Two dominant work patterns: C2 (bug fixes & tests) and C4 (workflow/docs/schema additions) together make up 58% of all Copilot SWE work in this window. Both merge at ~85% — these are the agent's bread and butter.
Lock-file / workflow-recompile churn is risky: C0 (workflow / pr / workflows) has the largest average diff outside C5 (+658/-616 lines, 35 files) and merges at only 73%. These are the "regenerate compiled workflow artifacts" PRs, which often touch hundreds of files and run into conflicts.
Version/golden-fixture bumps are the heaviest and riskiest: C5 (awf / claude / golden) averages 1,342 additions and 124 files per PR and merges at only 68%. This cluster also has the highest comment count (7.4/PR), suggesting these PRs need a lot of back-and-forth.
Sous-chef tasks merge best: C3 (sous chef formatting/cleanup) merges at 89% despite having the highest avg comments (12.7) and commits (9.0) — lots of iteration but very high success.
"Fix failing GitHub Actions job" PRs underperform: C7 (24 PRs, all titled [WIP] Fix failing GitHub Actions job ...) merges at 73%, with low review engagement (0.9 reviews/PR). These are auto-generated CI-fixer PRs that frequently get abandoned.
Top terms: sous, sous chef, chef, pr sous, pr, id, workflow, model, run, workflow id
Example PRs:
✅ #34263 — Add opusplan builtin alias to Claude model routing
✅ #34245 — Clarify stable vs prerelease upgrade messaging
✅ #34149 — Use Copilot BYOK platform default model
❌ #33840 — create-pull-request: keep branch push on protected-files fallback
Cluster C6 — bin / gh / usr (data-leakage cluster)
Size: 28 PRs (2.8% of corpus)
Merge rate: 79%
Top terms are all firewall-log fragments — these PRs likely had very short bodies that fell through to residual log noise. Real themes inside are mixed (docs, small fixes, deps).
All titled [WIP] Fix failing GitHub Actions job ... — these look like an automated CI-fixer pattern. Low review engagement and below-average merge rate suggest several get abandoned without humans engaging.
Lean on the bug-fix and docs/schema patterns: C2 (fix/bug) and C4 (workflow/docs/schema) cover 58% of the workload at 84–85% merge rate. These are the agent's strongest task shapes — keep routing this kind of work to it.
Investigate C5 (version/golden bumps): 68% merge rate, +1342/-1181 lines per PR, 7.4 comments per PR. These are heavy and contentious. Consider:
Splitting "bump version X + regenerate golden fixtures" into two PRs.
Adding a stricter scope-guard so the agent doesn't update unrelated golden files.
Investigate C7 (Fix failing GitHub Actions job ...): 73% merge rate, near-zero human review engagement (0.2 comments). Either these aren't being triaged, or the prompt is producing PRs that are unfit to review. Worth sampling 5 closed ones to see why.
Watch C0 (workflow recompile) diff size: average +658/-616 lines / 36 files — this is the noisy "merge main + regenerate lock files" pattern. Lower merge rate (73%) suggests these often get superseded; consider a queue/dedup rule so only one such PR is open at a time.
C6 cleanup: a small number of PRs (~2.8%) ended up in a noise cluster because their bodies were dominated by firewall-block warnings. The clustering pipeline could strip these even more aggressively (currently using a [!WARNING] block regex) — or these PRs could be flagged upstream because a body that's mostly firewall noise is itself a signal of agent trouble.
Backfill workflow telemetry: turn counts and per-PR durations weren't available in this run because the only workflow logs present were for unrelated agentic workflows. Joining Copilot SWE telemetry (turns, model usage, cost) by PR number would let us see whether C5/C7 also burn the most agent time before producing their (often abandoned) PRs.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Cluster Overview
Key Findings
workflow / pr / workflows) has the largest average diff outside C5 (+658/-616 lines, 35 files) and merges at only 73%. These are the "regenerate compiled workflow artifacts" PRs, which often touch hundreds of files and run into conflicts.awf / claude / golden) averages 1,342 additions and 124 files per PR and merges at only 68%. This cluster also has the highest comment count (7.4/PR), suggesting these PRs need a lot of back-and-forth.[WIP] Fix failing GitHub Actions job ...) merges at 73%, with low review engagement (0.9 reviews/PR). These are auto-generated CI-fixer PRs that frequently get abandoned.Per-cluster breakdown with example PRs
Cluster C2 — fix / bug / added
panicinlibrarycodein CI and tune it for accepted repo patternstime.Sleepliterals inpkg/cliwith named duration constantsCluster C4 — added / workflow / docs
Cluster C0 — workflow / pr / workflows
jqschema.shto unblock Copilot PR data fetchCluster C1 — agent / span / step
copilot_harnessCluster C5 — awf / claude / golden
Cluster C3 — sous / chef / pr
opusplanbuiltin alias to Claude model routingCluster C6 — bin / gh / usr (data-leakage cluster)
Cluster C7 — actions / job / fix
[WIP] Fix failing GitHub Actions job ...— these look like an automated CI-fixer pattern. Low review engagement and below-average merge rate suggest several get abandoned without humans engaging.Cluster comparison table (merge rate × effort × engagement)
Recommendations
Fix failing GitHub Actions job ...): 73% merge rate, near-zero human review engagement (0.2 comments). Either these aren't being triaged, or the prompt is producing PRs that are unfit to review. Worth sampling 5 closed ones to see why.[!WARNING]block regex) — or these PRs could be flagged upstream because a body that's mostly firewall noise is itself a signal of agent trouble.References:
Beta Was this translation helpful? Give feedback.
All reactions