You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Analysis Period: Last 30 days (2026-05-31 → 2026-06-20) · PRs Analyzed: 998 (of 1000; 2 empty bodies skipped) Clusters: 11 (k chosen by silhouette, score 0.054) · Overall Merge Rate: 80.0% (792 merged / 990 decided · 8 still open)
The merge rate holds steady at 80%, in line with the past two weeks (79–81%). The standout signal is unchanged from prior runs: WIP-marked PRs merge at only 37% vs 80% overall — and one entire cluster (auto "fix failing actions" PRs) is 100% WIP.
Key Findings
WIP prompts are the dominant failure mode. 43 PRs are WIP-flagged; they merge at 37% vs 80% baseline. Cluster C10 (actions / fix / job, n=19) is entirely WIP (19/19) and the worst-performing cluster at 68% — these are auto-generated "fix the failing GitHub Actions job" PRs that frequently get superseded rather than merged.
Safe-output work is the hardest real task type. Cluster C3 (safe / safe output / safe outputs, n=60) merges at 71% — lowest among non-WIP-driven clusters — suggesting safe-output plumbing changes need more iteration or are more often rejected.
Narrow, mechanical tasks merge best. C8 (package / analyzer / spec, 88%), C4 (aic / forecast / et — cost/footer reporting, 86%), and C2 (claude / smoke, 85%) are the top performers — all tightly-scoped, well-templated changes.
The bulk of work is workflow/schema plumbing. The three largest clusters — C1 (workflow steps/jobs, 26%), C9 (schema/validation, 14%), C0 (prompt/guidance authoring, 14%) — account for 54% of all PRs and all sit right at the 79–80% baseline.
Source: copilot-prs.json — 1000 PRs authored by app/copilot-swe-agent in github/gh-aw, created 2026-05-31 → 2026-06-20.
Prompt text: extracted from PR body, stripped of code blocks, inline code, URLs, and #refs; lowercased; PRs with <30 chars of cleaned text excluded (2).
Vectorization: TF-IDF, 1–3 grams, max_features=600, min_df=3, max_df=0.6. Clustering: K-means, k selected by silhouette over k∈[6,11] → k=11 (sil 0.054; low absolute value is expected for short, topically-overlapping prompts).
Success = mergedAt != null; rate denominator excludes the 8 still-open PRs. WIP = title/body matches wip/[wip]/work in progress.
Limitation: the pr-full-data cache (comments/reviews/commits/turn counts) is stale (PRs 30xxx, zero overlap with the current 36xxx–40xxx window), so per-PR turn/iteration and commit metrics are not available this run. Analysis is prompt-text + merge-outcome only. Trend chart uses historical merge rates from cache-memory/clustering/history.jsonl.
Recommendations
Filter or gate the auto "fix failing actions" generator (C10). It produces a steady stream of WIP PRs that merge only 68% of the time and inflate noise. Consider converting these to draft/issue-only until a fix is verified, or dedup against existing open fix-PRs.
Treat WIP as a triage signal, not a normal state. At 37% merge, WIP PRs are 2× more likely to be abandoned. Surfacing "WIP PR open >N days" for cleanup would cut churn.
Add a safe-output change checklist to the prompt (C3, 71%). This is the lowest-merging substantive task family — likely needing schema + golden-file + doc updates together. A templated checklist could raise first-pass acceptance.
Reuse the patterns from C4/C8. The cost-footer and analyzer/package clusters (86–88%) are the most reliable — tightly scoped, single-concern prompts. Steering ambiguous requests toward this shape should lift the baseline.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Analysis Period: Last 30 days (2026-05-31 → 2026-06-20) · PRs Analyzed: 998 (of 1000; 2 empty bodies skipped)
Clusters: 11 (k chosen by silhouette, score 0.054) · Overall Merge Rate: 80.0% (792 merged / 990 decided · 8 still open)
The merge rate holds steady at 80%, in line with the past two weeks (79–81%). The standout signal is unchanged from prior runs: WIP-marked PRs merge at only 37% vs 80% overall — and one entire cluster (auto "fix failing actions" PRs) is 100% WIP.
Key Findings
actions / fix / job, n=19) is entirely WIP (19/19) and the worst-performing cluster at 68% — these are auto-generated "fix the failing GitHub Actions job" PRs that frequently get superseded rather than merged.safe / safe output / safe outputs, n=60) merges at 71% — lowest among non-WIP-driven clusters — suggesting safe-output plumbing changes need more iteration or are more often rejected.package / analyzer / spec, 88%), C4 (aic / forecast / et— cost/footer reporting, 86%), and C2 (claude / smoke, 85%) are the top performers — all tightly-scoped, well-templated changes.Cluster Breakdown
Methodology & data notes
copilot-prs.json— 1000 PRs authored byapp/copilot-swe-agentingithub/gh-aw, created 2026-05-31 → 2026-06-20.#refs; lowercased; PRs with <30 chars of cleaned text excluded (2).max_features=600,min_df=3,max_df=0.6. Clustering: K-means, k selected by silhouette over k∈[6,11] → k=11 (sil 0.054; low absolute value is expected for short, topically-overlapping prompts).mergedAt != null; rate denominator excludes the 8 still-open PRs. WIP = title/body matcheswip/[wip]/work in progress.pr-full-datacache (comments/reviews/commits/turn counts) is stale (PRs 30xxx, zero overlap with the current 36xxx–40xxx window), so per-PR turn/iteration and commit metrics are not available this run. Analysis is prompt-text + merge-outcome only. Trend chart uses historical merge rates fromcache-memory/clustering/history.jsonl.Recommendations
References: §27868474007
Beta Was this translation helpful? Give feedback.
All reactions