[prompt-clustering] Copilot Prompt Clustering — 2026-06-20: 80% merge rate, WIP PRs the key drag #40457

2026-06-20T10:58:32Z

github-actions[bot]
Bot Jun 20, 2026

Summary

Analysis Period: Last 30 days (2026-05-31 → 2026-06-20) · PRs Analyzed: 998 (of 1000; 2 empty bodies skipped)
Clusters: 11 (k chosen by silhouette, score 0.054) · Overall Merge Rate: 80.0% (792 merged / 990 decided · 8 still open)

The merge rate holds steady at 80%, in line with the past two weeks (79–81%). The standout signal is unchanged from prior runs: WIP-marked PRs merge at only 37% vs 80% overall — and one entire cluster (auto "fix failing actions" PRs) is 100% WIP.

Key Findings

WIP prompts are the dominant failure mode. 43 PRs are WIP-flagged; they merge at 37% vs 80% baseline. Cluster C10 (actions / fix / job, n=19) is entirely WIP (19/19) and the worst-performing cluster at 68% — these are auto-generated "fix the failing GitHub Actions job" PRs that frequently get superseded rather than merged.
Safe-output work is the hardest real task type. Cluster C3 (safe / safe output / safe outputs, n=60) merges at 71% — lowest among non-WIP-driven clusters — suggesting safe-output plumbing changes need more iteration or are more often rejected.
Narrow, mechanical tasks merge best. C8 (package / analyzer / spec, 88%), C4 (aic / forecast / et — cost/footer reporting, 86%), and C2 (claude / smoke, 85%) are the top performers — all tightly-scoped, well-templated changes.
The bulk of work is workflow/schema plumbing. The three largest clusters — C1 (workflow steps/jobs, 26%), C9 (schema/validation, 14%), C0 (prompt/guidance authoring, 14%) — account for 54% of all PRs and all sit right at the 79–80% baseline.

Cluster Breakdown

Cluster	Theme	Size	Merge %	WIP	Recent examples
C1	step / workflow / files	258 (25.9%)	79%	4	#40441, #40425, #40414
C9	schema / validation / field	139 (13.9%)	80%	1	#40442, #40422, #40394
C0	prompt / workflow / guidance	138 (13.8%)	80%	7	#40421, #40413, #40399
C8	package / analyzer / spec	84 (8.4%)	88%	1	#40420, #40366, #40248
C5	sdk / permission / copilot	83 (8.3%)	77%	1	#40419, #40368, #40124
C7	ai / credits / budget	68 (6.8%)	79%	1	#40000, #39061, #39058
C3	safe / safe output	60 (6.0%)	71%	4	#40423, #40367, #40363
C4	aic / forecast / cost footer	58 (5.8%)	86%	0	#40353, #40231, #39830
C2	claude / domains / smoke	47 (4.7%)	85%	5	#40388, #40208, #40166
C6	pr sous-chef	44 (4.4%)	80%	0	#40085, #40062, #39953
C10	actions / fix / job (auto)	19 (1.9%)	68%	19	#40239, #40109, #40007

Methodology & data notes

Source: copilot-prs.json — 1000 PRs authored by app/copilot-swe-agent in github/gh-aw, created 2026-05-31 → 2026-06-20.
Prompt text: extracted from PR body, stripped of code blocks, inline code, URLs, and #refs; lowercased; PRs with <30 chars of cleaned text excluded (2).
Vectorization: TF-IDF, 1–3 grams, max_features=600, min_df=3, max_df=0.6. Clustering: K-means, k selected by silhouette over k∈[6,11] → k=11 (sil 0.054; low absolute value is expected for short, topically-overlapping prompts).
Success = mergedAt != null; rate denominator excludes the 8 still-open PRs. WIP = title/body matches wip/[wip]/work in progress.
Limitation: the pr-full-data cache (comments/reviews/commits/turn counts) is stale (PRs 30xxx, zero overlap with the current 36xxx–40xxx window), so per-PR turn/iteration and commit metrics are not available this run. Analysis is prompt-text + merge-outcome only. Trend chart uses historical merge rates from cache-memory/clustering/history.jsonl.

Recommendations

Filter or gate the auto "fix failing actions" generator (C10). It produces a steady stream of WIP PRs that merge only 68% of the time and inflate noise. Consider converting these to draft/issue-only until a fix is verified, or dedup against existing open fix-PRs.
Treat WIP as a triage signal, not a normal state. At 37% merge, WIP PRs are 2× more likely to be abandoned. Surfacing "WIP PR open >N days" for cleanup would cut churn.
Add a safe-output change checklist to the prompt (C3, 71%). This is the lowest-merging substantive task family — likely needing schema + golden-file + doc updates together. A templated checklist could raise first-pass acceptance.
Reuse the patterns from C4/C8. The cost-footer and analyzer/package clusters (86–88%) are the most reliable — tightly scoped, single-concern prompts. Steering ambiguous requests toward this shape should lift the baseline.

References: §27868474007

Generated by 📊 Copilot Agent Prompt Clustering Analysis · 143.7 AIC · ⌖ 18.5 AIC · ⊞ 13.8K · ◷

expires on Jun 21, 2026, 2:58 AM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prompt-clustering] Copilot Prompt Clustering — 2026-06-20: 80% merge rate, WIP PRs the key drag #40457

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[prompt-clustering] Copilot Prompt Clustering — 2026-06-20: 80% merge rate, WIP PRs the key drag #40457

Uh oh!

github-actions[bot] Bot Jun 20, 2026

Summary

Key Findings

Cluster Breakdown

Recommendations

Replies: 0 comments

github-actions[bot]
Bot Jun 20, 2026