[prompt-clustering] Copilot SWE Agent Prompt Clustering — 997 PRs (2026-05-06 → 2026-05-24) #34410

2026-05-24T10:40:10Z

github-actions[bot]
Bot May 24, 2026

Summary

Analysis Period: 2026-05-06 → 2026-05-24 (last ~18 days, 1,000 most recent Copilot SWE PRs)
Total Copilot SWE PRs Analyzed: 997 (3 dropped for empty/short bodies)
Clusters Identified: 8 (k chosen by silhouette score over k=4..8)
Overall Merge Rate: 81.1%
Silhouette Score (k=8): 0.048

Methodology note: Workflow turn-count enrichment was skipped. The workflow logs available in this run (/tmp/gh-aw/agent/workflow-logs/) are for unrelated gh-aw agentic workflows (PR Sous Chef, Pull Request Reviewer, etc.), not the upstream Copilot SWE agent that authored these PRs — so turn counts and durations could not be cleanly joined. Analysis is therefore based on PR title + body + interaction metadata (comments, reviews, commits, files changed, additions/deletions). The "PR body" is the agent's own change description, which is a reasonable proxy for the work done (and correlates with the original prompt) but is not the literal prompt text.

Cluster Overview

Cluster	Theme	Tasks	%	Merge Rate	Avg Files	Avg +/-	Top terms
C2	fix / bug / added	300	30.1%	85%	20.3	+232/-134	fix, bug, added, test, error, safe
C4	added / workflow / docs	277	27.8%	84%	24.4	+290/-122	added, workflow, docs, schema, updated, new
C0	workflow / pr / workflows	150	15.0%	73%	35.5	+658/-616	workflow, pr, workflows, files, review, lock
C1	agent / span / step	106	10.6%	82%	28.2	+284/-99	agent, span, step, prompt, workflow, setup
C5	awf / claude / golden	66	6.6%	68%	124.5	+1342/-1181	awf, claude, golden, version, smoke, bump
C3	sous / chef / pr	44	4.4%	89%	36.4	+704/-257	sous, sous chef, chef, pr sous, pr, id
C6	bin / gh / usr (mostly noise)	28	2.8%	79%	3.1	+77/-28	bin, gh, usr, usr bin, git, bin gh
C7	actions / job / fix	26	2.6%	73%	51.6	+263/-59	actions, failing github, fix failing, actions job, job

Note on C6: top terms (bin, gh, usr, usr bin) suggest this cluster picked up residual firewall-block log fragments that survived cleaning. The cluster is small, so it doesn't distort the overall picture, but treat its theme as "miscellaneous / data-leakage" rather than a real task category.

Key Findings

Two dominant work patterns: C2 (bug fixes & tests) and C4 (workflow/docs/schema additions) together make up 58% of all Copilot SWE work in this window. Both merge at ~85% — these are the agent's bread and butter.
Lock-file / workflow-recompile churn is risky: C0 (workflow / pr / workflows) has the largest average diff outside C5 (+658/-616 lines, 35 files) and merges at only 73%. These are the "regenerate compiled workflow artifacts" PRs, which often touch hundreds of files and run into conflicts.
Version/golden-fixture bumps are the heaviest and riskiest: C5 (awf / claude / golden) averages 1,342 additions and 124 files per PR and merges at only 68%. This cluster also has the highest comment count (7.4/PR), suggesting these PRs need a lot of back-and-forth.
Sous-chef tasks merge best: C3 (sous chef formatting/cleanup) merges at 89% despite having the highest avg comments (12.7) and commits (9.0) — lots of iteration but very high success.
"Fix failing GitHub Actions job" PRs underperform: C7 (24 PRs, all titled [WIP] Fix failing GitHub Actions job ...) merges at 73%, with low review engagement (0.9 reviews/PR). These are auto-generated CI-fixer PRs that frequently get abandoned.

Per-cluster breakdown with example PRs

Cluster C2 — fix / bug / added

Size: 300 PRs (30.1% of corpus)
Merge rate: 85% (255 merged of 300)
Avg files / additions / deletions: 20.3 / 232 / 134
Avg comments / reviews / commits: 2.2 / 1.6 / 4.4
Top terms: fix, bug, added, test, error, safe, path, workflow, branch, tests
Example PRs:
- ✅ #34374 — Enforce panicinlibrarycode in CI and tune it for accepted repo patterns
- ✅ #34373 — Replace magic time.Sleep literals in pkg/cli with named duration constants
- ✅ #34312 — fix: IsCompatible returns false for invalid semver inputs
- ❌ #34305 — fix: defer file Close() and mutex Unlock() to prevent resource leaks

Cluster C4 — added / workflow / docs

Size: 277 PRs (27.8% of corpus)
Merge rate: 84% (233 merged of 277)
Avg files / additions / deletions: 24.4 / 290 / 122
Avg comments / reviews / commits: 1.9 / 1.6 / 3.9
Top terms: added, workflow, docs, schema, updated, new, output, field, experiment, guidance
Example PRs:
- ✅ #34304 — deps(go): bump charmbracelet golden to 798e623 pseudo-version
- ✅ #34300 — fix: use actual resolved model name in effective tokens footer
- ✅ #34291 — Prefix effective-token footer values with deterministic 5-char model IDs
- ❌ #34303 — fix: use actual model name from token-usage.jsonl in effective tokens footer

Cluster C0 — workflow / pr / workflows

Size: 150 PRs (15.0% of corpus)
Merge rate: 73% (109 merged of 150)
Avg files / additions / deletions: 35.5 / 658 / 616
Avg comments / reviews / commits: 2.5 / 1.4 / 4.7
Top terms: workflow, pr, workflows, files, review, lock, run, comment, command, reviewer
Example PRs:
- ✅ #34322 — feat: add Avenger hourly CI fixer workflow
- ✅ #34316 — fix: lint Go, update node:lts-alpine SHA, recompile lock files
- ✅ #34301 — Set executable bit on jqschema.sh to unblock Copilot PR data fetch
- ❌ #34317 — [WIP] Update Go version from 1.25 to 1.26

Cluster C1 — agent / span / step

Size: 106 PRs (10.6% of corpus)
Merge rate: 82% (87 merged of 106)
Avg files / additions / deletions: 28.2 / 284 / 99
Avg comments / reviews / commits: 1.3 / 1.5 / 3.4
Top terms: agent, span, step, prompt, workflow, setup, spans, sub, pre, token
Example PRs:
- ✅ #34244 — Pin Agent Persona Explorer to explicit Copilot model
- ✅ #34230 — Inline Copilot error detection into copilot_harness
- ✅ #34229 — Optimize mattpocock-skills-reviewer with inline small-model sub-agent
- ❌ #34286 — [WIP] Fix copilot-harness post-step set_output ENOENT error

Cluster C5 — awf / claude / golden

Size: 66 PRs (6.6% of corpus)
Merge rate: 68% (45 merged of 66) — lowest merge rate
Avg files / additions / deletions: 124.5 / 1342 / 1181 — largest diffs
Avg comments / reviews / commits: 7.4 / 1.9 / 4.1 — most discussion per PR
Top terms: awf, claude, golden, version, smoke, bump, workflow, smoke claude, changeset, codex
Example PRs:
- ✅ #34338 — Disable npm release-age cooldown for Claude, Codex, and Gemini engine installs
- ✅ #34321 — chore: bump AWF firewall to v0.25.53
- ✅ #34307 — bump: Claude Code 2.1.150, Copilot CLI 1.0.51, GitHub MCP Server v1.0.5
- ❌ #34324 — chore: bump gh-aw-firewall to v0.25.53

Cluster C3 — sous / chef / pr

Size: 44 PRs (4.4% of corpus)
Merge rate: 89% (39 merged of 44) — highest merge rate
Avg files / additions / deletions: 36.4 / 704 / 257
Avg comments / reviews / commits: 12.7 / 3.1 / 9.0 — most iteration
Top terms: sous, sous chef, chef, pr sous, pr, id, workflow, model, run, workflow id
Example PRs:
- ✅ #34263 — Add opusplan builtin alias to Claude model routing
- ✅ #34245 — Clarify stable vs prerelease upgrade messaging
- ✅ #34149 — Use Copilot BYOK platform default model
- ❌ #33840 — create-pull-request: keep branch push on protected-files fallback

Cluster C6 — bin / gh / usr (data-leakage cluster)

Size: 28 PRs (2.8% of corpus)
Merge rate: 79%
Top terms are all firewall-log fragments — these PRs likely had very short bodies that fell through to residual log noise. Real themes inside are mixed (docs, small fixes, deps).

Cluster C7 — actions / job / fix

Size: 26 PRs (2.6% of corpus)
Merge rate: 73%
Avg files / additions / deletions: 51.6 / 263 / 59
Avg comments / reviews / commits: 0.2 / 0.9 / 3.0 — lowest engagement
Top terms: actions, failing github, fix failing, actions job, job, github actions
All titled [WIP] Fix failing GitHub Actions job ... — these look like an automated CI-fixer pattern. Low review engagement and below-average merge rate suggest several get abandoned without humans engaging.

Cluster comparison table (merge rate × effort × engagement)

Cluster	Theme	Size	Merge %	Files	+/-	Comments	Reviews	Commits
C2	fix / bug / added	300	85%	20	+232/-134	2.2	1.6	4.4
C4	added / workflow / docs	277	84%	24	+290/-122	1.9	1.6	3.9
C0	workflow / pr / workflows	150	73%	36	+658/-616	2.5	1.4	4.7
C1	agent / span / step	106	82%	28	+284/-99	1.3	1.5	3.4
C5	awf / claude / golden	66	68%	124	+1342/-1181	7.4	1.9	4.1
C3	sous / chef / pr	44	89%	36	+704/-257	12.7	3.1	9.0
C6	bin / gh / usr (noise)	28	79%	3	+77/-28	0.4	1.1	4.2
C7	actions / job / fix	26	73%	52	+263/-59	0.2	0.9	3.0

Recommendations

Lean on the bug-fix and docs/schema patterns: C2 (fix/bug) and C4 (workflow/docs/schema) cover 58% of the workload at 84–85% merge rate. These are the agent's strongest task shapes — keep routing this kind of work to it.
Investigate C5 (version/golden bumps): 68% merge rate, +1342/-1181 lines per PR, 7.4 comments per PR. These are heavy and contentious. Consider:
- Splitting "bump version X + regenerate golden fixtures" into two PRs.
- Adding a stricter scope-guard so the agent doesn't update unrelated golden files.
Investigate C7 (Fix failing GitHub Actions job ...): 73% merge rate, near-zero human review engagement (0.2 comments). Either these aren't being triaged, or the prompt is producing PRs that are unfit to review. Worth sampling 5 closed ones to see why.
Watch C0 (workflow recompile) diff size: average +658/-616 lines / 36 files — this is the noisy "merge main + regenerate lock files" pattern. Lower merge rate (73%) suggests these often get superseded; consider a queue/dedup rule so only one such PR is open at a time.
C6 cleanup: a small number of PRs (~2.8%) ended up in a noise cluster because their bodies were dominated by firewall-block warnings. The clustering pipeline could strip these even more aggressively (currently using a [!WARNING] block regex) — or these PRs could be flagged upstream because a body that's mostly firewall noise is itself a signal of agent trouble.
Backfill workflow telemetry: turn counts and per-PR durations weren't available in this run because the only workflow logs present were for unrelated agentic workflows. Joining Copilot SWE telemetry (turns, model usage, cost) by PR number would let us see whether C5/C7 also burn the most agent time before producing their (often abandoned) PRs.

References:

Workflow run §26358559973

Generated by 📊 Copilot Agent Prompt Clustering Analysis · ● opu47 10.2M · ◷

expires on May 25, 2026, 10:40 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prompt-clustering] Copilot SWE Agent Prompt Clustering — 997 PRs (2026-05-06 → 2026-05-24) #34410

Uh oh!

{{title}}

Uh oh!

Cluster C2 — fix / bug / added

Cluster C4 — added / workflow / docs

Cluster C0 — workflow / pr / workflows

Cluster C1 — agent / span / step

Cluster C5 — awf / claude / golden

Cluster C3 — sous / chef / pr

Cluster C6 — bin / gh / usr (data-leakage cluster)

Cluster C7 — actions / job / fix

Replies: 0 comments

Select a reply

Uh oh!

[prompt-clustering] Copilot SWE Agent Prompt Clustering — 997 PRs (2026-05-06 → 2026-05-24) #34410

Uh oh!

github-actions[bot] Bot May 24, 2026

Summary

Cluster Overview

Key Findings

Cluster C2 — fix / bug / added

Cluster C4 — added / workflow / docs

Cluster C0 — workflow / pr / workflows

Cluster C1 — agent / span / step

Cluster C5 — awf / claude / golden

Cluster C3 — sous / chef / pr

Cluster C6 — bin / gh / usr (data-leakage cluster)

Cluster C7 — actions / job / fix

Recommendations

Replies: 0 comments

github-actions[bot]
Bot May 24, 2026