[prompt-clustering] Copilot Agent Prompt Clustering Analysis #33980

2026-05-22T11:00:42Z

github-actions[bot]
Bot May 22, 2026

Summary

Analyzed 1108 copilot-swe-agent PRs opened in 2026-05-02 → 2026-05-22 (881 merged, 203 closed unmerged, 24 open). TF-IDF on cleaned PR bodies + K-means (k=7, picked by silhouette over k∈[4,9]) identified 7 prompt clusters spanning agent self-improvement, bug triage, compiler-output regeneration, safe-outputs plumbing, and CI failure auto-fixes.

Overall merge rate: 79.5%
Avg files changed: 30.0 per PR
Avg cycle time: 2.0h from open → close
Best-performing cluster: C1 · Bug fixes (CI failures, stale assertions, lint) — 85.7% merge rate
Worst-performing cluster: C4 · AWF compiler / golden file regeneration — 67.2% merge rate

Key findings

AWF compiler / golden-file regeneration is the pain point. C4 AWF compiler / golden file regeneration shows the worst outcomes: 67.2% merge rate, avg 81 files changed, and ~3.1h cycle time. Golden-file churn is dominating both diff size and failure rate.
Safe-outputs / review-workflow PRs need the most hand-holding. C3 Safe-outputs / review / comment-workflow plumbing averages 6.4 commits and 6.0 comments per PR — roughly 2× the other clusters. The agent gets the first draft wrong here more often.
Bug-fix PRs have the highest merge rate (85.7%, C1). Narrow, well-defined bug-fix prompts convert well; broad refactor prompts (C5, 80.4%) are a step behind.
The "[WIP] Fix failing GitHub Actions job ..." template (C2) is a recognizable formulaic prompt — 0.27 avg comments (vs 2.2 overall) confirms these are batch-shipped with minimal human review. Worth auditing the template directly.
OpenTelemetry plumbing (C6) is small but coherent — focused niche, healthy 81% merge, fast 1.5h cycle.

Clustering quality

Silhouette scores were tight across k=4–9: k=4:0.030, k=5:0.032, k=6:0.030, k=7:0.035, k=8:0.035, k=9:0.037. k=7 chosen for interpretability (k=9 fragments into clusters of 26–34 PRs without meaningful gain). The general-refactor cluster (C5, 479 PRs ≈ 43% of total) is genuinely heterogeneous — TF-IDF can't separate it further without dropping more generic terms.

Cluster summary table

ID	Theme	PRs	Merge	Files (avg)	+Lines (avg)	Commits	Comments	Cycle
C5	General refactors, doc updates, small fixes	479	80.4%	21.7	310	4.1	2.2	1.8h
C0	Agent self-improvement (prompts, models, sub-agents)	195	81.5%	17.7	252	3.2	1.3	1.0h
C1	Bug fixes (CI failures, stale assertions, lint)	133	85.7%	28.8	404	4.2	2.3	1.8h
C4	AWF compiler / golden file regeneration	125	67.2%	80.5	805	3.8	4.8	3.1h
C3	Safe-outputs / review / comment-workflow plumbing	118	78.8%	25.1	442	6.4	6.0	3.6h
C6	OpenTelemetry / spans / attributes	32	81.2%	38.6	317	3.7	2.2	1.5h
C2	Auto-generated 'Fix failing GitHub Actions job' PRs	26	76.9%	51.5	264	3.2	0.3	1.2h

Per-cluster deep dive (top terms, sample PRs, metrics)

C5 · General refactors, doc updates, small fixes

Size: 479 PRs (43.2% of total)
Merge rate: 80.4% (385 merged, 79 closed unmerged, 15 open)
Diff footprint: avg 21.7 files, +310/-146 lines
Iteration: avg 4.1 commits, 2.2 comments, 1.5 reviews per PR
Cycle time: avg 1.8h from open to close
Top TF-IDF terms: updated, behavior, workflow, existing, docs, error, shared, coverage
Theme: Heterogeneous bucket — covers refactors, docs drift fixes, anti-pattern cleanup, dependency tweaks. Most diverse cluster; clustering coherence is weak here.
Representative PRs:
- #29860 — feat: migrate playwright workflows to CLI mode, deprecate MCP mode (Merged)
- #29888 — refactor: extract shared process_runner.cjs from claude and copilot harnesses (Merged)
- #29900 — fix: add context.Context to ResolveSHA for graceful cancellation (Merged)
- #29901 — fix: replace fmt.Errorf("%s", str) anti-pattern with errors.New(str) in pkg/cli (Merged)
- #29925 — docs: resolve API drift in pkg/parser, pkg/workflow, pkg/cli, pkg/console (Merged)

C0 · Agent self-improvement (prompts, models, sub-agents)

Size: 195 PRs (17.6% of total)
Merge rate: 81.5% (159 merged, 35 closed unmerged, 1 open)
Diff footprint: avg 17.7 files, +252/-75 lines
Iteration: avg 3.2 commits, 1.3 comments, 1.4 reviews per PR
Cycle time: avg 1.0h from open to close
Top TF-IDF terms: prompt, workflow, run, analysis, model, issue, experiment, report
Theme: Meta-work on the agentic workflows themselves: prompt tuning, model fallbacks, sub-agent extraction, agent-performance analysis.
Representative PRs:
- #29843 — optimize agent-performance-analyzer with inline sub-agents (Merged)
- #29902 — fix(skill-optimizer): update workflow for v2.0.0 CLI interface (Merged)
- #29930 — feat(spec-librarian): add inline sub-agents for phases 1–3 (Merged)
- #29932 — fix(design-decision-gate): increase max-turns from 12 to 20 (Merged)
- #29950 — feat: update daily-subagent-optimizer to prioritize common tool prefix optimization (Merged)

C1 · Bug fixes (CI failures, stale assertions, lint)

Size: 133 PRs (12.0% of total)
Merge rate: 85.7% (114 merged, 19 closed unmerged, 0 open)
Diff footprint: avg 28.8 files, +404/-457 lines
Iteration: avg 4.2 commits, 2.3 comments, 1.3 reviews per PR
Cycle time: avg 1.8h from open to close
Top TF-IDF terms: bug, did, bug bug, workflow, job, failed, ci, output
Theme: Bug-fix PRs — many auto-triaged via post-mortem reports. Includes go fmt, dep bumps, broken-test fixes. Highest merge rate of the major clusters.
Representative PRs:
- #30035 — feat: add default codex_harness.cjs with retry logic for Codex engine (Merged)
- #30100 — Fix stale $INSTRUCTION assertion in TestEngineArgsIntegrationCodex (Merged)
- #30199 — fix: format Go code with go fmt (Merged)
- #30222 — chore(deps): update fsnotify v1.9.0 → v1.10.0 (Merged)
- #30239 — Fix agent job needs not populated from engine.env needs expressions (Merged)

C4 · AWF compiler / golden file regeneration

Size: 125 PRs (11.3% of total)
Merge rate: 67.2% (84 merged, 38 closed unmerged, 3 open)
Diff footprint: avg 80.5 files, +805/-809 lines
Iteration: avg 3.8 commits, 4.8 comments, 1.6 reviews per PR
Cycle time: avg 3.1h from open to close
Top TF-IDF terms: awf, workflow, golden, lock, generated, updated, compiled, recompiled
Theme: Workflow compiler changes that regenerate compiled .lock.yml / golden test outputs. Huge diffs (avg 80 files, 800 lines). Worst merge rate and longest cycle time.
Representative PRs:
- #29848 — fix: version-pin AWF config $schema URL and add _schema field to JSONL types (Merged)
- #29858 — Add model aliases and fallbacks to AWF config (Merged)
- #29899 — fix: Pi engine uses COPILOT_GITHUB_TOKEN instead of PI_API_KEY (Merged)
- #29958 — fix: standardize safe outputs setup step names to use "Generate" verb consistently (Merged)
- #29962 — feat: add provider-prefix support to Pi engine (copilot/claude/codex routing) (Merged)

C3 · Safe-outputs / review / comment-workflow plumbing

Size: 118 PRs (10.6% of total)
Merge rate: 78.8% (93 merged, 21 closed unmerged, 4 open)
Diff footprint: avg 25.1 files, +442/-97 lines
Iteration: avg 6.4 commits, 6.0 comments, 2.1 reviews per PR
Cycle time: avg 3.6h from open to close
Top TF-IDF terms: comment, review, run, commit, validation, workflow, sous, sous chef
Theme: Touches safe-outputs, comment/review workflows, sous-chef. Highest iteration: ~6 commits and ~6 comments per PR. Lots of back-and-forth before merge.
Representative PRs:
- #29999 — fix: smoke-copilot add_comment targets newly created discussion, not the one closed by close-older-discussions (Merged)
- #30028 — feat: query /reflect before and after running the agent in harnesses (Merged)
- #30071 — refactor: decouple safe-outputs checkout from event trigger context (Merged)
- #30122 — Add mattpocock-skills-reviewer agentic workflow (Merged)
- #30197 — fix: add actions: read permission to smoke-water.yml (#investigate-smoke-water-failure) (Merged)

C6 · OpenTelemetry / spans / attributes

Size: 32 PRs (2.9% of total)
Merge rate: 81.2% (26 merged, 5 closed unmerged, 1 open)
Diff footprint: avg 38.6 files, +317/-18 lines
Iteration: avg 3.7 commits, 2.2 comments, 1.6 reviews per PR
Cycle time: avg 1.5h from open to close
Top TF-IDF terms: span, conclusion, spans, attributes, conclusion span, setup, attribute, env
Theme: Telemetry plumbing: token-usage breakdowns, conclusion spans, resource attributes, OTLP routing. Small, focused, well-defined scope.
Representative PRs:
- #29987 — feat: add gen_ai.usage token breakdown to conclusion spans (Merged)
- #30198 — Add service.version to setup job spans via compiler env injection (Merged)
- #30215 — fix(otlp): add standard resource attributes to logSpan tool spans (Merged)
- #30273 — feat: emit gh-aw.detection.conclusion and gh-aw.detection.reason as OTLP span attributes (Merged)
- #30350 — fix(otel): eliminate gen_ai.usage.* double-counting and gen_ai.request.model duplicate on agent span (Merged)

C2 · Auto-generated 'Fix failing GitHub Actions job' PRs

Size: 26 PRs (2.3% of total)
Merge rate: 76.9% (20 merged, 6 closed unmerged, 0 open)
Diff footprint: avg 51.5 files, +264/-60 lines
Iteration: avg 3.2 commits, 0.3 comments, 1.0 reviews per PR
Cycle time: avg 1.2h from open to close
Top TF-IDF terms: job, analyze logs, failure implement, progress failing, job url, implement check, logs identify, identify root
Theme: Boilerplate '[WIP] Fix failing GitHub Actions job X' template — formulaic. Lowest comment count (0.27 avg) suggests these are batched and resolved with little human review.
Representative PRs:
- #32003 — [WIP] Fix failing GitHub Actions job lint-go (Merged)
- #32004 — [WIP] Fix failing GitHub Actions job Lint Gate (Merged)
- #32036 — [WIP] Fix failing GitHub Actions job lint-js (Merged)
- #32041 — [WIP] Fix failing GitHub Actions job for CLI completion (Merged)
- #32042 — [WIP] Fix failing GitHub Actions job lint-js (Merged)

Daily PR volume by cluster (last 20 days)

2D projection of prompts (TF-IDF → SVD)

Each point is one PR; color = cluster assignment. Overlap is real — the prompt vocabulary is shared across themes, and silhouette confirms only weak separation.

Sample PR table (80 most recent across clusters)

PR	Cluster	State	Files	Commits	Title
#33944	C0	Open	3	3	Fix Step Name Alignment manifest path to avoid workspace access denials
#33896	C0	Merged	4	3	Auto-start docs server and gate agent execution on server readiness
#33827	C0	Merged	1	2	optimize(pr-code-quality-reviewer): ~290K token/run reduction
#33826	C0	Merged	2	1	feat(issue-monster): prioritize community-labeled issues first
#33817	C0	Closed	0	1	[WIP] Refactor to extract progressive disclosure guidelines into shared compo...
#33781	C0	Closed	0	1	[WIP] Refactor workflows to adopt github-guard-policy.md
#33755	C0	Closed	14	5	Precompute Daily Semgrep findings before agent execution
#33753	C0	Merged	234	3	[ab-advisor] A/B experiment: sub_agent_strategy for agent-persona-explorer
#33699	C0	Merged	5	2	fix(model-inventory): 2026-05-21 — critical gpt-5.1-codex-mini multiplier fix...
#33661	C0	Merged	240	5	Update model alias inventory and ET multiplier registry for 2026-05-21
#33657	C0	Merged	2	2	feat: PR triage agent reads customer triage rules from .github/triage.md at r...
#33655	C0	Merged	2	3	contribution-check: offload report formatting and comment routing to small-mo...
#33646	C0	Merged	87	2	Sync lock files with MinDiscussionBodyLength schema change; confirm formattin...
#33629	C0	Merged	9	2	feat: show effective-token delta per MCP tool call in agent log
#33628	C0	Merged	2	2	feat(token-usage): per-turn rows with ΔET and compounded ET in step summary
#33625	C0	Merged	7	2	fix: set per-workflow token budgets and narrow file-glob patterns in meta-orc...
#33623	C0	Merged	2	2	Add OTLP data quality validator workflow for end-to-end telemetry integrity c...
#33596	C0	Merged	7	2	Normalize report formatting guidelines across 7 agentic workflows
#33595	C0	Merged	5	4	fix: guard create_discussion against PLACEHOLDER-only bodies
#33570	C0	Merged	2	2	Update Daily OTel Advisor to use shared Sentry/Grafana OTEL MCP imports
#33540	C0	Merged	2	2	Add `sub_agent_strategy` A/B experiment to `smoke-gemini` workflow
#33523	C0	Merged	1	4	Improve Daily Reliability Review readability with progressive disclosure
#33430	C0	Merged	5	3	Add missing `gemini-3.5-flash` ET multiplier to model inventory
#33368	C0	Merged	3	5	Collapse generated footer install instructions behind details/summary disclosure
#33363	C0	Merged	2	4	Reduce CLI Consistency Checker token usage via pre-agent help capture and pro...
#33335	C0	Merged	3	4	Normalize report-formatting guidance across reporting workflows
#33314	C0	Merged	3	5	Harden daily experiment reporting with run→branch state verification
#33296	C0	Merged	2	6	Add `prompt_compression` A/B experiment and `caveman` prompt variant to agent...
#33247	C0	Merged	5	3	Reduce Step Name Alignment agent turns via deterministic pre-agent manifest
#33221	C0	Merged	2	3	Multi-Device Docs Tester: move Astro server startup to pre-agent steps to unb...
#33220	C0	Merged	6	2	Normalize report-style guidance across non-compliant issue/report workflows
#33218	C0	Merged	1	2	[aw] Prevent Step Name Alignment from using invalid `gh search issues --state...
#33179	C0	Merged	3	3	Optimize CLI Consistency Checker via inline small-model sub-agents
#33177	C0	Merged	7	2	Model inventory 2026-05-19: add `raptor-mini` alias coverage and missing GPT-...
#33152	C0	Merged	2	3	Wrap experiment assignment summary in collapsible details block
#33125	C0	Merged	2	2	feat(pr-sous-chef): run formatters and push to branch
#33085	C0	Merged	3	4	Trim token spend in Matt Pocock skills reviewer workflow
#32939	C0	Merged	2	2	fix(model-inventory): enrich /reflect null models via models_url fallback
#32904	C0	Merged	2	3	Add UK AI operational resilience workflow with recent-change triage and sub-a...
#32861	C0	Merged	1	2	Add log-triage and workflow-file-scanner inline sub-agents to q.md
#32836	C0	Merged	3	2	Normalize report-formatting guidance across non-compliant reporting workflows
#32802	C0	Merged	2	2	feat(daily-semgrep-scan): add semgrep_output_format A/B experiment
#32771	C0	Merged	3	7	Add shared AgentDB MCP import and wire deep-report for large-scale discussion...
#32746	C0	Closed	2	3	lint-monster: skip-if recent open issues (24h), single agent session
#32742	C0	Merged	3	4	Refine ET budget exhaustion message for scanability, link fidelity, and optim...
#32735	C0	Merged	4	2	[model-inventory] Register 2026-05-17 OpenAI/Gemini model variants in ET mult...
#32689	C0	Merged	2	4	fix: remove screenshot requirement from Documentation Unbloat workflow (#32666)
#32687	C0	Merged	2	1	Add daily LintMonster workflow for custom linter triage and agent-driven reme...
#32677	C0	Merged	39	3	Model alias inventory update 2026-05-16: add coding/vision aliases and claude...
#32650	C0	Merged	3	2	Add ASCII chart guidance and route chart requests in agentic-workflows dispat...
#32643	C0	Merged	3	3	Prevent Multi-Device Docs Tester from self-terminating during cleanup
#32642	C0	Merged	1	2	[workflow-style] Normalize Daily Go Function Namer issue-body header formatting
#32641	C0	Merged	2	3	Optimize `spec-enforcer` prompt with inline small-model sub-agents
#32636	C0	Merged	1	3	Add sub_agent_strategy, caveman_mode, and model_size variant types to ab-test...
#32630	C0	Merged	2	2	Add prompt_style A/B experiment to blog-auditor workflow
#32607	C0	Merged	1	1	Prevent empty issue creation in Smoke OTEL Backends
#32535	C0	Merged	3	4	Add `output_format` A/B experiment to daily-code-metrics workflow
#32534	C0	Closed	2	7	feat: wire analysis_type, tags, and notify into pick_experiment runtime
#32531	C0	Merged	1	2	Prevent Linter Miner runs from completing without a terminal safe output
#32506	C0	Merged	230	3	Raise Daily Observability workflow ET budget to prevent proxy-enforced exhaus...
#32442	C0	Closed	1	5	Quick Start: define frontmatter early, clarify `.lock.yml` ownership, and uni...
#32416	C0	Merged	1	4	Optimize daily AgentRx trace workflow with three inline small-model sub-agents
#32415	C0	Merged	2	4	Optimize linter-miner token usage via preloaded context, narrower mining wind...
#32406	C0	Merged	2	2	feat(experiments): add output_format A/B test to daily-compiler-quality
#32403	C0	Merged	2	2	Route Agent of the Day chart PNGs through `upload-asset` and enforce text-onl...
#32364	C0	Merged	4	2	Generalize workflow-failure assignment instruction to be agent-agnostic
#32339	C0	Merged	2	2	Add `prompt_style` A/B experiment to `ci-coach` with concise vs detailed prom...
#32255	C0	Merged	2	1	Enable experiment-driven model selection in smoke-copilot workflow
#32253	C0	Merged	3	5	Expand audit-workflows repo memory taxonomy
#32238	C0	Merged	2	4	Reduce token pressure in Daily Observability Report workflow
#32237	C0	Merged	3	3	Prefetch Copilot reflect data before agent startup in Daily Model Inventory C...
#32211	C0	Merged	4	6	Improve ET budget exhaustion failure issue message
#32209	C0	Closed	2	3	Raise Daily Cache Strategy Analyzer ET budget above default 25M cap
#32122	C0	Closed	0	1	[WIP] Refactor to create shared agentic workflows tool for analysis
#32121	C0	Closed	0	1	[WIP] Create shared/github-proxy-default.md for toolset bundling
#32120	C0	Closed	0	1	[WIP] Create shared/strict-copilot.md for strict and copilot-requests
#32102	C0	Merged	63	3	feat(architecture-guardian): offload violation classification to small inline...
#32001	C0	Closed	0	1	[WIP] Recompile workflows to update lock files
#31953	C0	Closed	4	3	Reduce Multi-Device Docs Tester invocation burn and stop non-retriable 429 re...
#31927	C0	Merged	2	3	Add `detail_level` A/B experiment to daily architecture diagram workflow output

Full per-PR cluster assignments are in pr-clusters.csv (1108 rows).

Recommendations

Cut golden-file churn (C4). The compiler regenerates large lock files on most config touches. Either: (a) make golden regeneration a separate workflow step that human reviewers can skip past, or (b) only include the relevant golden diff in the PR rather than the full regenerated set. Cycle time 3h+ and 33% close-rate is a clear signal.
Tighten safe-outputs PR prompts (C3). 6.4 commits + 6.0 comments per PR means the first draft is consistently incomplete. Audit the prompt templates that produce these — they likely need clearer acceptance criteria or examples up front.
Audit the "[WIP] Fix failing GitHub Actions job" template (C2). Auto-generated with near-zero human comments. Either great (true automation) or invisible (slipping by review). Sample a handful to confirm.
Promote the bug-fix prompt pattern. C1's 85.7% merge rate is the bar to beat — narrow scope + concrete failure + targeted fix. The "did X, did Y" post-mortem format that shows up in top terms is working.

Methodology

Data: 1108 copilot-swe-agent PRs from pr-full-data/pr-*.json, opened 2026-05-02 – 2026-05-22.
Cleaning: stripped code fences, inline code, URLs, the firewall WARNING trailer, and the `` boilerplate. Kept body text only.
Vectorization: TF-IDF, 1–2 grams, min_df=5, max_df=0.7, 800 features, sublinear TF. Custom stopwords drop generic Copilot/AW noise (agent, workflow, fix, update, claude, copilot, etc.) to surface real topical signal.
Clustering: K-means (n_init=20, random_state=42). k selected via silhouette on a 500-PR sample over k∈[4,9]; pinned k=7 for interpretability since k=8,9 fragmented into <40-PR niches without separating C5.
Workflow turn counts were NOT integrated — log download (gh-aw logs) wasn't run for this analysis; cycle time stands in as a proxy for iteration cost.

References:

Workflow run: §26282897632

Generated by 📊 Copilot Agent Prompt Clustering Analysis · ● 10.7M · ◷

expires on May 23, 2026, 11:00 AM UTC

2026-05-23T10:37:21Z

github-actions[bot]
Bot May 23, 2026
Author

This discussion has been marked as outdated by Copilot Agent Prompt Clustering Analysis.

A newer discussion is available at Discussion #34200.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prompt-clustering] Copilot Agent Prompt Clustering Analysis #33980

Uh oh!

{{title}}

Uh oh!

C5 · General refactors, doc updates, small fixes

C0 · Agent self-improvement (prompts, models, sub-agents)

C1 · Bug fixes (CI failures, stale assertions, lint)

C4 · AWF compiler / golden file regeneration

C3 · Safe-outputs / review / comment-workflow plumbing

C6 · OpenTelemetry / spans / attributes

C2 · Auto-generated 'Fix failing GitHub Actions job' PRs

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[prompt-clustering] Copilot Agent Prompt Clustering Analysis #33980

Uh oh!

github-actions[bot] Bot May 22, 2026

Summary

Key findings

Clustering quality

Cluster summary table

C5 · General refactors, doc updates, small fixes

C0 · Agent self-improvement (prompts, models, sub-agents)

C1 · Bug fixes (CI failures, stale assertions, lint)

C4 · AWF compiler / golden file regeneration

C3 · Safe-outputs / review / comment-workflow plumbing

C6 · OpenTelemetry / spans / attributes

C2 · Auto-generated 'Fix failing GitHub Actions job' PRs

Recommendations

Methodology

Replies: 1 comment

Uh oh!

github-actions[bot] Bot May 23, 2026 Author

github-actions[bot]
Bot May 22, 2026

github-actions[bot]
Bot May 23, 2026
Author