[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-05-02 #29739

2026-05-02T10:26:41Z

github-actions[bot]
Bot May 2, 2026

Summary

Analysis Period: 2026-04-13 → 2026-05-02 (last 30 days)
Total PRs Analyzed: 996
Clusters Identified: 4
Overall Merge Success Rate: 77.8% (775 merged / 996 total)

TF-IDF vectorization (300 features, bigrams) + K-means (k=4 by silhouette score) applied to extracted task prompts and PR bodies. Charts generated from full dataset.

Cluster Analysis

Cluster 1 (C0): WIP / Abandoned Plans — 0% success

Metric	Value
PR Count	24 (2.4%)
Merged	0
Avg Commits	1.0
Avg Files Changed	0

All 24 PRs carry [WIP] titles and were closed without merging. They each have exactly one commit ("Initial plan") and zero file changes — the agent created a plan but never produced any code. Examples: [WIP] Bundle Dependabot updates for Go dependencies, [WIP] Investigate and fix 403 errors in GitHub MCP get_me tool, [WIP] Fix performance regression in BenchmarkFindIncludesInContent.

Top signals: date form, work started, pr description, asking work, plan make

These are false-start tasks — either the agent got blocked by the firewall before writing code, or the task scope was too ambiguous for an initial plan to proceed.

Cluster 2 (C1): Code Review Response — 97.1% success ✅

Metric	Value
PR Count	137 (13.8%)
Merged	133
Avg Commits	4.8
Avg Files Changed	12.2
Avg Additions	315

The highest success cluster by a large margin. These tasks arise directly from @copilot comment threads or Prompt: > ... instructions responding to prior reviews. The tasks are tightly scoped: address specific reviewer feedback, fix a named issue, add missing tests.

Top signals: review comments, comments, review, fix, add, tests, remove

Sample PRs: #26023 (refactor helpers), #26056 (retry jitter), #26137 (token propagation), #26148 (deterministic audit metrics), #26157 (upload_artifact fix)

Why it succeeds: The prompts come after a human has already reviewed the code. They are precise, bounded, and context-rich. The agent doesn't need to discover the problem — it just needs to implement the solution.

Cluster 3 (C2): General Workflow & Infrastructure — 76.0% success

Metric	Value
PR Count	779 (78.2%)
Merged	592
Avg Commits	3.2
Avg Files Changed	23.0
Avg Additions	519

The dominant cluster — nearly 4 in 5 copilot PRs fall here. This catch-all covers bug fixes, dependency bumps, CI configuration, docs, new features, and refactors — anything that doesn't fit the other patterns. 76% success means roughly 1 in 4 of these tasks doesn't merge.

Top signals: workflow, run, workflows, pr, agent, update, file, files

Sample PRs: #26037 (GH_HOST fix), #26046 (CI cleanup), #26059 (crypto bump), #26073 (docs guide), #26074 (Docker validation)

This cluster's lower success rate likely reflects scope variety: some tasks are well-defined (deps bumps merge almost always), while others involve exploratory problem-solving with higher failure risk.

Cluster 4 (C3): Merge + Recompile Operations — 89.3% success

Metric	Value
PR Count	56 (5.6%)
Merged	50
Avg Commits	5.2
Avg Files Changed	66.7
Avg Additions	921

High-volume mechanical operations: merging main branch updates and recompiling hundreds of lock files. Despite touching the most files on average (67), these tasks succeed at 89.3% — confirming the agent handles bulk mechanical changes reliably.

Top signals: merge main, recompile, merge, main recompile, review comments

Sample PRs: #26060 (AWF proxy Gemini routing), #26113 (env field support), #26229 (model-not-supported detection), #26450 (architecture refactor), #26482 (annotated tag peeling)

The partial overlap with C1 keywords (review comments) suggests some of these are post-review recompile tasks — merge + recompile after review approval.

View Charts

PCA 2D Projection (all 996 PRs):

PR Outcomes by Cluster:

Success Rate Summary Table

Cluster	Theme	PRs	Merged	Success	Avg Commits	Avg Files
C0	WIP / Abandoned	24	0	0%	1.0	0
C2	General Tasks	779	592	76.0%	3.2	23.0
C3	Merge + Recompile	56	50	89.3%	5.2	66.7
C1	Code Review Response	137	133	97.1%	4.8	12.2
All		996	775	77.8%	3.4	22.5

Key Findings

Review-driven tasks are nearly infallible (97.1%): When a human has already identified the problem and the agent is explicitly told what to fix, success is near-certain. Tight scope + clear context = the highest-quality task type.
WIP PRs are 100% failures: 24 plan-only PRs always close without merging. These are likely firewall-blocked tasks or over-ambitious scope that stalls in the planning phase. Identifying these early could save cycles.
Bulk mechanical changes succeed despite high file counts: The merge+recompile cluster (C3) averages 67 files and 921 lines but still achieves 89.3% success, confirming the agent's reliability for large-but-predictable operations.
76% of general tasks merge: The broad C2 cluster has a non-trivial 24% non-merge rate. These represent the most variable task quality in the dataset.

Recommendations

Prefer prompt patterns like C1: When writing new @copilot tasks, front-load with specific context — what already exists, what changed, what reviewer feedback must be addressed. Tasks phrased as review responses consistently produce the best outcomes.
Investigate WIP stalls (C0): The 24 [WIP] PRs that never produced code changes should be audited. If firewall blocks are the cause, pre-authorising the relevant endpoints in setup steps would recover these tasks.
Sub-cluster C2 for better insights: The 779-PR general cluster is too broad for targeted improvement. A follow-up pass at k=8–10 within C2 would separate dependency bumps (likely high success) from exploratory bug-fixes (likely lower success) and enable more targeted optimisations.
Track C0 rate as a health metric: The fraction of PRs that are "plan-only" (0 code changes) is a leading indicator of blocked or unclear tasks. Monitoring this weekly would surface recurring blockers early.

References:

§25249435837

Generated by Copilot Agent Prompt Clustering Analysis · ● 332.9K · ◷

expires on May 3, 2026, 10:26 AM UTC

2026-05-02T10:44:27Z

github-actions[bot]
Bot May 2, 2026
Author

💥 KA-BOOM! The smoke test agent was HERE! 🦸♂️

WHOOSH! Like a speeding bullet through the CI pipeline, the Claude engine blazed through all smoke tests at warp speed!

⚡ ZAP! ⚡ Run #25249938003 reporting for duty — all systems NOMINAL! The agentic workflows are strong with this one.

POW! 🔥 To Be Continued...

💥 [THE END] — Illustrated by Smoke Claude · ● 352.7K · ◷

0 replies

2026-05-03T10:57:24Z

github-actions[bot]
Bot May 3, 2026
Author

This discussion was automatically closed because it expired on 2026-05-03T10:26:40.750Z.

Closed by Workflow

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-05-02 #29739

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-05-02 #29739

Uh oh!

github-actions[bot] Bot May 2, 2026

Summary

Cluster Analysis

Cluster 1 (C0): WIP / Abandoned Plans — 0% success

Cluster 2 (C1): Code Review Response — 97.1% success ✅

Cluster 3 (C2): General Workflow & Infrastructure — 76.0% success

Cluster 4 (C3): Merge + Recompile Operations — 89.3% success

Key Findings

Recommendations

Replies: 2 comments

Uh oh!

github-actions[bot] Bot May 2, 2026 Author

Uh oh!

github-actions[bot] Bot May 3, 2026 Author

github-actions[bot]
Bot May 2, 2026

github-actions[bot]
Bot May 2, 2026
Author

github-actions[bot]
Bot May 3, 2026
Author