[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-05-02 #29739
Closed
Replies: 2 comments
-
|
💥 KA-BOOM! The smoke test agent was HERE! 🦸♂️ WHOOSH! Like a speeding bullet through the CI pipeline, the Claude engine blazed through all smoke tests at warp speed! ⚡ ZAP! ⚡ Run #25249938003 reporting for duty — all systems NOMINAL! The agentic workflows are strong with this one. POW! 🔥 To Be Continued...
|
Beta Was this translation helpful? Give feedback.
0 replies
-
|
This discussion was automatically closed because it expired on 2026-05-03T10:26:40.750Z.
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Analysis Period: 2026-04-13 → 2026-05-02 (last 30 days)
Total PRs Analyzed: 996
Clusters Identified: 4
Overall Merge Success Rate: 77.8% (775 merged / 996 total)
TF-IDF vectorization (300 features, bigrams) + K-means (k=4 by silhouette score) applied to extracted task prompts and PR bodies. Charts generated from full dataset.
Cluster Analysis
Cluster 1 (C0): WIP / Abandoned Plans — 0% success
All 24 PRs carry
[WIP]titles and were closed without merging. They each have exactly one commit ("Initial plan") and zero file changes — the agent created a plan but never produced any code. Examples:[WIP] Bundle Dependabot updates for Go dependencies,[WIP] Investigate and fix 403 errors in GitHub MCP get_me tool,[WIP] Fix performance regression in BenchmarkFindIncludesInContent.Top signals:
date form,work started,pr description,asking work,plan makeThese are false-start tasks — either the agent got blocked by the firewall before writing code, or the task scope was too ambiguous for an initial plan to proceed.
Cluster 2 (C1): Code Review Response — 97.1% success ✅
The highest success cluster by a large margin. These tasks arise directly from
@copilotcomment threads orPrompt: > ...instructions responding to prior reviews. The tasks are tightly scoped: address specific reviewer feedback, fix a named issue, add missing tests.Top signals:
review comments,comments,review,fix,add,tests,removeSample PRs: #26023 (refactor helpers), #26056 (retry jitter), #26137 (token propagation), #26148 (deterministic audit metrics), #26157 (upload_artifact fix)
Why it succeeds: The prompts come after a human has already reviewed the code. They are precise, bounded, and context-rich. The agent doesn't need to discover the problem — it just needs to implement the solution.
Cluster 3 (C2): General Workflow & Infrastructure — 76.0% success
The dominant cluster — nearly 4 in 5 copilot PRs fall here. This catch-all covers bug fixes, dependency bumps, CI configuration, docs, new features, and refactors — anything that doesn't fit the other patterns. 76% success means roughly 1 in 4 of these tasks doesn't merge.
Top signals:
workflow,run,workflows,pr,agent,update,file,filesSample PRs: #26037 (GH_HOST fix), #26046 (CI cleanup), #26059 (crypto bump), #26073 (docs guide), #26074 (Docker validation)
This cluster's lower success rate likely reflects scope variety: some tasks are well-defined (deps bumps merge almost always), while others involve exploratory problem-solving with higher failure risk.
Cluster 4 (C3): Merge + Recompile Operations — 89.3% success
High-volume mechanical operations: merging
mainbranch updates and recompiling hundreds of lock files. Despite touching the most files on average (67), these tasks succeed at 89.3% — confirming the agent handles bulk mechanical changes reliably.Top signals:
merge main,recompile,merge,main recompile,review commentsSample PRs: #26060 (AWF proxy Gemini routing), #26113 (env field support), #26229 (model-not-supported detection), #26450 (architecture refactor), #26482 (annotated tag peeling)
The partial overlap with C1 keywords (
review comments) suggests some of these are post-review recompile tasks — merge + recompile after review approval.View Charts
PCA 2D Projection (all 996 PRs):

PR Outcomes by Cluster:

Success Rate Summary Table
Key Findings
Review-driven tasks are nearly infallible (97.1%): When a human has already identified the problem and the agent is explicitly told what to fix, success is near-certain. Tight scope + clear context = the highest-quality task type.
WIP PRs are 100% failures: 24 plan-only PRs always close without merging. These are likely firewall-blocked tasks or over-ambitious scope that stalls in the planning phase. Identifying these early could save cycles.
Bulk mechanical changes succeed despite high file counts: The merge+recompile cluster (C3) averages 67 files and 921 lines but still achieves 89.3% success, confirming the agent's reliability for large-but-predictable operations.
76% of general tasks merge: The broad C2 cluster has a non-trivial 24% non-merge rate. These represent the most variable task quality in the dataset.
Recommendations
Prefer prompt patterns like C1: When writing new
@copilottasks, front-load with specific context — what already exists, what changed, what reviewer feedback must be addressed. Tasks phrased as review responses consistently produce the best outcomes.Investigate WIP stalls (C0): The 24
[WIP]PRs that never produced code changes should be audited. If firewall blocks are the cause, pre-authorising the relevant endpoints in setup steps would recover these tasks.Sub-cluster C2 for better insights: The 779-PR general cluster is too broad for targeted improvement. A follow-up pass at k=8–10 within C2 would separate dependency bumps (likely high success) from exploratory bug-fixes (likely lower success) and enable more targeted optimisations.
Track C0 rate as a health metric: The fraction of PRs that are "plan-only" (0 code changes) is a leading indicator of blocked or unclear tasks. Monitoring this weekly would surface recurring blockers early.
References:
Beta Was this translation helpful? Give feedback.
All reactions