[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-04-05 to 2026-04-25 #28491
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by Copilot Agent Prompt Clustering Analysis. A newer discussion is available at Discussion #28629. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Analysis Period: 2026-04-05 → 2026-04-25 (20 days)
Total PRs Analyzed: 1,000
Clusters Identified: 8
Overall Merge Rate: 78.0%
Silhouette Score: 0.048 (NLP clustering of short code-change descriptions tends to produce low silhouette scores — the clusters are nonetheless semantically coherent)
The copilot agent produced 1,000 pull requests over the past 20 days. TF-IDF vectorization + K-Means (k=8) reveals eight distinct work categories, from code-quality improvements and workflow automation to security hardening and engine/CLI maintenance. Success rates range from 59% to 84%, with validation/WIP tasks being the least likely to merge and test/refactor tasks the most successful.
Cluster Overview
Cluster-by-Cluster Analysis
C1 — Code Quality & Documentation (195 PRs, 84% merged)
The largest cluster covers targeted improvements to tests, documentation, and code style. Tasks are well-scoped and almost always merge, indicating that the agent handles these precisely.
Representative PRs:
C2 — MCP / Gateway Tooling (101 PRs, 78% merged)
Tasks that add, fix, or deprecate MCP servers, the MCP gateway, and related CLI tooling. Several closed PRs in this cluster are duplicates (the agent retried the same feature task). The 78% merge rate suggests some instability in this rapidly-evolving area.
Representative PRs:
C3 — Workflow & Agent Features (183 PRs, 81% merged)
The second-largest cluster encompasses new agent workflow features, audit tooling, cache-memory improvements, and daily analysis workflows. Healthy merge rate reflects core product development work.
Representative PRs:
C4 — Firewall / Rules / Migrations (150 PRs, 77% merged)
Large-scale migration tasks (e.g., migrating 24 workflows between patterns), firewall-rule updates, and dependency bumps. The slightly lower merge rate (77%) likely reflects the complexity of migration PRs that require manual coordination.
Representative PRs:
C5 — Validation & WIP Tasks (73 PRs, 59% merged)⚠️
The lowest-performing cluster. It captures tasks that involve pre-flight validation, GitHub Actions version updates, and work-in-progress features. The 59% merge rate indicates these tasks need iteration or are frequently superseded. Several PRs are explicitly tagged
[WIP]or[actions]automation that is replaced before merging.Representative PRs:
C6 — CI / Job / Step Fixes (106 PRs, 83% merged)
Fixes to GitHub Actions job definitions, step names, checkout behavior, and TypeScript type errors. High merge rate (83%) — focused bug fixes that the agent resolves accurately.
Representative PRs:
C7 — Safe Outputs & Security Hardening (89 PRs, 82% merged)
Security-focused tasks: protecting repository folders, sanitizing steganographic channels, fixing XPIA vectors, and improving safe-output handlers. 82% merge rate shows high-value security work that consistently lands.
Representative PRs:
C8 — Engine / CLI / Version Management (103 PRs, 68% merged)
Tasks that update engine versions (Copilot CLI, Gemini CLI, AWF), fix engine-specific runtime issues (node not found, npm EROFS, timeout configs), and manage multi-engine routing. The 68% merge rate is the second lowest — these tasks often require environmental fixes and multiple retries.
Representative PRs:
Timeline: Task Distribution Over Time
Cluster Selection Methodology (Elbow + Silhouette)
K=8 was selected based on the highest silhouette score (0.048) across k=3...11. TF-IDF on short code-change descriptions naturally yields low silhouette scores because many PRs share overlapping vocabulary (fix, add, feat); k=8 provides the best semantic granularity without over-fragmenting.
Key Findings
Code Quality & Docs are the most common task type (C1 — 19.5% of PRs) and the highest-success cluster (84% merge rate). The agent excels at focused, well-scoped tasks like improving tests, adding docstrings, and fixing style issues.
Security hardening (C7) has a strong 82% merge rate despite its complexity. Tasks involving safe outputs protection, XPIA mitigation, and file sanitization consistently land — suggesting the security perimeter is well-defined enough for the agent to operate confidently.
Validation/WIP tasks (C5) are the weakest cluster at 59% merge rate. Many are
[actions]automated updates or exploratory[WIP]branches that get closed without merging. These inflate the closed-PR count but represent normal exploratory work.Engine & CLI version management (C8, 68% merge rate) involves environment-specific issues (GPU runners, npm permissions, model deprecations) that require iteration. Multi-retry patterns are visible in this cluster.
MCP tooling (C2, 78% merge rate) shows duplicate PRs — the same feature was attempted 2-3 times before landing. This suggests the agent benefits from clearer issue scoping when MCP server integration is involved.
Recommendations
For C5 (Validation & WIP): Tag automated dependency-update PRs separately so they don't skew clustering. Consider a distinct workflow for
[actions]version bumps that auto-merges when CI passes, reducing the noise of open→close cycles.For C8 (Engine/CLI): Environment-specific failures (GPU runners, npm EROFS, missing binaries) cause multiple retry PRs. A lightweight pre-flight environment check step in the engine bootstrap could prevent common failures before the agent commits code.
For C2 (MCP Tooling): Duplicate PRs suggest the agent loses context between retries. Persisting a "PR already attempted" flag in cache-memory for active feature tasks would prevent redundant work.
Across all clusters: 78% overall merge rate is healthy. The 22% that don't merge are mostly WIP branches, exploratory probes, and retried duplicates — not outright failures. The agent's precision on scoped tasks (C1, C6, C7) is excellent.
Full Data Table (Top 8 PRs per cluster)
References: §24939388489
Beta Was this translation helpful? Give feedback.
All reactions