You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NLP clustering of 998 copilot-agent PRs opened in github/gh-aw over the last 30 days (data spans 2026-05-22 → 2026-06-09, ~18 active days). PR titles + bodies were cleaned, TF-IDF vectorized (unigrams+bigrams), and grouped with K-means. k=7 was selected within an interpretable 3–7 range (silhouette is near-flat ≈0.03, as expected for short text — themes were validated by manual inspection of representative PRs).
Total tasks analyzed: 998 (of 1,000 fetched; 2 too short to cluster)
Outcomes: 784 merged · 201 closed · 13 open
Overall success rate: 79.6% (merged ÷ decided)
Clusters identified: 7 coherent themes
Key Findings
Three themes account for 71% of all agent work — engine schema/validation (28%), CI & packaging (22%), and workflow/prompt authoring (21%). The agent is overwhelmingly used for internal gh-aw platform maintenance rather than feature work.
Model-pinning / SDK-driver tasks are the weakest spot (70% success, 30% closed). These PRs fix engine-model mismatches (e.g. "pin explicit Copilot model") and are disproportionately abandoned — a recurring, error-prone task class worth a durable fix rather than repeated agent patches.
Auto-generated "Fix failing GitHub Actions job" PRs are the least reliable (65% success, the lowest) and show almost no human discussion (0.3 comments). Many are [WIP] retries of the same failing job — high churn, low yield.
Dependency/firewall bumps are small in count (5%) but by far the most expensive — 141 files changed and 20.9 comments per PR on average (6× the mean). These are heavy, review-intensive PRs despite mostly mechanical intent.
Iteration count tracks task type, not success — tokens/footer and CI tasks converge in ~2.7–3.8 commits; schema and firewall work need 5–6.6. More iterations did not lower success, suggesting the agent iterates productively rather than thrashing.
Keywords: schema, behavior, validation, path, coverage. The largest theme: schema rendering, validation guards, and behavior coverage in the engine/AWF layer.
Source: pre-fetched copilot-prs.json (1,000 PRs, fresh 2026-06-09 state) joined with cached full-PR data (comments/commits/files). PR state was reconciled to the fresher snapshot.
Prompt signal: PR title + body, cleaned of code fences, checkbox task-lists, URLs, HTML, and numbers. The agent's original task prompt is not stored on the PR, so the PR description is used as a proxy.
Vectorization: TF-IDF, 1–2 grams, max_features=400, min_df=3, max_df=0.6, sublinear TF, with an extended domain stopword list (pr, copilot, agent, change, file, ...).
Clustering: K-means, k chosen over 3–7. Silhouette is near-flat (0.015→0.029) — inherent to short overlapping text — so themes were confirmed by inspecting representative PRs, not by the metric alone.
Iteration proxy: per-workflow turn/cost metrics (aw_info.json via gh-aw logs) were not fetched — these copilot PR numbers don't map to gh-aw engine runs. Commit count is used as an iteration proxy instead. This is the main limitation.
Recommendations
Harden model/driver resolution (Cluster 4). 30% of model-pinning PRs are abandoned — recurring manual patches signal a systemic gap. Add a validation step that rejects unsupported engine/model combos at compile time so the agent doesn't keep re-fixing them.
Gate the "fix failing CI job" automation (Cluster 7). At 65% success with [WIP] retries and no human engagement, this loop produces low-value churn. Add a retry cap and require a root-cause summary before opening the PR.
Pre-stage dependency bumps (Cluster 6). These mechanical PRs cost 21 comments each. A templated bump checklist (schema sync + smoke domains pre-listed) would cut review back-and-forth.
Replicate the token/footer playbook (Cluster 5). The highest-success cluster is small, deterministic, and well-specified — a good template for scoping future agent tasks tightly.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
NLP clustering of 998 copilot-agent PRs opened in
github/gh-awover the last 30 days (data spans 2026-05-22 → 2026-06-09, ~18 active days). PR titles + bodies were cleaned, TF-IDF vectorized (unigrams+bigrams), and grouped with K-means. k=7 was selected within an interpretable 3–7 range (silhouette is near-flat ≈0.03, as expected for short text — themes were validated by manual inspection of representative PRs).Key Findings
gh-awplatform maintenance rather than feature work.[WIP]retries of the same failing job — high churn, low yield.Success Rate & Complexity by Cluster
Cluster details & representative PRs
Cluster 1 — Engine schema & validation (278 PRs, 79%)
Keywords:
schema, behavior, validation, path, coverage. The largest theme: schema rendering, validation guards, and behavior coverage in the engine/AWF layer.create_pull_requestthreat-warning mode #34133 — Create REQUEST_CHANGES review forcreate_pull_requestthreat-warning modeCluster 2 — CI, packaging & analyzers (217 PRs, 81%)
Keywords:
ci, package, fix, spec, analyzer. Large mechanical refactors (avg 7.7k additions): logger migrations, package consolidation, analyzer/spec fixes.FieldLocationonto consoleErrorPosition#34123 — Consolidate workflowFieldLocationonto consoleErrorPositionCluster 3 — Workflow authoring & prompts (209 PRs, 83%)
Keywords:
prompt, workflow, guidance, step, output. Smallest diffs (14 files): tuning.github/workflows/*.mdagentic workflows, safe-output behavior, and prompt guidance.Cluster 4 — Engine/SDK driver & model pinning (123 PRs, 70%⚠️ )
Keywords:
sdk, driver, model, workflow, engine. Lowest non-trivial success rate. Repeated fixes for model/driver mismatches and pinning supported models.OPENAI_API_KEYin AWF container env #34129 — Fix Codex smoke workflow by preservingOPENAI_API_KEYin AWF container envCluster 5 — AI tokens, cost & footers (105 PRs, 84% ✅)
Keywords:
ai, effective, token, et, credits. Highest success rate. Well-scoped, deterministic work on token-usage footers and model-id labeling.Cluster 6 — Firewall & dependency bumps (46 PRs, 80%)
Keywords:
domains, firewall, blocked, claude, smoke. Most expensive (141 files, 20.9 comments, 6.6 commits). Version bumps of CLIs/firewall + schema sync.Cluster 7 — Failing CI job auto-fixes (20 PRs, 65% ❌)
Keywords:
actions, github actions, job, fix, failing. Lowest success, near-zero discussion — mostly[WIP]auto-retries of the same broken job.Methodology & limitations
copilot-prs.json(1,000 PRs, fresh 2026-06-09 state) joined with cached full-PR data (comments/commits/files). PR state was reconciled to the fresher snapshot.max_features=400,min_df=3,max_df=0.6, sublinear TF, with an extended domain stopword list (pr, copilot, agent, change, file, ...).aw_info.jsonviagh-aw logs) were not fetched — these copilot PR numbers don't map to gh-aw engine runs. Commit count is used as an iteration proxy instead. This is the main limitation.Recommendations
[WIP]retries and no human engagement, this loop produces low-value churn. Add a retry cap and require a root-cause summary before opening the PR.References:
Beta Was this translation helpful? Give feedback.
All reactions