You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copilot agent work in github/gh-aw splits into one large general-engineering bucket plus seven tight domain clusters tracking gh-aw's own subsystems (safe-outputs, firewall, AI-credit guardrails, Copilot SDK, model resolution). Merge rates are remarkably uniform (78–82%) across every theme — outcome is driven less by what the task is than by execution. The one clear underperformer is WIP "fix failing Actions job" tasks at 68%.
Key Findings
Throughput is healthy and consistent. 80% of agent PRs merge, and seven of eight clusters fall in a narrow 78–82% band. Task category is a weak predictor of success — the pipeline handles refactors, infra plumbing, and reporting workflows about equally well.
One catch-all dominates volume. Cluster C4 (general refactors / schemas / tests / docs) is 36% of all tasks. Its keywords (schema, docs, test, string, remove, coverage) show heavy investment in code hygiene and reinvention cleanup (e.g. [WIP] Refactor inline string-truncation reinventions and file-existence idioms #41191 truncation/os.Stat consolidation).
The rest of the work mirrors gh-aw's architecture. Distinct, coherent clusters map 1:1 to subsystems: agent-context/prompt tuning (22%), AI-credit & token guardrails (13%), safe-outputs/MCP (9%), Copilot SDK/driver (8%), firewall/AWF network-isolation (6%), and model-resolution audits (4%).
Clustering: K-means, k chosen by silhouette over k∈[3,8] → k=8. Silhouette scores are low (0.01–0.03), expected for short engineering text — clusters are thematically coherent (see keywords) but not geometrically well-separated; treat boundaries as soft.
Limitation — no turn/cost metrics: these are GitHub Copilot coding-agent PRs, not gh-aw workflow runs, so per-task turn counts, duration, and AIC cost from aw_info.json do not map to individual PRs and were intentionally omitted rather than estimated. The stale pr-full-data/ cache (May, different PR numbers) was not used.
Recommendations
Target CI-repair tasks (C0). The only sub-pack cluster (68%) and visibly repetitive. Give "fix failing job X" prompts the failing run's logs and a reproduction command up front instead of just the job name — these tasks fail for lack of diagnostic context, not capability.
Sub-segment the C4 catch-all. At 36% it hides distinct work types (truncation/idiom refactors vs. schema vs. test coverage). A future run could re-cluster C4 alone for sharper prompt-pattern insights.
Keep the domain workflows as-is. Safe-outputs, firewall, AIC-guardrail and model-resolution clusters all sit at 78–82% — no category-level intervention warranted; gains there are about reducing the ~20% close rate via clearer acceptance criteria, not retargeting.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Analysis Period: last 30 days (2026-06-03 → 2026-06-24)
Tasks Analyzed: 1,000 Copilot coding-agent PRs (
app/copilot-swe-agent)Clusters Identified: 8 (TF-IDF + K-means, title weighted 3×)
Overall Merge (Success) Rate: 80.0% (790 merged · 198 closed · 12 open)
Copilot agent work in
github/gh-awsplits into one large general-engineering bucket plus seven tight domain clusters tracking gh-aw's own subsystems (safe-outputs, firewall, AI-credit guardrails, Copilot SDK, model resolution). Merge rates are remarkably uniform (78–82%) across every theme — outcome is driven less by what the task is than by execution. The one clear underperformer is WIP "fix failing Actions job" tasks at 68%.Key Findings
schema,docs,test,string,remove,coverage) show heavy investment in code hygiene and reinvention cleanup (e.g. [WIP] Refactor inline string-truncation reinventions and file-existence idioms #41191 truncation/os.Statconsolidation).Cluster Analysis (8 clusters, full detail)
Merge rate by cluster (sorted)
Themes, keywords & representative PRs
schema,docs,test,repo,string,remove,coverage· ex docs: spec audit — add github README, update fileutil/constants/timeutil/tty specs #38848, docs(cost-management): replace all tables with headers and lists #38224, Replace non-idiomaticlen(string) == 0checks flagged bylenstringzero#38015context,failure,guidance,prompt,agent,step,report· ex Remove redundantpython-datavizimports from daily reporting workflows #41158, Guard startup terminal probing on Windows when stderr is redirected #37823, Alignmcp/secretsunknown-subcommand behavior with root CLI errors #37935aic,credits,usage,token,daily,guardrail,max· ex Backfill firewall activity reports from usage artifact domain aggregates #41046, Lower daily credit-limit guardrail test to 1 AI credit #37631, Preserve agent AIC in create-issue footer breakdown #37464safe outputs,safe output,outputs,mcp,tool· ex Add logging to publish-safe-outputs-node scripts #39085, [aw] Make spec-librarian reliably emit a safe output #37321, Tighten safe-outputs noop contract for prompt-omission scenarios #37122copilot,sdk,driver,permission,auth· ex Add multi-language Copilot SDK driver samples and wire daily workflows to exercise runtime installs #36734, Fix Copilot SDK sample driver BYOK session configuration in Daily Model Inventory workflow #37454, feat: add-wizard prompts Copilot users to choose copilot-requests (org billing) vs PAT #38449awf,firewall,domains,bump,blocked· ex chore: bump CLI tool versions (Claude 2.1.178, Copilot 1.0.63, Codex 0.140.0, Pi 0.79.4, GH MCP Server v1.3.0, Playwright v1.61.0) #39624, chore: bump Claude Code 2.1.178→2.1.179, Pi 0.79.4→0.79.6 #39772, Fix AIC usage cache always empty in activation job #39130model,experiment,alias,models,sub· ex Addsummary_detailA/B experiment to dependabot campaign and support guardrail direction metadata #37563, Switch model refresh to models.dev catalog schema, consume native cost fields, and add dispatch-based refresh PR workflow #37055, Fix workflow test expectations for strict-mode deprecations and daily model inventory prompt text #37128failing actions,actions job,wip failing,integration· ex [WIP] Fix failing GitHub Actions job 'Integration: Workflow Features' #41153, [WIP] Fix failing GitHub Actions job Integration: Workflow Misc Part 2 #38265, [WIP] Fix the failing GitHub Actions job build #37674Methodology & limitations
copilot-prs.json— 1,000 PRs authored byapp/copilot-swe-agent, created 2026-06-03 → 2026-06-24. All had non-empty bodies (avg 1,562 chars).[WIP]prefixes stripped, weighted 3×) + body with code fences, inline code, URLs, HTML, markdown and checkboxes removed; letters only, 1–2 char tokens dropped.max_features=600,min_df=3,max_df=0.5, English + domain stop-words (github,workflow,gh,aw,pkg,fix,add, ...).gh-awworkflow runs, so per-task turn counts, duration, and AIC cost fromaw_info.jsondo not map to individual PRs and were intentionally omitted rather than estimated. The stalepr-full-data/cache (May, different PR numbers) was not used.Recommendations
References: §28092507871
Beta Was this translation helpful? Give feedback.
All reactions