[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-06-10 #38340

2026-06-10T11:26:58Z

github-actions[bot]
Bot Jun 10, 2026

Summary

Analysis period: last 30 days (2026-05-24 → 2026-06-10)
Tasks analyzed: 1,000 Copilot coding-agent PRs (app/copilot-swe-agent)
Clusters identified: 8 (KMeans on TF-IDF, 1–2 grams)
Overall success rate: 80% (731 merged / 915 decided; 85 still open)

Eight coherent task themes emerge. A single bucket — workflow infrastructure & JS (.cjs) test work — accounts for ~39% of all agent tasks, while docs/skills and AI-credit guardrail work are both the smallest-effort and highest-success themes (85%). The genuinely hard work concentrates in two small clusters: firewall/domain allowlist PRs (avg 117 files touched, ~20 comments each) and Copilot SDK driver/harness PRs (most commits-per-PR, heavy iteration).

Cluster overview

Cluster	Theme	PRs	%	Success	Open	Avg files	Avg comments	Avg commits
C1	Workflow infra & JS tests (`.cjs`)	389	39%	77%	35	27	3.4	4.2
C5	Docs, skills & markdown workflows	207	21%	85%	10	15	1.4	3.4
C4	Go `pkg` / CLI / linters refactors	126	13%	77%	21	64	1.9	3.3
C0	AI-credit / cost guardrails	86	9%	85%	5	44	2.8	3.8
C6	Safe-outputs features	70	7%	77%	6	15	2.5	3.7
C3	Copilot SDK driver & harness	56	6%	78%	2	28	5.7	5.4
C2	Firewall / domain allowlists	44	4%	84%	6	117	19.7	5.6
C7	CI failure fixes (WIP)	22	2%	73%	0	16	0.5	2.5

Success rate = merged / (merged + closed); "Open" PRs are excluded from that ratio. Avg comments/commits are used as iteration/complexity proxies — see Methodology & limitations.

Cluster details, top keywords & representative PRs

C1 — Workflow infra & JS tests (`.cjs`) · 389 PRs (39%) · 77% success

The dominant catch-all: GitHub Actions setup, the .cjs runtime scripts, model/engine wiring, and their tests.

Keywords: cjs, workflow, test, setup, aw, model, gh, added
Examples: #36676 (open, 45 comments) Compat-based Copilot CLI install · #35694 expose authHeader in sandbox.agent.targets · #38152 fix assertTrustedCheckoutRuntime for bot/app actors

C5 — Docs, skills & markdown workflows · 207 PRs (21%) · 85% success

Documentation, .md workflow definitions, skills, and daily-workflow prompts. Lowest discussion (1.4 comments) and joint-highest success.

Keywords: md, workflow, workflows, prompt, github, daily, docs, skill
Examples: #36748 portable agentic-workflow-designer skill · #34874 inline skill extraction/runtime · #34941 body hash in lock metadata

C4 — Go `pkg` / CLI / linters refactors · 126 PRs (13%) · 77% success

Core Go code: pkg/cli, pkg/workflow, analyzers, linters, dedup/refactors. High file counts (avg 64) but low discussion — mechanical refactors land cleanly.

Keywords: pkg, linters, pkg cli, pkg workflow, function, analyzer, cli
Examples: #37162 consolidate import-path resolution · #36010 refactor workflow helpers · #36012 ParseWorkflowFile into phases

C0 — AI-credit / cost guardrails · 86 PRs (9%) · 85% success

aic, AI-credit resolution, max-daily-ai-credits guardrails, effective-multiplier work.

Keywords: ai, credits, ai credits, max, effective, daily, aic
Examples: #38197 (32 comments) enforce AI-credit resolution order · #37101 max-daily-ai-credits guardrail · #37936 detect guardrail exhaustion from firewall log

C6 — Safe-outputs features · 70 PRs (7%) · 77% success

The safe-outputs subsystem: normalization, targeting, temporary_id, noop.

Keywords: safe, safe output, safe outputs, outputs, target, noop, prompt
Examples: #36701 per-output opt-in normalization · #38237 PR-targeting for create_check_run · #37469 enforce required temporary_id

C3 — Copilot SDK driver & harness · 56 PRs (6%) · 78% success

The experimental copilot-sdk driver, harness, stdin wiring, permission/threat-detection. Most iteration-heavy theme (5.4 commits, 5.7 comments per PR).

Keywords: sdk, copilot sdk, copilot, driver, permission, harness
Examples: #37133 (open, 47 comments) two-phase SDK driver for threat detection · #36358 fix harness stdin wiring · #36549 refactor driver into self-contained Node program

C2 — Firewall / domain allowlists · 44 PRs (4%) · 84% success

Network egress: firewall config, domain allowlists (googleapis.com, etc.), proxy/cache-mount handling. By far the heaviest PRs — avg 117 files and ~20 comments each.

Keywords: com, google com, v0, domains, firewall, googleapis
Examples: #35836 (72 comments) CLI proxy mode + relax Smoke-Pi restrictions · #34837 (65 comments) model alias/multiplier propagation · #35802 tool-cache mount + cache-mem fixes

C7 — CI failure fixes (WIP) · 22 PRs (2%) · 73% success

Narrow "fix the failing GitHub Actions job" tasks, almost all titled [WIP]. Lowest success (73%) and almost no discussion (0.5 comments) — fast, single-purpose, sometimes abandoned.

Keywords: job, actions, github actions, fix, failing, pr description
Examples: #37886 CLI Docker build · #38265 Integration Workflow Misc Part 2 · #37891 lint-go

Key findings

Work is heavily skewed to harness/infra, not product code. C1 + C4 + C6 (workflow infra, Go refactors, safe-outputs) = 59% of all agent tasks. The agent is mostly maintaining the gh-aw machinery itself.
Documentation & cost-guardrail tasks are the sweet spot (both 85% success, low comment volume). Well-scoped, low-ambiguity prompts merge reliably with little back-and-forth.
Two clusters carry the complexity tail. Firewall/allowlist PRs (C2) average 117 files and ~20 comments; SDK-driver PRs (C3) average the most commits per PR. These are where prompts spawn the longest iteration loops — and where 2 of C3's PRs are still open.
"Fix the failing CI job" prompts are the weakest pattern (C7, 73% success). They're terse, reactive, and [WIP]-tagged; a non-trivial fraction get superseded or abandoned rather than merged.
More discussion correlates with bigger, riskier changes, not with failure. C2 has the most comments yet an above-average 84% merge rate — review volume tracks blast radius, not quality.

Recommendations

Decompose firewall/allowlist tasks (C2). 117-file, 20-comment PRs are a smell. Prompts in this theme should be scoped to one subsystem (e.g. "add domain X to allowlist Y") rather than broad "relax restrictions" asks, to shrink review load and blast radius.
Give CI-fix prompts (C7) more context. Replace bare "fix the failing job" with the failing log excerpt + suspected root cause; the 73% success and near-zero discussion suggest the agent is often guessing. Drop the reflexive [WIP] so abandoned drafts are distinguishable.
Template the high-success themes (C5/C0). Docs and credit-guardrail prompts already work well — capture their structure (clear deliverable, explicit files, acceptance criteria) as a reusable prompt template and apply it to weaker clusters.
Watch the SDK-driver loop (C3). Highest commits/comments per PR and open PRs lingering — consider splitting "driver + harness + tests" into staged prompts instead of one omnibus task.

Methodology & limitations

Pipeline: PR title + body → strip HTML/code/URLs (and unwrap any <summary>Original prompt</summary> blockquote) → TF-IDF (max 600 feats, 1–2 grams, min_df=3) → KMeans. k=8 chosen by silhouette sweep (k=3–8); silhouette is low (~0.04, expected for sparse text) so clusters were validated by thematic coherence of top terms + representative PRs, not the score alone.
Only 8 of 1,000 PRs carried an explicit Original prompt block; for the rest the PR description stands in for the task prompt, so clustering reflects task topic, not verbatim user wording.
No agent turn/duration/cost metrics. These PRs come from copilot-swe-agent, which does not emit aw_info.json; gh-aw logs would not map onto them, so comments and commits are used as iteration proxies. Turn-level analysis remains a gap.
Cache updated: all 1,000 analyzed PR numbers written to analyzed-prs.txt for incremental future runs.

References: §27271629188

Generated by 📊 Copilot Agent Prompt Clustering Analysis · 161.2 AIC · ⌖ 20.8 AIC · ⊞ 14K · ◷

expires on Jun 11, 2026, 3:26 AM UTC-08:00

2026-06-11T11:56:59Z

github-actions[bot]
Bot Jun 11, 2026
Author

This discussion was automatically closed because it expired on 2026-06-11T11:26:58.016Z.

Closed by Workflow

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-06-10 #38340

Uh oh!

{{title}}

Uh oh!

C1 — Workflow infra & JS tests (`.cjs`) · 389 PRs (39%) · 77% success

C5 — Docs, skills & markdown workflows · 207 PRs (21%) · 85% success

C4 — Go `pkg` / CLI / linters refactors · 126 PRs (13%) · 77% success

C0 — AI-credit / cost guardrails · 86 PRs (9%) · 85% success

C6 — Safe-outputs features · 70 PRs (7%) · 77% success

C3 — Copilot SDK driver & harness · 56 PRs (6%) · 78% success

C2 — Firewall / domain allowlists · 44 PRs (4%) · 84% success

C7 — CI failure fixes (WIP) · 22 PRs (2%) · 73% success

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[prompt-clustering] Copilot Agent Prompt Clustering Analysis — 2026-06-10 #38340

Uh oh!

github-actions[bot] Bot Jun 10, 2026

Summary

Cluster overview

C1 — Workflow infra & JS tests (.cjs) · 389 PRs (39%) · 77% success

C5 — Docs, skills & markdown workflows · 207 PRs (21%) · 85% success

C4 — Go pkg / CLI / linters refactors · 126 PRs (13%) · 77% success

C0 — AI-credit / cost guardrails · 86 PRs (9%) · 85% success

C6 — Safe-outputs features · 70 PRs (7%) · 77% success

C3 — Copilot SDK driver & harness · 56 PRs (6%) · 78% success

C2 — Firewall / domain allowlists · 44 PRs (4%) · 84% success

C7 — CI failure fixes (WIP) · 22 PRs (2%) · 73% success

Key findings

Recommendations

Methodology & limitations

Replies: 1 comment

Uh oh!

github-actions[bot] Bot Jun 11, 2026 Author

github-actions[bot]
Bot Jun 10, 2026

C1 — Workflow infra & JS tests (`.cjs`) · 389 PRs (39%) · 77% success

C4 — Go `pkg` / CLI / linters refactors · 126 PRs (13%) · 77% success

github-actions[bot]
Bot Jun 11, 2026
Author