[prompt-clustering] Copilot Agent Prompt Clustering — 2026-06-30 (1,000 PRs, 7 clusters, 80.7% merged) #42465

2026-06-30T11:12:29Z

github-actions[bot]
Bot Jun 30, 2026

Summary

Analysis Period: Last 30 days (2026-06-08 → 2026-06-30) · Tasks Analyzed: 1,000 Copilot agent PRs · Clusters: 7 · Overall Merge Rate: 80.7%

NLP clustering (TF-IDF + K-means, k selected by silhouette) of the task prompts in 1,000 PRs authored by app/copilot-swe-agent surfaces 7 coherent task families. Volume is dominated by feature/schema additions and agentic-workflow tuning; the lowest merge rate sits in engine/harness runtime fixes (71.8%) — the hardest, most iteration-heavy category. Dependency bumps and firewall/smoke-test PRs are nearly auto-merged (96.4%).

Outcome	Count	Share
✅ Merged	807	80.7%
❌ Closed (unmerged)	183	18.3%
🔄 Open	10	1.0%

Full clustering report — themes, success rates, examples, data table, recommendations

Methodology

Corpus: PR body text (the task prompt) from all 1,000 Copilot-authored PRs in the window. All bodies were non-empty and ≥20 chars after cleaning.
Cleaning: stripped fenced/inline code, URLs, and punctuation; normalized whitespace.
Vectorization: TF-IDF, unigrams+bigrams, max_features=600, min_df=3, max_df=0.6.
Clustering: K-means; k=7 chosen by silhouette over k∈[3,7] (monotonic: 0.022→0.041). Silhouette is low in absolute terms (expected for short, vocabulary-overlapping engineering prompts), but term/example inspection confirms the clusters are thematically coherent.
Limitations: Per-task turn counts / cost / duration were not available — these PRs come from GitHub's standalone Copilot coding agent, not gh-aw workflow runs, so aw_info.json metrics don't map to them. The cached full-PR dataset (pr-full-data/) is stale (May, PR #30xxx) and does not cover the current Improve lenstringzero precision for len(string) aliases in zero-comparisons #37750–Allow branding field in aw.yml package manifests #42454 range, so comment/review/file-change counts were not enriched. Analysis is therefore prompt-text + outcome + temporal only.

Cluster Analysis

1. Schema / Manifest / Feature additions — 327 tasks (32.7%), 81.7% merged

Largest family. Adding fields/properties to JSON schemas, manifest support, new CLI commands, dashboard features, docs. Keywords: schema, adds, changes, updated, workflow, docs, files, test, command. Bread-and-butter additive work — solid, near-average merge rate.

2. Agentic workflow & prompt tuning — 218 tasks (21.8%), 79.4% merged

Authoring/editing .github/workflows agentic specs, prompt guidance, turn budgets, model selection, reviewer path-gating. Keywords: workflow, prompt, agent, guidance, tool, output. Meta-work on the agentic system itself; slightly below-average merge rate (prompt/policy choices are more contested → more closures).

3. Engine / harness runtime fixes — 181 tasks (18.1%), 71.8% merged (lowest)

The hardest category: API routing (Responses API, provider mapping), harness retry loops, sandbox EACCES, HTTP 400 surfacing, TPM exhaustion. Keywords: step, failure, job, copilot, env, detection, awf. Lowest merge rate and the most open PRs (4) — these are deep, stateful runtime bugs that need real reproduction.

4. Refactor / linters / code quality — 141 tasks (14.1%), 84.4% merged

Deduplication, custom linters/analyzers, function relocation, largefunc cleanup, panic→error. Keywords: error, analyzer, sites, flagged, function, helper. High merge rate — mechanical, well-scoped, low-risk.

5. Sous-chef generated PRs — 59 tasks (5.9%), 84.7% merged

PRs produced by the "sous-chef" agentic workflow (long, structured bodies; avg ~2,100 chars). Keywords: sous chef, chef, pr, aic, chef run. Self-generated improvements with above-average acceptance.

6. Dependency bumps & firewall/smoke-test — 56 tasks (5.6%), 96.4% merged (highest)

Version bumps (firewall, Claude Code, Codex, mcpg), firewall/smoke-test changesets. Longest bodies (avg ~3,800 chars — verbose smoke-test summaries). Keywords: claude, smoke, domains, firewall, blocked. Near-automatic acceptance — routine, low-judgment, high-confidence.

7. CI job auto-fixes (WIP) — 18 tasks (1.8%), 77.8% merged

Auto-generated "[WIP] Fix failing GitHub Actions job ..." PRs. Shortest bodies (avg ~410 chars). Keywords: actions, fix, job, github actions, logs, root cause. Small reactive category.

Merge Rate by Cluster

Cluster	Tasks	Share	Merge Rate	Avg body
Schema / Manifest / Feature adds	327	32.7%	81.7%	1,380
Agentic workflow & prompt tuning	218	21.8%	79.4%	1,286
Engine / harness runtime fixes	181	18.1%	71.8%	1,466
Refactor / linters / code quality	141	14.1%	84.4%	1,486
Sous-chef generated PRs	59	5.9%	84.7%	2,145
Dependency bumps & firewall/smoke	56	5.6%	96.4%	3,796
CI job auto-fixes (WIP)	18	1.8%	77.8%	413

Temporal Trend

Merge rate is stable across the window (78–85% per 5-day bucket), with a mild peak (85.0%) around 2026-06-21. No degradation or improvement trend; volume is heaviest in the final week.

Representative PRs (3 per cluster)

PR #	Title	Cluster	Outcome
#42454	Allow `branding` field in `aw.yml` package manifests	Schema/Feature	✅ Merged
#42408	chore: bump gh-aw-firewall to v0.27.15	Schema/Feature	🔄 Open
#42397	Remove in-repo agentic-workflows dashboard extension and cl...	Schema/Feature	✅ Merged
#42426	Add frontmatter `skills` support with activation-time `gh s...	Workflow/Prompt	🔄 Open
#42411	[aw] Raise Daily yamllint Fixer turn budget to prevent max-...	Workflow/Prompt	✅ Merged
#42373	Downgrade read-only maintenance workflows from Sonnet to Haiku	Workflow/Prompt	❌ Closed
#42421	fix: route gpt-5.5 through OpenAI Responses API (wireApi=re...	Engine/Harness	🔄 Open
#42420	Stop Codex harness retry loops on TPM exhaustion and unfini...	Engine/Harness	🔄 Open
#42400	fix: reclaim non-writable /tmp/gh-aw/sandbox before AWF wri...	Engine/Harness	🔄 Open
#42431	Deduplicate glob-list validation across workflow validators	Refactor/Lint	🔄 Open
#42430	Deduplicate sandbox and MCP mount validation flow	Refactor/Lint	🔄 Open
#42412	refactor: eliminate all 13 largefunc lint violations in pkg...	Refactor/Lint	🔄 Open
#42323	feat(linters): add errortypeassertion analyzer for error-to...	Sous-chef	✅ Merged
#42322	Extract shared SafeOutputAllowBlockConfig across safe-outpu...	Sous-chef	✅ Merged
#42295	Scale MCP logs timeout for larger fetch windows	Sous-chef	✅ Merged
#41945	Bump gh-aw-firewall to v0.27.12 and gh-aw-mcpg to v0.3.31	Deps/Firewall	✅ Merged
#41912	Propagate enterprise host context into curated DIFC/CLI pro...	Deps/Firewall	✅ Merged
#41865	Bump Claude Code to 2.1.195 and Codex to 0.142.3	Deps/Firewall	✅ Merged
#42379	[WIP] Fix failing GitHub Actions job for CLI Completion	CI auto-fix	✅ Merged
#42113	[WIP] Fix failing GitHub Actions job Integration: Workflow ...	CI auto-fix	✅ Merged
#41153	[WIP] Fix failing GitHub Actions job 'Integration: Workflow...	CI auto-fix	✅ Merged

Key Findings

Two task families are ~55% of all Copilot work. Schema/feature additions (32.7%) and agentic workflow/prompt tuning (21.8%) dominate — the agent is used most for additive, well-bounded changes, which merge at ~80%.
Difficulty is concentrated in the engine/harness category. At 71.8% merged with the most open PRs, runtime/API/sandbox fixes are where the agent struggles most — these need reproduction and stateful debugging that single-shot prompts under-serve.
Routine, low-judgment work is nearly auto-accepted. Dependency bumps + firewall/smoke (96.4%) and refactor/lint (84.4%) show the agent is highly reliable on mechanical, low-ambiguity tasks. Body length here is high but reflects verbose auto-generated summaries, not task difficulty.
Body length does not predict success. Merged (avg 1,564) vs closed (1,510) bodies are nearly identical — prompt length is not a quality signal; task category is the stronger predictor.

Recommendations

Invest prompt-engineering effort in the engine/harness category. For runtime/API/sandbox fixes, prepend explicit reproduction steps, expected-vs-actual behavior, and links to the failing run/log. This is the lowest-merge, highest-iteration cluster and has the most headroom.
Templatize the two high-volume additive families. Schema/feature and workflow/prompt tasks (55% of volume) would benefit from a standard prompt scaffold (affected file, schema/contract, a required test case) to push their ~80% merge rate higher with minimal effort.
Keep routing mechanical work to the agent freely. Dependency bumps, firewall/smoke, and lint/refactor are near-solved (84–96% merged); these are safe to automate aggressively.
Add turn/cost telemetry to close the loop. The single biggest analysis gap is the absence of per-task turn/duration/cost. Tagging Copilot PRs with their session metrics would let future runs correlate iteration cost with cluster and outcome — likely confirming engine/harness as the most expensive.

References:

§28438505290

Generated by Prompt Clustering Analysis (Run: 28438505290)

Generated by 📊 Copilot Agent Prompt Clustering Analysis · 179.6 AIC · ⌖ 19 AIC · ⊞ 13.2K · ◷

expires on Jul 1, 2026, 3:12 AM UTC-08:00

2026-07-01T11:15:39Z

github-actions[bot]
Bot Jul 1, 2026
Author

This discussion has been marked as outdated by Copilot Agent Prompt Clustering Analysis.

A newer discussion is available at Discussion #42710.

0 replies

2026-07-01T11:31:52Z

github-actions[bot]
Bot Jul 1, 2026
Author

Smoke bot grunt. Run 28513491479 done poke.

Warning

Firewall blocked 6 domains

The following domains were blocked by the firewall during workflow execution:

accounts.google.com
android.clients.google.com
clients2.google.com
contentautofill.googleapis.com
safebrowsingohttpgateway.googleapis.com
www.google.com

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "accounts.google.com"
    - "android.clients.google.com"
    - "clients2.google.com"
    - "contentautofill.googleapis.com"
    - "safebrowsingohttpgateway.googleapis.com"
    - "www.google.com"

See Network Configuration for more information.

📰 BREAKING: Report filed by Smoke Copilot · 327.4 AIC · ⌖ 15.6 AIC · ⊞ 19.2K · ◷
_{Comment /smoke-copilot to run again}
_{Add label smoke to run again}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[prompt-clustering] Copilot Agent Prompt Clustering — 2026-06-30 (1,000 PRs, 7 clusters, 80.7% merged) #42465

Uh oh!

{{title}}

Uh oh!

Methodology

Cluster Analysis

1. Schema / Manifest / Feature additions — 327 tasks (32.7%), 81.7% merged

2. Agentic workflow & prompt tuning — 218 tasks (21.8%), 79.4% merged

3. Engine / harness runtime fixes — 181 tasks (18.1%), 71.8% merged (lowest)

4. Refactor / linters / code quality — 141 tasks (14.1%), 84.4% merged

5. Sous-chef generated PRs — 59 tasks (5.9%), 84.7% merged

6. Dependency bumps & firewall/smoke-test — 56 tasks (5.6%), 96.4% merged (highest)

7. CI job auto-fixes (WIP) — 18 tasks (1.8%), 77.8% merged

Merge Rate by Cluster

Temporal Trend

Representative PRs (3 per cluster)

Key Findings

Recommendations

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

[prompt-clustering] Copilot Agent Prompt Clustering — 2026-06-30 (1,000 PRs, 7 clusters, 80.7% merged) #42465

Uh oh!

github-actions[bot] Bot Jun 30, 2026

Summary

Methodology

Cluster Analysis

1. Schema / Manifest / Feature additions — 327 tasks (32.7%), 81.7% merged

2. Agentic workflow & prompt tuning — 218 tasks (21.8%), 79.4% merged

3. Engine / harness runtime fixes — 181 tasks (18.1%), 71.8% merged (lowest)

4. Refactor / linters / code quality — 141 tasks (14.1%), 84.4% merged

5. Sous-chef generated PRs — 59 tasks (5.9%), 84.7% merged

6. Dependency bumps & firewall/smoke-test — 56 tasks (5.6%), 96.4% merged (highest)

7. CI job auto-fixes (WIP) — 18 tasks (1.8%), 77.8% merged

Merge Rate by Cluster

Temporal Trend

Representative PRs (3 per cluster)

Key Findings

Recommendations

Replies: 2 comments

Uh oh!

github-actions[bot] Bot Jul 1, 2026 Author

Uh oh!

github-actions[bot] Bot Jul 1, 2026 Author

github-actions[bot]
Bot Jun 30, 2026

github-actions[bot]
Bot Jul 1, 2026
Author

github-actions[bot]
Bot Jul 1, 2026
Author