[prompt-clustering] Copilot Agent Prompt Clustering — 998 PRs, 7 themes (last 30d) #38091

2026-06-09T11:13:44Z

github-actions[bot]
Bot Jun 9, 2026

Summary

NLP clustering of 998 copilot-agent PRs opened in github/gh-aw over the last 30 days (data spans 2026-05-22 → 2026-06-09, ~18 active days). PR titles + bodies were cleaned, TF-IDF vectorized (unigrams+bigrams), and grouped with K-means. k=7 was selected within an interpretable 3–7 range (silhouette is near-flat ≈0.03, as expected for short text — themes were validated by manual inspection of representative PRs).

Total tasks analyzed: 998 (of 1,000 fetched; 2 too short to cluster)
Outcomes: 784 merged · 201 closed · 13 open
Overall success rate: 79.6% (merged ÷ decided)
Clusters identified: 7 coherent themes

Key Findings

Three themes account for 71% of all agent work — engine schema/validation (28%), CI & packaging (22%), and workflow/prompt authoring (21%). The agent is overwhelmingly used for internal gh-aw platform maintenance rather than feature work.
Model-pinning / SDK-driver tasks are the weakest spot (70% success, 30% closed). These PRs fix engine-model mismatches (e.g. "pin explicit Copilot model") and are disproportionately abandoned — a recurring, error-prone task class worth a durable fix rather than repeated agent patches.
Auto-generated "Fix failing GitHub Actions job" PRs are the least reliable (65% success, the lowest) and show almost no human discussion (0.3 comments). Many are [WIP] retries of the same failing job — high churn, low yield.
Dependency/firewall bumps are small in count (5%) but by far the most expensive — 141 files changed and 20.9 comments per PR on average (6× the mean). These are heavy, review-intensive PRs despite mostly mechanical intent.
Iteration count tracks task type, not success — tokens/footer and CI tasks converge in ~2.7–3.8 commits; schema and firewall work need 5–6.6. More iterations did not lower success, suggesting the agent iterates productively rather than thrashing.

Success Rate & Complexity by Cluster

#	Theme	PRs	%	Merged	Closed	Open	Success	Avg commits	Avg files	Avg comments
1	Engine schema & validation	278	28%	217	56	5	79%	5.0	30	4.5
2	CI, packaging & analyzers	217	22%	173	41	3	81%	2.7	44	1.2
3	Workflow authoring & prompts	209	21%	171	34	4	83%	3.2	14	1.1
4	Engine/SDK driver & model pinning	123	12%	85	37	1	70%	4.2	23	3.4
5	AI tokens, cost & footers	105	11%	88	17	0	84%	3.8	43	2.5
6	Firewall & dependency bumps	46	5%	37	9	0	80%	6.6	141	20.9
7	Failing CI job auto-fixes	20	2%	13	7	0	65%	2.5	14	0.3

Cluster details & representative PRs

Cluster 1 — Engine schema & validation (278 PRs, 79%)

Keywords: schema, behavior, validation, path, coverage. The largest theme: schema rendering, validation guards, and behavior coverage in the engine/AWF layer.

Render sandbox.firewall models.json in AWF step summaries #34088 — Render sandbox.firewall models.json in AWF step summaries
Guard OTLP attribute merge against allocation-size overflow #34117 — Guard OTLP attribute merge against allocation-size overflow
Create REQUEST_CHANGES review for create_pull_request threat-warning mode #34133 — Create REQUEST_CHANGES review for create_pull_request threat-warning mode

Cluster 2 — CI, packaging & analyzers (217 PRs, 81%)

Keywords: ci, package, fix, spec, analyzer. Large mechanical refactors (avg 7.7k additions): logger migrations, package consolidation, analyzer/spec fixes.

Validate logger migration completeness across targeted packages #34122 — Validate logger migration completeness across targeted packages
Consolidate workflow FieldLocation onto console ErrorPosition #34123 — Consolidate workflow FieldLocation onto console ErrorPosition
docs(reference): add non-Copilot engine examples to targeted reference pages #34121 — docs(reference): add non-Copilot engine examples to targeted reference page

Cluster 3 — Workflow authoring & prompts (209 PRs, 83%)

Keywords: prompt, workflow, guidance, step, output. Smallest diffs (14 files): tuning .github/workflows/*.md agentic workflows, safe-output behavior, and prompt guidance.

Fix daily-syntax-error-quality producing no safe outputs #34212 — Fix daily-syntax-error-quality producing no safe outputs
fix: set GH_AW_WORKFLOW_SOURCE_URL for local workflows in failure issues #34090 — fix: set GH_AW_WORKFLOW_SOURCE_URL for local workflows in failure issues
Increase audit workflow repo-memory patch budget to prevent push_repo_memory failures #34120 — Increase audit workflow repo-memory patch budget

Cluster 4 — Engine/SDK driver & model pinning (123 PRs, 70% ⚠️)

Keywords: sdk, driver, model, workflow, engine. Lowest non-trivial success rate. Repeated fixes for model/driver mismatches and pinning supported models.

Pin explicit Copilot model in Constraint Solving POTD workflow to avoid utility-model rate-limit failures #34208 — Pin explicit Copilot model in Constraint Solving POTD workflow to avoid util issues
Pin Matt Pocock reviewer to supported Copilot model #34148 — Pin Matt Pocock reviewer to supported Copilot model
Fix Codex smoke workflow by preserving OPENAI_API_KEY in AWF container env #34129 — Fix Codex smoke workflow by preserving OPENAI_API_KEY in AWF container env

Cluster 5 — AI tokens, cost & footers (105 PRs, 84% ✅)

Keywords: ai, effective, token, et, credits. Highest success rate. Well-scoped, deterministic work on token-usage footers and model-id labeling.

fix: use actual model name from token-usage.jsonl in effective tokens footer prefix #34303 — fix: use actual model name from token-usage.jsonl in effective tokens footer
fix: use actual resolved model name in effective tokens footer, not user-provided alias #34300 — fix: use actual resolved model name in effective tokens footer
Prefix effective-token footer values with deterministic 5-char model IDs #34291 — Prefix effective-token footer values with deterministic 5-char model IDs

Cluster 6 — Firewall & dependency bumps (46 PRs, 80%)

Keywords: domains, firewall, blocked, claude, smoke. Most expensive (141 files, 20.9 comments, 6.6 commits). Version bumps of CLIs/firewall + schema sync.

bump: Claude Code 2.1.150, Copilot CLI 1.0.51, GitHub MCP Server v1.0.5 #34307 — bump: Claude Code 2.1.150, Copilot CLI 1.0.51, GitHub MCP Server v1.0.5
chore(deps): bump default Claude/Copilot/Codex CLIs and GitHub MCP Server to latest patch/minor #34220 — chore(deps): bump default Claude/Copilot/Codex CLIs and GitHub MCP Server
Bump gh-aw-firewall to v0.25.52 and sync embedded AWF schema #34114 — Bump gh-aw-firewall to v0.25.52 and sync embedded AWF schema

Cluster 7 — Failing CI job auto-fixes (20 PRs, 65% ❌)

Keywords: actions, github actions, job, fix, failing. Lowest success, near-zero discussion — mostly [WIP] auto-retries of the same broken job.

[WIP] Fix failing GitHub Actions job agent #34119 / [WIP] Fix failing GitHub Actions job 'agent' #34639 — [WIP] Fix failing GitHub Actions job 'agent'
[WIP] Fix failing GitHub Actions job Integration: CLI MCP Other #35200 — [WIP] Fix failing GitHub Actions job Integration: CLI MCP Other

Methodology & limitations

Source: pre-fetched copilot-prs.json (1,000 PRs, fresh 2026-06-09 state) joined with cached full-PR data (comments/commits/files). PR state was reconciled to the fresher snapshot.
Prompt signal: PR title + body, cleaned of code fences, checkbox task-lists, URLs, HTML, and numbers. The agent's original task prompt is not stored on the PR, so the PR description is used as a proxy.
Vectorization: TF-IDF, 1–2 grams, max_features=400, min_df=3, max_df=0.6, sublinear TF, with an extended domain stopword list (pr, copilot, agent, change, file, ...).
Clustering: K-means, k chosen over 3–7. Silhouette is near-flat (0.015→0.029) — inherent to short overlapping text — so themes were confirmed by inspecting representative PRs, not by the metric alone.
Iteration proxy: per-workflow turn/cost metrics (aw_info.json via gh-aw logs) were not fetched — these copilot PR numbers don't map to gh-aw engine runs. Commit count is used as an iteration proxy instead. This is the main limitation.

Recommendations

Harden model/driver resolution (Cluster 4). 30% of model-pinning PRs are abandoned — recurring manual patches signal a systemic gap. Add a validation step that rejects unsupported engine/model combos at compile time so the agent doesn't keep re-fixing them.
Gate the "fix failing CI job" automation (Cluster 7). At 65% success with [WIP] retries and no human engagement, this loop produces low-value churn. Add a retry cap and require a root-cause summary before opening the PR.
Pre-stage dependency bumps (Cluster 6). These mechanical PRs cost 21 comments each. A templated bump checklist (schema sync + smoke domains pre-listed) would cut review back-and-forth.
Replicate the token/footer playbook (Cluster 5). The highest-success cluster is small, deterministic, and well-specified — a good template for scoping future agent tasks tightly.

References:

§27200898190

Generated by 📊 Copilot Agent Prompt Clustering Analysis · 161.7 AIC · ⌖ 9.38 AIC · ⊞ 14.1K · ◷

expires on Jun 10, 2026, 3:13 AM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prompt-clustering] Copilot Agent Prompt Clustering — 998 PRs, 7 themes (last 30d) #38091

Uh oh!

{{title}}

Uh oh!

Cluster 1 — Engine schema & validation (278 PRs, 79%)

Cluster 2 — CI, packaging & analyzers (217 PRs, 81%)

Cluster 3 — Workflow authoring & prompts (209 PRs, 83%)

Cluster 4 — Engine/SDK driver & model pinning (123 PRs, 70% ⚠️)

Cluster 5 — AI tokens, cost & footers (105 PRs, 84% ✅)

Cluster 6 — Firewall & dependency bumps (46 PRs, 80%)

Cluster 7 — Failing CI job auto-fixes (20 PRs, 65% ❌)

Replies: 0 comments

Select a reply

Uh oh!

[prompt-clustering] Copilot Agent Prompt Clustering — 998 PRs, 7 themes (last 30d) #38091

Uh oh!

github-actions[bot] Bot Jun 9, 2026

Summary

Key Findings

Success Rate & Complexity by Cluster

Cluster 1 — Engine schema & validation (278 PRs, 79%)

Cluster 2 — CI, packaging & analyzers (217 PRs, 81%)

Cluster 3 — Workflow authoring & prompts (209 PRs, 83%)

Cluster 4 — Engine/SDK driver & model pinning (123 PRs, 70% ⚠️)

Cluster 5 — AI tokens, cost & footers (105 PRs, 84% ✅)

Cluster 6 — Firewall & dependency bumps (46 PRs, 80%)

Cluster 7 — Failing CI job auto-fixes (20 PRs, 65% ❌)

Recommendations

Replies: 0 comments

github-actions[bot]
Bot Jun 9, 2026