Agent Performance Report — Week of 2026-07-01 #42767

2026-07-01T13:41:41Z

github-actions[bot]
Bot Jul 1, 2026

Executive Summary

Metric	Score	vs Last Run
Quality Score	61/100	↓1
Effectiveness Score	62/100	↓1
Ecosystem Health	75/100	↓3
Agents analyzed	25+ active workflows	—
Total outputs reviewed	43 issues, 30 PRs (recent window)	—
P1 open	7	+1 (PR Sous Chef recurrence)
P2 open	13	+5 new this week

Top performers: Copilot SWE Agent, Issue Monster, PR Triage, Auto-Triage Issues, Avenger
Needs improvement: PR Sous Chef, Sub-Agent Model Resolution Audit, PR Code Quality Reviewer, Daily Safe Output Integrator

New this run: #aw_model_lifecycle — Systemic model version lifecycle management issue filed.

Performance Rankings

Top Performing Agents

Copilot SWE Agent (Q: 92/100, E: 91/100)
- 80% PR merge rate on 85 PRs (68 merged) — highest-volume contributor
- Broad coverage: SDK refactors, test coverage, doc updates, dependency bumps
- Example outputs: Prefer skills: frontmatter for workflow skill installation in instructions #42756, Document skills: frontmatter with pinned refs, per-skill auth, and Matt Pocock example #42747, Add regression coverage for Copilot AWF chroot-home cleanup #42736, eslint-factory: add empty-string suggestion for null/undefined in no-core-setoutput-non-string #42723, Standardize two linters on Cursor traversal and add shared astutil.Root #42719
Issue Monster (Q: 88/100, E: 87/100)
- 100% success rate; consistent high-volume daily output
- Structured, actionable issue creation
PR Triage (Q: 88/100, E: 86/100)
- 100% success; structured daily reports (see [PR Triage Report] 🤖 PR Triage Report — 2026-07-01 (Run §28519998993) #42760)
- Clear, navigable triage summaries
Auto-Triage Issues (Q: 84/100, E: 82/100)
- 100% today (3/3 runs); 9/10 this week (1 transient PI crash, [aw] Auto-Triage Issues failed #42607 filed)
- Stable and reliable label/triage automation
Avenger (Q: 83/100, E: 82/100)
- 100% success; proactive maintenance without over-creation
Team Status (Q: 82/100, E: 81/100)
- Daily report filed ([team-status] 🌟 Team Daily Status Report — 2026-07-01 #42744); good breadth
Static Analysis (Q: 81/100, E: 80/100)
- 11+ days zero High-severity findings; consistent clean runs
AB Advisor (Q: 78/100, E: 76/100)
- 2 actionable issues today ([ab-advisor] Daily A/B Testing Advisor - Issue Group #42732, [ab-advisor] Improve experiment infrastructure: schema, reporting & audit #42733); structured experiment insights
AIC Consumption Report (Q: 75/100, E: 75/100)
- Good observability; daily audit ([agentic-token-audit] 📊 Daily AIC Usage Audit — 2026-07-01 #42746) on-time
Content Moderation (Q: 74/100, E: 72/100)
- 67% today (4/6); occasional skips acceptable given event-driven triggers

Agents Needing Improvement

PR Sous Chef (Q: 38/100, E: 30/100)
- P1 RECURRING: HTTP 400 post-fix recurrence ([aw] PR Sous Chef hit HTTP 400 bad request #42652 OPEN)
- Fix [aw-failures] PR Sous Chef 100% red — Copilot main agent requests gpt-5.5, SDK returns 400 "not accessible via /chat/completio [Content truncated due to length] #42444 closed Jun 30 22:50 but problem persists on Jul 1
- Engine switch to pi (feat: switch pr-sous-chef to pi engine #42730 merged) is the active mitigation — monitor closely
Sub-Agent Model Resolution Audit (Q: 30/100, E: 25/100)
- 100% red since Jun 24; codex alpha 404 ([aw-failures] Daily Sub-Agent Model Resolution Audit 100% red — Codex gpt-5-codex-alpha-2025-11-07 404s (same alpha-snapshot d [Content truncated due to length] #42033 OPEN)
- Blocked on model availability; no self-recovery possible
PR Code Quality Reviewer (Q: 35/100, E: 30/100)
- Tier-unsupported model → SDK 400 ([aw-failures] PR Code Quality Reviewer red — Copilot general-purpose subagent requests tier-unsupported model → SDK 400 `model [Content truncated due to length] #42095 OPEN). Every run fails.
Daily Safe Output Integrator (Q: 40/100, E: 35/100)
- Tool denial 5/5 (4th recurrence) ([aw] Daily Safe Output Integrator exceeded tool denial limit #42333 OPEN)
Daily BYOK Ollama (Q: 35/100, E: 30/100)
- api-proxy 503 ([aw-failures] Daily BYOK Ollama Test 100% red for 8+ days — offline+BYOK api-proxy returns 503 on /v1/models, Copilot CLI gets H [Content truncated due to length] #41827 OPEN); infrastructure dependency
AI Moderator (Q: 55/100, E: 45/100)
- 17% today (1/6); recurring action_required / skipped pattern
Agentic Commands (Q: 60/100, E: 52/100)
- 42% today (5/12); high run volume with low success rate

Inactive / Zero-Output Agents

Daily Team Evolution Insights: Missing required tool ([aw] Daily Team Evolution Insights is missing required tool #42342 OPEN)
Daily Hippo Learn: hippo MCP tool unavailable ([aw] Daily Hippo Learn is missing required tool #42442 OPEN)
Smoke Copilot: Missing message input ([aw-failures] Smoke Copilot safe_outputs red — dispatch_workflow to haiku-printer omits required input message #41988 OPEN)
Smoke CI: EACCES mkdir /tmp/gh-aw ([aw-failures] Smoke CI hard-red at startup — EACCES mkdir /tmp/gh-aw/sandbox/firewall/logs, agent never invoked (rootless left [Content truncated due to length] #42398 OPEN)

Quality Analysis

Output Quality Distribution

Excellent (80-100): 7 agents (Copilot SWE, Issue Monster, PR Triage, Auto-Triage, Avenger, Team Status, Static Analysis)
Good (60-79): 6 agents (AB Advisor, AIC Report, Content Moderation, Agentic Maintenance, Bot Detection, AgentRx Optimizer)
Fair (40-59): 4 agents (AI Moderator, Agentic Commands, Daily Safe Output Integrator, Agentic Workflow Audit)
Poor (<40): 5 agents (PR Sous Chef, Sub-Agent Model Resolution Audit, PR Code Quality Reviewer, Daily BYOK Ollama, Go Logger Enhancement)

Common Quality Issues

Model version mismatch (5 agents): Configured model unavailable or deprecated; no pre-flight validation to fail fast.
Tool denial at runtime (2 agents: Safe Output Integrator, Copilot Opt): Burns retries before reporting failure.
Incomplete/missing outputs (3 agents: Hippo Learn, Team Evolution, Smoke CI): Zero-length output due to missing dependency at start.

Effectiveness Analysis

PR Merge Rate (copilot-swe-agent)

85 PRs submitted, 68 merged = 80% merge rate (unchanged from Jun 30)
4 currently open (active review), 5 closed without merge

Task Completion Rates (Jul 1)

Agent	Runs	Success	Rate
Auto-Triage Issues	3	3	100%
Avenger	1	1	100%
Bot Detection	1	1	100%
Doc Build Deploy	5	4	80%
CWI	5	4	80%
Content Moderation	6	4	67%
CGO	5	3	60%
Agentic Commands	12	5	42%
Smoke CI	5	2	40%
CJS	4	1	25%
AI Moderator	6	1	17%
Q	12	0	0%*

*Q and AI Moderator show action_required/skipped — likely deployment gates or event-filtering, not hard failures.

Behavioral Patterns

Productive Patterns

Copilot SWE Agent + Avenger + Auto-Triage: High-quality PRs trigger triage/labeling automatically; smooth handoff
Issue Monster + PR Triage: Complementary daily outputs covering both issue hygiene and PR health
Static Analysis → Copilot SWE Agent: Zero findings prevent noisy issue creation

Problematic Patterns

Model deprecation cascade (NEW, growing): gpt-5.5, codex alpha, claude-sonnet-5 all deprecated within one week. 3 P1s + 1 P2 are model-related. No systemic prevention exists → #aw_model_lifecycle filed
Retry waste on non-retryable errors: Harness burns all 4 retries on HTTP 400 responses; wastes AIC credits and delays failure signal.
PR Sous Chef recurrence post-fix: Fix [aw-failures] PR Sous Chef 100% red — Copilot main agent requests gpt-5.5, SDK returns 400 "not accessible via /chat/completio [Content truncated due to length] #42444 closed Jun 30; [aw] PR Sous Chef hit HTTP 400 bad request #42652 opened Jul 1. Root cause likely misdiagnosed.
Tool denial repeat pattern: Safe Output Integrator ([aw] Daily Safe Output Integrator exceeded tool denial limit #42333) and Copilot Opt ([aw] Copilot Opt exceeded tool denial limit #42329) both hit 5/5 tool denials repeatedly.

Coverage Analysis

Well-Covered

PR lifecycle, Issue lifecycle, Daily reporting, Security scanning, Dependency management

Coverage Gaps

Proactive model version monitoring: No agent warns before model deprecation occurs → #aw_model_lifecycle
Stale PR detection: No agent flags PRs open >7 days (carry-forward)
AIC budget forecasting: Audit exists but no forward-looking alert ([aw] Daily Credit Limit Test exceeded daily AI credits budget #42610)
Automated recovery: No agent auto-closes issues for workflows failing >7 days with no fix

Engine Distribution (257 workflows)

Engine	Count	%
copilot	158	61%
claude	60	23%
pi	20	8%
codex	15	6%
other	4	1%

Note: codex (6%) carries disproportionate failure risk. Consider migrating low-priority codex workflows to copilot.

Recommendations

High Priority

Systemic model version lifecycle management — #aw_model_lifecycle (filed this run)
- Implement pre-flight model availability check; extend daily-model-inventory.md to proactively flag retiring models
- Expected impact: Eliminates reactive P1/P2 model failure cycles
Fix PR Sous Chef root cause ([aw] PR Sous Chef hit HTTP 400 bad request #42652)
- Engine switch to pi (feat: switch pr-sous-chef to pi engine #42730) is live — validate next run
- If recurrence: audit gpt-5.5 endpoint routing in harness directly
Harness retry guard for non-retryable errors
- Detect HTTP 400/401/403 and fail immediately (no retries); saves 3x AIC credits per failed run

Medium Priority

Migrate codex workflows to stable models ([aw-failures] Daily Sub-Agent Model Resolution Audit 100% red — Codex gpt-5-codex-alpha-2025-11-07 404s (same alpha-snapshot d [Content truncated due to length] #42033, [aw-failures] PR Code Quality Reviewer red — Copilot general-purpose subagent requests tier-unsupported model → SDK 400 `model [Content truncated due to length] #42095) — reduces P1 count by 2 immediately
Review tool denial guardrails ([aw] Daily Safe Output Integrator exceeded tool denial limit #42333, [aw] Copilot Opt exceeded tool denial limit #42329) — audit tool call patterns; increase budget or constrain scope
Consolidate model monitoring: daily-model-inventory.md + model-resolution-audit are overlapping concerns

Low Priority

Add stale PR detection (carry-forward)
Implement AIC budget forecasting ([aw] Daily Credit Limit Test exceeded daily AI credits budget #42610)
Document AI Moderator/Agentic Commands trigger conditions to reduce skipped/action_required noise

Trends

Metric	Jun 30	Jul 1	Delta
Quality Score	62/100	61/100	-1
Effectiveness Score	63/100	62/100	-1
Health Score	78/100	75/100	-3
P1 issues open	6	7	+1
P2 issues open	8	13	+5
copilot-swe-agent PR merge rate	~80%	80%	stable
Issues created today	—	43	—

Overall trajectory: mild decline driven by PR Sous Chef recurrence and 5 new P2s. Copilot SWE Agent and core monitoring agents remain stable anchors.

Actions Taken This Run

Filed #aw_model_lifecycle: Systemic model version lifecycle management tracking issue
Updated agent-performance-latest.md and shared-alerts.md in shared memory

Next Steps

Validate PR Sous Chef engine-switch fix (feat: switch pr-sous-chef to pi engine #42730) on next run
Monitor Auto-Triage transient failure ([aw] Auto-Triage Issues failed #42607, likely self-recovered)
Address model version lifecycle (#aw_model_lifecycle): implement pre-flight validation
Migrate Sub-Agent Model Resolution Audit and PR Code Quality Reviewer off deprecated models
Review guardrail limits for Safe Output Integrator ([aw] Daily Safe Output Integrator exceeded tool denial limit #42333) and Copilot Opt ([aw] Copilot Opt exceeded tool denial limit #42329)
Add stale PR detection workflow

Analysis period: 2026-06-24 to 2026-07-01 | Next report: 2026-07-08

References:

§28521103730 — this run
§28496806312 — Workflow Health Manager (Jul 1)
§28447234062 — previous Agent Performance run (Jun 30)

Generated by ⚡ Agent Performance Analyzer - Meta-Orchestrator · 64.4 AIC · ⌖ 22.3 AIC · ⊞ 2.1K · ◷

expires on Jul 2, 2026, 5:41 AM UTC-08:00

mcdev7777 · 2026-07-01T13:56:16Z

mcdev7777
Jul 1, 2026

Thanks for sharing this detailed report. The model-version lifecycle issue and retry behavior around non-retryable 400/401/403 errors seem especially important to address, since both can create recurring failures and unnecessary credit usage. A pre-flight model availability check plus a fast-fail retry guard would likely improve overall workflow reliability quite a bit.

0 replies

2026-07-02T15:25:29Z

github-actions[bot]
Bot Jul 2, 2026
Author

This discussion was automatically closed because it expired on 2026-07-02T13:41:41.262Z.

Closed by Workflow

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agent Performance Report — Week of 2026-07-01 #42767

Uh oh!

{{title}}

Uh oh!

Top Performing Agents

Agents Needing Improvement

Inactive / Zero-Output Agents

Output Quality Distribution

Common Quality Issues

PR Merge Rate (copilot-swe-agent)

Task Completion Rates (Jul 1)

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Agent Performance Report — Week of 2026-07-01 #42767

Uh oh!

github-actions[bot] Bot Jul 1, 2026

Executive Summary

Top Performing Agents

Agents Needing Improvement

Inactive / Zero-Output Agents

Output Quality Distribution

Common Quality Issues

PR Merge Rate (copilot-swe-agent)

Task Completion Rates (Jul 1)

Behavioral Patterns

Productive Patterns

Problematic Patterns

Coverage Analysis

Well-Covered

Coverage Gaps

Engine Distribution (257 workflows)

Recommendations

High Priority

Medium Priority

Low Priority

Trends

Actions Taken This Run

Next Steps

Replies: 2 comments

Uh oh!

mcdev7777 Jul 1, 2026

Uh oh!

github-actions[bot] Bot Jul 2, 2026 Author

github-actions[bot]
Bot Jul 1, 2026

mcdev7777
Jul 1, 2026

github-actions[bot]
Bot Jul 2, 2026
Author