[audit-workflows] Agentic Workflow Audit — 2026-06-13 (success 68.4%, lowest since 05-23; Avenger ×6 + PR Sous Chef ×6) #39155
Closed
Replies: 2 comments
-
|
Smoke test ping: Copilot reached this thread, left a breadcrumb, and kept the lights green. ✅ Warning Firewall blocked 6 domainsThe following domains were blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "accounts.google.com"
- "android.clients.google.com"
- "clients2.google.com"
- "contentautofill.googleapis.com"
- "safebrowsingohttpgateway.googleapis.com"
- "www.google.com"See Network Configuration for more information.
|
Beta Was this translation helpful? Give feedback.
0 replies
-
|
This discussion has been marked as outdated by Agentic Workflow Audit Agent. A newer discussion is available at Discussion #39289. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Agentic Workflow Audit — 2026-06-13
Audited the last ~7h of agentic runs (window 14:52–21:35Z; the
logsMCP tool timed out at 120s, so analysis ran against 99 locally-downloaded agentic run directories out of 207 total). Overall success was 67/98 = 68.4% — the lowest single-window rate since 2026-05-23 (41.6%), with prod-main at 66.7%. The drop is almost entirely two NEW failure clusters: Avenger ×6 and PR Sous Chef ×6, which together account for ~12 of the ~14 genuine main-branch failures. Encouragingly, maintainers already have fix branches in flight for both (copilot/aw-avenger-failed-fix,copilot/aw-fix-pr-sous-chef-fail,copilot/replace-opaque-copilot-error). All PR-event failures were by-design smoke-probe noise.Summary
Engine health: copilot 48/61 (78.7%) · claude 11/21 (52.4%, dragged down by Avenger) · codex 4/6 · pi 2/2 · antigravity 2/2 · gemini 0/2.
Critical Findings (NEW / escalating)
1. Avenger reddens ×6 —
ERR_CONFIG: no structured log entries were produced(claude, HIGH)The agent completes real work and emits a valid safe-output, then a follow-up engine invocation in the same job returns no structured logs, raising
ERR_CONFIGthat hard-fails the agent job and reddens the run. Token counts vary (34k/69k/0), consistent with partial completion. First appearance; this is the single biggest contributor to today's trough.2. PR Sous Chef reddens ×6 (copilot/gpt-5-mini, MEDIUM) — two distinct modes:
update_pull_request(update_branch: true)→ "update pull request branch from base failed" on PR [ARC/DinD] Emit chroot.binariesSourcePath and chroot.identity in AWF stdin-config #38911 (non-fast-forward / conflict). Agent succeeded; the safe_outputs job reddened.Other prod-main failures (recurring / known)
copilot-sdk-driver-failures(day 7)copilot-sdk-driver-failures(day 7)copilot-sdk-driver-failures(day 7).md+git checkout -b+git status+git diff; calledmissing_toolmodel-param-config-incompatibility(recur ×4)gpt-5-codex-alpha— same wf/signature as 06-11 & 06-05chroot-node-not-available(recur ×2)doc-unbloat-empty-outputThe copilot-sdk tool-permission-lockout continues into day 7 (3× today, down from 4× on 06-12) and remains the dominant known class. Nothing new there — the standing fix recommendation still applies.
PR-event / smoke failures (all by-design probe noise)
Smoke Gemini ×2 (gemini pricing missing, recur), Smoke Claude ×2, Smoke Codex ×1, Smoke Copilot ×1, Smoke Copilot AOAI apikey+Entra ×2 (transient-error loop, recur), PR Code Quality Reviewer ×3 (copilot-sdk, 0-tok), Design Decision Gate ×1. Firewall blocks (373/6,453) were 100% smoke-probe telemetry — by-design.
Trend Charts
Workflow Health (30d)
Success rate held a healthy 84–96% band through early June but dropped sharply to 68.4% today — the deepest trough since the 41.6% day on 05-23. The decline is concentrated rather than systemic: two workflows (Avenger, PR Sous Chef) caused the bulk of the failures, so a rebound toward ~90% is likely once their in-flight fix branches land.
Token Usage (30d)
Daily tokens landed at 43.9M, near the 7-day moving average and well below the 06-12 spike (~127M). Note today is a partial ~7h window, so the true daily figure is higher; even so, consumption is in normal range. Top consumers: Matt Pocock Skills Reviewer (5.5M/5 runs), Daily Code Metrics (4.8M), Test Quality Sentinel (4.7M/5), PR Sous Chef (4.1M/11 — much of it wasted on the failing runs).
Recommendations
ERR_CONFIG; or gate the 2nd invocation behind a "work remaining" check. Verifycopilot/aw-avenger-failed-fixcovers this case.update_pull_request(update_branch: true)tolerate non-fast-forward/conflict (skip-and-warn instead of reddening the safe_outputs job); the 0-token startup mode shares the Avenger no-structured-logs root. Confirmcopilot/aw-fix-pr-sous-chef-failhandles both modes.Repo memory updated (audit-history, known-issues +4 new IDs, recommendations, anomalies, metrics-summary) and pushed.
References: §27470367219 (Avenger) · §27471203716 (PR Sous Chef) · §27476058710 (Safe Output Integrator)
Beta Was this translation helpful? Give feedback.
All reactions