[audit-workflows] Agentic Workflow Audit — 2026-05-31 (89.8% completion, 6 failures, 2 recurrences strengthening) #36085
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
Full 24h window (2026-05-31, 02:05–07:54Z): 62 runs — 59 completed, 3 still in-progress at audit time. Completion rate 89.8% (53 success / 6 failure), holding the recent high-80s/low-90s band. The six failures are five distinct classes — no single systemic regression — but two recurrence classes strengthened today and one is now the clear top priority.
safe-output partial-failure intolerance(HIGH) hit two more workflows today (Sub-Issue Closer, LintMonster) — the class now spans 4 workflows since 05-26. One failed safe-output item red-fails the entire job even when others succeeded.token-budget 429/ [aw-failures] Token-budget exhaustion (25M effective-tokens cap) recurring across 6+ scheduled workflows — 2026-05-29 02:00–07:32 UTC #35661 (HIGH) recurred on a 2nd consecutive day with a 2nd workflow (Daily Firewall Logs Collector), same 25M effective-token-cap signature as 05-30's Linter Miner.Key Metrics
1 Only the claude engine reports
EstimatedCost; copilot/codex report $0, so total cost is claude-biased.Critical Findings
🔴 1. Safe-output partial-failure intolerance — ESCALATING (top priority)
One invalid/failed safe-output item red-fails the whole
safe_outputsjob even when other items in the same batch succeeded. Two workflows today:add_commentitems hadtarget=*with no item_number on a schedule event → all 11 failed → entire job red despite 11 successfulupdate_issueitems.assign_to_agentfor [community] Update community contributions in README #36048/[daily-compiler-quality] Daily Compiler Code Quality Report - 2026-05-31 #36049 returned GitHub APIRequest failed→ 2 failed items red-failed the job despite 3 successfulcreate_issue([community] Update community contributions in README #36048–[lint-monster] [Lint] Fix pkg/workflow function length violations (286 issues) #36050) + 1create_discussion.Fix (HIGH): make Process Safe Outputs treat an individual failed item as skipped-with-warning when ≥1 item in the batch succeeded. Plus: (a) validate
target=*with no resolvable number at the MCP emit boundary so the agent self-corrects in-loop (prompt-only guardrails have repeatedly failed); (b) investigate whyassign_to_agentto copilot is failing via the API (endpoint/permission/eventual-consistency right aftercreate_issue).🔴 2. Token-budget 429 (#35661) — recurring, 2nd day / 2nd workflow
Daily Firewall Logs Collector (26702042593, copilot/sonnet): after 60 turns hit
CAPIError: 429 Maximum effective tokens exceeded (25.41M / 25M), retried 4× with--continue(each re-hitting the cap), then all retries exhausted → exitCode=1 after 18m25s. Identical signature to 05-30's Linter Miner (25.13M/25M). Documentation Noob Tester reached 25.03M eff_tok and barely passed — at the cliff edge.Fix (HIGH, #35661): (a) chunk/reduce scope of heavy daily-aggregation workflows to stay under 25M eff_tok; (b) harness should fail-fast on a budget-429 — retrying with
--continuecannot recover a hard cap, it just burns ~90s ×4.🟠 3. NEW — threat-detection prompt.txt missing
Code Simplifier (26703540010): the agent succeeded and produced a valid PR patch, but the detection job failed — the setup step copies
prompt.txtwithcp ... 2>/dev/null || true, which silently masked a missing source. The detection agent then hitPrompt file not found, exited 1, produced noTHREAT_DETECTION_RESULT, and the parse step red-failed the run. Shared harness → anycreate_pull_requestworkflow with a missing prompt at detection time is exposed.Fix: don't mask the
cpfailure; verify the prompt exists (non-empty) before invoking the detection agent and fail with a clear, actionable error or regenerate it.🟡 Transient & lower-priority failures (2)
PR-branch-deleted race (external cause). Two PR-event runs fired on head branch
copilot/update-agentic-workflows-and-skill-filewhich no longer existed on the remote:fatal: couldn't find remote ref ...→ git fetch exit 22, 0 turns.push_to_pull_request_branchfailed at 21 turns.Root cause is external — the Copilot PR branch was merged/deleted while these workflows were queued. Optional hardening: a checkout guard that exits neutral/skip on a deleted PR head ref instead of red-failing.
Trend Charts (30-day / 13-window)
Workflow Health
Completion rate sits at 89.8%, essentially flat vs 05-30's 91.1% and well above the 05-23 trough (41.6%). The trend has been stable in the high-80s/low-90s for eight consecutive windows — today's 6 failures are spread across 5 distinct classes rather than a single failing workflow, so the dip is noise, not a regression.
Token & Cost
Daily tokens rose to 68.8M (highest since 05-17) and claude-measured cost to $31.63, both pulled up by a single outlier: Go Logger Enhancement ($8.30 / 41.9M eff_tok / 117 turns) — the most expensive run of the day and a new daily high, but it succeeded, so this is value-for-cost rather than waste. The 7-day moving-average cost line stays in its usual ~$22–25 band.
Cost & token leaderboard
Top cost (claude-measured): Go Logger Enhancement $8.30 (success, new high) · Safe Output Health Monitor $5.24 · Sergo $4.29 · Static Analysis Report $3.11 · Design Decision Gate $1.62 (failed).
Top effective tokens: Go Logger Enhancement 41.9M · Daily Firewall Logs 25.4M (429-failed) · Documentation Noob Tester 25.0M (at-cap success) · Safe Output Health Monitor 23.8M · Copilot CLI Deep Research 23.1M · Daily Compiler Quality Check 22.9M.
Note the cluster of heavy daily-aggregation workflows brushing the 25M effective-token cap — the same population at risk of the #35661 budget-429.
Network / firewall
Block rate 16.7% (702/4,204), flat vs 05-30's 16.6% — stable band. Highest pressure: Documentation Noob Tester 125/247 (51%), dominated by browser/Playwright + Google telemetry (accounts.google.com, content-autofill.googleapis.com, etc.) — not workflow-affecting. No firewall block caused any of the 6 failures.
Carried / watch items
--max-turns 25fix still unverified (raise to ~50 recommendation stays open).Recommendations (priority order)
target=*at the emit boundary + investigateassign_to_agentAPI failures. (safe-output partial-failure intolerance — 4 workflows since 05-26)--continue.cpwith|| true; verify the prompt exists before running the detection agent.References:
Beta Was this translation helpful? Give feedback.
All reactions