[audit-workflows] Daily Agentic Workflow Audit — 2026-06-09 #38221
Closed
Replies: 1 comment
-
|
This discussion has been marked as outdated by Agentic Workflow Audit Agent. A newer discussion is available at Discussion #38321. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Window: 2026-06-09 19:13–21:41Z (~2.5h evening cluster, 38 in-window runs — logs tool timed out at 97s so this is a partial window). Engines: copilot 27, claude 9, codex 2.
🟢 Headline — yesterday's CRITICAL production incident is RESOLVED. The
copilot-pat-not-supported-400incident ("Personal Access Tokens are not supported for this endpoint"), which failed ~15 runs yesterday, is closed. Four runs still carry the signature in their logs, but all four concludedsuccesson attempt 1 — the harness now tolerates/retries past the token-check 400 (fix branchpelikhan/fix-pat-400-retrylanded). Zero PAT-400-attributable failures this window.proxy.golang.org, by-design)Every one of the 6 failures was copilot-engine. Claude ran 9/9 clean and codex 2/2 clean.
Critical / New This Window
daily-ai-credits-cap-429— Daily Ambient Context Optimizer (prod-main, run §27233269431) ran 57 turns / 3.0M tokens, then the Copilot API returnedCAPIError 429 Maximum AI credits exceeded (1026.78 / 1000)after 5 retries → failure. This is the account daily-credit cap (~1000), distinct from the 25M effective-token cap (which was absent). The heavy aggregator crosses the cap late in its run, wasting all prior work.copilot-harness-15min-action-timeout— Test Quality Sentinel (PR branch, run §27234356749) produced 20 turns / 649k tokens of valid review, thenExecute GitHub Copilot CLI timed out after 15 minutes→ reddened with no emitted output. Sibling TQS runs on other branches succeeded.Recurring Issues
View 4 recurring failures
cli-proxy-difc-liveness-probe-failed(recur day 2, infra) —liveness probe failed for localhost:18443 (gh api exit=0); agent never starts (turns=0). Hit Issue Monster (§27229956123) and PR Sous Chef (§27229990045). Intermittent — other Sous Chef main runs succeeded the same window.copilot-sdk-driver-failures/ tool-perm-lockout (recur) — Daily Safe Output Integrator (§27230190112): 11 permission-denied, turns=1. This was also the window's singlemissing_tool(tool/permission).copilot-sdk-driver-failures/ session.idle (recur) — PR Code Quality Reviewer (§27233924083) PR branch, turns=1.📊 Trends
Workflow Health (30 days)
Success rate sits at 83.8%, essentially on the 30-day mean of 83.6% and recovering from yesterday's PAT-incident dip. The failure stack is now small and dominated by copilot infra/timeout classes rather than a single systemic incident — a healthier shape than the past week.
Token Usage (30 days)
Daily tokens (23.8M) are below the 7-day moving average and well under the ~37M 30-day average — partly because this is a short partial window. No effective-token cap pressure this window.
Recommendations
localhost:18443liveness probe before declaring the cli-proxy dead — it is intermittently flaky and zeroes out otherwise-healthy runs.rec-pat-400-rollback-or-fixand keeppelikhan/fix-pat-400-retrymerged on main.Notes
All 6 failures were copilot-engine; claude and codex were 100% clean. The window is partial (logs tool timeout), so absolute counts undercount the full 24h. Repo memory updated: PAT-400 marked RESOLVED, two new issue classes recorded, recurrence counters bumped for the cli-proxy and sdk-driver families.
References: §27233269431 · §27234356749 · §27230190112
Beta Was this translation helpful? Give feedback.
All reactions