[audit-workflows] Daily Agentic Workflow Audit — 2026-06-10 (87.0% health; NEW Docker Hub pull timeout + cli-proxy day-3 recur) #38321
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Daily Agentic Workflow Audit — 2026-06-10
Observed a PARTIAL ~4.2h morning cluster (05:33–09:46Z, 48 runs with full summaries before the
logsMCP tool timed out at 67s; ~20h of the 24h window unobserved). Health held at 87.0% (40 success / 6 failure of 46 completed; 2 in-progress including this audit). The 06-08 PAT-400 production incident remains resolved, the 06-09 daily-AI-credits-429 cap was absent, and the 25M effective-token cap stayed quiet. All six failures landed onmain, and all agent-execution failures were copilot-engine — claude was 15/15 and codex 1/1 on the agent step. The headline this window is a new Docker Hub image-pull timeout that reddened two prod workflows before the agent could even start.Summary
Critical Issues
🆕 Docker Hub registry pull timeout (NEW, dominant — 2 prod-main runs) —
download_docker_images.shpulled everyghcr.ioimage fine but timed out on the single Docker Hub imagenode:lts-alpine:Get "(registry1.docker.io/redacted) context deadline exceeded/dial tcp 54.166.120.201:443: i/o timeoutacross all 3 retries →exit code 123, agent never started (turns=0). Hit Daily Windows Terminal Integration Builder and Daily Hippo Learn in a ~1.7h band (06:09–07:49Z). Likely a Docker Hub anonymous-pull rate-limit or transient network flake. Fix: mirrornode:lts-alpineinto ghcr.io (or pin a ghcr-hosted node image) so agent setup no longer depends on Docker Hub.🔁 cli-proxy DIFC liveness probe (RECUR — day 3) —
awf-cli-proxyexits(1), liveness probe tolocalhost:18443connection-refused → fail-fast, turns=0. Newly hit Sub-Issue Closer and Auto-Triage Issues today (06-08 PR Description Updater/Issue Monster, 06-09 Issue Monster/PR Sous Chef). Intermittent (sibling runs succeed) ⇒ container start-order/readiness race. Fix: gate agent start on:18443readiness with bounded retry instead of fail-fast.Other Findings
🆕 Degenerate tool-loop → 5-min action timeout (3.7M tokens wasted)
GitHub Remote MCP Authentication Test (copilot, run 27258574627) ran the identical command
date -u +"%Y-%m-%d %H:%M:%S UTC"130+ times across 136 turns, burning 3.7M tokens with zero progress, then hit##[error]The action 'Execute GitHub Copilot CLI' has timed out after 5 minutes. This shares the action-timeout mechanism with the 06-09 TQS 15-min timeout, but the root cause here is a degenerate self-loop, not legitimate overrun. A repeated-identical-tool-call / no-progress guard would have aborted early and saved the tokens.🆕 repo-memory push: orphan branch needs signed-commit seeding (agent succeeded, run still reddened)
Daily Safe Outputs Git Simulator (claude, run 27256126843) — the agent, detection, and safe_outputs jobs all succeeded, but the post-agent
push_repo_memoryjob failed 4/4 attempts:remote: error: GH013: Repository rule violations found for refs/heads/memory/git-simulator ... push declined. The orphan branchmemory/git-simulatorhas never existed and the repo requires signed commits, but the workflow's first commit to a brand-new orphan branch is unsigned (patch size was fine, 3.4KB/10KB). This reddens the whole run despite a successful task. Fix: seedmemory/*branches with a signed first commit, sign the orphan-branch first commit, or exemptmemory/*from the signed-commit ruleset.Full failure roster (6, all on main)
date -uObservability signals & quiet/resolved classes
EffectiveTokensunpopulated for all runs again (recurring since 06-08) → 25M eff-cap proximity unverifiable this window.index.crates.io,proxy.golang.org,mtalk.google.com:5228, all by-design.📊 Trend Charts
Workflow Health (observed windows, ~21 days)
Success rate sits at 87.0%, right around the ~84% 30-day baseline and well clear of the 05-23 41.6% trough. Failures stay in the low single digits per window; today's six are spread across four distinct classes (two of them brand new) rather than one systemic break, so the rate held despite the new Docker Hub disruption.
Token Usage (daily + 7-window moving average)
Daily tokens came in at 43.4M, above the recent 24M evening-cluster windows because this morning cluster caught more heavy schedulers — and inflated by the 3.7M tokens the Remote MCP Auth Test burned in its degenerate loop. The 7-window moving average remains in the stable 30–50M band with no runaway trend.
Recommendations
node:lts-alpineto ghcr.io (or authenticate the pull) so agent setup never depends onregistry-1.docker.io. Two prod-main reds in one window.:18443readiness before starting the agent (bounded retry) — 3 consecutive days of intermittent liveness-probe reds.memory/*branches (or exempt them from the signed-commit ruleset) so a successful agent run is not reddened by the push step.References:
Beta Was this translation helpful? Give feedback.
All reactions