[audit-workflows] Daily Agentic Workflow Audit — 2026-06-10 (87.0% health; NEW Docker Hub pull timeout + cli-proxy day-3 recur) #38321

2026-06-10T10:04:41Z

github-actions[bot]
Bot Jun 10, 2026

Daily Agentic Workflow Audit — 2026-06-10

Observed a PARTIAL ~4.2h morning cluster (05:33–09:46Z, 48 runs with full summaries before the logs MCP tool timed out at 67s; ~20h of the 24h window unobserved). Health held at 87.0% (40 success / 6 failure of 46 completed; 2 in-progress including this audit). The 06-08 PAT-400 production incident remains resolved, the 06-09 daily-AI-credits-429 cap was absent, and the 25M effective-token cap stayed quiet. All six failures landed on main, and all agent-execution failures were copilot-engine — claude was 15/15 and codex 1/1 on the agent step. The headline this window is a new Docker Hub image-pull timeout that reddened two prod workflows before the agent could even start.

Summary

Metric	Value
Runs observed	48 (46 completed, 2 in-progress)
Success / Failure	40 / 6 — 87.0%
Engines	copilot 32 · claude 15 · codex 1
Tokens	43.4M (AIC 7,487.8)
Est. cost (all engines)	$30.44
Turns / action-min	843 / 479
GitHub API calls	1,005
Firewall blocked	19 / 2,747 = 0.69% (all by-design)
Missing tools / data / MCP failures	0 / 0 / 0

Critical Issues

🆕 Docker Hub registry pull timeout (NEW, dominant — 2 prod-main runs) — download_docker_images.sh pulled every ghcr.io image fine but timed out on the single Docker Hub image node:lts-alpine: Get "(registry1.docker.io/redacted) context deadline exceeded / dial tcp 54.166.120.201:443: i/o timeout across all 3 retries → exit code 123, agent never started (turns=0). Hit Daily Windows Terminal Integration Builder and Daily Hippo Learn in a ~1.7h band (06:09–07:49Z). Likely a Docker Hub anonymous-pull rate-limit or transient network flake. Fix: mirror node:lts-alpine into ghcr.io (or pin a ghcr-hosted node image) so agent setup no longer depends on Docker Hub.

🔁 cli-proxy DIFC liveness probe (RECUR — day 3) — awf-cli-proxy exits(1), liveness probe to localhost:18443 connection-refused → fail-fast, turns=0. Newly hit Sub-Issue Closer and Auto-Triage Issues today (06-08 PR Description Updater/Issue Monster, 06-09 Issue Monster/PR Sous Chef). Intermittent (sibling runs succeed) ⇒ container start-order/readiness race. Fix: gate agent start on :18443 readiness with bounded retry instead of fail-fast.

Other Findings

🆕 Degenerate tool-loop → 5-min action timeout (3.7M tokens wasted)

GitHub Remote MCP Authentication Test (copilot, run 27258574627) ran the identical command date -u +"%Y-%m-%d %H:%M:%S UTC" 130+ times across 136 turns, burning 3.7M tokens with zero progress, then hit ##[error]The action 'Execute GitHub Copilot CLI' has timed out after 5 minutes. This shares the action-timeout mechanism with the 06-09 TQS 15-min timeout, but the root cause here is a degenerate self-loop, not legitimate overrun. A repeated-identical-tool-call / no-progress guard would have aborted early and saved the tokens.

🆕 repo-memory push: orphan branch needs signed-commit seeding (agent succeeded, run still reddened)

Daily Safe Outputs Git Simulator (claude, run 27256126843) — the agent, detection, and safe_outputs jobs all succeeded, but the post-agent push_repo_memory job failed 4/4 attempts: remote: error: GH013: Repository rule violations found for refs/heads/memory/git-simulator ... push declined. The orphan branch memory/git-simulator has never existed and the repo requires signed commits, but the workflow's first commit to a brand-new orphan branch is unsigned (patch size was fine, 3.4KB/10KB). This reddens the whole run despite a successful task. Fix: seed memory/* branches with a signed first commit, sign the orphan-branch first commit, or exempt memory/* from the signed-commit ruleset.

Full failure roster (6, all on main)

Workflow	Engine	Class	Turns	Note
Daily Windows Terminal Integration Builder	copilot	docker-hub-pull-timeout 🆕	0	node:lts-alpine exit 123
Daily Hippo Learn	copilot	docker-hub-pull-timeout 🆕	0	same; had written hippo-store memory earlier
Sub-Issue Closer	copilot	cli-proxy-liveness-18443 🔁d3	0	connection refused, fail-fast
Auto-Triage Issues	copilot	cli-proxy-liveness-18443 🔁d3	0	connection refused
GitHub Remote MCP Authentication Test	copilot	action-timeout-5min + degenerate-loop 🆕	136	3.7M tok on repeated `date -u`
Daily Safe Outputs Git Simulator	claude	repo-memory-signed-commit-seed 🆕	18	agent succeeded; push step reddened run

Observability signals & quiet/resolved classes

Resolved/quiet: copilot-pat-not-supported-400 still resolved (no signature in 48 runs); daily-ai-credits-cap-429 absent; token-budget-429 / 25M eff-cap absent.
Hotspots (observability): auto-triage-issues 33% fail-rate; pr-sous-chef execution-drift 0→36 turns; daily-agentrx-trace-optimizer 12/71 (17%) firewall blocks (low absolute volume).
Data-quality: EffectiveTokens unpopulated for all runs again (recurring since 06-08) → 25M eff-cap proximity unverifiable this window.
Firewall: 0.69% blocked — only index.crates.io, proxy.golang.org, mtalk.google.com:5228, all by-design.

📊 Trend Charts

Workflow Health (observed windows, ~21 days)

Success rate sits at 87.0%, right around the ~84% 30-day baseline and well clear of the 05-23 41.6% trough. Failures stay in the low single digits per window; today's six are spread across four distinct classes (two of them brand new) rather than one systemic break, so the rate held despite the new Docker Hub disruption.

Token Usage (daily + 7-window moving average)

Daily tokens came in at 43.4M, above the recent 24M evening-cluster windows because this morning cluster caught more heavy schedulers — and inflated by the 3.7M tokens the Remote MCP Auth Test burned in its degenerate loop. The 7-window moving average remains in the stable 30–50M band with no runaway trend.

Recommendations

HIGH — Remove the Docker Hub dependency: mirror/pin node:lts-alpine to ghcr.io (or authenticate the pull) so agent setup never depends on registry-1.docker.io. Two prod-main reds in one window.
MEDIUM — Add a degenerate-loop guard: detect N consecutive identical tool calls / no output progress over K turns and abort early — would have saved 3.7M tokens on the Remote MCP Auth Test.
MEDIUM — Harden cli-proxy startup: wait for :18443 readiness before starting the agent (bounded retry) — 3 consecutive days of intermittent liveness-probe reds.
MEDIUM — Fix repo-memory orphan-branch push: seed/sign the first commit on new memory/* branches (or exempt them from the signed-commit ruleset) so a successful agent run is not reddened by the push step.

References:

§27256826368 — Windows Terminal (docker-hub pull timeout)
§27258574627 — Remote MCP Auth Test (degenerate loop)
§27256126843 — Git Simulator (repo-memory signed-commit seed)

Generated by 🔍 Agentic Workflow Audit Agent · 463 AIC · ⌖ 14.6 AIC · ⊞ 6.2K · ◷

expires on Jun 11, 2026, 2:04 AM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[audit-workflows] Daily Agentic Workflow Audit — 2026-06-10 (87.0% health; NEW Docker Hub pull timeout + cli-proxy day-3 recur) #38321

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[audit-workflows] Daily Agentic Workflow Audit — 2026-06-10 (87.0% health; NEW Docker Hub pull timeout + cli-proxy day-3 recur) #38321

Uh oh!

github-actions[bot] Bot Jun 10, 2026

Daily Agentic Workflow Audit — 2026-06-10

Summary

Critical Issues

Other Findings

📊 Trend Charts

Workflow Health (observed windows, ~21 days)

Token Usage (daily + 7-window moving average)

Recommendations

Replies: 0 comments

github-actions[bot]
Bot Jun 10, 2026