You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The awf-api-proxy sidecar container fails its Docker health check before agent activation, blocking multiple workflows with a pre-inference crash:
Container awf-api-proxy Started
Container awf-api-proxy Waiting
Container awf-api-proxy Error
dependency failed to start: container awf-api-proxy is unhealthy
This was previously seen only on PR branches (tracked by #27688, classified as lower priority). It is now affecting main-branch production workflows, making this a P0 issue:
All affected runs: 0 turns, 0 tokens, 0 tool calls. Agent container never starts.
Root Cause Analysis
The awf-squid proxy starts and reaches Healthy, but awf-api-proxy stalls at the health-check polling phase (Waiting) and times out. The container startup log pattern is consistent across all 4 failures:
Container awf-squid Healthy ← succeeds
Container awf-api-proxy Waiting ← health check polling
Container awf-api-proxy Error ← health check timed out or failed
Failures are clustered across a ~34-minute window (15:01–15:35 UTC) and span three distinct engines (Claude Code, GitHub Copilot CLI, OpenCode) — ruling out per-engine config as the cause. Smoke CI succeeded again at 15:24 UTC, suggesting the problem was intermittent during that window.
Probable causes:
A dependency the api-proxy needs (env var, secret, downstream endpoint) was transiently unavailable
The health check timeout/grace period is too short under load — the proxy takes longer to initialize when concurrent runs are queued
A recent api-proxy image change introduced a broken or slower health check
Proposed Remediation
Diagnose: Read api-proxy container logs from the failing runs (/tmp/gh-aw/sandbox/firewall/logs/api-proxy-logs in run artifacts) to get the specific health check failure reason
If timeout-related: Increase start_period and/or interval in the Docker Compose health check config for awf-api-proxy
If image-related: Identify recent changes to the api-proxy image and roll back if needed
If intermittent: Add a retry/backoff mechanism at the docker compose up level before failing the run
Success Criteria
DeepReport and Smoke CI complete without awf-api-proxy is unhealthy on next scheduled/push triggers
Root cause (timeout vs. broken health check vs. upstream dependency) confirmed from api-proxy logs
Smoke CI ran successfully at 12:26, 12:31, 12:44 UTC on the same day confirming the failure is intermittent, not a permanent regression. Root cause and remediation options remain as documented above.
Result: detection pre-activation crash; main agent output valid
This is the 3rd documented recurrence. Agent job succeeded but the detection sidecar still intermittently fails health checks, causing the overall workflow to report failure despite valid output.
Problem
The
awf-api-proxysidecar container fails its Docker health check before agent activation, blocking multiple workflows with a pre-inference crash:This was previously seen only on PR branches (tracked by #27688, classified as lower priority). It is now affecting main-branch production workflows, making this a P0 issue:
All affected runs: 0 turns, 0 tokens, 0 tool calls. Agent container never starts.
Root Cause Analysis
The
awf-squidproxy starts and reachesHealthy, butawf-api-proxystalls at the health-check polling phase (Waiting) and times out. The container startup log pattern is consistent across all 4 failures:Failures are clustered across a ~34-minute window (15:01–15:35 UTC) and span three distinct engines (Claude Code, GitHub Copilot CLI, OpenCode) — ruling out per-engine config as the cause. Smoke CI succeeded again at 15:24 UTC, suggesting the problem was intermittent during that window.
Probable causes:
Proposed Remediation
/tmp/gh-aw/sandbox/firewall/logs/api-proxy-logsin run artifacts) to get the specific health check failure reasonstart_periodand/orintervalin the Docker Compose health check config forawf-api-proxydocker compose uplevel before failing the runSuccess Criteria
awf-api-proxy is unhealthyon next scheduled/push triggersParent Issue
Part of failure investigation report: #27729
Note
🔒 Integrity filter blocked 5 items
The following items were blocked because they don't meet the GitHub integrity level.
list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".To allow these resources, lower
min-integrityin your GitHub frontmatter:New Recurrence — 2026-04-23 13:05 UTC
Run §24836877236 (Smoke CI, push to
main) failed with identicalawf-api-proxy is unhealthypattern:Smoke CI ran successfully at 12:26, 12:31, 12:44 UTC on the same day confirming the failure is intermittent, not a permanent regression. Root cause and remediation options remain as documented above.
New Recurrence — 2026-04-23 18:51 UTC
Run §24852980557 (Design Decision Gate 🏗️, push to
main) failed with the sameawf-api-proxy is unhealthypattern in the detection job:detection(agent job ran successfully — noop for PR Fix 1px header nav gap at iPad 768px breakpoint #28146)This is the 3rd documented recurrence. Agent job succeeded but the detection sidecar still intermittently fails health checks, causing the overall workflow to report failure despite valid output.