Summary
A downstream elastic/docs-content agentic workflow failed before the agent could run because the awf-api-proxy sidecar became unhealthy during docker compose up. The user reports seeing this flaky failure twice recently; this run is one concrete example.
Run: https://github.com/elastic/docs-content/actions/runs/25046213576
Failed job: Docs AI / triage / agent
Failed step: Execute GitHub Copilot CLI
Workflow: Docs AI menu, event issue_comment, branch main.
This looks like a recurrence of the failure pattern from #27888, but that issue is closed and this example occurred later on gh-aw v0.71.1 with firewall 0.25.28.
Observed failure
The agent never starts and no model/tool work is performed. The uploaded agent_output.json contains an empty items array.
The decisive lines are in c5bf1190-agent/agent-stdio.log:
[INFO] Starting containers...
Container awf-squid Starting
Container awf-api-proxy Starting
Container awf-api-proxy Started
Container awf-squid Started
Container awf-squid Waiting
Container awf-api-proxy Waiting
Container awf-squid Healthy
Container awf-api-proxy Error
dependency failed to start: container awf-api-proxy is unhealthy
[ERROR] Failed to start containers: Error: Command failed with exit code 1: docker compose up -d --pull never
The generated compose file shows the api proxy health check is:
api-proxy:
healthcheck:
test:
- CMD
- curl
- '-f'
- http://localhost:10000/health
interval: 1s
timeout: 1s
retries: 5
start_period: 2s
Environment from the failing artifact
GH_AW_VERSION: v0.71.1.
COPILOT_MODEL: claude-sonnet-4.6.
- Firewall images:
ghcr.io/github/gh-aw-firewall/agent:0.25.28@sha256:a8834e285807654bf680154faa710d43fe4365a0868142f5c20e48c85e137a7a.
ghcr.io/github/gh-aw-firewall/api-proxy:0.25.28@sha256:93290f2393752252911bd7c39a047f776c0b53063575e7bde4e304962a9a61cb.
ghcr.io/github/gh-aw-firewall/squid:0.25.28@sha256:844c18280f82cd1b06345eb2f4e91966b34185bfc51c9f237c3e022e848fb474.
- Runner image:
ubuntu24, ImageVersion: 20260426.100.1.
- Triggering repo:
elastic/docs-content.
Diagnostic gap
The artifact says API proxy logs are available at /tmp/gh-aw/sandbox/firewall/logs/api-proxy-logs, and the compose file mounts that path into /var/log/api-proxy, but the downloaded artifact only includes these firewall log files:
sandbox/firewall/logs/access.log
sandbox/firewall/logs/audit.jsonl
sandbox/firewall/logs/cache.log
There are no api-proxy-logs files in the artifact, so it is not possible to tell whether the proxy failed to bind, crashed, timed out during initialization, or failed its /health route for another reason.
Why this seems flaky
Requested follow-up
Could gh-aw either reopen/continue the previous investigation or track this as a new recurrence on v0.71.1?
Useful fixes would be:
- Capture
awf-api-proxy container logs when docker compose up fails before removing containers.
- Include the mounted
api-proxy-logs directory in uploaded artifacts even if it is empty, so absence is explicit.
- Consider increasing the api proxy health check grace period/retries or retrying
docker compose up when only the api proxy health check flakes.
Summary
A downstream
elastic/docs-contentagentic workflow failed before the agent could run because theawf-api-proxysidecar became unhealthy duringdocker compose up. The user reports seeing this flaky failure twice recently; this run is one concrete example.Run: https://github.com/elastic/docs-content/actions/runs/25046213576
Failed job:
Docs AI / triage / agentFailed step:
Execute GitHub Copilot CLIWorkflow:
Docs AI menu, eventissue_comment, branchmain.This looks like a recurrence of the failure pattern from #27888, but that issue is closed and this example occurred later on gh-aw
v0.71.1with firewall0.25.28.Observed failure
The agent never starts and no model/tool work is performed. The uploaded
agent_output.jsoncontains an emptyitemsarray.The decisive lines are in
c5bf1190-agent/agent-stdio.log:The generated compose file shows the api proxy health check is:
Environment from the failing artifact
GH_AW_VERSION:v0.71.1.COPILOT_MODEL:claude-sonnet-4.6.ghcr.io/github/gh-aw-firewall/agent:0.25.28@sha256:a8834e285807654bf680154faa710d43fe4365a0868142f5c20e48c85e137a7a.ghcr.io/github/gh-aw-firewall/api-proxy:0.25.28@sha256:93290f2393752252911bd7c39a047f776c0b53063575e7bde4e304962a9a61cb.ghcr.io/github/gh-aw-firewall/squid:0.25.28@sha256:844c18280f82cd1b06345eb2f4e91966b34185bfc51c9f237c3e022e848fb474.ubuntu24,ImageVersion: 20260426.100.1.elastic/docs-content.Diagnostic gap
The artifact says API proxy logs are available at
/tmp/gh-aw/sandbox/firewall/logs/api-proxy-logs, and the compose file mounts that path into/var/log/api-proxy, but the downloaded artifact only includes these firewall log files:There are no
api-proxy-logsfiles in the artifact, so it is not possible to tell whether the proxy failed to bind, crashed, timed out during initialization, or failed its/healthroute for another reason.Why this seems flaky
awf-api-proxysidecar unhealthy — now blocking main-branch workflows (DeepReport, Smoke CI, Test Quality Sentinel) #27888, includingawf-squidbecoming healthy whileawf-api-proxytimes out as unhealthy.Requested follow-up
Could gh-aw either reopen/continue the previous investigation or track this as a new recurrence on
v0.71.1?Useful fixes would be:
awf-api-proxycontainer logs whendocker compose upfails before removing containers.api-proxy-logsdirectory in uploaded artifacts even if it is empty, so absence is explicit.docker compose upwhen only the api proxy health check flakes.