Skip to content

[aw-failures] awf-api-proxy sidecar unhealthy — now blocking main-branch workflows (DeepReport, Smoke CI, Test Quality Sentinel) #27888

@github-actions

Description

@github-actions

Problem

The awf-api-proxy sidecar container fails its Docker health check before agent activation, blocking multiple workflows with a pre-inference crash:

Container awf-api-proxy  Started
Container awf-api-proxy  Waiting
Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy

This was previously seen only on PR branches (tracked by #27688, classified as lower priority). It is now affecting main-branch production workflows, making this a P0 issue:

Workflow Run ID Trigger Branch Time (UTC)
DeepReport - Intelligence Gathering Agent §24785716916 schedule main 15:01
Smoke CI §24786173739 push main 15:10
Test Quality Sentinel §24786862508 push main 15:24
Smoke OpenCode §24787456440 pull_request copilot/disable-shell-history-expansion 15:35

All affected runs: 0 turns, 0 tokens, 0 tool calls. Agent container never starts.

Root Cause Analysis

The awf-squid proxy starts and reaches Healthy, but awf-api-proxy stalls at the health-check polling phase (Waiting) and times out. The container startup log pattern is consistent across all 4 failures:

Container awf-squid   Healthy        ← succeeds
Container awf-api-proxy  Waiting     ← health check polling
Container awf-api-proxy  Error       ← health check timed out or failed

Failures are clustered across a ~34-minute window (15:01–15:35 UTC) and span three distinct engines (Claude Code, GitHub Copilot CLI, OpenCode) — ruling out per-engine config as the cause. Smoke CI succeeded again at 15:24 UTC, suggesting the problem was intermittent during that window.

Probable causes:

  1. A dependency the api-proxy needs (env var, secret, downstream endpoint) was transiently unavailable
  2. The health check timeout/grace period is too short under load — the proxy takes longer to initialize when concurrent runs are queued
  3. A recent api-proxy image change introduced a broken or slower health check

Proposed Remediation

  1. Diagnose: Read api-proxy container logs from the failing runs (/tmp/gh-aw/sandbox/firewall/logs/api-proxy-logs in run artifacts) to get the specific health check failure reason
  2. If timeout-related: Increase start_period and/or interval in the Docker Compose health check config for awf-api-proxy
  3. If image-related: Identify recent changes to the api-proxy image and roll back if needed
  4. If intermittent: Add a retry/backoff mechanism at the docker compose up level before failing the run

Success Criteria

  • DeepReport and Smoke CI complete without awf-api-proxy is unhealthy on next scheduled/push triggers
  • Root cause (timeout vs. broken health check vs. upstream dependency) confirmed from api-proxy logs
  • Fix documented and validated

Parent Issue

Part of failure investigation report: #27729

Note

🔒 Integrity filter blocked 5 items

The following items were blocked because they don't meet the GitHub integrity level.

  • push_to_pull_request_branch does not support multi-repo (side-repo) checkout pattern #27757 list_issues: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #27880 issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #27881 issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #27882 issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
  • #27883 issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".

To allow these resources, lower min-integrity in your GitHub frontmatter:

tools:
  github:
    min-integrity: approved  # merged | approved | unapproved | none

Generated by [aw] Failure Investigator (6h) · ● 497.5K ·

  • expires on Apr 29, 2026, 7:21 PM UTC

New Recurrence — 2026-04-23 13:05 UTC

Run §24836877236 (Smoke CI, push to main) failed with identical awf-api-proxy is unhealthy pattern:

Container awf-squid    Healthy
Container awf-api-proxy  Waiting
Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy
  • Engine: GitHub Copilot CLI v1.0.21 · Firewall v0.25.28
  • Result: 0 turns, 0 tokens, 0 tool calls — pre-activation crash
  • Auto-issue created: [aw] Smoke CI failed #28084

Smoke CI ran successfully at 12:26, 12:31, 12:44 UTC on the same day confirming the failure is intermittent, not a permanent regression. Root cause and remediation options remain as documented above.

Added by [aw] Failure Investigator (6h) · 2026-04-23T13:10Z

Generated by [aw] Failure Investigator (6h) · ● 250K ·


New Recurrence — 2026-04-23 18:51 UTC

Run §24852980557 (Design Decision Gate 🏗️, push to main) failed with the same awf-api-proxy is unhealthy pattern in the detection job:

Container awf-squid    Healthy
Container awf-api-proxy  Waiting
Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy

This is the 3rd documented recurrence. Agent job succeeded but the detection sidecar still intermittently fails health checks, causing the overall workflow to report failure despite valid output.

Added by [aw] Failure Investigator (6h) · 2026-04-23T19:08Z

Generated by [aw] Failure Investigator (6h) · ● 539.8K ·

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions