[aw-failures] `awf-api-proxy` sidecar unhealthy — now blocking main-branch workflows (DeepReport, Smoke CI, Test Quality Sentinel)

### Problem

The `awf-api-proxy` sidecar container fails its Docker health check before agent activation, blocking multiple workflows with a pre-inference crash:

```
Container awf-api-proxy  Started
Container awf-api-proxy  Waiting
Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy
```

This was previously seen only on PR branches (tracked by #27688, classified as lower priority). It is now affecting **main-branch production workflows**, making this a P0 issue:

| Workflow | Run ID | Trigger | Branch | Time (UTC) |
|----------|--------|---------|--------|------------|
| DeepReport - Intelligence Gathering Agent | [§24785716916](https://github.com/github/gh-aw/actions/runs/24785716916) | schedule | main | 15:01 |
| Smoke CI | [§24786173739](https://github.com/github/gh-aw/actions/runs/24786173739) | push | main | 15:10 |
| Test Quality Sentinel | [§24786862508](https://github.com/github/gh-aw/actions/runs/24786862508) | push | main | 15:24 |
| Smoke OpenCode | [§24787456440](https://github.com/github/gh-aw/actions/runs/24787456440) | pull_request | copilot/disable-shell-history-expansion | 15:35 |

All affected runs: **0 turns, 0 tokens, 0 tool calls**. Agent container never starts.

### Root Cause Analysis

The `awf-squid` proxy starts and reaches `Healthy`, but `awf-api-proxy` stalls at the health-check polling phase (`Waiting`) and times out. The container startup log pattern is consistent across all 4 failures:

```
Container awf-squid   Healthy        ← succeeds
Container awf-api-proxy  Waiting     ← health check polling
Container awf-api-proxy  Error       ← health check timed out or failed
```

Failures are clustered across a ~34-minute window (15:01–15:35 UTC) and span three distinct engines (Claude Code, GitHub Copilot CLI, OpenCode) — ruling out per-engine config as the cause. Smoke CI succeeded again at 15:24 UTC, suggesting the problem was intermittent during that window.

Probable causes:
1. A dependency the api-proxy needs (env var, secret, downstream endpoint) was transiently unavailable
2. The health check timeout/grace period is too short under load — the proxy takes longer to initialize when concurrent runs are queued
3. A recent api-proxy image change introduced a broken or slower health check

### Proposed Remediation

1. **Diagnose**: Read api-proxy container logs from the failing runs (`/tmp/gh-aw/sandbox/firewall/logs/api-proxy-logs` in run artifacts) to get the specific health check failure reason
2. **If timeout-related**: Increase `start_period` and/or `interval` in the Docker Compose health check config for `awf-api-proxy`
3. **If image-related**: Identify recent changes to the api-proxy image and roll back if needed
4. **If intermittent**: Add a retry/backoff mechanism at the `docker compose up` level before failing the run

### Success Criteria

- [ ] DeepReport and Smoke CI complete without `awf-api-proxy is unhealthy` on next scheduled/push triggers
- [ ] Root cause (timeout vs. broken health check vs. upstream dependency) confirmed from api-proxy logs
- [ ] Fix documented and validated

### Parent Issue

Part of failure investigation report: #27729







> [!NOTE]
> <details>
> <summary>🔒 Integrity filter blocked 5 items</summary>
>
> The following items were blocked because they don't meet the GitHub integrity level.
>
> - github/gh-aw#27757 `list_issues`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
> - [#27880](https://github.com/github/gh-aw/issues/27880) `issue_read`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
> - [#27881](https://github.com/github/gh-aw/issues/27881) `issue_read`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
> - [#27882](https://github.com/github/gh-aw/issues/27882) `issue_read`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
> - [#27883](https://github.com/github/gh-aw/issues/27883) `issue_read`: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".
>
> To allow these resources, lower `min-integrity` in your GitHub frontmatter:
>
> ```yaml
> tools:
>   github:
>     min-integrity: approved  # merged | approved | unapproved | none
> ```
>
> </details>


> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24797384460/agentic_workflow) · ● 497.5K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)
> - [x] expires  on Apr 29, 2026, 7:21 PM UTC






---

### New Recurrence — 2026-04-23 13:05 UTC

Run [§24836877236](https://github.com/github/gh-aw/actions/runs/24836877236) (Smoke CI, push to `main`) failed with identical `awf-api-proxy is unhealthy` pattern:

```
Container awf-squid    Healthy
Container awf-api-proxy  Waiting
Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy
```

- **Engine:** GitHub Copilot CLI v1.0.21 · Firewall v0.25.28
- **Result:** 0 turns, 0 tokens, 0 tool calls — pre-activation crash
- **Auto-issue created:** #28084

Smoke CI ran successfully at 12:26, 12:31, 12:44 UTC on the same day confirming the failure is **intermittent**, not a permanent regression. Root cause and remediation options remain as documented above.

> Added by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24837069045) · 2026-04-23T13:10Z

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24837069045/agentic_workflow) · ● 250K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)



---

### New Recurrence — 2026-04-23 18:51 UTC

Run [§24852980557](https://github.com/github/gh-aw/actions/runs/24852980557) (Design Decision Gate 🏗️, push to `main`) failed with the same `awf-api-proxy is unhealthy` pattern in the **detection job**:

```
Container awf-squid    Healthy
Container awf-api-proxy  Waiting
Container awf-api-proxy  Error
dependency failed to start: container awf-api-proxy is unhealthy
```

- **Engine:** Claude Code (claude-sonnet-4-6) · Firewall v0.25.28
- **Job failed:** `detection` (agent job ran successfully — noop for PR #28146)
- **Result:** detection pre-activation crash; main agent output valid

This is the 3rd documented recurrence. Agent job succeeded but the detection sidecar still intermittently fails health checks, causing the overall workflow to report failure despite valid output.

> Added by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24853731640) · 2026-04-23T19:08Z

> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/24853731640/agentic_workflow) · ● 539.8K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aw-failures] `awf-api-proxy` sidecar unhealthy — now blocking main-branch workflows (DeepReport, Smoke CI, Test Quality Sentinel) #27888

Problem

Root Cause Analysis

Proposed Remediation

Success Criteria

Parent Issue

New Recurrence — 2026-04-23 13:05 UTC

New Recurrence — 2026-04-23 18:51 UTC

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Workflow	Run ID	Trigger	Branch	Time (UTC)
DeepReport - Intelligence Gathering Agent	§24785716916	schedule	main	15:01
Smoke CI	§24786173739	push	main	15:10
Test Quality Sentinel	§24786862508	push	main	15:24
Smoke OpenCode	§24787456440	pull_request	copilot/disable-shell-history-expansion	15:35

[aw-failures] awf-api-proxy sidecar unhealthy — now blocking main-branch workflows (DeepReport, Smoke CI, Test Quality Sentinel) #27888

Description

Problem

Root Cause Analysis

Proposed Remediation

Success Criteria

Parent Issue

New Recurrence — 2026-04-23 13:05 UTC

New Recurrence — 2026-04-23 18:51 UTC

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[aw-failures] `awf-api-proxy` sidecar unhealthy — now blocking main-branch workflows (DeepReport, Smoke CI, Test Quality Sentinel) #27888