Skip to content

[WIP] Fix unhealthy awf-squid firewall container#35803

Closed
Copilot wants to merge 1 commit into
mainfrom
copilot/aw-failures-fix-squid-container
Closed

[WIP] Fix unhealthy awf-squid firewall container#35803
Copilot wants to merge 1 commit into
mainfrom
copilot/aw-failures-fix-squid-container

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 29, 2026

Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.


This section details on the original issue you should resolve

<issue_title>[aw-failures] Squid firewall container (awf-squid) unhealthy → claude engine fails to start (0-turn run failures)</issue_title>
<issue_description>### Problem statement

A scheduled claude-engine workflow failed in the last 6h because the firewall egress proxy container (awf-squid) never became healthy, so the engine's container stack could not start and the agent produced 0 turns / no output. This is an infrastructure / engine-startup failure class — distinct from the token-budget exhaustion tracked in the parent (#35661) and not related to the agent prompt or logic.

The per-run auto-notification (#35756) captured the symptom, but those [aw] ... failed issues auto-expire within ~24h, so this cross-cutting infra failure mode would otherwise lose durable tracking. Filing this as a focused, durable fix item.

Affected workflow / run (6h window: 2026-05-29 ~13:38–19:38 UTC)

Workflow Run Engine Symptom
Documentation Unbloat §26652420396 claude Execute Claude Code CLI step failed at startup; agent ran 2.4m but recorded 0 turns, 0 tool calls, emitted nothing

Failed step (from repos/.../jobs): job agent → step Execute Claude Code CLI (and cascading Parse agent logs for step summary).

Evidence

Engine-failure details (from auto-issue #35756)
⚠️ Engine Failure: The `claude` engine terminated before producing output.

Error details:
- dependency failed to start: container awf-squid is unhealthy
- Failed to start containers: Error: Command failed with exit code 1: docker compose up -d --pull never
Audit corroboration (run 26652420396)
  • overview.conclusion = failure; jobs: pre_activation ✅ → activation ✅ → agent ❌ (2.4m) → detection/safe_outputs/update_cache_memory all skipped.
  • key_findings: "failed before agent activation — no error logs were available to analyze".
  • observability_insights: turns=0 tool_types=0, created_items=0 safe_items=0 (read-only posture, nothing produced).
  • Engine: claude (Claude Code v2.1.156), firewall_enabled=true, steps.firewall=squid.
  • MCP/safeoutputs logs for the run are clean — the failure is upstream of the agent, at container compose time.

Probable root cause

The firewall is provisioned via docker compose up -d --pull never with awf-squid (the squid egress proxy) declared as a healthcheck dependency of the engine container. The compose start aborted because awf-squid did not pass its healthcheck within the allotted window. Likely contributors:

  1. Squid healthcheck race / too-tight timeout — squid can take longer to become ready under runner load; if the compose healthcheck/depends_on: condition: service_healthy window is short, a slow start trips container ... is unhealthy.
  2. --pull never + a missing/partial local image layer — if the squid image wasn't fully present in the runner cache, the container can come up degraded and fail its healthcheck.
  3. No retry on transient compose failure — a single unhealthy result fails the whole run rather than restarting the proxy.

Proposed remediation

  1. Increase squid healthcheck robustness — raise start_period/retries/interval on the awf-squid service healthcheck so a slow-but-healthy squid is tolerated.
  2. Retry container startup once — wrap docker compose up -d --pull never with a single retry (and docker compose logs awf-squid capture) before declaring engine-startup failure, so a transient unhealthy state self-heals.
  3. Surface squid logs on failure — on container awf-squid is unhealthy, automatically emit the squid container logs into the run artifacts to make the next occurrence diagnosable (today the squid logs are not in the downloaded log bundle).
  4. Verify image availability before --pull never — fail fast with a clear message (or fall back to a pull) if the squid image layer is absent, instead of starting a degraded container.

Success criteria / verification

  • Over a 7-day window after rollout, zero claude-engine runs fail with container awf-squid is unhealthy / Failed to start containers.
  • When squid does fail a healthcheck, the run artifacts contain awf-squid container logs and the startup is retried at least once before the run is marked failed.

6h-window correlation (context for parent #35661)

Three runs failed in the window; this issue covers cluster (2). The others are already accounted for:

  • (1) Copilot Agent PR Analysis §26656668984 — git exit 128 in Setup cache-memory git repository + Commit cache-memory changes steps (cache-memory git failure). Already tracked by the auto-issue [aw] Copilot Agent PR Analysis failed #35770 (which currently has no root-cause analysis — the cache-memory orphan-branch git ops are the failing site).
  • (3) Daily Code Metrics §26657137373 — the agent succeeded (9.9m; create_discussion report + 4 assets emitted); only the downstream upload_assets job's Push assets step failed, so the published daily-metrics discussion likely has broken chart-image links (historical_trends.png, quality_score_breakdown.png). Currently untracked, low severity — noted here rather than filed separately to keep new issues to the minimum.

Note on the parent (#35661, token-budget 429): No effective_tokens_rate_limit_error / CAPI-429 failures occurred in this 6h window, and PR Sous Chef (a prior 429 victim) ran and succeeded. This is weak positive signal but not sufficient evidence the remediation landed, so #35661 is left open. Confidence: medium that the squid diagnosis above is correct (one occurrence; root cause taken verbatim from the engine error); low confidence on whether it is transient vs. recurring.

References: §26652420396 · §26656668984 · §26657137373
Related to #35661

Generated by 🔍 [aw] Failure Investigator (6h) · opus48 3.4M ·

  • expires on Jun 5, 2026, 8:40 PM UTC

Comments on the Issue (you are @copilot in this section)

@lpcox ## Already fixed in AWF v0.25.57

The root cause (proposed remediation #1squid healthcheck race / too-tight timeout) was fixed by gh-aw-firewall#3936, which shipped in v0.25.57.

What changed:

  • start_period: 2s → 5s
  • retries: 5 → 10
  • interval: 1s → 2s
  • timeout: 1s → 2s
  • Total healthcheck window: ~7s → ~25s

Happy-path startup is still detected within 5–7s, but a slow start on a loaded runner now has significantly more headroom.

Dependency bump needed: gh-aw currently pins an older AWF version. Issue #35781 tracks the bump to v0.25.57 — once that lands, this failure class should not recur.

Recommend closing this after the v0.25.57 bump ships.</comment_new>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[aw-failures] Squid firewall container (awf-squid) unhealthy → claude engine fails to start (0-turn run failures)

3 participants