[aw-failures] Failure Analysis Report - 2026-05-04 (6h window)

### Executive Summary

44 workflow runs analyzed over the 6-hour window (2026-05-03T19:18Z – 2026-05-04T01:18Z). **15 failures** identified across 6 failure clusters. The dominant pattern is the **GitHub API Consumption Report Agent** failing pre-agent on every push (10 failures, no engine configured). Two genuine agentic failures require immediate action: **Daily Model Inventory Checker** silently crashes on startup (P0, 100% failure rate), and **Smoke Gemini** is blocked by firewall at 95% request block rate (P1).

### Failure Clusters

| Priority | Workflow | Failures | Root Cause | Status |
|----------|----------|----------|------------|--------|
| P0 | Daily Model Inventory Checker | 2/2 (100%) | Copilot CLI silent exit code 1 (no output) | Untracked |
| P1 | Smoke Gemini | 1 | 95% firewall block rate (localhost:8080) | Untracked |
| P1 | Smoke Claude | 1 | APM bundle unpack failure (PR-specific) | PR merged |
| P2 | GitHub API Consumption Report Agent | 10 | Pre-flight failure, no engine configured | Likely known |
| P2 | Design Decision Gate | 1 | `push_to_pull_request_branch` status unknown | Single occurrence |
| INFO | Smoke CI | 1 cancelled | Superseded by newer push (normal) | Expected |

**Cost & Scale:** 7 errors, $7.02 total cost, 17.9M tokens across 44 runs, 215 action-minutes consumed.

### Evidence

<details>
<summary>P0: Daily Model Inventory Checker — Silent Startup Crash</summary>

**Affected runs:** [§25294739769](https://github.com/github/gh-aw/actions/runs/25294739769) (schedule), [§25294350506](https://github.com/github/gh-aw/actions/runs/25294350506) (workflow_dispatch)  
**Engine:** GitHub Copilot CLI v1.0.40, model `claude-sonnet-4.6`

- All data-collection jobs succeed (`collect_anthropic_models`, `collect_openai_models`, `collect_gemini_models`, `collect_copilot_models`)
- `models.json` produced: only **54 bytes** (effectively empty)
- Agent job duration: ~1 minute, then exits with code 1, zero stdout/zero stderr
- Harness message: *"no output produced — not retrying (possible causes: binary not found, permission denied, auth failure, or silent startup crash)"*
- 2 network requests to `api.githubcopilot.com:443` (allowed), 0 blocked — agent never made substantive calls
- 0 turns, 0 tool calls

**Pattern:** Consistent across both schedule and workflow_dispatch triggers on `main`. No code changes between the two failures.

</details>

<details>
<summary>P1: Smoke Gemini — Firewall Blocking MCP Bridge</summary>

**Affected run:** [§25295890959](https://github.com/github/gh-aw/actions/runs/25295890959) (pull_request on `copilot/add-default-agent-harness`)  
**Engine:** Google Gemini CLI

- 320 total network requests; **304 blocked (95% block rate)**
- `localhost:8080`: 288 blocked requests — Gemini harness attempting to reach a local MCP bridge/tool server
- `172.30.0.30:10003`: 15 blocked requests
- `play.googleapis.com:443`: 16 allowed requests
- Two Gemini client error JSON files captured (50 KB + 34 KB)
- 0 agent turns; agent job ran 5.9m before failing

**Pattern:** Gemini engine requires `localhost:8080` for MCP tooling. This endpoint is not in the firewall allowlist.

</details>

<details>
<summary>P1: Smoke Claude — APM Bundle Unpack Failure</summary>

**Affected run:** [§25295890954](https://github.com/github/gh-aw/actions/runs/25295890954) (pull_request on `copilot/add-default-agent-harness`, now merged as 3a4fe48)  
**Engine:** Claude Code v2.1.126

**Error:** `APM action failed: apm unpack failed for bundle 1 of 1 (path: /tmp/gh-aw/apm-bundles/apm-default.tar.gz, exit code: 1)`  
**Step:** "Restore APM packages (all bundles)" — agent never ran (0 turns, 0 tokens)

audit-diff vs last successful Smoke Claude run (25263690532, 2026-05-02):
- Duration: 12m 20s → 3m 14s (agent never reached execution)
- Tokens: 1,873,538 → 0 (100% regression)
- All 21 MCP tools absent in failure (agent never started)

The `apm-prep` job succeeded; the bundle was fetched but failed to unpack in the `agent` job. The PR that triggered this was merged — **verify whether the APM unpack issue persists on `main`**.

</details>

<details>
<summary>P2: GitHub API Consumption Report Agent — Pre-Flight Failures (10 runs)</summary>

**Pattern:** Fires on every push to `copilot/*` and `main` branches. All 10 failures are instant (created_at = updated_at), no jobs ran, no engine configured, no artifacts preserved.  
**Sample runs:** 25296428563, 25295436032, 25294865032, 25294795992, 25294764258

This workflow appears to be triggered broadly but fails before any job starts — likely missing required secrets, a misconfigured trigger condition, or intentionally disabled without removing the trigger. The high frequency (10 in 6h) generates noise and inflates the failure count.

</details>

<details>
<summary>P2: Design Decision Gate — Safe Outputs Push Unconfirmed</summary>

**Affected run:** [§25293685460](https://github.com/github/gh-aw/actions/runs/25293685460) (pull_request on `copilot/mark-experiments-feature-as-experimental`)  
**Engine:** Claude Code v2.1.126, 21 turns, 899K tokens (~$0.52)

- Agent completed reasoning and attempted 3× `push_to_pull_request_branch` MCP calls
- All 3 calls returned `status: unknown`
- `safe_outputs` job was **skipped**
- Workflow concluded as failure despite agent believing it had written outputs
- Full stdio log available: `/tmp/gh-aw/aw-mcp/logs/run-25293685460/agent-stdio.log` (169 KB)

Isolated occurrence; may be transient MCP connectivity issue.

</details>

### Existing Issue Correlation

GitHub issue list was unavailable (local API proxy returned 403 during this run). Issue correlation is based on failure pattern analysis only. Sub-issue #aw_fail504 created for P0 Daily Model Inventory Checker.

### Proposed Fix Roadmap

| Priority | Item | Owner Signal | Effort |
|----------|------|-------------|--------|
| P0 | Daily Model Inventory Checker: investigate Copilot CLI silent crash | Infra/harness | Medium |
| P1 | Smoke Gemini: add `localhost:8080` + `172.30.0.30:10003` to firewall allowlist | Platform/firewall | Low |
| P1 | Smoke Claude: verify APM unpack issue on `main` post-merge of 3a4fe48 | APM/harness | Low |
| P2 | GitHub API Consumption Report Agent: review trigger config, remove broad push triggers or add guard | Workflow owner | Low |
| P2 | Design Decision Gate: investigate transient safe_outputs MCP push failures | MCP/harness | Low |

### Sub-Issues Created

- #aw_fail504_p0 — Daily Model Inventory Checker: Copilot CLI silent startup crash

**References:**
- [§25294739769](https://github.com/github/gh-aw/actions/runs/25294739769) · [§25295890959](https://github.com/github/gh-aw/actions/runs/25295890959) · [§25295890954](https://github.com/github/gh-aw/actions/runs/25295890954)







> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/25296429028/agentic_workflow) · ● 594.5K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)
> - [x] expires  on May 11, 2026, 1:33 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aw-failures] Failure Analysis Report - 2026-05-04 (6h window) #30042

Executive Summary

Failure Clusters

Evidence

Existing Issue Correlation

Proposed Fix Roadmap

Sub-Issues Created

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Priority	Workflow	Failures	Root Cause	Status
P0	Daily Model Inventory Checker	2/2 (100%)	Copilot CLI silent exit code 1 (no output)	Untracked
P1	Smoke Gemini	1	95% firewall block rate (localhost:8080)	Untracked
P1	Smoke Claude	1	APM bundle unpack failure (PR-specific)	PR merged
P2	GitHub API Consumption Report Agent	10	Pre-flight failure, no engine configured	Likely known
P2	Design Decision Gate	1	`push_to_pull_request_branch` status unknown	Single occurrence
INFO	Smoke CI	1 cancelled	Superseded by newer push (normal)	Expected

Priority	Item	Owner Signal	Effort
P0	Daily Model Inventory Checker: investigate Copilot CLI silent crash	Infra/harness	Medium
P1	Smoke Gemini: add `localhost:8080` + `172.30.0.30:10003` to firewall allowlist	Platform/firewall	Low
P1	Smoke Claude: verify APM unpack issue on `main` post-merge of `3a4fe48`	APM/harness	Low
P2	GitHub API Consumption Report Agent: review trigger config, remove broad push triggers or add guard	Workflow owner	Low
P2	Design Decision Gate: investigate transient safe_outputs MCP push failures	MCP/harness	Low

[aw-failures] Failure Analysis Report - 2026-05-04 (6h window) #30042

Description

Executive Summary

Failure Clusters

Evidence

Existing Issue Correlation

Proposed Fix Roadmap

Sub-Issues Created

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions