[aw-failures] Token-budget exhaustion (25M effective-tokens cap) recurring across 6+ scheduled workflows — 2026-05-29 02:00–07:32 UTC

### Problem statement

Between 2026-05-29 02:00 UTC and 07:32 UTC, **at least 7 scheduled agentic workflow runs across 6 distinct workflows** failed with the same Copilot CLI error:

```
CAPIError: 429 Maximum effective tokens exceeded (25011516.00 / 25000000)
```

This is a material expansion of the P1 "token-budget loop" cluster first surfaced in the parent report [#35484](https://github.com/github/gh-aw/issues/35484) (which observed 2 affected workflows). The pattern has tripled in 24 hours and is now the dominant failure mode in the 6h window.

### Affected workflows / runs (6h window)

| Workflow | Run ID | Time (UTC) | Symptom |
|---|---|---|---|
| PR Sous Chef | [§26623257736](https://github.com/github/gh-aw/actions/runs/26623257736) | 07:02 | 4 retries of CAPI 429 → action timed out @ 25m |
| PR Sous Chef | [§26620005382](https://github.com/github/gh-aw/actions/runs/26620005382) | 05:31 | `effective_tokens_rate_limit_error` set |
| Safe Output Health Monitor | [§26620239212](https://github.com/github/gh-aw/actions/runs/26620239212) | 05:38 | `effective_tokens_rate_limit_error` set |
| Step Name Alignment | [§26619561645](https://github.com/github/gh-aw/actions/runs/26619561645) | 05:18 | `effective_tokens_rate_limit_error` set (also tracked in [#35644](https://github.com/github/gh-aw/issues/35644)) |
| Copilot CLI Deep Research Agent | [§26619051030](https://github.com/github/gh-aw/actions/runs/26619051030) | 05:01 | `effective_tokens_rate_limit_error` set + 10 KB body limit hit on `create_discussion` |
| Go Logger Enhancement | [§26618323959](https://github.com/github/gh-aw/actions/runs/26618323959) | 04:39 | `effective_tokens_rate_limit_error` set |
| Daily Firewall Logs Collector and Reporter | [§26615980789](https://github.com/github/gh-aw/actions/runs/26615980789) | 03:22 | `effective_tokens_rate_limit_error` set |

### Probable root cause

The Copilot CLI harness retries the agent up to 4 times with `--continue` after partial failures. When the prior turn already accumulated ≥20 M effective tokens (large MCP tool descriptions + workflow body + tool output history), each retry re-sends the full conversation and crosses the 25 M cap on the next request. The job then either:

1. Loops through 4 retries each consuming ~1–2 minutes of 429 backoff (94 s total wait), then times out at the 25-minute step limit, or
2. Exits non-zero on attempt 1 and the conclusion step marks the run as failure.

Contributing factors:
- MCP tool list payload is large (full descriptions of `audit`, `audit-diff`, `logs`, `compile`, `codemod`, etc. each ~1 KB).
- Some workflow prompts (e.g. PR Sous Chef triaging 7 PRs) accumulate `gh pr view` JSON outputs across iterations.
- Cache hits are high in absolute tokens (3.4 M cached on the PR Sous Chef failure) but cached input still counts against the effective-tokens cap.

### Proposed remediation

1. **Cap per-workflow turn count** — set explicit `max-turns` on the affected scheduled workflows (suggested 30 for triage workflows, 60 for investigative). Today none of these workflows declare `max-turns`.
2. **Reduce MCP tool surface area per workflow** — most failing workflows have access to the full `agenticworkflows` tool catalog when they only need `logs` + `audit`. Use `allow-tool` lists to keep MCP description payload small.
3. **Trim conversation between retries** — when the harness detects 429 effective-tokens on attempt N, it should pass `--no-resume` (or compact) instead of `--continue` for attempt N+1, so the retry starts from a smaller context window.
4. **Pre-emptive guard** — emit a workflow warning when cumulative tokens cross 20 M (80 % of cap) so the agent can self-truncate with `noop` before failing on the next request.

### Success criteria / verification

- Over a 24h window after rollout, agentic workflow runs failing with `effective_tokens_rate_limit_error` drop below **5 %** of completed runs (current rate: ~7 of ~25 scheduled completions in the 6h sample = ~28 %).
- No single workflow contributes >1 token-cap failure in any 6h window.
- PR Sous Chef, Safe Output Health Monitor, and Copilot CLI Deep Research run to completion in ≥ 80 % of scheduled invocations.

### Related issues

- Parent: [#35484](https://github.com/github/gh-aw/issues/35484)
- Related single-workflow tracker: [#35644](https://github.com/github/gh-aw/issues/35644) (Step Name Alignment 80 % failure rate)
- Related but distinct cause (not token-budget): [#35441](https://github.com/github/gh-aw/issues/35441) (Daily Hippo Learn cache-memory git pack corruption — recurred at 07:37 today, [§26624675283](https://github.com/github/gh-aw/actions/runs/26624675283), confirming that tracker is still live).

_Filed by [aw] Failure Investigator (6h) [§26625870457](https://github.com/github/gh-aw/actions/runs/26625870457)._







> Generated by [🔍 [aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/26625870457) · opus47 11.4M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)
> - [x] expires  on Jun 5, 2026, 8:18 AM UTC






---

### Recurrence confirmed — 2026-05-30 17:43 UTC (still active)

The 25M effective-tokens cap fired again ~33h after this issue was opened, confirming the token-budget exhaustion pattern is **still active and not yet remediated**.

| Workflow | Run | Time (UTC) | Symptom |
|---|---|---|---|
| Linter Miner | [§26690626184](https://github.com/github/gh-aw/actions/runs/26690626184) | 2026-05-30 17:43 | `agent` → `Execute GitHub Copilot CLI` failed; `effective_tokens_rate_limit_error` set |

<details>
<summary>Exact 429 signature</summary>

```
429 Maximum effective tokens exceeded (25132364.10 / 25000000).
```

Run profile: 21.5m, 59 turns, 25.13M effective tokens — same single-run-crosses-the-cap shape described above (no `--continue` retry needed; one long run exceeded 25M on its own).

</details>

Keeping this issue **open**. (Investigated by the [aw] Failure Investigator 6h window ending 2026-05-30 ~19:10 UTC.)

> Generated by [🔍 [aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/26692427111) · opus48 4.1M · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aw-failures] Token-budget exhaustion (25M effective-tokens cap) recurring across 6+ scheduled workflows — 2026-05-29 02:00–07:32 UTC #35661

Problem statement

Affected workflows / runs (6h window)

Probable root cause

Proposed remediation

Success criteria / verification

Related issues

Recurrence confirmed — 2026-05-30 17:43 UTC (still active)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Workflow	Run ID	Time (UTC)	Symptom
PR Sous Chef	§26623257736	07:02	4 retries of CAPI 429 → action timed out @ 25m
PR Sous Chef	§26620005382	05:31	`effective_tokens_rate_limit_error` set
Safe Output Health Monitor	§26620239212	05:38	`effective_tokens_rate_limit_error` set
Step Name Alignment	§26619561645	05:18	`effective_tokens_rate_limit_error` set (also tracked in #35644)
Copilot CLI Deep Research Agent	§26619051030	05:01	`effective_tokens_rate_limit_error` set + 10 KB body limit hit on `create_discussion`
Go Logger Enhancement	§26618323959	04:39	`effective_tokens_rate_limit_error` set
Daily Firewall Logs Collector and Reporter	§26615980789	03:22	`effective_tokens_rate_limit_error` set

[aw-failures] Token-budget exhaustion (25M effective-tokens cap) recurring across 6+ scheduled workflows — 2026-05-29 02:00–07:32 UTC #35661

Description

Problem statement

Affected workflows / runs (6h window)

Probable root cause

Proposed remediation

Success criteria / verification

Related issues

Recurrence confirmed — 2026-05-30 17:43 UTC (still active)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions