Skip to content

[aw-failures] Token-budget exhaustion (25M effective-tokens cap) recurring across 6+ scheduled workflows — 2026-05-29 02:00–07:32 UTC #35661

@github-actions

Description

@github-actions

Problem statement

Between 2026-05-29 02:00 UTC and 07:32 UTC, at least 7 scheduled agentic workflow runs across 6 distinct workflows failed with the same Copilot CLI error:

CAPIError: 429 Maximum effective tokens exceeded (25011516.00 / 25000000)

This is a material expansion of the P1 "token-budget loop" cluster first surfaced in the parent report #35484 (which observed 2 affected workflows). The pattern has tripled in 24 hours and is now the dominant failure mode in the 6h window.

Affected workflows / runs (6h window)

Workflow Run ID Time (UTC) Symptom
PR Sous Chef §26623257736 07:02 4 retries of CAPI 429 → action timed out @ 25m
PR Sous Chef §26620005382 05:31 effective_tokens_rate_limit_error set
Safe Output Health Monitor §26620239212 05:38 effective_tokens_rate_limit_error set
Step Name Alignment §26619561645 05:18 effective_tokens_rate_limit_error set (also tracked in #35644)
Copilot CLI Deep Research Agent §26619051030 05:01 effective_tokens_rate_limit_error set + 10 KB body limit hit on create_discussion
Go Logger Enhancement §26618323959 04:39 effective_tokens_rate_limit_error set
Daily Firewall Logs Collector and Reporter §26615980789 03:22 effective_tokens_rate_limit_error set

Probable root cause

The Copilot CLI harness retries the agent up to 4 times with --continue after partial failures. When the prior turn already accumulated ≥20 M effective tokens (large MCP tool descriptions + workflow body + tool output history), each retry re-sends the full conversation and crosses the 25 M cap on the next request. The job then either:

  1. Loops through 4 retries each consuming ~1–2 minutes of 429 backoff (94 s total wait), then times out at the 25-minute step limit, or
  2. Exits non-zero on attempt 1 and the conclusion step marks the run as failure.

Contributing factors:

  • MCP tool list payload is large (full descriptions of audit, audit-diff, logs, compile, codemod, etc. each ~1 KB).
  • Some workflow prompts (e.g. PR Sous Chef triaging 7 PRs) accumulate gh pr view JSON outputs across iterations.
  • Cache hits are high in absolute tokens (3.4 M cached on the PR Sous Chef failure) but cached input still counts against the effective-tokens cap.

Proposed remediation

  1. Cap per-workflow turn count — set explicit max-turns on the affected scheduled workflows (suggested 30 for triage workflows, 60 for investigative). Today none of these workflows declare max-turns.
  2. Reduce MCP tool surface area per workflow — most failing workflows have access to the full agenticworkflows tool catalog when they only need logs + audit. Use allow-tool lists to keep MCP description payload small.
  3. Trim conversation between retries — when the harness detects 429 effective-tokens on attempt N, it should pass --no-resume (or compact) instead of --continue for attempt N+1, so the retry starts from a smaller context window.
  4. Pre-emptive guard — emit a workflow warning when cumulative tokens cross 20 M (80 % of cap) so the agent can self-truncate with noop before failing on the next request.

Success criteria / verification

  • Over a 24h window after rollout, agentic workflow runs failing with effective_tokens_rate_limit_error drop below 5 % of completed runs (current rate: ~7 of ~25 scheduled completions in the 6h sample = ~28 %).
  • No single workflow contributes >1 token-cap failure in any 6h window.
  • PR Sous Chef, Safe Output Health Monitor, and Copilot CLI Deep Research run to completion in ≥ 80 % of scheduled invocations.

Related issues

  • Parent: #35484
  • Related single-workflow tracker: #35644 (Step Name Alignment 80 % failure rate)
  • Related but distinct cause (not token-budget): #35441 (Daily Hippo Learn cache-memory git pack corruption — recurred at 07:37 today, §26624675283, confirming that tracker is still live).

Filed by [aw] Failure Investigator (6h) §26625870457.

Generated by 🔍 [aw] Failure Investigator (6h) · opus47 11.4M ·

  • expires on Jun 5, 2026, 8:18 AM UTC

Recurrence confirmed — 2026-05-30 17:43 UTC (still active)

The 25M effective-tokens cap fired again ~33h after this issue was opened, confirming the token-budget exhaustion pattern is still active and not yet remediated.

Workflow Run Time (UTC) Symptom
Linter Miner §26690626184 2026-05-30 17:43 agentExecute GitHub Copilot CLI failed; effective_tokens_rate_limit_error set
Exact 429 signature
429 Maximum effective tokens exceeded (25132364.10 / 25000000).

Run profile: 21.5m, 59 turns, 25.13M effective tokens — same single-run-crosses-the-cap shape described above (no --continue retry needed; one long run exceeded 25M on its own).

Keeping this issue open. (Investigated by the [aw] Failure Investigator 6h window ending 2026-05-30 ~19:10 UTC.)

Generated by 🔍 [aw] Failure Investigator (6h) · opus48 4.1M ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions