Problem statement
Between 2026-05-29 02:00 UTC and 07:32 UTC, at least 7 scheduled agentic workflow runs across 6 distinct workflows failed with the same Copilot CLI error:
CAPIError: 429 Maximum effective tokens exceeded (25011516.00 / 25000000)
This is a material expansion of the P1 "token-budget loop" cluster first surfaced in the parent report #35484 (which observed 2 affected workflows). The pattern has tripled in 24 hours and is now the dominant failure mode in the 6h window.
Affected workflows / runs (6h window)
| Workflow |
Run ID |
Time (UTC) |
Symptom |
| PR Sous Chef |
§26623257736 |
07:02 |
4 retries of CAPI 429 → action timed out @ 25m |
| PR Sous Chef |
§26620005382 |
05:31 |
effective_tokens_rate_limit_error set |
| Safe Output Health Monitor |
§26620239212 |
05:38 |
effective_tokens_rate_limit_error set |
| Step Name Alignment |
§26619561645 |
05:18 |
effective_tokens_rate_limit_error set (also tracked in #35644) |
| Copilot CLI Deep Research Agent |
§26619051030 |
05:01 |
effective_tokens_rate_limit_error set + 10 KB body limit hit on create_discussion |
| Go Logger Enhancement |
§26618323959 |
04:39 |
effective_tokens_rate_limit_error set |
| Daily Firewall Logs Collector and Reporter |
§26615980789 |
03:22 |
effective_tokens_rate_limit_error set |
Probable root cause
The Copilot CLI harness retries the agent up to 4 times with --continue after partial failures. When the prior turn already accumulated ≥20 M effective tokens (large MCP tool descriptions + workflow body + tool output history), each retry re-sends the full conversation and crosses the 25 M cap on the next request. The job then either:
- Loops through 4 retries each consuming ~1–2 minutes of 429 backoff (94 s total wait), then times out at the 25-minute step limit, or
- Exits non-zero on attempt 1 and the conclusion step marks the run as failure.
Contributing factors:
- MCP tool list payload is large (full descriptions of
audit, audit-diff, logs, compile, codemod, etc. each ~1 KB).
- Some workflow prompts (e.g. PR Sous Chef triaging 7 PRs) accumulate
gh pr view JSON outputs across iterations.
- Cache hits are high in absolute tokens (3.4 M cached on the PR Sous Chef failure) but cached input still counts against the effective-tokens cap.
Proposed remediation
- Cap per-workflow turn count — set explicit
max-turns on the affected scheduled workflows (suggested 30 for triage workflows, 60 for investigative). Today none of these workflows declare max-turns.
- Reduce MCP tool surface area per workflow — most failing workflows have access to the full
agenticworkflows tool catalog when they only need logs + audit. Use allow-tool lists to keep MCP description payload small.
- Trim conversation between retries — when the harness detects 429 effective-tokens on attempt N, it should pass
--no-resume (or compact) instead of --continue for attempt N+1, so the retry starts from a smaller context window.
- Pre-emptive guard — emit a workflow warning when cumulative tokens cross 20 M (80 % of cap) so the agent can self-truncate with
noop before failing on the next request.
Success criteria / verification
- Over a 24h window after rollout, agentic workflow runs failing with
effective_tokens_rate_limit_error drop below 5 % of completed runs (current rate: ~7 of ~25 scheduled completions in the 6h sample = ~28 %).
- No single workflow contributes >1 token-cap failure in any 6h window.
- PR Sous Chef, Safe Output Health Monitor, and Copilot CLI Deep Research run to completion in ≥ 80 % of scheduled invocations.
Related issues
- Parent: #35484
- Related single-workflow tracker: #35644 (Step Name Alignment 80 % failure rate)
- Related but distinct cause (not token-budget): #35441 (Daily Hippo Learn cache-memory git pack corruption — recurred at 07:37 today, §26624675283, confirming that tracker is still live).
Filed by [aw] Failure Investigator (6h) §26625870457.
Generated by 🔍 [aw] Failure Investigator (6h) · opus47 11.4M · ◷
Recurrence confirmed — 2026-05-30 17:43 UTC (still active)
The 25M effective-tokens cap fired again ~33h after this issue was opened, confirming the token-budget exhaustion pattern is still active and not yet remediated.
| Workflow |
Run |
Time (UTC) |
Symptom |
| Linter Miner |
§26690626184 |
2026-05-30 17:43 |
agent → Execute GitHub Copilot CLI failed; effective_tokens_rate_limit_error set |
Exact 429 signature
429 Maximum effective tokens exceeded (25132364.10 / 25000000).
Run profile: 21.5m, 59 turns, 25.13M effective tokens — same single-run-crosses-the-cap shape described above (no --continue retry needed; one long run exceeded 25M on its own).
Keeping this issue open. (Investigated by the [aw] Failure Investigator 6h window ending 2026-05-30 ~19:10 UTC.)
Generated by 🔍 [aw] Failure Investigator (6h) · opus48 4.1M · ◷
Problem statement
Between 2026-05-29 02:00 UTC and 07:32 UTC, at least 7 scheduled agentic workflow runs across 6 distinct workflows failed with the same Copilot CLI error:
This is a material expansion of the P1 "token-budget loop" cluster first surfaced in the parent report #35484 (which observed 2 affected workflows). The pattern has tripled in 24 hours and is now the dominant failure mode in the 6h window.
Affected workflows / runs (6h window)
effective_tokens_rate_limit_errorseteffective_tokens_rate_limit_errorseteffective_tokens_rate_limit_errorset (also tracked in #35644)effective_tokens_rate_limit_errorset + 10 KB body limit hit oncreate_discussioneffective_tokens_rate_limit_errorseteffective_tokens_rate_limit_errorsetProbable root cause
The Copilot CLI harness retries the agent up to 4 times with
--continueafter partial failures. When the prior turn already accumulated ≥20 M effective tokens (large MCP tool descriptions + workflow body + tool output history), each retry re-sends the full conversation and crosses the 25 M cap on the next request. The job then either:Contributing factors:
audit,audit-diff,logs,compile,codemod, etc. each ~1 KB).gh pr viewJSON outputs across iterations.Proposed remediation
max-turnson the affected scheduled workflows (suggested 30 for triage workflows, 60 for investigative). Today none of these workflows declaremax-turns.agenticworkflowstool catalog when they only needlogs+audit. Useallow-toollists to keep MCP description payload small.--no-resume(or compact) instead of--continuefor attempt N+1, so the retry starts from a smaller context window.noopbefore failing on the next request.Success criteria / verification
effective_tokens_rate_limit_errordrop below 5 % of completed runs (current rate: ~7 of ~25 scheduled completions in the 6h sample = ~28 %).Related issues
Filed by [aw] Failure Investigator (6h) §26625870457.
Recurrence confirmed — 2026-05-30 17:43 UTC (still active)
The 25M effective-tokens cap fired again ~33h after this issue was opened, confirming the token-budget exhaustion pattern is still active and not yet remediated.
agent→Execute GitHub Copilot CLIfailed;effective_tokens_rate_limit_errorsetExact 429 signature
Run profile: 21.5m, 59 turns, 25.13M effective tokens — same single-run-crosses-the-cap shape described above (no
--continueretry needed; one long run exceeded 25M on its own).Keeping this issue open. (Investigated by the [aw] Failure Investigator 6h window ending 2026-05-30 ~19:10 UTC.)