Problem Statement
On 2026-04-30 between 11:05 and 11:41 UTC, 10 agentic workflow runs failed due to GitHub App installation rate limit exhaustion. The failures were caused by a concentrated burst of scheduled workflows all starting around 11:00–11:10 UTC, collectively saturating the GitHub App API call quota within minutes.
This is the dominant failure cluster in the 6-hour window (10/13 failures traced to this cause).
Affected Workflows & Runs
| Workflow |
Engine |
Run ID |
Failure Point |
| AI Moderator |
N/A |
§25163385305 |
Lock issue |
| AI Moderator |
N/A |
§25163332906 |
Lock issue |
| AI Moderator |
N/A |
§25162561867 |
Lock issue |
| Instructions Janitor |
Claude Code |
§25162158074 |
PR creation + fallback issue |
| Daily AstroStyleLite Markdown Spellcheck |
Claude Code |
§25162377257 |
PR creation |
| Daily Community Attribution Updater |
Copilot CLI |
§25161944698 |
PR creation + fallback issue |
| Developer Documentation Consolidator |
Claude Code |
§25162557165 |
PR creation + fallback issue |
| Daily AW Cross-Repo Compile Check |
Claude Code |
§25162144631 |
create_issue in safe_outputs (after 4 retries) |
| Daily Rendering Scripts Verifier |
Claude Code |
§25162697356 |
Auto guard policy determination |
| Daily Fact About gh-aw |
Codex |
§25163325014 |
Auto guard policy determination |
Root Cause
The GitHub App installation (used for all workflow API calls) has a rate limit that was exhausted during the burst. Representative error:
API rate limit exceeded for installation.
Request ID: 1200:1EDC6B:765CDF5:1D4E1C81:69F34022
Timestamp: 2026-04-30 11:20:46 UTC
The rate limit hit multiple API operations:
POST /repos/{owner}/{repo}/issues (safe_outputs create_issue — 4 retries, all failed)
POST /repos/{owner}/{repo}/pulls (PR creation with fallback to issue — both failed)
PATCH /repos/{owner}/{repo}/issues/{number} (lock issue)
- GitHub MCP guard policy determination (GET on installation context)
GET /repos/{owner}/{repo} (fetch default branch before push)
- GraphQL signed push fallback
The burst window corresponds to the 11:00–11:10 UTC scheduled trigger time for approximately 15–20 workflows running concurrently.
Impact
- 10 workflow failures in a 36-minute window; all work lost for those runs
- Workflows that pushed code (Instructions Janitor, Markdown Spellcheck, etc.) could not deliver their PRs
- The rate limit self-recovered after ~40 minutes; subsequent runs in the same hour succeeded
- Daily AW Cross-Repo Compile Check agent completed its full analysis (33 turns, $1.75) but could not post results — work was discarded
Proposed Remediation
Option A (Recommended): Stagger scheduled workflow start times
- Spread the 11:00 UTC batch across a 20–30 minute window using jitter in
cron: expressions
- This reduces the API burst from ~20 concurrent triggers to ~2-3 per minute
- Example: change
0 11 * * * schedules to */3 11 * * * variants or random-minute offsets
Option B: Increase safe_outputs retry count + backoff for rate limit errors
- Current retry: 4 attempts with ~1 minute gap — not sufficient for installation-level rate resets
- Extend to 8 retries with exponential backoff (up to 10 minutes) specifically for 429/rate-limit errors
Option C: App-level investigation
- Verify the GitHub App installation's rate limit tier is appropriate for the current number of concurrent workflows
- Check if the App can be upgraded to a higher rate limit tier or if request batching is possible
Success Criteria
- Zero rate-limit failures in the next 6-hour monitoring window after schedule staggering is applied
- OR: safe_outputs retry succeeds under load without workflow failure
References
Parent: #29232
References:
- §25162158074 (Instructions Janitor — rate limit on PR + fallback)
- §25162144631 (Daily AW Cross-Repo Compile Check — safe_outputs rate limit after 4 retries)
- §25163385305 (AI Moderator — lock issue rate limit)
Generated by [aw] Failure Investigator (6h) · ● 464.2K · ◷
Problem Statement
On 2026-04-30 between 11:05 and 11:41 UTC, 10 agentic workflow runs failed due to GitHub App installation rate limit exhaustion. The failures were caused by a concentrated burst of scheduled workflows all starting around 11:00–11:10 UTC, collectively saturating the GitHub App API call quota within minutes.
This is the dominant failure cluster in the 6-hour window (10/13 failures traced to this cause).
Affected Workflows & Runs
create_issuein safe_outputs (after 4 retries)Root Cause
The GitHub App installation (used for all workflow API calls) has a rate limit that was exhausted during the burst. Representative error:
The rate limit hit multiple API operations:
POST /repos/{owner}/{repo}/issues(safe_outputs create_issue — 4 retries, all failed)POST /repos/{owner}/{repo}/pulls(PR creation with fallback to issue — both failed)PATCH /repos/{owner}/{repo}/issues/{number}(lock issue)GET /repos/{owner}/{repo}(fetch default branch before push)The burst window corresponds to the 11:00–11:10 UTC scheduled trigger time for approximately 15–20 workflows running concurrently.
Impact
Proposed Remediation
Option A (Recommended): Stagger scheduled workflow start times
cron:expressions0 11 * * *schedules to*/3 11 * * *variants or random-minute offsetsOption B: Increase safe_outputs retry count + backoff for rate limit errors
Option C: App-level investigation
Success Criteria
References
Parent: #29232
References: