Problem Statement
The GitHub installation token is hitting API rate limits during burst windows, causing safe-output write operations (create_issue, add_labels, lock issue) to fail across multiple concurrent agentic workflows. The agent completes its work successfully, but results are silently lost because the safe-outputs step cannot write to GitHub.
Affected workflows (2026-05-01T11:32–12:28Z):
- Daily Skill Optimizer Improvements →
create_issue failed after 3 retries
- Step Name Alignment →
create_issue failed after 3 retries
- AI Moderator (PR event, §25213666728) →
add_labels rate-limited
- AI Moderator (issue_comment, §25214243935) →
lock issue rate-limited
Pattern: 12 workflows launched nearly simultaneously (schedule jobs at ~12:05 UTC). The burst of installation-token API calls from concurrent safe-outputs jobs exhausted the rate limit quota. A second hit at 12:28Z confirms the token was still constrained 20+ minutes later.
Agent work lost:
Probable Root Cause
The safe_outputs step uses the GitHub installation token for all writes. When >10 workflows complete simultaneously and attempt to write to GitHub, the 60 req/minute installation token quota fills immediately. The current retry logic (3 attempts over ~90 s) is insufficient when the token is completely drained — later attempts hit the same wall.
Proposed Remediation
-
Stagger scheduled workflow start times — Add ±random(0, 5min) jitter to the cron expressions for scheduled daily workflows so the burst of concurrent launches is spread over a wider window. This is the highest-leverage, lowest-risk fix.
-
Increase retry count and exponential backoff in safe_outputs — The current 3-retry / 90 s total budget is too short for installation token exhaustion. Consider 5 retries with exponential backoff (30 s → 60 s → 120 s → 240 s), giving ~8 min total window.
-
Respect Retry-After header — GitHub's rate-limit responses include a Retry-After header. The safe_outputs handler should parse it and sleep exactly that long instead of using a fixed interval.
-
Monitor rate limit headroom — Add a pre-check step that reads x-ratelimit-remaining before each safe-outputs write and logs a warning when below threshold (e.g., <20% remaining).
Success Criteria / Verification
References
Generated by [aw] Failure Investigator (6h) · ● 427.4K · ◷
Problem Statement
The GitHub installation token is hitting API rate limits during burst windows, causing safe-output write operations (
create_issue,add_labels,lock issue) to fail across multiple concurrent agentic workflows. The agent completes its work successfully, but results are silently lost because the safe-outputs step cannot write to GitHub.Affected workflows (2026-05-01T11:32–12:28Z):
create_issuefailed after 3 retriescreate_issuefailed after 3 retriesadd_labelsrate-limitedlock issuerate-limitedPattern: 12 workflows launched nearly simultaneously (schedule jobs at ~12:05 UTC). The burst of installation-token API calls from concurrent safe-outputs jobs exhausted the rate limit quota. A second hit at 12:28Z confirms the token was still constrained 20+ minutes later.
Agent work lost:
[skill-optimizer] Daily Skill Optimizer Improvements - 2026-05-01(issue body ready, not created)[step-names] Improve compiler-generated step names for precision and clarity(issue body ready, not created)ai-inspectednot applied to PR [FAQ] Update: off-platform admission control for safe outputs #29535Probable Root Cause
The
safe_outputsstep uses the GitHub installation token for all writes. When >10 workflows complete simultaneously and attempt to write to GitHub, the 60 req/minute installation token quota fills immediately. The current retry logic (3 attempts over ~90 s) is insufficient when the token is completely drained — later attempts hit the same wall.Proposed Remediation
Stagger scheduled workflow start times — Add
±random(0, 5min)jitter to the cron expressions for scheduled daily workflows so the burst of concurrent launches is spread over a wider window. This is the highest-leverage, lowest-risk fix.Increase retry count and exponential backoff in safe_outputs — The current 3-retry / 90 s total budget is too short for installation token exhaustion. Consider 5 retries with exponential backoff (30 s → 60 s → 120 s → 240 s), giving ~8 min total window.
Respect
Retry-Afterheader — GitHub's rate-limit responses include aRetry-Afterheader. The safe_outputs handler should parse it and sleep exactly that long instead of using a fixed interval.Monitor rate limit headroom — Add a pre-check step that reads
x-ratelimit-remainingbefore each safe-outputs write and logs a warning when below threshold (e.g., <20% remaining).Success Criteria / Verification
create_issue/add_labelsrate-limit failures in the next 7-day window for the same workflows.lock.ymlfiles for affected daily workflowsDaily Skill Optimizer ImprovementsandStep Name Alignmentissues created successfully on next runReferences
create_issuefailedcreate_issuefailedadd_labelsfailedlock issuefailed