Skip to content

[aw-failures] [P0] Installation token rate limiting causes silent loss of safe-output writes in concurrent burst windows #29541

@github-actions

Description

@github-actions

Problem Statement

The GitHub installation token is hitting API rate limits during burst windows, causing safe-output write operations (create_issue, add_labels, lock issue) to fail across multiple concurrent agentic workflows. The agent completes its work successfully, but results are silently lost because the safe-outputs step cannot write to GitHub.

Affected workflows (2026-05-01T11:32–12:28Z):

  • Daily Skill Optimizer Improvements → create_issue failed after 3 retries
  • Step Name Alignment → create_issue failed after 3 retries
  • AI Moderator (PR event, §25213666728) → add_labels rate-limited
  • AI Moderator (issue_comment, §25214243935) → lock issue rate-limited

Pattern: 12 workflows launched nearly simultaneously (schedule jobs at ~12:05 UTC). The burst of installation-token API calls from concurrent safe-outputs jobs exhausted the rate limit quota. A second hit at 12:28Z confirms the token was still constrained 20+ minutes later.

Agent work lost:


Probable Root Cause

The safe_outputs step uses the GitHub installation token for all writes. When >10 workflows complete simultaneously and attempt to write to GitHub, the 60 req/minute installation token quota fills immediately. The current retry logic (3 attempts over ~90 s) is insufficient when the token is completely drained — later attempts hit the same wall.


Proposed Remediation

  1. Stagger scheduled workflow start times — Add ±random(0, 5min) jitter to the cron expressions for scheduled daily workflows so the burst of concurrent launches is spread over a wider window. This is the highest-leverage, lowest-risk fix.

  2. Increase retry count and exponential backoff in safe_outputs — The current 3-retry / 90 s total budget is too short for installation token exhaustion. Consider 5 retries with exponential backoff (30 s → 60 s → 120 s → 240 s), giving ~8 min total window.

  3. Respect Retry-After header — GitHub's rate-limit responses include a Retry-After header. The safe_outputs handler should parse it and sleep exactly that long instead of using a fixed interval.

  4. Monitor rate limit headroom — Add a pre-check step that reads x-ratelimit-remaining before each safe-outputs write and logs a warning when below threshold (e.g., <20% remaining).


Success Criteria / Verification

  • Zero create_issue/add_labels rate-limit failures in the next 7-day window for the same workflows
  • Staggered cron schedules visible in .lock.yml files for affected daily workflows
  • At minimum, Daily Skill Optimizer Improvements and Step Name Alignment issues created successfully on next run

References

Generated by [aw] Failure Investigator (6h) · ● 427.4K ·

  • expires on May 8, 2026, 1:19 PM UTC

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions