[aw-failures] [P0] Installation token rate limiting causes silent loss of safe-output writes in concurrent burst windows

### Problem Statement

The GitHub installation token is hitting API rate limits during burst windows, causing safe-output write operations (`create_issue`, `add_labels`, `lock issue`) to fail across multiple concurrent agentic workflows. **The agent completes its work successfully, but results are silently lost** because the safe-outputs step cannot write to GitHub.

**Affected workflows (2026-05-01T11:32–12:28Z):**
- Daily Skill Optimizer Improvements → `create_issue` failed after 3 retries
- Step Name Alignment → `create_issue` failed after 3 retries  
- AI Moderator (PR event, [§25213666728](https://github.com/github/gh-aw/actions/runs/25213666728)) → `add_labels` rate-limited
- AI Moderator (issue_comment, [§25214243935](https://github.com/github/gh-aw/actions/runs/25214243935)) → `lock issue` rate-limited

**Pattern:** 12 workflows launched nearly simultaneously (schedule jobs at ~12:05 UTC). The burst of installation-token API calls from concurrent safe-outputs jobs exhausted the rate limit quota. A second hit at 12:28Z confirms the token was still constrained 20+ minutes later.

**Agent work lost:**
- `[skill-optimizer] Daily Skill Optimizer Improvements - 2026-05-01` (issue body ready, not created)
- `[step-names] Improve compiler-generated step names for precision and clarity` (issue body ready, not created)
- AI Moderator label `ai-inspected` not applied to PR #29535

---

### Probable Root Cause

The `safe_outputs` step uses the GitHub installation token for all writes. When >10 workflows complete simultaneously and attempt to write to GitHub, the 60 req/minute installation token quota fills immediately. The current retry logic (3 attempts over ~90 s) is insufficient when the token is completely drained — later attempts hit the same wall.

---

### Proposed Remediation

1. **Stagger scheduled workflow start times** — Add `±random(0, 5min)` jitter to the cron expressions for scheduled daily workflows so the burst of concurrent launches is spread over a wider window. This is the highest-leverage, lowest-risk fix.

2. **Increase retry count and exponential backoff in safe_outputs** — The current 3-retry / 90 s total budget is too short for installation token exhaustion. Consider 5 retries with exponential backoff (30 s → 60 s → 120 s → 240 s), giving ~8 min total window.

3. **Respect `Retry-After` header** — GitHub's rate-limit responses include a `Retry-After` header. The safe_outputs handler should parse it and sleep exactly that long instead of using a fixed interval.

4. **Monitor rate limit headroom** — Add a pre-check step that reads `x-ratelimit-remaining` before each safe-outputs write and logs a warning when below threshold (e.g., <20% remaining).

---

### Success Criteria / Verification

- [ ] Zero `create_issue`/`add_labels` rate-limit failures in the next 7-day window for the same workflows
- [ ] Staggered cron schedules visible in `.lock.yml` files for affected daily workflows
- [ ] At minimum, `Daily Skill Optimizer Improvements` and `Step Name Alignment` issues created successfully on next run

---

### References

- Parent report: #29540 (Failure Investigation Report 2026-05-01)
- Run [§25213299148](https://github.com/github/gh-aw/actions/runs/25213299148) — Daily Skill Optimizer: `create_issue` failed
- Run [§25213669352](https://github.com/github/gh-aw/actions/runs/25213669352) — Step Name Alignment: `create_issue` failed
- Run [§25213666728](https://github.com/github/gh-aw/actions/runs/25213666728) — AI Moderator: `add_labels` failed
- Run [§25214243935](https://github.com/github/gh-aw/actions/runs/25214243935) — AI Moderator: `lock issue` failed







> Generated by [[aw] Failure Investigator (6h)](https://github.com/github/gh-aw/actions/runs/25215416435/agentic_workflow) · ● 427.4K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Faw-failure-investigator%22&type=issues)
> - [x] expires  on May 8, 2026, 1:19 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aw-failures] [P0] Installation token rate limiting causes silent loss of safe-output writes in concurrent burst windows #29541

Problem Statement

Probable Root Cause

Proposed Remediation

Success Criteria / Verification

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[aw-failures] [P0] Installation token rate limiting causes silent loss of safe-output writes in concurrent burst windows #29541

Description

Problem Statement

Probable Root Cause

Proposed Remediation

Success Criteria / Verification

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions