Fix silent safe-output write failures under installation token rate limiting by Copilot · Pull Request #29543 · github/gh-aw

Copilot · 2026-05-01T13:40:56Z

When 10+ scheduled workflows complete concurrently, the burst of installation-token API calls from safe-output jobs exhausts the 60 req/min quota. The current 3-retry / ~90s window is too short — the token stays exhausted for 20+ minutes, causing create_issue, add_labels, and lock issue writes to fail after all retries without the agent knowing.

Changes

Retry-After header support (`error_recovery.cjs`)

Added getRetryAfterMs(error): extracts the server-supplied wait time from Retry-After (seconds) or x-ratelimit-reset (Unix timestamp) headers, but only when the response status indicates a genuine rate-limit condition (HTTP 429, or HTTP 403 with x-ratelimit-remaining: 0). Non-rate-limit errors (e.g. 502) continue to use normal exponential backoff.
withRetry() now uses the server-dictated delay instead of calculated backoff when the header is present; logs when capping occurs

Dedicated rate-limit retry config (`error_recovery.cjs`)

New RATE_LIMIT_RETRY_CONFIG: 5 retries with ~30s→60s→120s→240s backoff (5s jitter per retry to spread concurrent retries), replacing the old 3-retry / 45s-cap inline configs:

const RATE_LIMIT_RETRY_CONFIG = {
  maxRetries: 5,
  initialDelayMs: 15000, // × backoffMultiplier(2) = 30s first retry
  maxDelayMs: 240000,    // 4-min cap
  backoffMultiplier: 2,
  jitterMs: 5000,
  shouldRetry: isTransientError,
};

Apply rate-limit retry to affected handlers

add_labels.cjs: previously had no retry at all — wraps addLabels with RATE_LIMIT_RETRY_CONFIG
lock-issue.cjs: wraps both issues.get and issues.lock
create_issue.cjs: upgrades from 3-retry/45s inline config to RATE_LIMIT_RETRY_CONFIG

Pre-flight headroom check (`safe_output_handler_manager.cjs`, `rate_limit_helpers.cjs`)

Added checkRateLimitHeadroom(): fetches current quota and emits a warning when remaining drops below 20% of the limit (uses Math.floor so values like 19.9% correctly trigger the warning)
Called before processing any safe-output messages, surfacing quota pressure in logs even when no individual write fails
Error details from the catch branch are included in the warning for easier diagnosis of permission/network issues

Rate-limit observability via JSONL log file (`github_rate_limit_logger.cjs`)

Added logRetryEvent(error, operation, attempt, delayMs): writes a source: "retry" JSONL entry to github_rate_limits.jsonl before each retry sleep, capturing the attempt number, delay, HTTP status, and any x-ratelimit-* headers from the error response. This allows post-run analysis to correlate retry storms with quota exhaustion.
withRetry() calls logRetryEvent() on every retry attempt.
checkRateLimitHeadroom() now also calls logRateLimitFromResponse() on the rateLimit.get() response, so every pre-flight quota snapshot is written to the JSONL log file (single API call, no extra round-trip).

… 5-retry backoff, headroom monitoring - Add RATE_LIMIT_RETRY_CONFIG (5 retries, 30s→60s→120s→240s backoff) - Add getRetryAfterMs() to parse Retry-After/x-ratelimit-reset headers - Respect Retry-After in withRetry delay calculation - Add checkRateLimitHeadroom() warning at <20% quota remaining - Wrap add_labels, lock-issue, create_issue API calls with rate-limit retry - Pre-check rate-limit headroom before processing safe outputs - Update tests Agent-Logs-Url: https://github.com/github/gh-aw/sessions/faaeb54c-db6a-44f0-b3a3-b25cd24bca1e Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>

Copilot

Pull request overview

Improves reliability and observability of safe-output GitHub write operations under installation-token rate limiting by adding server-honored retry delays and a pre-flight rate-limit headroom check.

Changes:

Add Retry-After / x-ratelimit-reset header support and introduce a dedicated RATE_LIMIT_RETRY_CONFIG for longer backoff retries.
Apply the new rate-limit retry config to create_issue, add_labels, and lock-issue write paths.
Add a pre-flight checkRateLimitHeadroom() warning before safe-output processing, with accompanying unit tests.

Show a summary per file

File	Description
actions/setup/js/safe_output_handler_manager.cjs	Calls a new pre-flight rate-limit headroom check before processing safe-output messages.
actions/setup/js/rate_limit_helpers.cjs	Adds `LOW_RATE_LIMIT_THRESHOLD_PERCENT` and `checkRateLimitHeadroom()` logging/return values.
actions/setup/js/rate_limit_helpers.test.cjs	Adds unit tests for the new threshold constant and `checkRateLimitHeadroom()` behavior.
actions/setup/js/error_recovery.cjs	Adds `RATE_LIMIT_RETRY_CONFIG`, `getRetryAfterMs()`, and updates `withRetry()` to honor server retry headers.
actions/setup/js/error_recovery.test.cjs	Adds unit tests for the new retry config, header parsing, and `withRetry()` header behavior.
actions/setup/js/create_issue.cjs	Switches issue creation retries to use `RATE_LIMIT_RETRY_CONFIG`.
actions/setup/js/add_labels.cjs	Wraps label application in `withRetry()` using `RATE_LIMIT_RETRY_CONFIG`.
actions/setup/js/add_labels.test.cjs	Updates mocks/assertions to account for retry logging behavior.
actions/setup/js/lock-issue.cjs	Wraps `issues.get` and `issues.lock` in `withRetry()` using `RATE_LIMIT_RETRY_CONFIG`.
actions/setup/js/lock-issue.test.cjs	Loosens assertions to accommodate updated error messaging behavior.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Files reviewed: 10/10 changed files
Comments generated: 3

Copilot · 2026-05-01T15:08:14Z

+/**
+ * Extract the Retry-After delay in milliseconds from a GitHub API rate-limit error.
+ *
+ * GitHub returns one of two headers on 429 / 403-rate-limit responses:
+ *   - `retry-after`       – integer seconds to wait (per RFC 6585)
+ *   - `x-ratelimit-reset` – Unix timestamp (seconds) when the quota resets
+ *
+ * @param {any} error - The error object from a failed GitHub API call
+ * @returns {number|null} Milliseconds to wait, or null if no header is present
+ */
+function getRetryAfterMs(error) {
+  // Octokit surfaces response headers via error.response.headers or error.headers
+  const headers = error?.response?.headers ?? error?.headers ?? null;
+  if (!headers) return null;
+
+  // retry-after: number of seconds (highest priority)
+  const retryAfter = headers["retry-after"];
+  if (retryAfter != null) {
+    const seconds = parseInt(retryAfter, 10);
+    if (!isNaN(seconds) && seconds > 0) {
+      return seconds * 1000;
+    }
+  }
+
+  // x-ratelimit-reset: Unix timestamp — derive wait time from clock delta
+  const resetAt = headers["x-ratelimit-reset"];
+  if (resetAt != null) {
+    const resetTimestampMs = parseInt(resetAt, 10) * 1000;
+    if (!isNaN(resetTimestampMs)) {
+      const waitMs = resetTimestampMs - Date.now();
+      if (waitMs > 0) {
+        return waitMs;
+      }
+    }
+  }


getRetryAfterMs() falls back to x-ratelimit-reset whenever that header is present. GitHub includes x-ratelimit-reset on most API responses (including non-rate-limit failures like 5xx/502), so withRetry() will start honoring that reset time for unrelated transient errors and effectively override the intended exponential backoff (often immediately capping to maxDelayMs). Consider only using retry-after/x-ratelimit-reset when the response status indicates a rate-limit condition (e.g. 429, or 403 with x-ratelimit-remaining: 0 / message contains "rate limit"), otherwise return null so normal backoff applies.

Copilot · 2026-05-01T15:08:14Z

+  try {
+    const { data } = await github.rest.rateLimit.get();
+    const { remaining, limit } = data.rate;
+    const percentRemaining = limit > 0 ? Math.round((remaining / limit) * 100) : 100;


checkRateLimitHeadroom() uses Math.round((remaining / limit) * 100). Rounding can hide low-headroom cases right below the threshold (e.g. 19.6% rounds to 20% and won't warn even though it's below a 20% threshold). Using Math.floor (or comparing using the raw ratio) would make the warning behavior match the documented "when < 20% remains" intent.

Suggested change

const percentRemaining = limit > 0 ? Math.round((remaining / limit) * 100) : 100;

const percentRemaining = limit > 0 ? Math.floor((remaining / limit) * 100) : 100;

Copilot · 2026-05-01T15:08:15Z

+    return { remaining, limit, percentRemaining };
+  } catch {
+    core.warning(`Could not check rate-limit headroom for ${operation}`);
+    return { remaining: -1, limit: -1, percentRemaining: -1 };


The catch { ... } branch logs Could not check rate-limit headroom for ${operation} but drops the underlying error details. Since this is intended for observability, including getErrorMessage(err) (or similar) would make diagnosing permission/network/API issues much easier.

github-actions · 2026-05-01T15:09:58Z

🧪 Test Quality Sentinel Report

Test Quality Score: 71/100

⚠️ Acceptable — with suggestions

Metric	Value
New/modified tests analyzed	17
✅ Design tests (behavioral contracts)	12 (70.6%)
⚠️ Implementation tests (low value)	5 (29.4%)
Tests with error/edge cases	10 (58.8%)
Duplicate test clusters	1
Test inflation detected	No
🚨 Coding-guideline violations	None

Test Classification Details

View all 17 test classifications

Test	File	Classification	Notes
`RATE_LIMIT_RETRY_CONFIG / should have 5 retries`	`error_recovery.test.cjs`	⚠️ Implementation	Asserts a config constant value
`RATE_LIMIT_RETRY_CONFIG / should have initialDelayMs producing 30s first retry sleep`	`error_recovery.test.cjs`	✅ Design	Verifies computed backoff behavior
`RATE_LIMIT_RETRY_CONFIG / should cap delay at 240s`	`error_recovery.test.cjs`	⚠️ Implementation	Asserts a config constant value
`RATE_LIMIT_RETRY_CONFIG / should use isTransientError as shouldRetry`	`error_recovery.test.cjs`	⚠️ Implementation	Asserts internal function reference
`getRetryAfterMs / should return null when error has no response headers`	`error_recovery.test.cjs`	✅ Design	Null/undefined edge cases
`getRetryAfterMs / should extract retry-after seconds from response headers`	`error_recovery.test.cjs`	✅ Design	Primary happy path
`getRetryAfterMs / should extract retry-after seconds from top-level headers`	`error_recovery.test.cjs`	✅ Design	Alternate input shape
`getRetryAfterMs / should prefer response.headers over top-level headers`	`error_recovery.test.cjs`	✅ Design	Priority / precedence contract
`getRetryAfterMs / should return null for zero or negative retry-after`	`error_recovery.test.cjs`	✅ Design	Boundary / edge case
`getRetryAfterMs / should fall back to x-ratelimit-reset when retry-after is absent`	`error_recovery.test.cjs`	✅ Design	Fallback behavioral contract
`getRetryAfterMs / should return null when x-ratelimit-reset is in the past`	`error_recovery.test.cjs`	✅ Design	Temporal edge case
`getRetryAfterMs / should return null for non-numeric retry-after`	`error_recovery.test.cjs`	✅ Design	Invalid input handling
`withRetry with Retry-After header / should use Retry-After delay`	`error_recovery.test.cjs`	✅ Design	End-to-end retry behavior
`LOW_RATE_LIMIT_THRESHOLD_PERCENT / should be 20`	`rate_limit_helpers.test.cjs`	⚠️ Implementation	Asserts a config constant value
`checkRateLimitHeadroom / should return remaining, limit, and percentRemaining`	`rate_limit_helpers.test.cjs`	✅ Design	Verifies return shape and math
`checkRateLimitHeadroom / should log info when headroom is above threshold`	`rate_limit_helpers.test.cjs`	✅ Design	Observable side-effect (no warning emitted)
`checkRateLimitHeadroom / should emit warning when headroom is below threshold`	`rate_limit_helpers.test.cjs`	✅ Design	Edge case + observable side-effect
`checkRateLimitHeadroom / should return -1 values and warn on error`	`rate_limit_helpers.test.cjs`	✅ Design	Error path + fallback contract

Flagged Tests — Suggestions

⚠️ Config-constant tests in `RATE_LIMIT_RETRY_CONFIG` and `LOW_RATE_LIMIT_THRESHOLD_PERCENT`

Tests: should have 5 retries, should cap delay at 240s, should use isTransientError as shouldRetry, LOW_RATE_LIMIT_THRESHOLD_PERCENT / should be 20

Classification: Implementation — these four tests assert that exported config objects contain specific literal values or function references.

What design invariant do they enforce? They enforce that specific constants haven't changed, but not the observable behavior those constants produce. Deleting them would only be noticed if someone accidentally changed a constant — not if the retry logic itself broke.

Suggested improvement: Fold these constant checks into behavioral integration tests. For example, rather than asserting RATE_LIMIT_RETRY_CONFIG.maxRetries === 5, add a test that calls withRetry with a RATE_LIMIT_RETRY_CONFIG config and a mock that fails 5 times, then asserts the 6th call throws — proving the retry limit is actually enforced at runtime. This collapses multiple constant tests into one high-value contract test.

Score Breakdown

Component	Weight	Score
Behavioral Coverage (12/17 = 70.6%)	40 pts	28 pts
Error/Edge Case Coverage (10/17 = 58.8%)	30 pts	18 pts
Low Duplication (1 duplicate cluster)	20 pts	15 pts
Proportional Growth (max ratio 1.57, below 2:1)	10 pts	10 pts
Total	100 pts	71 pts

Inflation check: error_recovery.test.cjs +88 lines / error_recovery.cjs +75 lines = 1.17×; rate_limit_helpers.test.cjs +58 lines / rate_limit_helpers.cjs +37 lines = 1.57× — both within the 2:1 threshold.

Language Support

Tests analyzed:

🟨 JavaScript (*.test.cjs): 17 tests (vitest)

No Go test files were added or modified in this PR.

Verdict

✅ Check passed. 29.4% of new tests are implementation tests (threshold: 30%). The getRetryAfterMs and checkRateLimitHeadroom suites provide solid behavioral coverage with good edge-case handling. The main improvement opportunity is replacing the four config-constant assertions with end-to-end behavioral tests.

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

Assert on observable outputs, return values, or state changes
Cover error paths and boundary conditions
Would catch a behavioral regression if deleted
Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

Assert on internal function calls or config constant values
Only test the happy path with typical inputs
Break during legitimate refactoring even when behavior is correct
Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §25219423361

🧪 Test quality analysis by Test Quality Sentinel · ● 526.3K · ◷

github-actions

✅ Test Quality Sentinel: 71/100. Test quality is acceptable — 29.4% of new tests are implementation tests (threshold: 30%).

pelikhan · 2026-05-01T16:38:06Z

@copilot review all comments

pelikhan · 2026-05-01T16:42:46Z

@copilot make sure to log rate limit lookups and retry into the rate limit log file

…ror details in catch - getRetryAfterMs() now only reads Retry-After/x-ratelimit-reset headers when the response status indicates a rate-limit condition (429, or 403 with x-ratelimit-remaining: 0), preventing unintended backoff override on 5xx errors - Use Math.floor instead of Math.round for percent calculation so 19.9% correctly triggers the low-headroom warning - Include getErrorMessage(err) in checkRateLimitHeadroom catch block for better diagnostic visibility - Update tests to cover status-gated behavior and Math.floor edge case Agent-Logs-Url: https://github.com/github/gh-aw/sessions/32cbeba1-59fa-4768-a053-1a50b0efd8a5 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot · 2026-05-01T16:44:51Z

@copilot review all comments

Addressed all three review comments in 11f4fd3:

getRetryAfterMs status-gating — now returns null for any response that is not a genuine rate-limit condition (HTTP 429, or HTTP 403 with x-ratelimit-remaining: 0). This prevents the 5xx-transient-error path from accidentally overriding exponential backoff with a x-ratelimit-reset header that GitHub includes on most responses.
Math.floor instead of Math.round — a 19.9% remaining quota now correctly triggers the warning (rounding up to 20% was silently skipping it).
Error details in catch — checkRateLimitHeadroom now includes getErrorMessage(err) in the warning so API/network/permission failures are visible in the job log.

- Add logRetryEvent() to github_rate_limit_logger.cjs: writes a JSONL entry with source='retry', attempt number, delay, HTTP status, and any x-ratelimit-* headers extracted from the error response. This lets retry storms be correlated with quota exhaustion in post-run analysis. - Import and call logRetryEvent() in withRetry() (error_recovery.cjs) right before sleeping on each retry, capturing the error state at the moment of failure. - In checkRateLimitHeadroom() (rate_limit_helpers.cjs), call logRateLimitFromResponse() on the rateLimit.get() response so every pre-flight headroom snapshot is also written to the JSONL log file. - Tests: logRetryEvent unit tests in github_rate_limit_logger.test.cjs, JSONL-write verification in error_recovery.test.cjs and rate_limit_helpers.test.cjs. Agent-Logs-Url: https://github.com/github/gh-aw/sessions/37e5ae29-9669-453f-9568-f4b68d6f1e98 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot · 2026-05-01T16:56:58Z

@copilot make sure to log rate limit lookups and retry into the rate limit log file

Done in 3038814. Two changes:

Retry events — added logRetryEvent() to github_rate_limit_logger.cjs (new export). withRetry() now calls it right before each sleep, writing a source: "retry" JSONL entry with the attempt number, delay, HTTP status, and any x-ratelimit-* headers extracted from the error response.
Headroom lookups — checkRateLimitHeadroom() now calls logRateLimitFromResponse(response, operation) on the rateLimit.get() response, so every pre-flight quota snapshot is also captured in github_rate_limits.jsonl.

Initial plan

95da4d4

Copilot AI assigned Copilot and gh-aw-bot May 1, 2026

Copilot started work on behalf of gh-aw-bot May 1, 2026 13:41 View session

Copilot AI linked an issue May 1, 2026 that may be closed by this pull request

[aw-failures] [P0] Installation token rate limiting causes silent loss of safe-output writes in concurrent burst windows #29541

Closed

4 tasks

Copilot AI changed the title ~~[WIP] Fix installation token rate limiting causing silent loss of outputs~~ Fix silent safe-output write failures under installation token rate limiting May 1, 2026

Copilot finished work on behalf of gh-aw-bot May 1, 2026 14:05

Copilot AI requested a review from gh-aw-bot May 1, 2026 14:05

pelikhan marked this pull request as ready for review May 1, 2026 15:04

Copilot AI review requested due to automatic review settings May 1, 2026 15:04

Copilot started reviewing on behalf of pelikhan May 1, 2026 15:04 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

github-actions Bot mentioned this pull request May 1, 2026

[aw] No-Op Runs #29134

Open

github-actions Bot approved these changes May 1, 2026

View reviewed changes

Copilot started work on behalf of pelikhan May 1, 2026 16:38 View session

Copilot finished work on behalf of pelikhan May 1, 2026 16:45

Copilot AI requested a review from pelikhan May 1, 2026 16:45

Copilot started work on behalf of pelikhan May 1, 2026 16:45 View session

Copilot finished work on behalf of pelikhan May 1, 2026 16:58

pelikhan merged commit 1d6403c into main May 1, 2026

pelikhan deleted the copilot/aw-failures-fix-token-rate-limiting branch May 1, 2026 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix silent safe-output write failures under installation token rate limiting#29543

Fix silent safe-output write failures under installation token rate limiting#29543
pelikhan merged 4 commits intomainfrom
copilot/aw-failures-fix-token-rate-limiting

Copilot AI commented May 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

Copilot AI May 1, 2026

Uh oh!

github-actions Bot commented May 1, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

pelikhan commented May 1, 2026

Uh oh!

pelikhan commented May 1, 2026

Uh oh!

Copilot AI commented May 1, 2026

Uh oh!

Copilot AI commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	const percentRemaining = limit > 0 ? Math.round((remaining / limit) * 100) : 100;
	const percentRemaining = limit > 0 ? Math.floor((remaining / limit) * 100) : 100;

Conversation

Copilot AI commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Retry-After header support (error_recovery.cjs)

Dedicated rate-limit retry config (error_recovery.cjs)

Apply rate-limit retry to affected handlers

Pre-flight headroom check (safe_output_handler_manager.cjs, rate_limit_helpers.cjs)

Rate-limit observability via JSONL log file (github_rate_limit_logger.cjs)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 1, 2026

🧪 Test Quality Sentinel Report

Test Quality Score: 71/100

Test Classification Details

Flagged Tests — Suggestions

⚠️ Config-constant tests in RATE_LIMIT_RETRY_CONFIG and LOW_RATE_LIMIT_THRESHOLD_PERCENT

Score Breakdown

Language Support

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

pelikhan commented May 1, 2026

Uh oh!

pelikhan commented May 1, 2026

Uh oh!

Copilot AI commented May 1, 2026

Uh oh!

Copilot AI commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented May 1, 2026 •

edited

Loading

Retry-After header support (`error_recovery.cjs`)

Dedicated rate-limit retry config (`error_recovery.cjs`)

Pre-flight headroom check (`safe_output_handler_manager.cjs`, `rate_limit_helpers.cjs`)

Rate-limit observability via JSONL log file (`github_rate_limit_logger.cjs`)

⚠️ Config-constant tests in `RATE_LIMIT_RETRY_CONFIG` and `LOW_RATE_LIMIT_THRESHOLD_PERCENT`