Skip to content

Fix silent safe-output write failures under installation token rate limiting#29543

Merged
pelikhan merged 4 commits intomainfrom
copilot/aw-failures-fix-token-rate-limiting
May 1, 2026
Merged

Fix silent safe-output write failures under installation token rate limiting#29543
pelikhan merged 4 commits intomainfrom
copilot/aw-failures-fix-token-rate-limiting

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 1, 2026

When 10+ scheduled workflows complete concurrently, the burst of installation-token API calls from safe-output jobs exhausts the 60 req/min quota. The current 3-retry / ~90s window is too short — the token stays exhausted for 20+ minutes, causing create_issue, add_labels, and lock issue writes to fail after all retries without the agent knowing.

Changes

Retry-After header support (error_recovery.cjs)

  • Added getRetryAfterMs(error): extracts the server-supplied wait time from Retry-After (seconds) or x-ratelimit-reset (Unix timestamp) headers, but only when the response status indicates a genuine rate-limit condition (HTTP 429, or HTTP 403 with x-ratelimit-remaining: 0). Non-rate-limit errors (e.g. 502) continue to use normal exponential backoff.
  • withRetry() now uses the server-dictated delay instead of calculated backoff when the header is present; logs when capping occurs

Dedicated rate-limit retry config (error_recovery.cjs)

New RATE_LIMIT_RETRY_CONFIG: 5 retries with ~30s→60s→120s→240s backoff (5s jitter per retry to spread concurrent retries), replacing the old 3-retry / 45s-cap inline configs:

const RATE_LIMIT_RETRY_CONFIG = {
  maxRetries: 5,
  initialDelayMs: 15000, // × backoffMultiplier(2) = 30s first retry
  maxDelayMs: 240000,    // 4-min cap
  backoffMultiplier: 2,
  jitterMs: 5000,
  shouldRetry: isTransientError,
};

Apply rate-limit retry to affected handlers

  • add_labels.cjs: previously had no retry at all — wraps addLabels with RATE_LIMIT_RETRY_CONFIG
  • lock-issue.cjs: wraps both issues.get and issues.lock
  • create_issue.cjs: upgrades from 3-retry/45s inline config to RATE_LIMIT_RETRY_CONFIG

Pre-flight headroom check (safe_output_handler_manager.cjs, rate_limit_helpers.cjs)

  • Added checkRateLimitHeadroom(): fetches current quota and emits a warning when remaining drops below 20% of the limit (uses Math.floor so values like 19.9% correctly trigger the warning)
  • Called before processing any safe-output messages, surfacing quota pressure in logs even when no individual write fails
  • Error details from the catch branch are included in the warning for easier diagnosis of permission/network issues

Rate-limit observability via JSONL log file (github_rate_limit_logger.cjs)

  • Added logRetryEvent(error, operation, attempt, delayMs): writes a source: "retry" JSONL entry to github_rate_limits.jsonl before each retry sleep, capturing the attempt number, delay, HTTP status, and any x-ratelimit-* headers from the error response. This allows post-run analysis to correlate retry storms with quota exhaustion.
  • withRetry() calls logRetryEvent() on every retry attempt.
  • checkRateLimitHeadroom() now also calls logRateLimitFromResponse() on the rateLimit.get() response, so every pre-flight quota snapshot is written to the JSONL log file (single API call, no extra round-trip).

… 5-retry backoff, headroom monitoring

- Add RATE_LIMIT_RETRY_CONFIG (5 retries, 30s→60s→120s→240s backoff)
- Add getRetryAfterMs() to parse Retry-After/x-ratelimit-reset headers
- Respect Retry-After in withRetry delay calculation
- Add checkRateLimitHeadroom() warning at <20% quota remaining
- Wrap add_labels, lock-issue, create_issue API calls with rate-limit retry
- Pre-check rate-limit headroom before processing safe outputs
- Update tests

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/faaeb54c-db6a-44f0-b3a3-b25cd24bca1e

Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix installation token rate limiting causing silent loss of outputs Fix silent safe-output write failures under installation token rate limiting May 1, 2026
Copilot AI requested a review from gh-aw-bot May 1, 2026 14:05
@pelikhan pelikhan marked this pull request as ready for review May 1, 2026 15:04
Copilot AI review requested due to automatic review settings May 1, 2026 15:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves reliability and observability of safe-output GitHub write operations under installation-token rate limiting by adding server-honored retry delays and a pre-flight rate-limit headroom check.

Changes:

  • Add Retry-After / x-ratelimit-reset header support and introduce a dedicated RATE_LIMIT_RETRY_CONFIG for longer backoff retries.
  • Apply the new rate-limit retry config to create_issue, add_labels, and lock-issue write paths.
  • Add a pre-flight checkRateLimitHeadroom() warning before safe-output processing, with accompanying unit tests.
Show a summary per file
File Description
actions/setup/js/safe_output_handler_manager.cjs Calls a new pre-flight rate-limit headroom check before processing safe-output messages.
actions/setup/js/rate_limit_helpers.cjs Adds LOW_RATE_LIMIT_THRESHOLD_PERCENT and checkRateLimitHeadroom() logging/return values.
actions/setup/js/rate_limit_helpers.test.cjs Adds unit tests for the new threshold constant and checkRateLimitHeadroom() behavior.
actions/setup/js/error_recovery.cjs Adds RATE_LIMIT_RETRY_CONFIG, getRetryAfterMs(), and updates withRetry() to honor server retry headers.
actions/setup/js/error_recovery.test.cjs Adds unit tests for the new retry config, header parsing, and withRetry() header behavior.
actions/setup/js/create_issue.cjs Switches issue creation retries to use RATE_LIMIT_RETRY_CONFIG.
actions/setup/js/add_labels.cjs Wraps label application in withRetry() using RATE_LIMIT_RETRY_CONFIG.
actions/setup/js/add_labels.test.cjs Updates mocks/assertions to account for retry logging behavior.
actions/setup/js/lock-issue.cjs Wraps issues.get and issues.lock in withRetry() using RATE_LIMIT_RETRY_CONFIG.
actions/setup/js/lock-issue.test.cjs Loosens assertions to accommodate updated error messaging behavior.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 10/10 changed files
  • Comments generated: 3

Comment on lines +102 to +136
/**
* Extract the Retry-After delay in milliseconds from a GitHub API rate-limit error.
*
* GitHub returns one of two headers on 429 / 403-rate-limit responses:
* - `retry-after` – integer seconds to wait (per RFC 6585)
* - `x-ratelimit-reset` – Unix timestamp (seconds) when the quota resets
*
* @param {any} error - The error object from a failed GitHub API call
* @returns {number|null} Milliseconds to wait, or null if no header is present
*/
function getRetryAfterMs(error) {
// Octokit surfaces response headers via error.response.headers or error.headers
const headers = error?.response?.headers ?? error?.headers ?? null;
if (!headers) return null;

// retry-after: number of seconds (highest priority)
const retryAfter = headers["retry-after"];
if (retryAfter != null) {
const seconds = parseInt(retryAfter, 10);
if (!isNaN(seconds) && seconds > 0) {
return seconds * 1000;
}
}

// x-ratelimit-reset: Unix timestamp — derive wait time from clock delta
const resetAt = headers["x-ratelimit-reset"];
if (resetAt != null) {
const resetTimestampMs = parseInt(resetAt, 10) * 1000;
if (!isNaN(resetTimestampMs)) {
const waitMs = resetTimestampMs - Date.now();
if (waitMs > 0) {
return waitMs;
}
}
}
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getRetryAfterMs() falls back to x-ratelimit-reset whenever that header is present. GitHub includes x-ratelimit-reset on most API responses (including non-rate-limit failures like 5xx/502), so withRetry() will start honoring that reset time for unrelated transient errors and effectively override the intended exponential backoff (often immediately capping to maxDelayMs). Consider only using retry-after/x-ratelimit-reset when the response status indicates a rate-limit condition (e.g. 429, or 403 with x-ratelimit-remaining: 0 / message contains "rate limit"), otherwise return null so normal backoff applies.

Copilot uses AI. Check for mistakes.
Comment thread actions/setup/js/rate_limit_helpers.cjs Outdated
try {
const { data } = await github.rest.rateLimit.get();
const { remaining, limit } = data.rate;
const percentRemaining = limit > 0 ? Math.round((remaining / limit) * 100) : 100;
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkRateLimitHeadroom() uses Math.round((remaining / limit) * 100). Rounding can hide low-headroom cases right below the threshold (e.g. 19.6% rounds to 20% and won't warn even though it's below a 20% threshold). Using Math.floor (or comparing using the raw ratio) would make the warning behavior match the documented "when < 20% remains" intent.

Suggested change
const percentRemaining = limit > 0 ? Math.round((remaining / limit) * 100) : 100;
const percentRemaining = limit > 0 ? Math.floor((remaining / limit) * 100) : 100;

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +78
return { remaining, limit, percentRemaining };
} catch {
core.warning(`Could not check rate-limit headroom for ${operation}`);
return { remaining: -1, limit: -1, percentRemaining: -1 };
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch { ... } branch logs Could not check rate-limit headroom for ${operation} but drops the underlying error details. Since this is intended for observability, including getErrorMessage(err) (or similar) would make diagnosing permission/network/API issues much easier.

Copilot uses AI. Check for mistakes.
@github-actions github-actions Bot mentioned this pull request May 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

🧪 Test Quality Sentinel Report

Test Quality Score: 71/100

⚠️ Acceptable — with suggestions

Metric Value
New/modified tests analyzed 17
✅ Design tests (behavioral contracts) 12 (70.6%)
⚠️ Implementation tests (low value) 5 (29.4%)
Tests with error/edge cases 10 (58.8%)
Duplicate test clusters 1
Test inflation detected No
🚨 Coding-guideline violations None

Test Classification Details

View all 17 test classifications
Test File Classification Notes
RATE_LIMIT_RETRY_CONFIG / should have 5 retries error_recovery.test.cjs ⚠️ Implementation Asserts a config constant value
RATE_LIMIT_RETRY_CONFIG / should have initialDelayMs producing 30s first retry sleep error_recovery.test.cjs ✅ Design Verifies computed backoff behavior
RATE_LIMIT_RETRY_CONFIG / should cap delay at 240s error_recovery.test.cjs ⚠️ Implementation Asserts a config constant value
RATE_LIMIT_RETRY_CONFIG / should use isTransientError as shouldRetry error_recovery.test.cjs ⚠️ Implementation Asserts internal function reference
getRetryAfterMs / should return null when error has no response headers error_recovery.test.cjs ✅ Design Null/undefined edge cases
getRetryAfterMs / should extract retry-after seconds from response headers error_recovery.test.cjs ✅ Design Primary happy path
getRetryAfterMs / should extract retry-after seconds from top-level headers error_recovery.test.cjs ✅ Design Alternate input shape
getRetryAfterMs / should prefer response.headers over top-level headers error_recovery.test.cjs ✅ Design Priority / precedence contract
getRetryAfterMs / should return null for zero or negative retry-after error_recovery.test.cjs ✅ Design Boundary / edge case
getRetryAfterMs / should fall back to x-ratelimit-reset when retry-after is absent error_recovery.test.cjs ✅ Design Fallback behavioral contract
getRetryAfterMs / should return null when x-ratelimit-reset is in the past error_recovery.test.cjs ✅ Design Temporal edge case
getRetryAfterMs / should return null for non-numeric retry-after error_recovery.test.cjs ✅ Design Invalid input handling
withRetry with Retry-After header / should use Retry-After delay error_recovery.test.cjs ✅ Design End-to-end retry behavior
LOW_RATE_LIMIT_THRESHOLD_PERCENT / should be 20 rate_limit_helpers.test.cjs ⚠️ Implementation Asserts a config constant value
checkRateLimitHeadroom / should return remaining, limit, and percentRemaining rate_limit_helpers.test.cjs ✅ Design Verifies return shape and math
checkRateLimitHeadroom / should log info when headroom is above threshold rate_limit_helpers.test.cjs ✅ Design Observable side-effect (no warning emitted)
checkRateLimitHeadroom / should emit warning when headroom is below threshold rate_limit_helpers.test.cjs ✅ Design Edge case + observable side-effect
checkRateLimitHeadroom / should return -1 values and warn on error rate_limit_helpers.test.cjs ✅ Design Error path + fallback contract

Flagged Tests — Suggestions

⚠️ Config-constant tests in RATE_LIMIT_RETRY_CONFIG and LOW_RATE_LIMIT_THRESHOLD_PERCENT

Tests: should have 5 retries, should cap delay at 240s, should use isTransientError as shouldRetry, LOW_RATE_LIMIT_THRESHOLD_PERCENT / should be 20

Classification: Implementation — these four tests assert that exported config objects contain specific literal values or function references.

What design invariant do they enforce? They enforce that specific constants haven't changed, but not the observable behavior those constants produce. Deleting them would only be noticed if someone accidentally changed a constant — not if the retry logic itself broke.

Suggested improvement: Fold these constant checks into behavioral integration tests. For example, rather than asserting RATE_LIMIT_RETRY_CONFIG.maxRetries === 5, add a test that calls withRetry with a RATE_LIMIT_RETRY_CONFIG config and a mock that fails 5 times, then asserts the 6th call throws — proving the retry limit is actually enforced at runtime. This collapses multiple constant tests into one high-value contract test.


Score Breakdown

Component Weight Score
Behavioral Coverage (12/17 = 70.6%) 40 pts 28 pts
Error/Edge Case Coverage (10/17 = 58.8%) 30 pts 18 pts
Low Duplication (1 duplicate cluster) 20 pts 15 pts
Proportional Growth (max ratio 1.57, below 2:1) 10 pts 10 pts
Total 100 pts 71 pts

Inflation check: error_recovery.test.cjs +88 lines / error_recovery.cjs +75 lines = 1.17×; rate_limit_helpers.test.cjs +58 lines / rate_limit_helpers.cjs +37 lines = 1.57× — both within the 2:1 threshold.


Language Support

Tests analyzed:

  • 🟨 JavaScript (*.test.cjs): 17 tests (vitest)

No Go test files were added or modified in this PR.


Verdict

Check passed. 29.4% of new tests are implementation tests (threshold: 30%). The getRetryAfterMs and checkRateLimitHeadroom suites provide solid behavioral coverage with good edge-case handling. The main improvement opportunity is replacing the four config-constant assertions with end-to-end behavioral tests.


📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls or config constant values
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §25219423361

🧪 Test quality analysis by Test Quality Sentinel · ● 526.3K ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 71/100. Test quality is acceptable — 29.4% of new tests are implementation tests (threshold: 30%).

@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented May 1, 2026

@copilot review all comments

@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented May 1, 2026

@copilot make sure to log rate limit lookups and retry into the rate limit log file

…ror details in catch

- getRetryAfterMs() now only reads Retry-After/x-ratelimit-reset headers when the
  response status indicates a rate-limit condition (429, or 403 with
  x-ratelimit-remaining: 0), preventing unintended backoff override on 5xx errors
- Use Math.floor instead of Math.round for percent calculation so 19.9% correctly
  triggers the low-headroom warning
- Include getErrorMessage(err) in checkRateLimitHeadroom catch block for better
  diagnostic visibility
- Update tests to cover status-gated behavior and Math.floor edge case

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/32cbeba1-59fa-4768-a053-1a50b0efd8a5

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 1, 2026

@copilot review all comments

Addressed all three review comments in 11f4fd3:

  1. getRetryAfterMs status-gating — now returns null for any response that is not a genuine rate-limit condition (HTTP 429, or HTTP 403 with x-ratelimit-remaining: 0). This prevents the 5xx-transient-error path from accidentally overriding exponential backoff with a x-ratelimit-reset header that GitHub includes on most responses.

  2. Math.floor instead of Math.round — a 19.9% remaining quota now correctly triggers the warning (rounding up to 20% was silently skipping it).

  3. Error details in catchcheckRateLimitHeadroom now includes getErrorMessage(err) in the warning so API/network/permission failures are visible in the job log.

- Add logRetryEvent() to github_rate_limit_logger.cjs: writes a JSONL
  entry with source='retry', attempt number, delay, HTTP status, and any
  x-ratelimit-* headers extracted from the error response. This lets
  retry storms be correlated with quota exhaustion in post-run analysis.
- Import and call logRetryEvent() in withRetry() (error_recovery.cjs)
  right before sleeping on each retry, capturing the error state at the
  moment of failure.
- In checkRateLimitHeadroom() (rate_limit_helpers.cjs), call
  logRateLimitFromResponse() on the rateLimit.get() response so every
  pre-flight headroom snapshot is also written to the JSONL log file.
- Tests: logRetryEvent unit tests in github_rate_limit_logger.test.cjs,
  JSONL-write verification in error_recovery.test.cjs and
  rate_limit_helpers.test.cjs.

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/37e5ae29-9669-453f-9568-f4b68d6f1e98

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 1, 2026

@copilot make sure to log rate limit lookups and retry into the rate limit log file

Done in 3038814. Two changes:

  1. Retry events — added logRetryEvent() to github_rate_limit_logger.cjs (new export). withRetry() now calls it right before each sleep, writing a source: "retry" JSONL entry with the attempt number, delay, HTTP status, and any x-ratelimit-* headers extracted from the error response.

  2. Headroom lookupscheckRateLimitHeadroom() now calls logRateLimitFromResponse(response, operation) on the rateLimit.get() response, so every pre-flight quota snapshot is also captured in github_rate_limits.jsonl.

@pelikhan pelikhan merged commit 1d6403c into main May 1, 2026
@pelikhan pelikhan deleted the copilot/aw-failures-fix-token-rate-limiting branch May 1, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[aw-failures] [P0] Installation token rate limiting causes silent loss of safe-output writes in concurrent burst windows

4 participants