Fix silent safe-output write failures under installation token rate limiting#29543
Fix silent safe-output write failures under installation token rate limiting#29543
Conversation
… 5-retry backoff, headroom monitoring - Add RATE_LIMIT_RETRY_CONFIG (5 retries, 30s→60s→120s→240s backoff) - Add getRetryAfterMs() to parse Retry-After/x-ratelimit-reset headers - Respect Retry-After in withRetry delay calculation - Add checkRateLimitHeadroom() warning at <20% quota remaining - Wrap add_labels, lock-issue, create_issue API calls with rate-limit retry - Pre-check rate-limit headroom before processing safe outputs - Update tests Agent-Logs-Url: https://github.com/github/gh-aw/sessions/faaeb54c-db6a-44f0-b3a3-b25cd24bca1e Co-authored-by: gh-aw-bot <259018956+gh-aw-bot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Improves reliability and observability of safe-output GitHub write operations under installation-token rate limiting by adding server-honored retry delays and a pre-flight rate-limit headroom check.
Changes:
- Add
Retry-After/x-ratelimit-resetheader support and introduce a dedicatedRATE_LIMIT_RETRY_CONFIGfor longer backoff retries. - Apply the new rate-limit retry config to
create_issue,add_labels, andlock-issuewrite paths. - Add a pre-flight
checkRateLimitHeadroom()warning before safe-output processing, with accompanying unit tests.
Show a summary per file
| File | Description |
|---|---|
| actions/setup/js/safe_output_handler_manager.cjs | Calls a new pre-flight rate-limit headroom check before processing safe-output messages. |
| actions/setup/js/rate_limit_helpers.cjs | Adds LOW_RATE_LIMIT_THRESHOLD_PERCENT and checkRateLimitHeadroom() logging/return values. |
| actions/setup/js/rate_limit_helpers.test.cjs | Adds unit tests for the new threshold constant and checkRateLimitHeadroom() behavior. |
| actions/setup/js/error_recovery.cjs | Adds RATE_LIMIT_RETRY_CONFIG, getRetryAfterMs(), and updates withRetry() to honor server retry headers. |
| actions/setup/js/error_recovery.test.cjs | Adds unit tests for the new retry config, header parsing, and withRetry() header behavior. |
| actions/setup/js/create_issue.cjs | Switches issue creation retries to use RATE_LIMIT_RETRY_CONFIG. |
| actions/setup/js/add_labels.cjs | Wraps label application in withRetry() using RATE_LIMIT_RETRY_CONFIG. |
| actions/setup/js/add_labels.test.cjs | Updates mocks/assertions to account for retry logging behavior. |
| actions/setup/js/lock-issue.cjs | Wraps issues.get and issues.lock in withRetry() using RATE_LIMIT_RETRY_CONFIG. |
| actions/setup/js/lock-issue.test.cjs | Loosens assertions to accommodate updated error messaging behavior. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 10/10 changed files
- Comments generated: 3
| /** | ||
| * Extract the Retry-After delay in milliseconds from a GitHub API rate-limit error. | ||
| * | ||
| * GitHub returns one of two headers on 429 / 403-rate-limit responses: | ||
| * - `retry-after` – integer seconds to wait (per RFC 6585) | ||
| * - `x-ratelimit-reset` – Unix timestamp (seconds) when the quota resets | ||
| * | ||
| * @param {any} error - The error object from a failed GitHub API call | ||
| * @returns {number|null} Milliseconds to wait, or null if no header is present | ||
| */ | ||
| function getRetryAfterMs(error) { | ||
| // Octokit surfaces response headers via error.response.headers or error.headers | ||
| const headers = error?.response?.headers ?? error?.headers ?? null; | ||
| if (!headers) return null; | ||
|
|
||
| // retry-after: number of seconds (highest priority) | ||
| const retryAfter = headers["retry-after"]; | ||
| if (retryAfter != null) { | ||
| const seconds = parseInt(retryAfter, 10); | ||
| if (!isNaN(seconds) && seconds > 0) { | ||
| return seconds * 1000; | ||
| } | ||
| } | ||
|
|
||
| // x-ratelimit-reset: Unix timestamp — derive wait time from clock delta | ||
| const resetAt = headers["x-ratelimit-reset"]; | ||
| if (resetAt != null) { | ||
| const resetTimestampMs = parseInt(resetAt, 10) * 1000; | ||
| if (!isNaN(resetTimestampMs)) { | ||
| const waitMs = resetTimestampMs - Date.now(); | ||
| if (waitMs > 0) { | ||
| return waitMs; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
getRetryAfterMs() falls back to x-ratelimit-reset whenever that header is present. GitHub includes x-ratelimit-reset on most API responses (including non-rate-limit failures like 5xx/502), so withRetry() will start honoring that reset time for unrelated transient errors and effectively override the intended exponential backoff (often immediately capping to maxDelayMs). Consider only using retry-after/x-ratelimit-reset when the response status indicates a rate-limit condition (e.g. 429, or 403 with x-ratelimit-remaining: 0 / message contains "rate limit"), otherwise return null so normal backoff applies.
| try { | ||
| const { data } = await github.rest.rateLimit.get(); | ||
| const { remaining, limit } = data.rate; | ||
| const percentRemaining = limit > 0 ? Math.round((remaining / limit) * 100) : 100; |
There was a problem hiding this comment.
checkRateLimitHeadroom() uses Math.round((remaining / limit) * 100). Rounding can hide low-headroom cases right below the threshold (e.g. 19.6% rounds to 20% and won't warn even though it's below a 20% threshold). Using Math.floor (or comparing using the raw ratio) would make the warning behavior match the documented "when < 20% remains" intent.
| const percentRemaining = limit > 0 ? Math.round((remaining / limit) * 100) : 100; | |
| const percentRemaining = limit > 0 ? Math.floor((remaining / limit) * 100) : 100; |
| return { remaining, limit, percentRemaining }; | ||
| } catch { | ||
| core.warning(`Could not check rate-limit headroom for ${operation}`); | ||
| return { remaining: -1, limit: -1, percentRemaining: -1 }; |
There was a problem hiding this comment.
The catch { ... } branch logs Could not check rate-limit headroom for ${operation} but drops the underlying error details. Since this is intended for observability, including getErrorMessage(err) (or similar) would make diagnosing permission/network/API issues much easier.
🧪 Test Quality Sentinel ReportTest Quality Score: 71/100
Test Classification DetailsView all 17 test classifications
Flagged Tests — Suggestions
|
| Component | Weight | Score |
|---|---|---|
| Behavioral Coverage (12/17 = 70.6%) | 40 pts | 28 pts |
| Error/Edge Case Coverage (10/17 = 58.8%) | 30 pts | 18 pts |
| Low Duplication (1 duplicate cluster) | 20 pts | 15 pts |
| Proportional Growth (max ratio 1.57, below 2:1) | 10 pts | 10 pts |
| Total | 100 pts | 71 pts |
Inflation check:
error_recovery.test.cjs+88 lines /error_recovery.cjs+75 lines = 1.17×;rate_limit_helpers.test.cjs+58 lines /rate_limit_helpers.cjs+37 lines = 1.57× — both within the 2:1 threshold.
Language Support
Tests analyzed:
- 🟨 JavaScript (
*.test.cjs): 17 tests (vitest)
No Go test files were added or modified in this PR.
Verdict
✅ Check passed. 29.4% of new tests are implementation tests (threshold: 30%). The
getRetryAfterMsandcheckRateLimitHeadroomsuites provide solid behavioral coverage with good edge-case handling. The main improvement opportunity is replacing the four config-constant assertions with end-to-end behavioral tests.
📖 Understanding Test Classifications
Design Tests (High Value) verify what the system does:
- Assert on observable outputs, return values, or state changes
- Cover error paths and boundary conditions
- Would catch a behavioral regression if deleted
- Remain valid even after internal refactoring
Implementation Tests (Low Value) verify how the system does it:
- Assert on internal function calls or config constant values
- Only test the happy path with typical inputs
- Break during legitimate refactoring even when behavior is correct
- Give false assurance: they pass even when the system is wrong
Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.
References: §25219423361
🧪 Test quality analysis by Test Quality Sentinel · ● 526.3K · ◷
|
@copilot review all comments |
|
@copilot make sure to log rate limit lookups and retry into the rate limit log file |
…ror details in catch - getRetryAfterMs() now only reads Retry-After/x-ratelimit-reset headers when the response status indicates a rate-limit condition (429, or 403 with x-ratelimit-remaining: 0), preventing unintended backoff override on 5xx errors - Use Math.floor instead of Math.round for percent calculation so 19.9% correctly triggers the low-headroom warning - Include getErrorMessage(err) in checkRateLimitHeadroom catch block for better diagnostic visibility - Update tests to cover status-gated behavior and Math.floor edge case Agent-Logs-Url: https://github.com/github/gh-aw/sessions/32cbeba1-59fa-4768-a053-1a50b0efd8a5 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Addressed all three review comments in 11f4fd3:
|
- Add logRetryEvent() to github_rate_limit_logger.cjs: writes a JSONL entry with source='retry', attempt number, delay, HTTP status, and any x-ratelimit-* headers extracted from the error response. This lets retry storms be correlated with quota exhaustion in post-run analysis. - Import and call logRetryEvent() in withRetry() (error_recovery.cjs) right before sleeping on each retry, capturing the error state at the moment of failure. - In checkRateLimitHeadroom() (rate_limit_helpers.cjs), call logRateLimitFromResponse() on the rateLimit.get() response so every pre-flight headroom snapshot is also written to the JSONL log file. - Tests: logRetryEvent unit tests in github_rate_limit_logger.test.cjs, JSONL-write verification in error_recovery.test.cjs and rate_limit_helpers.test.cjs. Agent-Logs-Url: https://github.com/github/gh-aw/sessions/37e5ae29-9669-453f-9568-f4b68d6f1e98 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in 3038814. Two changes:
|
When 10+ scheduled workflows complete concurrently, the burst of installation-token API calls from safe-output jobs exhausts the 60 req/min quota. The current 3-retry / ~90s window is too short — the token stays exhausted for 20+ minutes, causing
create_issue,add_labels, andlock issuewrites to fail after all retries without the agent knowing.Changes
Retry-After header support (
error_recovery.cjs)getRetryAfterMs(error): extracts the server-supplied wait time fromRetry-After(seconds) orx-ratelimit-reset(Unix timestamp) headers, but only when the response status indicates a genuine rate-limit condition (HTTP 429, or HTTP 403 withx-ratelimit-remaining: 0). Non-rate-limit errors (e.g. 502) continue to use normal exponential backoff.withRetry()now uses the server-dictated delay instead of calculated backoff when the header is present; logs when capping occursDedicated rate-limit retry config (
error_recovery.cjs)New
RATE_LIMIT_RETRY_CONFIG: 5 retries with ~30s→60s→120s→240s backoff (5s jitter per retry to spread concurrent retries), replacing the old 3-retry / 45s-cap inline configs:Apply rate-limit retry to affected handlers
add_labels.cjs: previously had no retry at all — wrapsaddLabelswithRATE_LIMIT_RETRY_CONFIGlock-issue.cjs: wraps bothissues.getandissues.lockcreate_issue.cjs: upgrades from 3-retry/45s inline config toRATE_LIMIT_RETRY_CONFIGPre-flight headroom check (
safe_output_handler_manager.cjs,rate_limit_helpers.cjs)checkRateLimitHeadroom(): fetches current quota and emits a warning when remaining drops below 20% of the limit (usesMath.floorso values like 19.9% correctly trigger the warning)Rate-limit observability via JSONL log file (
github_rate_limit_logger.cjs)logRetryEvent(error, operation, attempt, delayMs): writes asource: "retry"JSONL entry togithub_rate_limits.jsonlbefore each retry sleep, capturing the attempt number, delay, HTTP status, and anyx-ratelimit-*headers from the error response. This allows post-run analysis to correlate retry storms with quota exhaustion.withRetry()callslogRetryEvent()on every retry attempt.checkRateLimitHeadroom()now also callslogRateLimitFromResponse()on therateLimit.get()response, so every pre-flight quota snapshot is written to the JSONL log file (single API call, no extra round-trip).