-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Description
Description
When the upstream LLM provider (in my case GitHub Copilot / api.githubcopilot.com) returns transient errors, the session processor enters an infinite retry loop with exponential backoff that grows unbounded. There is no maximum retry count, no circuit breaker, and no configurable limit. Sessions become effectively dead for hours.
Root Cause Analysis
In packages/opencode/src/session/processor.ts, the process() method has a while (true) loop. When a retryable error occurs in the catch block:
const retry = SessionRetry.retryable(error)
if (retry !== undefined) {
attempt++
const delay = SessionRetry.delay(attempt, ...)
SessionStatus.set(input.sessionID, {
type: "retry",
attempt,
message: retry,
next: Date.now() + delay,
})
await SessionRetry.sleep(delay, input.abort).catch(() => {})
continue // ← loops forever
}The attempt counter increments without limit, and in retry.ts, the delay function when response headers ARE present has no cap:
// WITH headers → no max delay cap!
return RETRY_INITIAL_DELAY * Math.pow(RETRY_BACKOFF_FACTOR, attempt - 1)
// 2000 * 2^(attempt-1) → attempt 10 = 1,024,000ms (17 min), attempt 15 = 32,768,000ms (9 hrs)Only the NO-headers path has a 30-second cap (RETRY_MAX_DELAY_NO_HEADERS). When the upstream sends error responses WITH headers (as GitHub Copilot does), the backoff grows without bound.
Evidence
From my logs (~/.local/share/opencode/log/):
- 173 consecutive retry failures over 2.5 hours (16:46–19:27 UTC on 2026-03-15)
- Provider:
github-copilot, model:claude-opus-4.6, endpoint:api.githubcopilot.com/chat/completions - Error:
AI_APICallError: Could not relay message upstream - Backoff grew from ~2s (attempt feat: compact and other improvements #1) to 7+ minutes (attempt diff rendering issues #10) and continued growing
- Session was completely unresponsive — had to manually kill and restart
Related Issues
- Endless loop of TimeoutOverflowWarning when exceeding Github Copilot monthly message limit #6430 — Fixed the
TimeoutOverflowWarningwhen delay exceeds 32-bit int, but did NOT add a max retry count - [Bug]: Model rate limit error causes immediate termination instead of auto-retry #16994 — Rate limit errors not being detected as retryable (opposite problem — errors that SHOULD retry don't)
Expected Behavior
- Maximum retry count — After N retries (configurable, default ~10), stop retrying and surface the error to the user
- Maximum backoff cap — Even with headers, cap at e.g. 60 seconds (the 30s
RETRY_MAX_DELAY_NO_HEADERSshould apply universally) - Circuit breaker — After repeated failures, stop attempting for a longer period and notify the user rather than silently blocking
- Configurable — Allow users to set
maxRetriesandmaxBackoffMsinopencode.jsonprovider config
Proposed Fix
In retry.ts:
export const RETRY_MAX_ATTEMPTS = 10 // NEW: stop after 10 retries
export const RETRY_MAX_DELAY_WITH_HEADERS = 60_000 // NEW: cap at 60s even with headersIn processor.ts, inside the catch block:
if (retry !== undefined) {
attempt++
if (attempt > SessionRetry.RETRY_MAX_ATTEMPTS) {
// Circuit breaker: stop retrying, surface error
input.assistantMessage.error = error
Bus.publish(Session.Event.Error, { sessionID: input.sessionID, error })
SessionStatus.set(input.sessionID, { type: "idle" })
break // ← EXIT the while(true) loop
}
const delay = SessionRetry.delay(attempt, ...)
// ...
}In retry.ts delay function, apply cap universally:
return Math.min(
RETRY_INITIAL_DELAY * Math.pow(RETRY_BACKOFF_FACTOR, attempt - 1),
RETRY_MAX_DELAY_WITH_HEADERS // ← always cap
)Environment
- OpenCode: Latest (installed via
curl -fsSL https://opencode.ai/install | bash) - OS: macOS Sequoia 15.x
- Terminal: iTerm2 / tmux
- Provider: GitHub Copilot (
github-copilot)