Skip to content

[Bug]: Session processor retries indefinitely with unbounded exponential backoff — no max retries or circuit breaker #17648

@dawidbednarczyk

Description

@dawidbednarczyk

Description

When the upstream LLM provider (in my case GitHub Copilot / api.githubcopilot.com) returns transient errors, the session processor enters an infinite retry loop with exponential backoff that grows unbounded. There is no maximum retry count, no circuit breaker, and no configurable limit. Sessions become effectively dead for hours.

Root Cause Analysis

In packages/opencode/src/session/processor.ts, the process() method has a while (true) loop. When a retryable error occurs in the catch block:

const retry = SessionRetry.retryable(error)
if (retry !== undefined) {
    attempt++
    const delay = SessionRetry.delay(attempt, ...)
    SessionStatus.set(input.sessionID, {
        type: "retry",
        attempt,
        message: retry,
        next: Date.now() + delay,
    })
    await SessionRetry.sleep(delay, input.abort).catch(() => {})
    continue  // ← loops forever
}

The attempt counter increments without limit, and in retry.ts, the delay function when response headers ARE present has no cap:

// WITH headers → no max delay cap!
return RETRY_INITIAL_DELAY * Math.pow(RETRY_BACKOFF_FACTOR, attempt - 1)
// 2000 * 2^(attempt-1) → attempt 10 = 1,024,000ms (17 min), attempt 15 = 32,768,000ms (9 hrs)

Only the NO-headers path has a 30-second cap (RETRY_MAX_DELAY_NO_HEADERS). When the upstream sends error responses WITH headers (as GitHub Copilot does), the backoff grows without bound.

Evidence

From my logs (~/.local/share/opencode/log/):

  • 173 consecutive retry failures over 2.5 hours (16:46–19:27 UTC on 2026-03-15)
  • Provider: github-copilot, model: claude-opus-4.6, endpoint: api.githubcopilot.com/chat/completions
  • Error: AI_APICallError: Could not relay message upstream
  • Backoff grew from ~2s (attempt feat: compact and other improvements #1) to 7+ minutes (attempt diff rendering issues #10) and continued growing
  • Session was completely unresponsive — had to manually kill and restart

Related Issues

Expected Behavior

  1. Maximum retry count — After N retries (configurable, default ~10), stop retrying and surface the error to the user
  2. Maximum backoff cap — Even with headers, cap at e.g. 60 seconds (the 30s RETRY_MAX_DELAY_NO_HEADERS should apply universally)
  3. Circuit breaker — After repeated failures, stop attempting for a longer period and notify the user rather than silently blocking
  4. Configurable — Allow users to set maxRetries and maxBackoffMs in opencode.json provider config

Proposed Fix

In retry.ts:

export const RETRY_MAX_ATTEMPTS = 10  // NEW: stop after 10 retries
export const RETRY_MAX_DELAY_WITH_HEADERS = 60_000  // NEW: cap at 60s even with headers

In processor.ts, inside the catch block:

if (retry !== undefined) {
    attempt++
    if (attempt > SessionRetry.RETRY_MAX_ATTEMPTS) {
        // Circuit breaker: stop retrying, surface error
        input.assistantMessage.error = error
        Bus.publish(Session.Event.Error, { sessionID: input.sessionID, error })
        SessionStatus.set(input.sessionID, { type: "idle" })
        break  // ← EXIT the while(true) loop
    }
    const delay = SessionRetry.delay(attempt, ...)
    // ...
}

In retry.ts delay function, apply cap universally:

return Math.min(
    RETRY_INITIAL_DELAY * Math.pow(RETRY_BACKOFF_FACTOR, attempt - 1),
    RETRY_MAX_DELAY_WITH_HEADERS  // ← always cap
)

Environment

  • OpenCode: Latest (installed via curl -fsSL https://opencode.ai/install | bash)
  • OS: macOS Sequoia 15.x
  • Terminal: iTerm2 / tmux
  • Provider: GitHub Copilot (github-copilot)

Metadata

Metadata

Assignees

Labels

coreAnything pertaining to core functionality of the application (opencode server stuff)perfIndicates a performance issue or need for optimization

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions