Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

Problem

Auth errors and other stream failures caused unlimited retries with no backoff due to a stream lifecycle timing bug. The retry counter was reset on stream-start (when stream initiates) but auth errors happen after stream-start on the backend, preventing exponential backoff from progressing.

Solution

Two-part fix:

  1. Move retry reset from stream-start to stream-end - Only reset attempt counter when stream completes successfully
  2. Increment counter on stream-error event - Ensures backoff progresses even for errors that happen after stream initiation

Created src/utils/messages/retryState.ts with utility functions for state transitions:

  • createManualRetryState() - preserves attempt counter for manual retries
  • createFailedRetryState() - increments attempt counter on failure
  • createFreshRetryState() - resets on successful completion

Testing

Verified exponential backoff progression with invalid API key:

  • Attempt 0 → 1s delay
  • Attempt 1 → 2s delay
  • Attempt 2 → 4s delay
  • Continues up to 60s cap

All tests pass (997 pass, 1 skip, 0 fail).

Cleanup

Final commit cleaned up debug logging (65 lines removed) while preserving window.__CMUX_FORCE_ALL_RETRYABLE debug flag for testing.

Generated with cmux

The bug: Manual retries were clearing retry state entirely, causing
subsequent auto-retries to start from attempt 0 with no backoff.

Example bug scenario:
1. Auto-retry fails 3 times (should wait 8s for next retry)
2. User clicks manual retry
3. Manual retry fails
4. BUG: Next auto-retry waits only 1s instead of 16s

Root cause: RetryBarrier.tsx line 97 was setting retry state to null,
which caused useResumeManager to use default state with attempt: 0.

The fix:
- Manual retries now preserve attempt counter while making retry immediate
- Created utility functions (retryState.ts) to encapsulate state transitions
- All retry state mutations now use these utilities (DRY)

Design improvements:
- createManualRetryState(): immediate retry, preserves backoff
- createFailedRetryState(): increments attempt, stores error
- createFreshRetryState(): resets on successful stream start

This makes the bug impossible to reintroduce because:
1. Utility functions enforce correct state transitions
2. Tests verify backoff progression through manual retries
3. Single source of truth for INITIAL_DELAY constant

Generated with `cmux`
Removed duplication of backoff calculation across 3 files by creating
a single source of truth in retryState.ts.

Before:
- INITIAL_DELAY duplicated in useResumeManager + retryState.ts
- MAX_DELAY duplicated in useResumeManager + RetryBarrier
- Backoff formula (INITIAL_DELAY * 2^attempt) duplicated in 2 places

After:
- All constants and backoff calculation in retryState.ts
- calculateBackoffDelay() helper used by all consumers
- RetryBarrier simplified: removed getDelay callback, uses shared helper

Benefits:
- Single place to change backoff formula
- DRY: 13 lines of duplication removed
- Simpler RetryBarrier (no longer calculates, just displays)
- Added 3 tests for calculateBackoffDelay() function

Net change: -5 LoC in production code, +3 test cases

Generated with `cmux`
Removed redundant assertions per review feedback:
- Removed pointless MAX_DELAY constant test
- Consolidated createFailedRetryState tests (3 → 1)
- Simplified manual retry backoff test (removed redundant calculations)
- Simplified scenario tests (removed verbose comments and intermediate checks)

Result: 10 tests, 22 assertions (down from 13 tests, 35 assertions)
Same coverage, clearer intent, less noise.

Generated with `cmux`
Added window.__CMUX_FORCE_ALL_RETRYABLE flag for testing retry/backoff logic.

Usage in browser console:
  window.__CMUX_FORCE_ALL_RETRYABLE = true

This makes non-retryable errors (authentication, quota, model_not_found, etc.)
eligible for auto-retry, allowing easy testing of:
- Exponential backoff progression
- Manual retry preserving attempt counter
- Retry UI states without needing to simulate network conditions

Generated with `cmux`
When window.__CMUX_FORCE_ALL_RETRYABLE is set after an error already occurred,
the stored retry state may contain a non-retryable error. Clicking the
"Retry" button once will clear this state and enable auto-retry.

This is expected behavior - the flag affects eligibility checks going forward,
but doesn't retroactively modify stored state.

Generated with `cmux`
Added console.debug statements to trace retry eligibility decisions:
- isNonRetryableSendError: shows error type and debug flag status
- isEligibleForAutoRetry: shows stream error type and retryability
- RetryBarrier effectiveAutoRetry: shows all inputs to the decision

Usage:
1. Open browser console
2. Set window.__CMUX_FORCE_ALL_RETRYABLE = true
3. Trigger an error or click Retry
4. Watch console output to see why it's not auto-retrying

Generated with `cmux`
Changed console.debug → console.log so retry eligibility logs are always
visible without needing to enable verbose logging in DevTools.

Also added render-time log showing barrier state (effectiveAutoRetry, autoRetry, attempt, countdown, lastError).

Generated with `cmux`
Logs are now only shown when window.__CMUX_FORCE_ALL_RETRYABLE is set,
eliminating console spam from polling every 1 second.

Exception: Non-retryable errors still log to help diagnose issues.

Now console is clean by default, verbose only when debugging retries.

Generated with `cmux`
Root cause: resumeStream() returning success meant "stream initiated",
not "stream completed". useResumeManager was clearing retry state too
early, causing:

1. Retry N fails → stored as attempt N+1
2. Retry N+1 starts → resumeStream succeeds → state cleared to null
3. stream-start event → WorkspaceStore resets to attempt 0
4. Stream fails → stored as attempt 1 (should be N+2!)

Result: Backoff never progresses beyond attempt 1 → no exponential backoff.

Fix: Don't clear retry state in useResumeManager on resumeStream success.
Let stream-start/stream-end events handle clearing (they already do via
createFreshRetryState). This preserves attempt counter until stream
actually completes successfully.

Generated with `cmux`
Added minimal logging to trace the actual retry state transitions:
- attemptResume: shows current attempt when retry is triggered
- resumeStream result: shows attempt increment on failure
- stream-start: shows when retry state is reset to 0

This will help identify where backoff progression is breaking.

Generated with `cmux`
When using custom host (e.g., VITE_HOST=100.127.77.14), Vite's HMR
WebSocket was trying to connect to localhost instead of the configured
host, breaking hot module reload.

Added explicit hmr configuration to use the same host/port as the server.

Generated with `cmux`
Root cause: stream-start fires when stream initiates, not when it completes.
Auth errors happen AFTER stream-start, so resetting retry state there caused
backoff to reset every retry attempt.

Flow before fix:
1. Retry attempt N → stream-start fires → reset to attempt=0
2. Auth error happens → stored as attempt=1
3. Next retry → reset to attempt=0 again (infinite loop at attempt=0)

Flow after fix:
1. Retry attempt N → stream-start fires → attempt=N preserved
2. Auth error happens → stored as attempt=N+1
3. Next retry → uses attempt=N+1 → exponential backoff works!

stream-start = "stream initiated" (could still fail)
stream-end = "stream completed successfully" (actually succeeded)

Only reset retry state when stream actually completes, not when it starts.

Generated with `cmux`
Root cause: resumeStream() returns success when stream INITIATES, but
auth errors happen AFTER on the backend. When backend sends stream-error
event, we weren't incrementing the attempt counter.

Result: attempt counter stayed at 0 forever → no backoff → spam.

Fix: Increment attempt counter when stream-error event fires. Now:
1. Stream starts → attempt=0 preserved
2. Auth error on backend → stream-error event → attempt=0 → 1
3. Next retry → reads attempt=1 → waits 2s
4. Stream starts → attempt=1 preserved
5. Auth error → stream-error → attempt=1 → 2
6. Next retry → reads attempt=2 → waits 4s (exponential backoff works!)

Generated with `cmux`
Previous commit only added imports. Now actually incrementing the
attempt counter in the stream-error handler.

Generated with `cmux`
- Downgraded all retry logs from console.log to console.debug
- Removed verbose logging from RetryBarrier.tsx (31 lines)
- Simplified isNonRetryableSendError with direct returns
- Simplified isEligibleForAutoRetry - removed redundant logs
- Removed success log from useResumeManager (not needed)
- Updated comment: stream-end (not stream-start) resets retry state

Removed 65 lines of debug logging while preserving the debug flag
functionality. All typecheck and tests pass.

Generated with `cmux`
- Remove unused useCallback from RetryBarrier
- Remove unused MAX_DELAY from retryState tests

Generated with `cmux`
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Simplified stream-error retry counter increment to use the callback
pattern instead of read-then-write, making the update atomic.

Before:
- readPersistedState to get current state
- Create new state object
- updatePersistedState with new state

After:
- updatePersistedState with callback that increments atomically

Also removed unused RetryState import from WorkspaceStore.

Generated with `cmux`
Added typeof window !== "undefined" checks before accessing
window.__CMUX_FORCE_ALL_RETRYABLE in:
- isNonRetryableSendError
- isEligibleForAutoRetry

This prevents ReferenceError when these utilities are imported
in Node.js test environments (jest.config.js uses testEnvironment: "node").

Generated with `cmux`
Extract the duplicated window.__CMUX_FORCE_ALL_RETRYABLE check
into a helper function to follow DRY principles.

Before:
- typeof window !== "undefined" && window.__CMUX_FORCE_ALL_RETRYABLE
  duplicated in isNonRetryableSendError and isEligibleForAutoRetry

After:
- Single isForceAllRetryableEnabled() helper used by both functions

Generated with `cmux`
@ammario ammario changed the title 🤖 fix: implement exponential backoff for retry system 🤖 fix: prevent retry spam by waiting until stream-end to reset retry state Nov 10, 2025
@ammario ammario merged commit bb2c785 into main Nov 10, 2025
13 checks passed
@ammario ammario deleted the fix-retries branch November 10, 2025 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants