-
Notifications
You must be signed in to change notification settings - Fork 13
🤖 fix: prevent retry spam by waiting until stream-end to reset retry state #548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The bug: Manual retries were clearing retry state entirely, causing subsequent auto-retries to start from attempt 0 with no backoff. Example bug scenario: 1. Auto-retry fails 3 times (should wait 8s for next retry) 2. User clicks manual retry 3. Manual retry fails 4. BUG: Next auto-retry waits only 1s instead of 16s Root cause: RetryBarrier.tsx line 97 was setting retry state to null, which caused useResumeManager to use default state with attempt: 0. The fix: - Manual retries now preserve attempt counter while making retry immediate - Created utility functions (retryState.ts) to encapsulate state transitions - All retry state mutations now use these utilities (DRY) Design improvements: - createManualRetryState(): immediate retry, preserves backoff - createFailedRetryState(): increments attempt, stores error - createFreshRetryState(): resets on successful stream start This makes the bug impossible to reintroduce because: 1. Utility functions enforce correct state transitions 2. Tests verify backoff progression through manual retries 3. Single source of truth for INITIAL_DELAY constant Generated with `cmux`
Removed duplication of backoff calculation across 3 files by creating a single source of truth in retryState.ts. Before: - INITIAL_DELAY duplicated in useResumeManager + retryState.ts - MAX_DELAY duplicated in useResumeManager + RetryBarrier - Backoff formula (INITIAL_DELAY * 2^attempt) duplicated in 2 places After: - All constants and backoff calculation in retryState.ts - calculateBackoffDelay() helper used by all consumers - RetryBarrier simplified: removed getDelay callback, uses shared helper Benefits: - Single place to change backoff formula - DRY: 13 lines of duplication removed - Simpler RetryBarrier (no longer calculates, just displays) - Added 3 tests for calculateBackoffDelay() function Net change: -5 LoC in production code, +3 test cases Generated with `cmux`
Removed redundant assertions per review feedback: - Removed pointless MAX_DELAY constant test - Consolidated createFailedRetryState tests (3 → 1) - Simplified manual retry backoff test (removed redundant calculations) - Simplified scenario tests (removed verbose comments and intermediate checks) Result: 10 tests, 22 assertions (down from 13 tests, 35 assertions) Same coverage, clearer intent, less noise. Generated with `cmux`
Added window.__CMUX_FORCE_ALL_RETRYABLE flag for testing retry/backoff logic. Usage in browser console: window.__CMUX_FORCE_ALL_RETRYABLE = true This makes non-retryable errors (authentication, quota, model_not_found, etc.) eligible for auto-retry, allowing easy testing of: - Exponential backoff progression - Manual retry preserving attempt counter - Retry UI states without needing to simulate network conditions Generated with `cmux`
When window.__CMUX_FORCE_ALL_RETRYABLE is set after an error already occurred, the stored retry state may contain a non-retryable error. Clicking the "Retry" button once will clear this state and enable auto-retry. This is expected behavior - the flag affects eligibility checks going forward, but doesn't retroactively modify stored state. Generated with `cmux`
Added console.debug statements to trace retry eligibility decisions: - isNonRetryableSendError: shows error type and debug flag status - isEligibleForAutoRetry: shows stream error type and retryability - RetryBarrier effectiveAutoRetry: shows all inputs to the decision Usage: 1. Open browser console 2. Set window.__CMUX_FORCE_ALL_RETRYABLE = true 3. Trigger an error or click Retry 4. Watch console output to see why it's not auto-retrying Generated with `cmux`
Changed console.debug → console.log so retry eligibility logs are always visible without needing to enable verbose logging in DevTools. Also added render-time log showing barrier state (effectiveAutoRetry, autoRetry, attempt, countdown, lastError). Generated with `cmux`
Logs are now only shown when window.__CMUX_FORCE_ALL_RETRYABLE is set, eliminating console spam from polling every 1 second. Exception: Non-retryable errors still log to help diagnose issues. Now console is clean by default, verbose only when debugging retries. Generated with `cmux`
Root cause: resumeStream() returning success meant "stream initiated", not "stream completed". useResumeManager was clearing retry state too early, causing: 1. Retry N fails → stored as attempt N+1 2. Retry N+1 starts → resumeStream succeeds → state cleared to null 3. stream-start event → WorkspaceStore resets to attempt 0 4. Stream fails → stored as attempt 1 (should be N+2!) Result: Backoff never progresses beyond attempt 1 → no exponential backoff. Fix: Don't clear retry state in useResumeManager on resumeStream success. Let stream-start/stream-end events handle clearing (they already do via createFreshRetryState). This preserves attempt counter until stream actually completes successfully. Generated with `cmux`
Added minimal logging to trace the actual retry state transitions: - attemptResume: shows current attempt when retry is triggered - resumeStream result: shows attempt increment on failure - stream-start: shows when retry state is reset to 0 This will help identify where backoff progression is breaking. Generated with `cmux`
When using custom host (e.g., VITE_HOST=100.127.77.14), Vite's HMR WebSocket was trying to connect to localhost instead of the configured host, breaking hot module reload. Added explicit hmr configuration to use the same host/port as the server. Generated with `cmux`
Root cause: stream-start fires when stream initiates, not when it completes. Auth errors happen AFTER stream-start, so resetting retry state there caused backoff to reset every retry attempt. Flow before fix: 1. Retry attempt N → stream-start fires → reset to attempt=0 2. Auth error happens → stored as attempt=1 3. Next retry → reset to attempt=0 again (infinite loop at attempt=0) Flow after fix: 1. Retry attempt N → stream-start fires → attempt=N preserved 2. Auth error happens → stored as attempt=N+1 3. Next retry → uses attempt=N+1 → exponential backoff works! stream-start = "stream initiated" (could still fail) stream-end = "stream completed successfully" (actually succeeded) Only reset retry state when stream actually completes, not when it starts. Generated with `cmux`
Root cause: resumeStream() returns success when stream INITIATES, but auth errors happen AFTER on the backend. When backend sends stream-error event, we weren't incrementing the attempt counter. Result: attempt counter stayed at 0 forever → no backoff → spam. Fix: Increment attempt counter when stream-error event fires. Now: 1. Stream starts → attempt=0 preserved 2. Auth error on backend → stream-error event → attempt=0 → 1 3. Next retry → reads attempt=1 → waits 2s 4. Stream starts → attempt=1 preserved 5. Auth error → stream-error → attempt=1 → 2 6. Next retry → reads attempt=2 → waits 4s (exponential backoff works!) Generated with `cmux`
Previous commit only added imports. Now actually incrementing the attempt counter in the stream-error handler. Generated with `cmux`
- Downgraded all retry logs from console.log to console.debug - Removed verbose logging from RetryBarrier.tsx (31 lines) - Simplified isNonRetryableSendError with direct returns - Simplified isEligibleForAutoRetry - removed redundant logs - Removed success log from useResumeManager (not needed) - Updated comment: stream-end (not stream-start) resets retry state Removed 65 lines of debug logging while preserving the debug flag functionality. All typecheck and tests pass. Generated with `cmux`
- Remove unused useCallback from RetryBarrier - Remove unused MAX_DELAY from retryState tests Generated with `cmux`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Simplified stream-error retry counter increment to use the callback pattern instead of read-then-write, making the update atomic. Before: - readPersistedState to get current state - Create new state object - updatePersistedState with new state After: - updatePersistedState with callback that increments atomically Also removed unused RetryState import from WorkspaceStore. Generated with `cmux`
Added typeof window !== "undefined" checks before accessing window.__CMUX_FORCE_ALL_RETRYABLE in: - isNonRetryableSendError - isEligibleForAutoRetry This prevents ReferenceError when these utilities are imported in Node.js test environments (jest.config.js uses testEnvironment: "node"). Generated with `cmux`
Extract the duplicated window.__CMUX_FORCE_ALL_RETRYABLE check into a helper function to follow DRY principles. Before: - typeof window !== "undefined" && window.__CMUX_FORCE_ALL_RETRYABLE duplicated in isNonRetryableSendError and isEligibleForAutoRetry After: - Single isForceAllRetryableEnabled() helper used by both functions Generated with `cmux`
Problem
Auth errors and other stream failures caused unlimited retries with no backoff due to a stream lifecycle timing bug. The retry counter was reset on
stream-start(when stream initiates) but auth errors happen after stream-start on the backend, preventing exponential backoff from progressing.Solution
Two-part fix:
stream-starttostream-end- Only reset attempt counter when stream completes successfullystream-errorevent - Ensures backoff progresses even for errors that happen after stream initiationCreated
src/utils/messages/retryState.tswith utility functions for state transitions:createManualRetryState()- preserves attempt counter for manual retriescreateFailedRetryState()- increments attempt counter on failurecreateFreshRetryState()- resets on successful completionTesting
Verified exponential backoff progression with invalid API key:
All tests pass (997 pass, 1 skip, 0 fail).
Cleanup
Final commit cleaned up debug logging (65 lines removed) while preserving
window.__CMUX_FORCE_ALL_RETRYABLEdebug flag for testing.Generated with
cmux