🤖 fix: prevent infinite retry loop for non-retryable errors #485

ammar-agent · 2025-10-29T16:57:00Z

Problem

When users hit non-retryable errors (e.g., api_key_not_found, authentication, quota), the system would infinitely retry without user intervention. UI showed "Retrying..." indefinitely with no error message, and the manual retry button was non-functional.

Root Causes

Auto-retry didn't distinguish error types: Two error systems existed (SendMessageError from resumeStream failures, StreamErrorType in stream-error messages). isEligibleForAutoRetry() only checked the latter, missing api_key_not_found errors stored in RetryState.lastError.

RetryBarrier didn't receive updates: useResumeManager used localStorage.setItem() directly while RetryBarrier used usePersistedState with listener: true expecting custom events.

Manual retry was blocked: isEligibleForResume() treated manual retries the same as auto-retries, blocking user intent after fixing issues.

Solution

Split retry eligibility logic into three functions in src/utils/messages/retryEligibility.ts:

hasInterruptedStream() - Shows UI for ALL errors (user may have fixed it)
isEligibleForAutoRetry() - Filters non-retryable stream errors (authentication, quota, model_not_found, context_exceeded, aborted)
isNonRetryableSendError() - Checks SendMessageError types (api_key_not_found, provider_not_supported, invalid_model_string)

Fixed event synchronization: useResumeManager now uses updatePersistedState() which dispatches custom events that listeners receive.

Made RetryBarrier self-contained: Following AgentStatusIndicator pattern, accepts only workspaceId prop, fetches workspace state directly, manages own autoRetry state, computes effectiveAutoRetry internally. Removed ~37 lines from AIView.tsx.

Added manual retry bypass: RESUME_CHECK_REQUESTED events now include isManual flag. Manual retry bypasses eligibility checks, respecting user intent.

Centralized event type system: Created CustomEventPayloads interface and createCustomEvent() helper in src/constants/events.ts for type-safe event handling.

Centralized error formatting: Created formatSendMessageError() in src/utils/errors/formatSendError.ts, eliminating duplicate error message strings.

Behavior After Fix

Non-retryable errors: Auto-retry stops immediately, UI shows manual "Retry" button (not "Retrying..."), error message displayed clearly, user can fix and retry.

Retryable errors: Auto-retry with exponential backoff (1s → 2s → 4s → ... → 60s max), UI shows countdown, user can stop with Ctrl+C.

Testing

All 30 tests in retryEligibility.test.ts pass, verifying correct filtering of non-retryable errors and edge cases.

Generated with cmux

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

src/utils/messages/retryEligibility.ts

ibetitsmike · 2025-10-29T18:54:08Z

Nice. Had that happen today.

Fixes perpetual retry loop when authentication or other non-retryable errors occur. Previously, all stream errors would trigger auto-retry, causing the system to infinitely retry errors that require user action. ## Problem When a user provides a bad API key (or hits quota limits, model not found, context exceeded), the system would enter a perpetual retry loop: 1. User sets API key via /providers set 2. Request fails with authentication error 3. useResumeManager auto-retries every 1-60s with exponential backoff 4. Error repeats forever until user force-closes cmux This happened because hasInterruptedStream() returned true for ALL stream-error types, triggering auto-retry even when user action is needed. ## Solution Classify error types into retryable vs non-retryable: **Non-retryable (require user action):** - authentication - Bad API key - quota - Billing/usage limits - model_not_found - Invalid model selection - context_exceeded - Message too long - aborted - User cancelled **Retryable (transient issues):** - network - Temporary network issues - server_error - Provider issues (5xx) - rate_limit - Can retry with backoff - api - Generic API errors - unknown - Defensive retry hasInterruptedStream() now checks errorType and returns false for non-retryable errors, preventing useResumeManager from auto-retrying. ## Testing Added 8 new test cases covering: - Each non-retryable error type returns false - Retryable error types (network, server_error, rate_limit) return true - All 19 tests pass _Generated with `cmux`_

Allows users to specify which host and port to bind to when running in server mode, instead of hardcoding 0.0.0.0:3000. Usage: node dist/main.js server # default: 0.0.0.0:3000 node dist/main.js server -h localhost # bind to localhost only node dist/main.js server -h 0.0.0.0 -p 8080 # custom port Defaults to 0.0.0.0:3000 for network accessibility. _Generated with `cmux`_

Replaces manual argument parsing with commander library for better UX: - Automatic help generation with --help - Proper flag validation (errors on unknown flags) - Supports both short (-h, -p) and long (--host, --port) flags - Type-safe option parsing Usage: node dist/main.js server --help node dist/main.js server # 0.0.0.0:3000 node dist/main.js server --host localhost node dist/main.js server -h 127.0.0.1 -p 8080 Dependencies: Added commander@14.0.2 _Generated with `cmux`_

Fixes issue where errors (like api_key_not_found) are invisible on page reload. Previously, when resumeStream failed during auto-retry, the error was silently swallowed and users only saw "Retrying" without knowing why. ## Problem SendMessageError (validation errors before streaming starts) were never emitted as stream-error events, so they: - Were not persisted to partial.json - Were not visible on page reload - Left users confused during auto-retry loops ## Solution Emit all SendMessageErrors as stream-error events in streamWithHistory: - Both sendMessage and resumeStream now emit errors consistently - Errors persist via partial.json and survive page reloads - ChatInput still shows toast for immediate feedback (in addition to stream-error) ## Parity Before: - sendMessage: validation fails → returns error → toast shown (ephemeral) - resumeStream: validation fails → returns error → silently retries (no UI) After: - sendMessage: validation fails → returns error + emits stream-error → toast + persisted - resumeStream: validation fails → returns error + emits stream-error → persisted Both paths now emit errors for consistent UI feedback and persistence. _Generated with `cmux`_

Adds `make dev-server` for rapid iteration on server mode changes. Setup: - Watches TypeScript files in src/ and rebuilds main process - Watches dist/main.js and dist/main-server.js - Auto-restarts server on changes (500ms debounce) - Uses nodemon for process management Usage: make dev-server Server runs on http://localhost:3000 with automatic reload on file changes. Dependencies: Added nodemon@3.1.10 _Generated with `cmux`_

- Split retry eligibility logic: - hasInterruptedStream() shows UI for ALL errors - isEligibleForAutoRetry() filters non-retryable errors - isNonRetryableSendError() checks SendMessageError types - Fixed localStorage event synchronization: - useResumeManager now uses updatePersistedState() - Dispatches custom events for usePersistedState listeners - RetryBarrier receives updates immediately - Made RetryBarrier self-contained: - Accepts only workspaceId (like AgentStatusIndicator) - Manages own state and event handlers - Computes effectiveAutoRetry internally - Removed bloat from AIView.tsx - Fixed manual retry button: - Added isManual flag to bypass eligibility checks - User clicks = immediate retry regardless of error type - Auto-retry still respects eligibility rules - Centralized event type system: - CustomEventPayloads interface for all events - createCustomEvent() helper for type-safe dispatch - CustomEventType for type-safe listeners - Single source of truth for event payloads Tests: All 30 tests pass in retryEligibility.test.ts _Generated with `cmux`_

Created src/utils/errors/formatSendError.ts as single source of truth for SendMessageError formatting. Both RetryBarrier and ChatInputToasts now use formatSendMessageError() helper, eliminating duplicate logic for provider commands and error messages. - formatSendMessageError() returns message + optional providerCommand - RetryBarrier combines them: "message. Configure with command" - ChatInputToasts uses providerCommand in solution section - All provider-related strings now defined in one place

Added tests that explicitly demonstrate the key distinction: - hasInterruptedStream = Should RetryBarrier UI be shown? - isEligibleForAutoRetry = Should system auto-retry? For non-retryable errors (authentication, quota): - hasInterruptedStream returns true (show UI) - isEligibleForAutoRetry returns false (don't auto-retry) - Result: User sees manual Retry button, not 'Retrying...' For retryable errors (network, server): - Both return true - Result: User sees 'Retrying...' with countdown and exponential backoff This prevents confusion about why we have two similar-sounding functions.

Removed tests that only checked 'does stream-error mean interrupted?' without testing actual retry logic. These were misleading because: - They don't test retry behavior - They're trivial (of course errors mean interrupted) - Test names implied retry testing but just checked message detection Kept only tests that verify isEligibleForAutoRetry correctly filters non-retryable errors, which is the actual important logic.

chatgpt-codex-connector bot reviewed Oct 29, 2025

View reviewed changes

src/utils/messages/retryEligibility.ts Outdated Show resolved Hide resolved

ammar-agent force-pushed the fix-bad-retry branch 2 times, most recently from 603f688 to 0f883b5 Compare October 29, 2025 17:42

ammar-agent changed the title ~~🤖 fix: stop auto-retry for non-retryable error types~~ 🤖 fix: prevent infinite retry loop for non-retryable errors Oct 29, 2025

ammar-agent added 11 commits October 29, 2025 19:11

fix: lint errors and formatting

21b2195

fix: match expected error message in tests

8e3cf0e

ammar-agent force-pushed the fix-bad-retry branch from 25ca2a3 to 5a03fe0 Compare October 29, 2025 19:11

ammario merged commit cc13e60 into main Oct 29, 2025
13 checks passed

ammario deleted the fix-bad-retry branch October 29, 2025 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🤖 fix: prevent infinite retry loop for non-retryable errors #485

🤖 fix: prevent infinite retry loop for non-retryable errors #485

Uh oh!

ammar-agent commented Oct 29, 2025 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

ibetitsmike commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

🤖 fix: prevent infinite retry loop for non-retryable errors #485

🤖 fix: prevent infinite retry loop for non-retryable errors #485

Uh oh!

Conversation

ammar-agent commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Causes

Solution

Behavior After Fix

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ibetitsmike commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ammar-agent commented Oct 29, 2025 •

edited

Loading