Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

@ammar-agent ammar-agent commented Oct 29, 2025

Problem

When users hit non-retryable errors (e.g., api_key_not_found, authentication, quota), the system would infinitely retry without user intervention. UI showed "Retrying..." indefinitely with no error message, and the manual retry button was non-functional.

Root Causes

Auto-retry didn't distinguish error types: Two error systems existed (SendMessageError from resumeStream failures, StreamErrorType in stream-error messages). isEligibleForAutoRetry() only checked the latter, missing api_key_not_found errors stored in RetryState.lastError.

RetryBarrier didn't receive updates: useResumeManager used localStorage.setItem() directly while RetryBarrier used usePersistedState with listener: true expecting custom events.

Manual retry was blocked: isEligibleForResume() treated manual retries the same as auto-retries, blocking user intent after fixing issues.

Solution

Split retry eligibility logic into three functions in src/utils/messages/retryEligibility.ts:

  • hasInterruptedStream() - Shows UI for ALL errors (user may have fixed it)
  • isEligibleForAutoRetry() - Filters non-retryable stream errors (authentication, quota, model_not_found, context_exceeded, aborted)
  • isNonRetryableSendError() - Checks SendMessageError types (api_key_not_found, provider_not_supported, invalid_model_string)

Fixed event synchronization: useResumeManager now uses updatePersistedState() which dispatches custom events that listeners receive.

Made RetryBarrier self-contained: Following AgentStatusIndicator pattern, accepts only workspaceId prop, fetches workspace state directly, manages own autoRetry state, computes effectiveAutoRetry internally. Removed ~37 lines from AIView.tsx.

Added manual retry bypass: RESUME_CHECK_REQUESTED events now include isManual flag. Manual retry bypasses eligibility checks, respecting user intent.

Centralized event type system: Created CustomEventPayloads interface and createCustomEvent() helper in src/constants/events.ts for type-safe event handling.

Centralized error formatting: Created formatSendMessageError() in src/utils/errors/formatSendError.ts, eliminating duplicate error message strings.

Behavior After Fix

Non-retryable errors: Auto-retry stops immediately, UI shows manual "Retry" button (not "Retrying..."), error message displayed clearly, user can fix and retry.

Retryable errors: Auto-retry with exponential backoff (1s → 2s → 4s → ... → 60s max), UI shows countdown, user can stop with Ctrl+C.

Testing

All 30 tests in retryEligibility.test.ts pass, verifying correct filtering of non-retryable errors and edge cases.

Generated with cmux

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@ammar-agent ammar-agent force-pushed the fix-bad-retry branch 2 times, most recently from 603f688 to 0f883b5 Compare October 29, 2025 17:42
@ibetitsmike
Copy link
Contributor

Nice. Had that happen today.

@ammar-agent ammar-agent changed the title 🤖 fix: stop auto-retry for non-retryable error types 🤖 fix: prevent infinite retry loop for non-retryable errors Oct 29, 2025
Fixes perpetual retry loop when authentication or other non-retryable
errors occur. Previously, all stream errors would trigger auto-retry,
causing the system to infinitely retry errors that require user action.

## Problem

When a user provides a bad API key (or hits quota limits, model not
found, context exceeded), the system would enter a perpetual retry loop:

1. User sets API key via /providers set
2. Request fails with authentication error
3. useResumeManager auto-retries every 1-60s with exponential backoff
4. Error repeats forever until user force-closes cmux

This happened because hasInterruptedStream() returned true for ALL
stream-error types, triggering auto-retry even when user action is needed.

## Solution

Classify error types into retryable vs non-retryable:

**Non-retryable (require user action):**
- authentication - Bad API key
- quota - Billing/usage limits
- model_not_found - Invalid model selection
- context_exceeded - Message too long
- aborted - User cancelled

**Retryable (transient issues):**
- network - Temporary network issues
- server_error - Provider issues (5xx)
- rate_limit - Can retry with backoff
- api - Generic API errors
- unknown - Defensive retry

hasInterruptedStream() now checks errorType and returns false for
non-retryable errors, preventing useResumeManager from auto-retrying.

## Testing

Added 8 new test cases covering:
- Each non-retryable error type returns false
- Retryable error types (network, server_error, rate_limit) return true
- All 19 tests pass

_Generated with `cmux`_
Allows users to specify which host and port to bind to when running
in server mode, instead of hardcoding 0.0.0.0:3000.

Usage:
  node dist/main.js server             # default: 0.0.0.0:3000
  node dist/main.js server -h localhost  # bind to localhost only
  node dist/main.js server -h 0.0.0.0 -p 8080  # custom port

Defaults to 0.0.0.0:3000 for network accessibility.

_Generated with `cmux`_
Replaces manual argument parsing with commander library for better UX:
- Automatic help generation with --help
- Proper flag validation (errors on unknown flags)
- Supports both short (-h, -p) and long (--host, --port) flags
- Type-safe option parsing

Usage:
  node dist/main.js server --help
  node dist/main.js server                    # 0.0.0.0:3000
  node dist/main.js server --host localhost
  node dist/main.js server -h 127.0.0.1 -p 8080

Dependencies: Added commander@14.0.2

_Generated with `cmux`_
Fixes issue where errors (like api_key_not_found) are invisible on page
reload. Previously, when resumeStream failed during auto-retry, the error
was silently swallowed and users only saw "Retrying" without knowing why.

## Problem

SendMessageError (validation errors before streaming starts) were never
emitted as stream-error events, so they:
- Were not persisted to partial.json
- Were not visible on page reload
- Left users confused during auto-retry loops

## Solution

Emit all SendMessageErrors as stream-error events in streamWithHistory:
- Both sendMessage and resumeStream now emit errors consistently
- Errors persist via partial.json and survive page reloads
- ChatInput still shows toast for immediate feedback (in addition to stream-error)

## Parity

Before:
- sendMessage: validation fails → returns error → toast shown (ephemeral)
- resumeStream: validation fails → returns error → silently retries (no UI)

After:
- sendMessage: validation fails → returns error + emits stream-error → toast + persisted
- resumeStream: validation fails → returns error + emits stream-error → persisted

Both paths now emit errors for consistent UI feedback and persistence.

_Generated with `cmux`_
Adds `make dev-server` for rapid iteration on server mode changes.

Setup:
- Watches TypeScript files in src/ and rebuilds main process
- Watches dist/main.js and dist/main-server.js
- Auto-restarts server on changes (500ms debounce)
- Uses nodemon for process management

Usage:
  make dev-server

Server runs on http://localhost:3000 with automatic reload on file changes.

Dependencies: Added nodemon@3.1.10

_Generated with `cmux`_
- Split retry eligibility logic:
  - hasInterruptedStream() shows UI for ALL errors
  - isEligibleForAutoRetry() filters non-retryable errors
  - isNonRetryableSendError() checks SendMessageError types

- Fixed localStorage event synchronization:
  - useResumeManager now uses updatePersistedState()
  - Dispatches custom events for usePersistedState listeners
  - RetryBarrier receives updates immediately

- Made RetryBarrier self-contained:
  - Accepts only workspaceId (like AgentStatusIndicator)
  - Manages own state and event handlers
  - Computes effectiveAutoRetry internally
  - Removed bloat from AIView.tsx

- Fixed manual retry button:
  - Added isManual flag to bypass eligibility checks
  - User clicks = immediate retry regardless of error type
  - Auto-retry still respects eligibility rules

- Centralized event type system:
  - CustomEventPayloads interface for all events
  - createCustomEvent() helper for type-safe dispatch
  - CustomEventType for type-safe listeners
  - Single source of truth for event payloads

Tests: All 30 tests pass in retryEligibility.test.ts

_Generated with `cmux`_
Created src/utils/errors/formatSendError.ts as single source of truth
for SendMessageError formatting. Both RetryBarrier and ChatInputToasts
now use formatSendMessageError() helper, eliminating duplicate logic
for provider commands and error messages.

- formatSendMessageError() returns message + optional providerCommand
- RetryBarrier combines them: "message. Configure with command"
- ChatInputToasts uses providerCommand in solution section
- All provider-related strings now defined in one place
Added tests that explicitly demonstrate the key distinction:
- hasInterruptedStream = Should RetryBarrier UI be shown?
- isEligibleForAutoRetry = Should system auto-retry?

For non-retryable errors (authentication, quota):
- hasInterruptedStream returns true (show UI)
- isEligibleForAutoRetry returns false (don't auto-retry)
- Result: User sees manual Retry button, not 'Retrying...'

For retryable errors (network, server):
- Both return true
- Result: User sees 'Retrying...' with countdown and exponential backoff

This prevents confusion about why we have two similar-sounding functions.
Removed tests that only checked 'does stream-error mean interrupted?'
without testing actual retry logic. These were misleading because:
- They don't test retry behavior
- They're trivial (of course errors mean interrupted)
- Test names implied retry testing but just checked message detection

Kept only tests that verify isEligibleForAutoRetry correctly filters
non-retryable errors, which is the actual important logic.
@ammario ammario merged commit cc13e60 into main Oct 29, 2025
13 checks passed
@ammario ammario deleted the fix-bad-retry branch October 29, 2025 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants