Skip to content

Conversation

@ammar-agent
Copy link
Collaborator

@ammar-agent ammar-agent commented Oct 19, 2025

Problem: Stream Error Amnesia

After stream errors, the Agent would "forget" minutes of accumulated work and continue from an old state. This caused frustrating loss of progress during retries.

Root Cause: Errored partials were deleted without committing accumulated parts to history. The original logic conflated two separate concerns:

  • Error metadata (transient, UI-only)
  • Accumulated progress (valuable work that must be preserved)

Solution

Modified PartialService.commitToHistory to:

  1. Strip error metadata from partial messages
  2. Commit accumulated parts to chat.jsonl (preserves progress)
  3. Skip empty partials (prevents history pollution on immediate errors)

Error state remains in partial.json for UI display, but doesn't prevent progress persistence.

Changes

  • Modified commitToHistory to strip error/errorType before commit
  • Added guard to skip committing empty messages
  • Added comprehensive unit tests for error recovery scenarios

Impact

Amnesia fixed - Agent retains full context across error → retry cycles
No empty messages - Immediate errors don't pollute history
Net -4 LoC - Cleaner logic, single source of truth for commits

Verification

Before fix:

  • Error with parts → Progress LOST ❌ (amnesia bug)
  • Error without parts → Nothing committed ✅

After fix:

  • Error with parts → Progress PRESERVED ✅ (amnesia fixed!)
  • Error without parts → Nothing committed ✅ (guard prevents pollution)

Generated with cmux

When stream errors occur, accumulated parts (text, tool calls, reasoning)
are now committed to chat.jsonl before writing error state to partial.json.

This ensures resumptions/retries don't lose progress from failed streams.

Changes:
- Modified PartialService.commitToHistory to strip error metadata and commit
  accumulated parts instead of deleting errored partials
- Error metadata remains transient (UI-only), while progress is persisted
- Added tests verifying errored partials with parts are committed correctly

Net LoC: -5 (removed duplication, single source of truth for commit logic)
Prevents pollution of chat history with empty assistant messages when
streams error before any content is generated (e.g., immediate network
failures, auth errors, rate limits).

Changes:
- Added empty parts guard to commitToHistory shouldCommit logic
- Added test verifying empty errored partials are skipped
- Cleanup still happens (partial.json deleted)

This completes the error recovery fix - preserves progress when present,
skips empty placeholders when not.
@ammar-agent ammar-agent changed the title 🤖 Preserve partial progress on stream errors 🤖 Fix stream error amnesia: commit partial progress to history Oct 19, 2025
@ammario ammario merged commit bec7276 into main Oct 19, 2025
8 checks passed
@ammario ammario deleted the stream-error branch October 19, 2025 16:58
github-merge-queue bot pushed a commit that referenced this pull request Oct 19, 2025
## Summary

Adds integration test to verify that stream error recovery preserves
context (no amnesia bug).

## Changes

- **Debug IPC for testing**: Added `DEBUG_TRIGGER_STREAM_ERROR` IPC
channel
- **StreamManager debug method**: `debugTriggerStreamError()` triggers
artificial stream errors that follow the same code path as real errors
- **Integration test**: Single error + resume scenario verifies context
preservation via **structured markers**

## Test Design

**Structured-marker approach** for precise validation:

**Test Flow:**
1. Generate unique nonce for test run (random 10-char identifier)
2. Model counts 1-100 using structured format: `${nonce}-<n>: <word>`
(e.g. `ai7qcnc20g-1: one`)
3. Collect stream deltas until ≥10 complete markers detected
4. Trigger artificial network error mid-stream
5. Resume stream and wait for completion
6. Verify final message has **both** properties:
- **(a) Prefix preservation**: Starts with exact pre-error streamed text
- **(b) Exact continuation**: Contains next sequential marker
(${nonce}-11) shortly after prefix

**Validation:**
- Pre-error content captured from stream-delta events (user-visible data
path)
- Stable prefix truncated to last complete marker line (no partial
markers)
- Assertions directly prove both amnesia-prevention properties
- No coupling to internal storage formats or metadata

**Why this approach:**
- **Precise**: Detects exact continuation (not just "some work done")
- **Unambiguous**: Random nonce makes false positives virtually
impossible
- **Robust**: Structured format less likely to confuse model than
natural language
- **Fast**: Haiku 4.5 completes in ~18-21 seconds

## Bug Fix

Also fixed event collection bug in `collectStreamUntil`: properly track
consumed deltas to avoid returning the same event multiple times.
Previous logic returned first matching event on every poll, causing
duplicate processing.

## Related

Follow-up to #331 which fixed the amnesia bug by preserving accumulated
parts on error.

## Test Results

✅ Test passes reliably in ~18-21 seconds
✅ Validates **exact** prefix preservation and continuation
✅ No flaky failures from timing issues
✅ Integration tests pass: 1 passed, 1 total

_Generated with `cmux`_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants