Skip to content
This repository was archived by the owner on Apr 14, 2026. It is now read-only.

fix: drain pending replay notifications before emitting replay_complete#163

Merged
matt2e merged 12 commits intomainfrom
empty-chats-saved
Apr 13, 2026
Merged

fix: drain pending replay notifications before emitting replay_complete#163
matt2e merged 12 commits intomainfrom
empty-chats-saved

Conversation

@matt2e
Copy link
Copy Markdown
Collaborator

@matt2e matt2e commented Apr 13, 2026

Summary

  • Add a wait_for_replay_drain helper that yields to the async runtime and waits for the replay event counter to stabilise before finalizing replay, fixing a race where replay_complete fired before all replay notifications were dispatched
  • Track replay event counts on SessionRoute via a new replay_events field, with a get_replay_event_count accessor on the dispatcher
  • Remove debug/perf logging from AppShell.tsx and useAcpStream.ts that was added during investigation
  • Add unit tests for the drain stabilisation logic including edge cases (late arrivals, runaway counter cap)
  • Update file size limits in check-file-sizes.mjs to account for the new code

Test plan

  • Unit tests for wait_for_replay_drain cover: immediate zero, stable count, spawned notifications, late arrival reset, iteration cap
  • Manual verification that loading a session with history correctly shows all replay messages before the "replay complete" indicator
  • Verify no regression in new-session flow (no replay events)

🤖 Generated with Claude Code

matt2e and others added 11 commits April 13, 2026 13:17
… creation triggers

Adds [session] prefixed console.log statements at key points:
- createDraftSession: when a frontend draft is created
- promoteDraft: when a draft is promoted for first message send
- acpPrepareSession: before/after backend session creation with trigger reason
- acp:session_bound: when backend confirms session binding
- setSessionAcpId: when frontend links to backend session ID
- loadSessions: summary of backend sessions including empty session count
- sendMessage error/abort: when message send fails after session creation

These logs help diagnose empty chat sessions by tracing what triggered
backend session creation and whether the subsequent message send succeeded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces [session] lifecycle logs with [replay] logs that trace the full
session history loading pipeline:

Frontend (useAcpStream/AppShell):
- Log when loadSessionMessages starts with goose session ID and messageCount
- Count replay events (user messages, assistant messages, total) as they arrive
- Log replay_complete with event counts and buffer size
- Log buffer flush with message role breakdown
- Log timeout with accumulated event counts to detect stalled replays

Backend (Rust dispatcher/session_ops):
- Log load_session start with session ID mapping and working dir
- Count replayed user/assistant messages in SessionRoute
- Log load_session completion with event counts and elapsed time

These logs help diagnose sessions that load with missing messages by
showing exactly how many events the backend emitted vs how many the
frontend received and buffered.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After load_session().await resolves, the ACP RPC layer may still have
queued notification tasks (session/update with message chunks) that
haven't been processed yet on the single-threaded runtime. This caused
non-deterministic message loss when loading older sessions.

Yield repeatedly after load_session returns, waiting for the replay
event count to stabilize, before finalizing and emitting replay_complete.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests all 6 exported functions from replayBuffer.ts:
- ensureReplayBuffer: creation, identity, session isolation
- getBufferedMessage: lookup by id, missing cases
- getReplayBufferSize: unknown session, accurate count
- getAndDeleteReplayBuffer: returns and removes buffer
- clearReplayBuffer: removes without returning
- findLatestUnpairedToolRequest: edge cases for paired/unpaired tool requests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extract the yield-and-wait-for-stability loop from load_session_inner
into a generic wait_for_replay_drain() function that accepts an async
counter callback, making it testable in isolation.

Add 4 tests:
- replay_drain_returns_immediately_when_count_is_zero
- replay_drain_returns_stable_count (already-stable counter)
- replay_drain_waits_for_spawned_notifications (tokio::spawn simulating
  async ACP notifications arriving over multiple yields)
- replay_drain_resets_stability_on_late_arrival (counter bumps just
  before stability threshold, verifying the reset logic)

The last two tests fail when the drain logic is replaced with a
single poll (the pre-fix behavior), confirming the fix is necessary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove [session] and [replay] info-level console.log/log::info
statements added during empty-chats investigation. Revert the
dotenv-load justfile addition. Simplify dispatcher replay counting
to a single total (only field the drain fix actually reads).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a MAX_DRAIN_ITERATIONS (100) safety cap to wait_for_replay_drain so
a runaway counter cannot spin-yield forever and freeze the UI. Log a
warning when the cap is hit, returning the partial count.

Strengthen the late-arrival test to assert poll_count >= 7, confirming
the stability window truly resets rather than coincidentally seeing the
bumped value. Add a new test for the iteration cap behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…path

Merge the two consecutive routes lock acquisitions in the
UserMessageChunk replay handler into a single lock scope. The first
lock finalized any in-progress assistant message, and the second
incremented replay_events — both on the same route entry. Combining
them reduces contention and simplifies the code.

Tighten the file size limit for dispatcher.rs accordingly (545 → 540).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@matt2e matt2e requested a review from wesbillman as a code owner April 13, 2026 07:19
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed559e808f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

let mut routes = self.routes.lock().await;
if let Some(route) = routes.get_mut(&goose_session_id) {
route.replay_message_id = Some(new_id.clone());
route.replay_events += 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Count every replay event before finalizing replay

wait_for_replay_drain waits for replay_events to stabilize, but this counter is only incremented when a new replay assistant message is created (plus user chunks), not for subsequent AgentMessageChunk/tool replay updates on the same message. That lets load_session_inner call finalize_replay and emit acp:replay_complete while replay chunks are still queued, and those late chunks can be dropped once the frontend marks the message completed (shouldTrackStreamingEvent in useAcpStream). Please increment the drain counter for every replay notification that can still mutate buffered replay state, or use a completion signal tied to the notification queue instead of message creation.

Useful? React with 👍 / 👎.

The debug logging cleanup commit (5fdab4e) accidentally removed
[perf:load] and [perf:stream] console.log instrumentation by Bradley
Axen and a "Failed to prepare ACP session" console.error by Taylor Ho.
These were pre-existing logs, not debug logs added during investigation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@matt2e matt2e merged commit 268eca8 into main Apr 13, 2026
8 checks passed
@matt2e matt2e deleted the empty-chats-saved branch April 13, 2026 07:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant