Skip to content

fix: safe agent recovery on submit failure and accurate disconnect messaging#7972

Closed
kyledef wants to merge 1 commit intoblock:mainfrom
kyledef:fix/sse-retry-and-reconnect
Closed

fix: safe agent recovery on submit failure and accurate disconnect messaging#7972
kyledef wants to merge 1 commit intoblock:mainfrom
kyledef:fix/sse-retry-and-reconnect

Conversation

@kyledef
Copy link
Copy Markdown

@kyledef kyledef commented Mar 18, 2026

Summary

When a /reply request fails (due to agent LRU eviction, server restart, or network loss), silently restore the agent so the session remains usable, and show accurate messaging about what happened.

Problem

Multiple users have reported that Goose "just hangs" when sending a message after not interacting for a while, requiring a full restart to recover. The root causes:

  • When the backend agent becomes unavailable (LRU eviction, server restart, macOS sleep/wake), the frontend had no recovery mechanism — resumeAgent was only called once on mount
  • The existing connection-lost toast encouraged users to "try sending your message again" with no warning that resending could duplicate tool actions (shell commands, file edits, MCP calls)

Why we do NOT retry /reply

The /reply endpoint is a one-shot turn submission, not a resumable stream. There is no protocol support for idempotent turn submission, reconnecting to an in-flight turn, or replaying buffered events. Retrying the SSE connection would re-submit the same turn and potentially duplicate side effects. True resumable streaming requires protocol-level changes (idempotent reply_id, event replay, Last-Event-ID support) tracked separately.

Changes

  • Add explicit // Do not retry /reply comments at all 3 SSE call sites explaining why sseMaxRetryAttempts must stay at 0
  • Silent agent recovery on failure: when a submit fails with a non-abort error, attempt resumeAgent to restore the session (handles LRU eviction, server restart, extension reload) so the user can manually retry. Does NOT auto-resend the message.
  • Accurate disconnect messaging: update connection-lost toast to warn that the turn may have partially executed and resending may duplicate tool actions

What this does NOT do

  • Does not auto-retry /reply requests (unsafe without idempotency)
  • Does not auto-resend the user's message
  • Does not claim the connection was "restored" or "reconnected"

Testing

  • Verified syntax and balanced braces across all changes
  • All 3 sseMaxRetryAttempts call sites documented
  • Recovery logic added to both handleSubmit and submitElicitationResponse error handlers
  • No server-side changes required

Files Changed

  • ui/desktop/src/hooks/useChatStream.ts — agent recovery + messaging + comments
  • ui/desktop/package.json — patch version bump to 1.27.1

@kyledef kyledef force-pushed the fix/sse-retry-and-reconnect branch 2 times, most recently from b7351e4 to 5faabd4 Compare March 18, 2026 03:18
…ssaging

- Keep sseMaxRetryAttempts at 0 for /reply — the endpoint is not
  idempotent and retrying can duplicate tool calls (shell commands,
  file edits, MCP calls). Added explicit comment at all 3 call sites.
- On submit failure, attempt resumeAgent to restore the session
  (handles LRU eviction, server restart) so the user can manually
  retry. Does NOT auto-resend the message.
- Update connection-lost toast to warn about possible partial
  execution and duplicate tool actions on resend.
- Remove misleading 'connection restored' / 'reconnected' language.

True resumable streaming requires protocol-level changes (idempotent
turn submission, reply_id, event replay) tracked separately.
@kyledef kyledef force-pushed the fix/sse-retry-and-reconnect branch from 5faabd4 to ae5bce0 Compare March 18, 2026 04:48
@kyledef kyledef marked this pull request as ready for review March 18, 2026 04:48
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ae5bce0525

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +648 to +652
await resumeAgent({
body: {
session_id: sessionId,
load_model_and_extensions: true,
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Finish failed submit before attempting reconnect

When reply() fails (for example during the same network/backend outage this change is trying to recover from), this path now awaits resumeAgent() before calling onFinish. Because START_STREAMING has already been dispatched, a slow or hanging /agent/resume call keeps the UI stuck in streaming state (spinner/disabled input) instead of returning to idle with an error; previously the submit error finished immediately. Please call onFinish(...) first (or run reconnect in the background with a timeout), and apply the same fix to the mirrored logic in submitElicitationResponse.

Useful? React with 👍 / 👎.

@kyledef kyledef changed the title fix: add SSE retry attempts and auto-reconnect on agent loss fix: safe agent recovery on submit failure and accurate disconnect messaging Mar 18, 2026
@DOsinga
Copy link
Copy Markdown
Collaborator

DOsinga commented Mar 20, 2026

Thanks for the contribution @kyledef! Closing this one for now — a couple of things worth flagging:

Multiple PRs opened in quick succession: You've opened both #7972 and #7978 within a few hours of each other as a first-time contributor. Our CONTRIBUTING.md asks that you open one PR at a time and wait for it to land before opening more — this helps us give each change a proper review and keeps the feedback loop manageable. Please pick your preferred one (#7978 looks like the more complete approach) and let's work from there.

Unaddressed codex comment: There's an unaddressed P1 comment from codex pointing out a real bug — if the server is down (exactly the failure case this PR targets), await resumeAgent() can hang and keep the UI stuck in streaming state indefinitely, because onFinish is called only after it returns. The fix would be to call onFinish first (returning the UI to idle), then attempt recovery in the background. We'd want that resolved before merging anything in this area.

Feel free to reopen or incorporate the fix into #7978 — happy to keep the conversation going there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants