Skip to content

Sessions permanently stuck after server restart or stream interruption — no startup recovery for orphaned messages/tool parts #19023

@dzianisv

Description

@dzianisv

Description

When the OpenCode server restarts (or the process crashes) while a session is actively executing tool calls, the session gets permanently stuck in a "Thinking" state. The root cause is that there is no startup recovery that cleans up orphaned assistant messages and tool parts.

What happens

  1. A session is actively executing tool calls (e.g., bash commands)
  2. The server restarts or crashes
  3. The in-memory session state (SessionStatus) is lost — the session is no longer "busy"
  4. But the database state is stale: the last assistant message has time.completed = undefined (never completed) and tool parts remain in status: "running" forever
  5. The UI sees the incomplete assistant message and shows a permanent "Thinking" spinner
  6. The session cannot recover — sending a new message creates a new loop iteration, but the old orphaned message still exists

Root cause analysis

The existing cleanup in processor.ts:402-417 correctly handles the normal case — when the stream ends (normally, via error, or abort), it force-sets any non-terminal tool parts to status: "error". However, this cleanup only runs if the process survives long enough to reach it.

There is zero recovery at startup:

  • Session.initialize() does not scan for orphaned messages
  • SessionStatus (in-memory map) is empty after restart — no stale detection
  • No background watchdog checks for sessions stuck in busy state

The only defense is in toModelMessages() (message-v2.ts:740-746), which converts pending/running tool parts into "[Tool execution was interrupted]" when building the next LLM prompt. This helps contextual recovery if the user sends a new message, but the UI still shows the session as stuck because the orphaned assistant message has no time.completed.

Observed in production

  • Session ses_2f4299f5cffeVZfxCt3ViZ7eVJ stuck for 3+ hours with a git log tool part permanently in "running" status
  • Session ses_2e9127723ffeKJ1JpjLNS35B4z similar pattern (though this one was actually still running a long k8s test — but demonstrates the same vulnerability)

Relation to existing issues

This is the backend root cause behind several reported symptoms:

Open PRs #16907 and #17593 address frontend symptoms (making the UI more defensive about stale state), but neither fixes the backend root cause — orphaned messages and tool parts in the database.

Proposed fix

Startup recovery in Session or app bootstrap:

  1. On server start, query all messages where time.completed IS NULL and the message role = "assistant"
  2. For each orphaned message:
    • Set time.completed = Date.now()
    • Set all tool parts with status = "running" or status = "pending" to status = "error" with error = "Tool execution was interrupted (server restart)"
    • Emit Bus events so connected frontends update

This is a small, safe change — the cleanup logic already exists in processor.ts:402-417, it just needs to be callable from a recovery path at startup.

Steps to reproduce

  1. Start opencode serve
  2. Start a session that uses tool calls (e.g., ask it to run tests)
  3. Kill the server process while tools are executing (kill -9)
  4. Restart the server
  5. Open the session in the UI — it shows permanent "Thinking" spinner
  6. Session status API returns {} (idle) but the UI is stuck

Environment

  • opencode serve (long-running, multiple sessions)
  • macOS / Linux
  • Any provider (observed with gpt-5.3-codex via github-copilot)

OpenCode version

Latest dev branch (commit 814a515a8)

Metadata

Metadata

Assignees

Labels

coreAnything pertaining to core functionality of the application (opencode server stuff)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions