Skip to content

feat(session): add session state machine — Phase 1+2 (write + observability)#348

Open
dimakis wants to merge 2 commits into
mainfrom
feat/session-state-machine
Open

feat(session): add session state machine — Phase 1+2 (write + observability)#348
dimakis wants to merge 2 commits into
mainfrom
feat/session-state-machine

Conversation

@dimakis
Copy link
Copy Markdown
Owner

@dimakis dimakis commented May 20, 2026

Summary

  • Phase 1 (write-only): Adds a 7-state session lifecycle (CREATED → STARTING → ACTIVE → DETACHED/SUSPENDED → CLOSING → ENDED) written to EventStore at every transition point. Existing behavior unchanged, invalid transitions logged as warnings.
  • Phase 2 (observability): Adds detectStateMismatch() that checks consistency between in-memory SessionRegistry and durable EventStore state on every send/interrupt. Logs mismatches as errors with full context. No behavior change.

Together these lay the groundwork for Phase 3 where state reads replace the dual source-of-truth (SessionRegistry + isActive flag) that causes the mobile reattach hang.

Changes

Phase 1

  • packages/protocol/src/types.tsSessionState type, state fields on SessionMeta
  • packages/protocol/src/event-store.tsVALID_TRANSITIONS, migration, setSessionState()/getSessionState() with warn-only validation
  • server/chat.ts — State writes in startChat, detachChat, reattachChat, closeoutSession
  • server/query-loop.tsACTIVE on first SDK event, ENDED in finally block
  • server/ws-handler-v2.ts + server/app.tsSUSPENDED on iOS background

Phase 2

  • server/ws-handler-v2.tsdetectStateMismatch(), wired into handleSendV2 + handleInterruptV2, storeState added to routing logs
  • server/__tests__/ws-handler-v2.test.ts — 14 new tests for mismatch detection

Design doc

See PR #340 for the full design doc and review discussion.

Test plan

  • 13 new EventStore tests (Phase 1) — all pass
  • 14 new detectStateMismatch tests (Phase 2) — all pass
  • All 119 ws-handler-v2 tests pass
  • TypeScript compiles clean, lint passes
  • Deploy to dev, observe state transitions + mismatch logs
  • Verify no behavior change (both phases are observability-only)

🤖 Generated with Claude Code

Write 7-state lifecycle (CREATED→STARTING→ACTIVE→DETACHED/SUSPENDED→
CLOSING→ENDED) to EventStore at every transition point. Phase 1 is
write-only: existing behavior unchanged, invalid transitions logged
as warnings. This lays the groundwork for Phase 2 where reads replace
the dual source-of-truth (SessionRegistry + isActive flag).

- Add SessionState type and VALID_TRANSITIONS to protocol package
- Add setSessionState/getSessionState to EventStore with migration
- Write state in chat.ts (start/detach/reattach/closeout)
- Write state in query-loop.ts (ACTIVE on first event, ENDED in finally)
- Write state in ws-handler-v2.ts and app.ts (SUSPENDED)
- 13 new tests covering full lifecycle, edge cases, force flag

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Owner Author

@dimakis dimakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Centaur Review

Found 6 issue(s) (3 warning).

server/query-loop.ts

The state machine schema and EventStore API are solid, but new (non-resume) sessions silently skip CREATED/STARTING/ACTIVE transitions because sessionId isn't resolved yet at the call sites — only resumed sessions get full lifecycle tracking in Phase 1.

  • 🟡 bugs (L411): For new (non-resume) sessions, resolvedSessionId is undefined at this point (set later at line 475 on the first assistant event), and registry.get(clientId)?.sessionId is also undefined (set at line 478 via setSessionId). So sid is undefined and setSessionState('ACTIVE') is silently skipped. New sessions never record CREATED, STARTING, or ACTIVE — only ENDED in the finally block. The state machine only tracks lifecycle for resumed sessions. [fixable]

server/chat.ts

The state machine schema and EventStore API are solid, but new (non-resume) sessions silently skip CREATED/STARTING/ACTIVE transitions because sessionId isn't resolved yet at the call sites — only resumed sessions get full lifecycle tracking in Phase 1.

  • 🟡 bugs (L688): For new sessions, options.resume is undefined and session.sessionId is not yet assigned (it's set later in query-loop.ts line 478 when the first assistant event arrives). So stateSessionId is undefined/falsy and the CREATED transition is skipped. Same issue at line 925 for STARTING. The Phase 1 state machine captures lifecycle only for resumed sessions. [fixable]
  • 🔵 missing_tests: No integration tests verify that the server-side call sites (detachChat, reattachChat, closeoutSession, suspend endpoints) actually produce the expected state transitions in the EventStore. The unit tests cover the EventStore API directly, but don't exercise the wiring in chat.ts, query-loop.ts, app.ts, or ws-handler-v2.ts. [fixable]

packages/protocol/src/event-store.ts

The state machine schema and EventStore API are solid, but new (non-resume) sessions silently skip CREATED/STARTING/ACTIVE transitions because sessionId isn't resolved yet at the call sites — only resumed sessions get full lifecycle tracking in Phase 1.

  • 🟡 unsafe_assumptions (L551): setSessionState executes an UPDATE statement that silently does nothing if the session row doesn't exist in the database (no matching session_id). The method still logs 'session state transition' as if it succeeded. For new sessions where upsertSession hasn't been called yet, or if called with a stale/typo'd sessionId, the state write is silently lost. Consider checking changes on the run result. [fixable]
  • 🔵 style (L541): Invalid state transitions are logged at info level. In production, info logs are high-volume and these warnings could be lost. Consider using warn level (would require expanding the EventStoreLogger interface) or at minimum adding a distinguishing prefix/field (e.g., level: 'warn' in the meta object) so they're filterable. [fixable]

packages/protocol/__tests__/event-store.test.ts

The state machine schema and EventStore API are solid, but new (non-resume) sessions silently skip CREATED/STARTING/ACTIVE transitions because sessionId isn't resolved yet at the call sites — only resumed sessions get full lifecycle tracking in Phase 1.

  • 🔵 missing_tests: No test for calling setSessionState on a session that doesn't exist in the database. The UPDATE is a silent no-op, but the method still logs as if the transition succeeded. A test would document this edge case and catch regressions if the behavior changes. [fixable]

Comment thread server/query-loop.ts
firstEventReceived = true;
clearTimeout(firstEventTimer);
// Session state machine: mark ACTIVE on first SDK event (resume path)
const sid = resolvedSessionId || registry.get(clientId)?.sessionId;
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: For new (non-resume) sessions, resolvedSessionId is undefined at this point (set later at line 475 on the first assistant event), and registry.get(clientId)?.sessionId is also undefined (set at line 478 via setSessionId). So sid is undefined and setSessionState('ACTIVE') is silently skipped. New sessions never record CREATED, STARTING, or ACTIVE — only ENDED in the finally block. The state machine only tracks lifecycle for resumed sessions. [fixable]

Comment thread server/chat.ts
_onSessionChange?.(clientId, 'start');

// Session state machine: mark CREATED (Phase 1 — write only, no behavior change)
const stateSessionId = options.resume ?? session.sessionId;
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 bugs: For new sessions, options.resume is undefined and session.sessionId is not yet assigned (it's set later in query-loop.ts line 478 when the first assistant event arrives). So stateSessionId is undefined/falsy and the CREATED transition is skipped. Same issue at line 925 for STARTING. The Phase 1 state machine captures lifecycle only for resumed sessions. [fixable]

}
}

this.stmts.setSessionState.run(newState, now, sessionId);
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 unsafe_assumptions: setSessionState executes an UPDATE statement that silently does nothing if the session row doesn't exist in the database (no matching session_id). The method still logs 'session state transition' as if it succeeded. For new sessions where upsertSession hasn't been called yet, or if called with a stale/typo'd sessionId, the state write is silently lost. Consider checking changes on the run result. [fixable]

if (fromState && !opts?.force) {
const allowed = VALID_TRANSITIONS[fromState];
if (!allowed?.includes(newState)) {
this.log.info('invalid session state transition (warn-only)', {
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 style: Invalid state transitions are logged at info level. In production, info logs are high-volume and these warnings could be lost. Consider using warn level (would require expanding the EventStoreLogger interface) or at minimum adding a distinguishing prefix/field (e.g., level: 'warn' in the meta object) so they're filterable. [fixable]

Add detectStateMismatch() that checks consistency between in-memory
SessionRegistry and durable EventStore session state on every send
and interrupt. Logs mismatches as errors with full context (storeState,
registryHas, details) for debugging. No behavior change — observability
only, preparing for Phase 3 state-based routing.

- detectStateMismatch() checks registry↔state, attach↔state alignment
- Wired into handleSendV2 and handleInterruptV2 with span attributes
- State added to all routing decision log lines (storeState field)
- 14 new tests covering all mismatch scenarios

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dimakis dimakis changed the title feat(session): add session state machine — Phase 1 (write-only) feat(session): add session state machine — Phase 1+2 (write + observability) May 20, 2026
@dimakis
Copy link
Copy Markdown
Owner Author

dimakis commented May 20, 2026

Centaur Review

Found 8 issue(s) (4 warning).

server/chat.ts

Solid Phase 1/2 implementation with good test coverage for the core state machine. Main gaps are missing state transitions in stopChat and closeSessionByUser paths, which will cause false positives in the Phase 2 mismatch detector and leave stale states in the durable store.

  • 🟡 bugs (L1250): closeSessionByUser doesn't set state to CLOSING when it has an active agent (the path at line 1283+). It injects the closeout prompt and sets a 2-minute abort timeout, but never calls setSessionState(sessionId, 'CLOSING', ...). The auto-closeout path in _closeoutSessionInner does set CLOSING. This means user-initiated closes will jump from ACTIVE → ENDED with no intermediate state, making the detectStateMismatch detector report a false mismatch during the closeout window — ACTIVE in the store while the session is winding down. [fixable]
  • 🟡 bugs (L1272): closeSessionByUser early-return path (no inputQueue, line 1258-1280) calls upsertSession({isActive: false}) and registry.remove() but never sets state to ENDED. The session will keep whatever state it had before (likely ACTIVE or CREATED). This creates a permanent state/isActive mismatch in the durable store. [fixable]
  • 🟡 bugs (L1321): stopChat aborts the session immediately via registry.abort(clientId) but never sets any state transition. The query-loop finally block will eventually set ENDED, but there's a race window where detectStateMismatch could fire between the abort and query loop cleanup. More importantly, if the session has no query loop running (no inputQueue/queryInstance), the finally block never fires and state is never set to ENDED. [fixable]
  • 🔵 bugs (L688): For new sessions (not resume), session.sessionId may be null at line 688 — it's populated later when the SDK returns the session ID in the query loop. The stateSessionId fallback uses options.resume ?? session.sessionId, but for new sessions both are null/undefined, so the CREATED state is never written. The STARTING state at line 927 has the same issue. The first state written will be ACTIVE in the query loop when resolvedSessionId is set. This means new sessions skip CREATED and STARTING states entirely — only resume sessions get the full lifecycle.

server/ws-handler-v2.ts

Solid Phase 1/2 implementation with good test coverage for the core state machine. Main gaps are missing state transitions in stopChat and closeSessionByUser paths, which will cause false positives in the Phase 2 mismatch detector and leave stale states in the durable store.

  • 🟡 unsafe_assumptions (L109): detectStateMismatch calls registry.findBySessionId which iterates all sessions, plus store.getSessionState which hits SQLite, on every send and interrupt message. This is the hot path — every user message and every interrupt triggers this. If session volume is high, the linear scan + SQLite read on every message could become a performance concern. Consider whether this should be sampled or rate-limited in Phase 2. [fixable]

server/__tests__/ws-handler-v2.test.ts

Solid Phase 1/2 implementation with good test coverage for the core state machine. Main gaps are missing state transitions in stopChat and closeSessionByUser paths, which will cause false positives in the Phase 2 mismatch detector and leave stale states in the durable store.

  • 🔵 missing_tests: The detectStateMismatch tests cover the detection logic well, but there are no tests for the integration points — i.e., that handleSendV2 and handleInterruptV2 actually call detectStateMismatch and log errors when mismatches are found. The test for handleStopV2 doesn't verify state transitions either. [fixable]

packages/protocol/src/event-store.ts

Solid Phase 1/2 implementation with good test coverage for the core state machine. Main gaps are missing state transitions in stopChat and closeSessionByUser paths, which will cause false positives in the Phase 2 mismatch detector and leave stale states in the durable store.

  • 🔵 style (L530): setSessionState calls this.getSession(sessionId) (which fetches the full row) just to read the current state, even though this.stmts.getSessionState exists and is cheaper. Could use this.getSessionState(sessionId) instead for the validation check. [fixable]
  • 🔵 unsafe_assumptions (L549): this.log.info is called for invalid transitions with a message containing 'invalid'. The test at line 577 of the test file asserts messages.some(m => m.includes('invalid')), but log.info is called with an object as second arg ({ sessionId, fromState, ... }). The mock logger in the test only captures the first string arg. This works because the first arg is the string 'invalid session state transition (warn-only)', but the coupling between test assertion and log message string is fragile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant