Skip to content

Refactor ShapeStream state to be an explicit state machine#3816

Merged
kevin-dp merged 10 commits intomainfrom
kevin/state-machine
Feb 11, 2026
Merged

Refactor ShapeStream state to be an explicit state machine#3816
kevin-dp merged 10 commits intomainfrom
kevin/state-machine

Conversation

@kevin-dp
Copy link
Contributor

@kevin-dp kevin-dp commented Feb 9, 2026

Fixes #3785

This PR refactors the ShapeStream class into an explicit state machine. This removes many state variables and code paths from the ShapeStream into dedicated state classes.

Summary

Extracts the implicit sync state from ShapeStream into an explicit state machine (ShapeStreamState).

The original ShapeStream tracked sync state as ~12 flat private fields (#lastOffset, #shapeHandle, #isUpToDate, #liveCacheBuster, #schema, #lastSyncedAt, #lastSeenCursor, #consecutiveShortSseConnections, #sseFallbackToLongPolling, #staleCacheBuster, #staleCacheRetryCount, #state) with transition logic scattered across #onInitialResponse, #onMessages, #reset, #constructUrl, and #requestShape. This made it hard to reason about which fields were relevant in which phase of the sync lifecycle.

The new design replaces these with a single #syncState: ShapeStreamState field backed by an immutable state machine:

ShapeStreamState (abstract base)
  ├── ActiveState (abstract — shared field storage, response/message helpers)
  │   ├── FetchingState (abstract — shared Initial/Syncing/StaleRetry behavior)
  │   │   ├── InitialState
  │   │   ├── SyncingState
  │   │   └── StaleRetryState (+staleCacheBuster, +staleCacheRetryCount)
  │   ├── LiveState (+SSE tracking, live-specific response/up-to-date/URL handling)
  │   └── ReplayingState (+cursor, replay suppression logic)
  ├── PausedState (delegates to previousState)
  └── ErrorState (delegates to previousState)

Each state carries only the fields relevant to it and defines its own behavior:

  • Response handling — each active state has its own handleResponseMetadata (stale detection, field parsing, state-specific transitions)
  • Up-to-date handlingLiveState preserves SSE tracking, ReplayingState does cursor-based suppression, fetching states transition to LiveState
  • URL constructionapplyUrlParams(url) lets each state add its own query parameters (offset, handle, cache busters) instead of the client branching on fields
  • SSE decisionsshouldUseSse() and handleSseConnectionClosed() live on LiveState where the tracking state is

ShapeStream is simplified to orchestration: it drives the request loop, handles errors, manages async coordination (pause/resume, snapshots, visibility), and delegates all sync state decisions to the state machine.

@kevin-dp kevin-dp force-pushed the kevin/state-machine branch from bd8df84 to 6631758 Compare February 9, 2026 16:10
@pkg-pr-new
Copy link

pkg-pr-new bot commented Feb 10, 2026

Open in StackBlitz

npm i https://pkg.pr.new/@electric-sql/react@3816
npm i https://pkg.pr.new/@electric-sql/client@3816
npm i https://pkg.pr.new/@electric-sql/y-electric@3816

commit: 3f487f9

@netlify
Copy link

netlify bot commented Feb 10, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit ba30fba
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/698af78045b29e0008f57f48
😎 Deploy Preview https://deploy-preview-3816--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 88.51541% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.28%. Comparing base (091a232) to head (3f487f9).
⚠️ Report is 11 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ckages/typescript-client/src/shape-stream-state.ts 86.22% 35 Missing ⚠️
packages/typescript-client/src/client.ts 94.17% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3816      +/-   ##
==========================================
- Coverage   87.68%   87.28%   -0.40%     
==========================================
  Files          23       24       +1     
  Lines        2078     2305     +227     
  Branches      548      575      +27     
==========================================
+ Hits         1822     2012     +190     
- Misses        254      291      +37     
  Partials        2        2              
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 92.24% <88.51%> (-1.41%) ⬇️
packages/y-electric 56.05% <ø> (ø)
typescript 87.28% <88.51%> (-0.40%) ⬇️
unit-tests 87.28% <88.51%> (-0.40%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blacksmith-sh

This comment has been minimized.

@KyleAMathews
Copy link
Contributor

Test Review: Applying DSL & Property-Based Testing Ideas

The state machine extraction is a great structural move — pulling scattered mutable fields out of the 1800-line ShapeStream class into a pure, immutable state machine with explicit transitions. The current 42 tests verify individual transition edges competently, but I think we can dramatically improve coverage by applying ideas from this guide on building testing DSLs for complex systems.

Key ideas from the guide that apply here

The guide argues that when you have a well-defined state machine (which is exactly what this PR creates), example-based tests ("given this state and this input, expect that output") are the weakest form of verification. The stronger techniques are:

  1. Algebraic property testing — verify that operators satisfy mathematical properties (idempotence, round-trip, commutativity) across all states, not just the ones you remembered to test.
  2. Two-tier DSL design — a typed fluent builder for well-formed multi-step scenarios (making it easy to test journeys through the state graph), plus raw constructors for adversarial edge cases.
  3. History/trace-based verification — record execution traces and run invariant checkers at every step, shifting from "did I assert enough?" to "do my scenarios visit enough states?"
  4. Fuzz testing with shrinking — generate random event sequences with seeded RNGs for reproducibility, then verify invariants hold at every step. This explores the state space far beyond hand-written tests.
  5. Small-scope exhaustive exploration — for a state machine with 7 states and ~8 event types, you can exhaustively test all reachable (state, event) pairs within small bounds.

The core insight: you've built a pure, immutable, isolated state machine. This is the perfect candidate for these techniques — no mocking fetch or SSE, no async, just inputs → outputs. Don't leave that leverage on the table.


1. Scenario DSL (High Priority)

Bugs in state machines almost always come from unexpected sequences, not individual steps. The current tests check one transition at a time. A fluent scenario builder would let you test full journeys:

function scenario(initial?: Partial<SharedStateFields>) {
  let state: ShapeStreamState = createInitialState({ 
    offset: initial?.offset ?? `-1`,
    handle: initial?.handle 
  })
  const trace: Array<{ event: string; before: string; after: string }> = []

  const self = {
    response(input: Partial<ResponseMetadataInput>) {
      const before = state.kind
      const transition = state.handleResponseMetadata(makeResponseInput(input))
      state = transition.state
      trace.push({ event: `response`, before, after: state.kind })
      assertStateInvariants(state) // automatic invariant checking at every step
      return self
    },
    messages(input: Partial<MessageBatchInput>) { /* similar */ return self },
    pause() { /* ... */ return self },
    resume() { /* ... */ return self },
    error(msg: string) { /* ... */ return self },
    retry() { /* ... */ return self },
    reset(handle?: string) { /* ... */ return self },
    sseClose(input: Partial<SseCloseInput>) { /* ... */ return self },

    expectKind(kind: ShapeStreamStateKind) {
      expect(state.kind).toBe(kind)
      return self
    },
    expectUpToDate(expected: boolean) {
      expect(state.isUpToDate).toBe(expected)
      return self
    },
    expectHandle(h: string | undefined) {
      expect(state.handle).toBe(h)
      return self
    },

    done() { return { state, trace } }
  }
  return self
}

This enables readable multi-step tests:

it(`full lifecycle: initial → sync → live → pause → resume → error → retry`, () => {
  scenario()
    .response({ responseHandle: `h1`, responseOffset: `0_5` })
    .expectKind(`syncing`)
    .messages({ hasUpToDateMessage: true })
    .expectKind(`live`)
    .expectUpToDate(true)
    .pause()
    .expectKind(`paused`)
    .expectUpToDate(true)  // paused-from-live preserves isUpToDate
    .resume()
    .expectKind(`live`)
    .error(`connection lost`)
    .expectKind(`error`)
    .retry()
    .expectKind(`live`)
})

it(`stale CDN → retry → fresh response → sync → live`, () => {
  scenario()
    .response({ responseHandle: `stale-h`, expiredHandle: `stale-h` })
    .expectKind(`stale-retry`)
    .response({ responseHandle: `fresh-h`, responseOffset: `0_0` })
    .expectKind(`syncing`)
    .messages({ hasUpToDateMessage: true })
    .expectKind(`live`)
    .expectHandle(`fresh-h`)
})

The builder is the "well-formed scenario" tier. For adversarial testing, keep the raw constructors (new SyncingState(...)) to create states the builder wouldn't normally produce (e.g., PausedState wrapping PausedState).


2. Algebraic Property Tests (High Priority)

Pause/resume, error/retry, withHandle, and markMustRefetch should be verified for every state, not just the ones that happened to get a test:

const allStates = (): ShapeStreamState[] => {
  const shared = makeShared()
  return [
    createInitialState({ offset: `-1` }),
    new SyncingState(shared),
    new LiveState(shared),
    new ReplayingState({ ...shared, replayCursor: `c1` }),
    new StaleRetryState({ ...shared, staleCacheBuster: `cb`, staleCacheRetryCount: 1 }),
    new LiveState(shared).pause(),
    new SyncingState(shared).toErrorState(new Error(`test`)),
  ]
}

describe(`algebraic properties`, () => {
  it.each(allStates().map(s => [s.kind, s]))(
    `%s: pause().resume() round-trips`,
    (_kind, state) => {
      const roundTripped = state.pause().resume()
      expect(roundTripped.kind).toBe(state.kind)
      expect(roundTripped.handle).toBe(state.handle)
      expect(roundTripped.offset).toBe(state.offset)
      expect(roundTripped.isUpToDate).toBe(state.isUpToDate)
    }
  )

  it.each(allStates().map(s => [s.kind, s]))(
    `%s: toErrorState(e).retry() round-trips`,
    (_kind, state) => {
      const roundTripped = state.toErrorState(new Error(`x`)).retry()
      expect(roundTripped.kind).toBe(state.kind)
      expect(roundTripped.handle).toBe(state.handle)
      expect(roundTripped.offset).toBe(state.offset)
    }
  )

  it.each(allStates().map(s => [s.kind, s]))(
    `%s: markMustRefetch always → InitialState with offset=-1`,
    (_kind, state) => {
      const reset = state.markMustRefetch(`new-h`)
      expect(reset).toBeInstanceOf(InitialState)
      expect(reset.offset).toBe(`-1`)
      expect(reset.handle).toBe(`new-h`)
      expect(reset.schema).toBeUndefined()
      expect(reset.isUpToDate).toBe(false)
    }
  )

  it.each(allStates().map(s => [s.kind, s]))(
    `%s: withHandle changes only handle`,
    (_kind, state) => {
      const updated = state.withHandle(`changed`)
      expect(updated.handle).toBe(`changed`)
      expect(updated.offset).toBe(state.offset)
      expect(updated.kind).toBe(state.kind)
      expect(updated.isUpToDate).toBe(state.isUpToDate)
    }
  )
})

3. Random Sequence Fuzzing (Medium Priority)

Generate random event sequences and verify invariants hold at every step. A single fuzz run like this explores more of the state space than all 42 hand-written tests combined:

function applyEvent(state: ShapeStreamState, event: Event): ShapeStreamState {
  switch (event.type) {
    case `response`: return state.handleResponseMetadata(event.input).state
    case `messages`: return state.handleMessageBatch(event.input).state
    case `pause`: return state.pause()
    case `resume`: return state instanceof PausedState ? state.resume() : state
    case `error`: return state.toErrorState(new Error(`fuzz`))
    case `retry`: return state instanceof ErrorState ? state.retry() : state
    case `markMustRefetch`: return state.markMustRefetch()
    case `sseClose`: return state.handleSseConnectionClosed(event.input).state
  }
}

function checkInvariants(state: ShapeStreamState) {
  expect(state).toBeDefined()
  expect([`initial`,`syncing`,`live`,`replaying`,`stale-retry`,`paused`,`error`]).toContain(state.kind)
  expect(typeof state.offset).toBe(`string`)
  
  // Only LiveState (or delegates to it) should be up-to-date
  if ([`initial`, `syncing`, `stale-retry`, `replaying`].includes(state.kind)) {
    expect(state.isUpToDate).toBe(false)
  }
  
  // staleCacheBuster only present in StaleRetryState (or delegates)
  if (![`stale-retry`, `paused`, `error`].includes(state.kind)) {
    expect(state.staleCacheBuster).toBeUndefined()
  }
}

it(`survives 1000 random 50-step sequences without invariant violations`, () => {
  for (let seed = 0; seed < 1000; seed++) {
    let state: ShapeStreamState = createInitialState({ offset: `-1` })
    const rng = mulberry32(seed) // seeded PRNG for reproducibility
    for (let step = 0; step < 50; step++) {
      const event = randomEvent(rng)
      state = applyEvent(state, event)
      checkInvariants(state)
    }
  }
})

When a seed fails, you have a fully reproducible failing sequence you can minimize.


4. Specific Missing Edge Cases (High Priority)

Even without the DSL/fuzzing infrastructure, these gaps should be filled now:

Double-pause nesting — potential bug if ShapeStream accidentally calls pause() twice:

it(`double pause creates nested PausedState — resume only unwraps one layer`, () => {
  const live = new LiveState(makeShared())
  const paused1 = live.pause()
  const paused2 = paused1.pause()
  expect(paused2).toBeInstanceOf(PausedState)
  const resumed1 = paused2.resume()
  expect(resumed1).toBeInstanceOf(PausedState) // still paused once
})

(Consider: should pause() be idempotent on PausedState? If so, that's a code change.)

204 response handling:

it(`204 response sets lastSyncedAt`, () => {
  const syncing = new SyncingState(makeShared({ lastSyncedAt: undefined }))
  const transition = syncing.handleResponseMetadata(
    makeResponseInput({ status: 204, now: 1700000000 })
  )
  expect(transition.state.lastSyncedAt).toBe(1700000000)
})

SSE vs. non-SSE offset handling:

it(`SSE up-to-date message updates offset`, () => {
  const syncing = new SyncingState(makeShared({ offset: `0_0` }))
  const transition = syncing.handleMessageBatch(
    makeMessageBatchInput({ isSse: true, upToDateOffset: `5_3` as Offset })
  )
  expect(transition.state.offset).toBe(`5_3`)
})

it(`non-SSE up-to-date message does NOT update offset`, () => {
  const syncing = new SyncingState(makeShared({ offset: `0_0` }))
  const transition = syncing.handleMessageBatch(
    makeMessageBatchInput({ isSse: false, upToDateOffset: `5_3` as Offset })
  )
  expect(transition.state.offset).toBe(`0_0`)
})

Schema set-once semantics:

it(`schema is only set once (first response wins)`, () => {
  const initial = createInitialState({ offset: `-1` })
  const t1 = initial.handleResponseMetadata(
    makeResponseInput({ responseSchema: { id: { type: `int4` } } })
  )
  const t2 = t1.state.handleResponseMetadata(
    makeResponseInput({ responseSchema: { name: { type: `text` } } })
  )
  expect(t2.state.schema).toEqual({ id: { type: `int4` } })
})

Events on Paused/Error states (defensive no-ops):

it(`PausedState.handleResponseMetadata returns ignored`, () => {
  const paused = new SyncingState(makeShared()).pause()
  const transition = paused.handleResponseMetadata(makeResponseInput())
  expect(transition.action).toBe(`ignored`)
})

it(`ErrorState.handleMessageBatch returns no-op`, () => {
  const errored = new SyncingState(makeShared()).toErrorState(new Error(`x`))
  const transition = errored.handleMessageBatch(makeMessageBatchInput())
  expect(transition.suppressBatch).toBe(false)
  expect(transition.state).toBe(errored)
})

Summary

Priority What Why
High Scenario DSL builder Tests sequences, not just individual transitions — most bugs hide in sequences
High Algebraic property tests over all states pause/resume, error/retry, withHandle, markMustRefetch should hold universally
High Missing edge cases (double-pause, 204, schema set-once, SSE offset) Direct gaps in the current suite
Medium Random sequence fuzzing Explores the state space far beyond hand-written tests
Medium Invariant checker at every transition Catches violations early, makes trace failures debuggable

The current test suite is a solid "did I implement this correctly?" check. What these techniques add is a "can anything break this?" check. The state machine is pure and immutable — it's the perfect candidate for property-based and trace-based testing. That's the whole payoff of extracting it from ShapeStream.

@KyleAMathews
Copy link
Contributor

Test Review Part 2: The Glue Layer Between State Machine and Real World

The first review focused on testing the state machine in isolation — algebraic properties, scenario DSLs, fuzz testing. This review addresses the other major risk surface: the adapter code in client.ts that connects the pure state machine to messy real-world events (HTTP headers, SSE messages, abort signals, visibility changes).

The state machine refactoring creates a clean seam, but that seam is also where the new risk concentrates. The state machine is now a pure function: inputs → (new state, transition metadata). But the glue code has two jobs that are both undertested:

  1. Extracting the right inputs from HTTP responses, SSE events, and abort signals
  2. Interpreting transition results into the right side effects (throw, console.warn, abort, notify subscribers, sleep for backoff)

The state machine unit tests verify the engine is correct; they say nothing about whether the steering wheel is connected to the wheels.


Mapping the Risk Surface

I traced every this.#syncState access in client.ts38 total: 9 writes (state transitions) and 29 reads. Each write is an on-ramp where a real-world event gets translated into a state machine call. Here are the six specific risk zones:


Risk 1: Input Extraction at #onInitialResponse (line ~1058)

This is the adapter between raw Response and handleResponseMetadata():

const transition = this.#syncState.handleResponseMetadata({
  status,
  responseHandle: shapeHandle,                                       // headers.get(SHAPE_HANDLE_HEADER)
  responseOffset: headers.get(CHUNK_LAST_OFFSET_HEADER) as Offset,   // cast!
  responseCursor: headers.get(LIVE_CACHE_BUSTER_HEADER),
  responseSchema: getSchemaFromHeaders(headers),
  expiredHandle,                                                     // from expiredShapesCache lookup
  now: Date.now(),
  maxStaleCacheRetries: this.#maxStaleCacheRetries,
  createCacheBuster: () => `${Date.now()}-${Math.random()...}`,
})

No existing test verifies these extractions are correct. The state machine unit tests use makeResponseInput() which constructs the input object directly — bypassing all header parsing. If someone changes a header constant, breaks the as Offset cast, or alters getSchemaFromHeaders, the state machine gets wrong inputs and its own unit tests won't catch it.

Risk 2: Transition Result Branching

After handleResponseMetadata() returns, client.ts branches on transition.action:

  • stale-retry → cancel body, maybe throw FetchError(502), else console.warn + throw StaleCacheError
  • ignoredconsole.warn + early return (skip body processing entirely)
  • accepted → continue to parse body

And after handleMessageBatch() returns, it branches on transition.suppressBatch:

  • true → return early, skipping subscriber notification AND upToDateTracker.recordUpToDate
  • false → publish to subscribers, record up-to-date

No test isolates these branches against specific transition outcomes. The integration tests in client.test.ts test overall behavior (error recovery, shape rotation), but they can't distinguish whether a bug is in the state machine or in the branching logic.

Risk 3: The #pauseRequested / PausedState Dual-State Protocol

This is the most complex glue in the file. There are two parallel pause mechanisms:

#pause():
  sets #pauseRequested = true
  aborts requestAbortController with PAUSE_STREAM
      ↓
#requestShape() entry: 
  if #pauseRequested → syncState.pause(), clear flag, return
      OR
catch FetchBackoffAbortError:
  if abort reason === PAUSE_STREAM && #pauseRequested → syncState.pause(), clear flag
      ↓
#resume():
  clears #pauseRequested
  calls #start()  (does NOT call syncState.resume() — see below)
      ↓
#requestShape() entry:
  if syncState instanceof PausedState → resumingFromPause = true, syncState.resume()

The comment at line 1356 explains why #resume() doesn't immediately transition the state machine — it defers to #requestShape() so it can detect resumingFromPause and avoid live long-polling. This is a deliberate split-brain design with a subtle invariant: the #pauseRequested flag and PausedState must stay coordinated across async boundaries.

Existing coverage: client.test.ts tests visibility-based pause/resume, but doesn't test:

  • Rapid pause→resume before the request loop ticks (flag cleared without PausedState ever being created)
  • Pause during active fetch vs. pause between fetches (two different catch paths)
  • Resume while #pauseRequested is true but not yet consumed

Risk 4: #isMidStream Parallel State

#isMidStream is managed outside the state machine but must stay consistent with it:

// In #onMessages:
this.#isMidStream = true        // line 1127 — set BEFORE state machine call

const transition = this.#syncState.handleMessageBatch({...})
this.#syncState = transition.state

if (hasUpToDateMessage) {
  this.#isMidStream = false      // line 1140 — set after state machine call
  
  if (transition.suppressBatch) {
    return                        // line 1145 — early return, subscribers NOT notified
  }
  // ... record up-to-date, publish to subscribers
}

When a batch is suppressed (replay mode with unchanged cursor), #isMidStream gets toggled truefalse and the promise resolver fires, but subscribers aren't notified. Is this the intended behavior? The #isMidStream toggle plus promise resolution are side effects that happen regardless of suppression — if any code awaits the mid-stream promise expecting subscriber notification to follow, it won't.

Risk 5: Snapshot 409 — withHandle() vs markMustRefetch()

At line ~1707 (in fetchSnapshot), a 409 uses:

this.#syncState = this.#syncState.withHandle(nextHandle)  // handle only

While the main stream's 409 handler at line ~1578 uses:

this.#syncState = this.#syncState.markMustRefetch(handle)  // full reset

This distinction is critical — snapshot 409s should NOT reset offset/schema/etc because the main stream is paused and should not be disturbed. No test verifies this distinction. If someone "simplifies" the snapshot path to use markMustRefetch (thinking it's equivalent), the main stream state gets wiped.

Risk 6: SSE Close → Backoff Glue

The finally block at line ~1297:

const transition = this.#syncState.handleSseConnectionClosed({...})
this.#syncState = transition.state

if (transition.fellBackToLongPolling) {
  console.warn(...)
} else if (transition.wasShortConnection) {
  const maxDelay = Math.min(
    this.#sseBackoffMaxDelay,
    this.#sseBackoffBaseDelay * Math.pow(2, this.#syncState.consecutiveShortSseConnections)
    //                                       ^^^^^^^^^^^^^^^^ reads NEW state
  )
  const delayMs = Math.floor(Math.random() * maxDelay)
  await new Promise((resolve) => setTimeout(resolve, delayMs))
}

The backoff reads consecutiveShortSseConnections from the post-transition state. The state machine tests verify the counter increments correctly, but nothing verifies the glue code correctly uses that counter for the delay. No test covers this path.


What Existing Tests Cover

Test file Covers Glue layer gaps
shape-stream-state.test.ts Pure state machine transitions Doesn't touch client.ts at all
client.test.ts Error recovery w/ onError, visibility pause/resume, shape rotation, isConnected, isLoading No header extraction, no transition branching, no rapid pause/resume, no snapshot 409 distinction
integration.test.ts End-to-end with real server Can't isolate glue bugs from state machine bugs or server bugs
stream.test.ts URL construction, column mapping No state transitions
fetch.test.ts Fetch wrapper retries, backoff, prefetch No state machine interaction

The gap: there are no tests sitting between the state machine unit tests and the full integration tests. Nothing tests the adapter layer in isolation.


Proposed Tests

A. Input Extraction Contract Tests (High Priority)

Verify that #onInitialResponse correctly maps HTTP headers to state machine inputs:

describe(`glue: #onInitialResponse header extraction`, () => {
  it(`maps HTTP headers to state machine input fields`, async () => {
    const { stream, nextFetch } = createMockShapeStream()

    nextFetch.respond({
      status: 200,
      headers: {
        [SHAPE_HANDLE_HEADER]: `test-handle`,
        [CHUNK_LAST_OFFSET_HEADER]: `5_3`,
        [LIVE_CACHE_BUSTER_HEADER]: `cursor-42`,
        // + schema headers
      },
      body: `[]`,
    })

    await stream.waitForNextTick()

    expect(stream.shapeHandle).toBe(`test-handle`)
    expect(stream.lastOffset).toBe(`5_3`)
  })

  it(`looks up expired handle from cache and triggers stale-retry`, async () => {
    // Pre-populate expiredShapesCache with a handle
    // Respond with that same handle
    // Verify: StaleCacheError thrown, console.warn emitted
  })

  it(`204 response sets lastSyncedAt via state machine`, async () => {
    // Respond with 204
    // Verify: lastSyncedAt() returns a recent timestamp
  })
})

B. Transition Branch Tests (High Priority)

For each transition.action value, verify the correct side effect:

describe(`glue: transition result branching`, () => {
  it(`stale-retry cancels body and throws StaleCacheError`, async () => {
    // Setup: respond with handle matching expired handle, no local handle
    // Verify: response.body.cancel() called, StaleCacheError thrown
  })

  it(`stale-retry exceeding max retries throws FetchError 502`, async () => {
    // Setup: trigger stale-retry more than maxStaleCacheRetries times
    // Verify: FetchError with status 502
  })

  it(`ignored stale response logs warning and skips body processing`, async () => {
    // Setup: local handle exists, respond with different expired handle
    // Verify: console.warn includes "Ignoring", no subscriber notification from this response
  })

  it(`suppressBatch skips subscriber notification but resolves midStream promise`, async () => {
    // Setup: enter replay mode, respond with up-to-date + unchanged cursor
    // Verify: subscriber NOT called, but midStream promise resolves
  })
})

C. Pause/Resume Protocol Tests (High Priority)

describe(`glue: pause/resume protocol`, () => {
  it(`pause during idle → next requestShape creates PausedState`, async () => {
    const { stream } = createLiveShapeStream()
    stream.triggerPause()
    await stream.waitForNextTick()
    expect(stream.isPaused()).toBe(true)
  })

  it(`rapid pause→resume before request loop: no PausedState created`, async () => {
    const { stream } = createLiveShapeStream()
    stream.triggerPause()
    stream.triggerResume()  // immediately, before #requestShape runs
    await stream.waitForNextTick()
    expect(stream.isPaused()).toBe(false)
    // Verify: state was never PausedState
  })

  it(`pause during active fetch: abort caught, transitions to PausedState`, async () => {
    const { stream, hangingFetch } = createMockShapeStream()
    hangingFetch()  // fetch that never resolves until aborted
    stream.triggerPause()
    await stream.waitForNextTick()
    expect(stream.isPaused()).toBe(true)
  })

  it(`resume detects resumingFromPause, avoids live long-poll param`, async () => {
    const { stream, nextFetch, getLastFetchUrl } = createPausedLiveShapeStream()
    stream.triggerResume()
    nextFetch.respond({ /* up-to-date response */ })
    const url = getLastFetchUrl()
    expect(url.searchParams.has(`live`)).toBe(false)  // no long-poll!
  })

  it(`resume with aborted user signal: stays paused`, async () => {
    const controller = new AbortController()
    const { stream } = createPausedShapeStream({ signal: controller.signal })
    controller.abort()
    stream.triggerResume()
    expect(stream.isPaused()).toBe(true)
  })
})

D. Snapshot 409 Distinction Test (Medium Priority)

describe(`glue: snapshot 409 uses withHandle not markMustRefetch`, () => {
  it(`updates handle but preserves offset and schema`, async () => {
    const { stream, triggerSnapshot409 } = createLiveShapeStream({
      handle: `h1`, offset: `5_3`
    })

    triggerSnapshot409({ newHandle: `h2` })
    await stream.waitForNextTick()

    expect(stream.shapeHandle).toBe(`h2`)    // updated
    expect(stream.lastOffset).toBe(`5_3`)    // NOT reset to -1
    expect(stream.isUpToDate).toBe(true)     // NOT reset to false
  })
})

E. Dual-State Consistency Invariants (Medium Priority)

Add an invariant checker that can be wired into the mock harness:

function assertGlueConsistency(stream: ShapeStream) {
  // isLoading is the inverse of isUpToDate
  expect(stream.isLoading()).toBe(!stream.isUpToDate)

  // isPaused should only be true when syncState is PausedState
  // (not when #pauseRequested is true but not yet consumed)
  if (stream.isPaused()) {
    // state machine should be in PausedState
    // #pauseRequested should be false (consumed)
  }
}

Run this after every mock response and every pause/resume operation.

F. SSE Backoff Glue Test (Low Priority)

describe(`glue: SSE close → backoff computation`, () => {
  it(`short connection triggers sleep with exponential delay`, async () => {
    // Mock setTimeout to capture delay
    // Trigger SSE connection that closes after 50ms (< minSseConnectionDuration)
    // Verify: setTimeout called with delay based on 2^consecutiveShortSseConnections
  })

  it(`fallback to long polling emits warning`, async () => {
    // Trigger maxShortSseConnections consecutive short connections
    // Verify: console.warn about proxy buffering
    // Verify: next request does NOT include SSE params
  })
})

The Testing Harness

All of the above require a mock fetch harness that sits between the state machine unit tests and the full integration tests. The pattern already exists in stream.test.ts (with fetchWrapper), but needs to be extended to:

  1. Queue responsesnextFetch.respond({status, headers, body}) for sequencing multi-step scenarios
  2. Hang fetcheshangingFetch() returns a promise that never resolves (for testing pause during active fetch)
  3. Capture requestsgetLastFetchUrl() to verify URL params the glue code constructed
  4. Expose internals — access to isPaused(), isUpToDate, lastOffset, shapeHandle for assertions

This harness would make it trivial to write targeted tests for each glue-layer risk zone without the overhead of a real server.


Summary

Priority Risk Zone What to Test Current Coverage
High Input extraction (#onInitialResponse) HTTP headers → state machine input mapping None — unit tests bypass this entirely
High Transition branching stale-retry / ignored / suppressBatch → correct side effects Indirect only via integration tests
High Pause/resume protocol #pauseRequestedPausedState synchronization, race conditions Partial — only visibility-based pause/resume
Medium Snapshot 409 distinction withHandle() preserves offset/schema vs markMustRefetch() resets None
Medium Dual-state consistency #isMidStream, #connected, isLoading stay in sync with #syncState None
Low SSE backoff glue Duration → backoff delay computation, long-polling fallback None

The state machine extraction was the right move — it makes the core logic testable in isolation. The next step is testing the wiring harness that connects it to the real world. A mock fetch harness plus these targeted tests would close the gap between "state machine is correct" and "the system behaves correctly."

@KyleAMathews
Copy link
Contributor

Test Review Part 3: Concurrency, Parallel State, and Delegation Depth

Parts 1 and 2 covered the state machine in isolation and the glue layer between the state machine and real-world events. This final review covers three remaining risk areas: concurrent pause/resume interactions, the parallel state that didn't move into the state machine, and delegation chain depth in PausedState/ErrorState.


Risk 7: Snapshot ↔ Visibility Pause/Resume Interaction

The snapshot request flow pauses the main stream, fetches data, then resumes:

// requestSnapshot (line ~1597)
this.#activeSnapshotRequests++
if (this.#activeSnapshotRequests === 1) {
  this.#pause()                          // pause main stream
}
const { metadata, data } = await this.fetchSnapshot(opts)
// ... inject data ...
finally {
  this.#activeSnapshotRequests--
  if (this.#activeSnapshotRequests === 0) {
    this.#resume()                       // resume main stream
  }
}

The visibility handler also calls #pause() / #resume():

const visibilityHandler = () => {
  if (document.hidden) this.#pause()
  else this.#resume()
}

These can interleave in a problematic way:

  1. Tab is visible, stream is Live
  2. requestSnapshot() → counter=1, calls #pause()
  3. Tab goes hidden → visibilityHandler calls #pause() → guard prevents double-pause ✓
  4. fetchSnapshot() completes
  5. Counter goes to 0, calls #resume()stream resumes, even though tab is still hidden

#resume() doesn't check tab visibility — it unconditionally resumes if the state is paused or #pauseRequested is true. So the snapshot's finally block can override the visibility system's pause.

The reverse is also problematic:

  1. Snapshot in progress, stream paused for snapshot
  2. Tab goes hidden → #pause() no-ops (already paused)
  3. Tab goes visible → #resume()stream resumes while snapshot is still in flight
  4. Now the main stream and snapshot are both running concurrently, both potentially writing #syncState

This isn't a new bug introduced by this PR — the pause/resume control flow is unchanged — but it IS an untested interaction that could produce state inconsistencies. The state machine refactoring makes it easier to expose via targeted tests.

Proposed test:

describe(`snapshot ↔ visibility interaction`, () => {
  it(`snapshot resume while tab hidden: stream should stay paused`, async () => {
    const { stream, mockVisibility, triggerSnapshot } = createMockShapeStream()
    
    mockVisibility.hide()  // tab goes hidden → stream paused
    await triggerSnapshot() // snapshot pauses (no-op), fetches, resumes
    
    // After snapshot completes, stream should STILL be paused
    // because tab is still hidden
    expect(stream.isPaused()).toBe(true)  // THIS LIKELY FAILS — revealing the bug
  })

  it(`visibility resume during snapshot: stream should stay paused until snapshot completes`, async () => {
    const { stream, mockVisibility, hangingSnapshot } = createMockShapeStream()
    
    hangingSnapshot()       // snapshot starts, pauses stream
    mockVisibility.show()   // tab visible → resume called
    
    // Stream should NOT resume while snapshot is in flight
    expect(stream.isPaused()).toBe(true)
  })
})

Risk 8: The #isRefreshing Flag and queueMicrotask Timing

#isRefreshing has two different lifecycle patterns:

Pattern A — forceDisconnectAndRefresh() (line ~1453):

this.#isRefreshing = true
this.#requestAbortController?.abort(FORCE_DISCONNECT_AND_REFRESH)
await this.#nextTick()          // async wait
this.#isRefreshing = false      // cleared after await

Pattern B — system wake detection (line ~1553):

this.#isRefreshing = true
this.#requestAbortController.abort(SYSTEM_WAKE)
queueMicrotask(() => {
  this.#isRefreshing = false    // cleared via microtask
})

If wake detection fires during a forceDisconnectAndRefresh():

  1. forceDisconnectAndRefresh sets #isRefreshing = true, aborts, awaits #nextTick()
  2. setInterval fires wake detection → sets #isRefreshing = true (already true, no-op)
  3. Queues microtask to clear → #isRefreshing = false
  4. #nextTick() resolves → forceDisconnectAndRefresh sets #isRefreshing = false (already false)

Between step 3 and 4, there's a window where #isRefreshing is false even though forceDisconnectAndRefresh hasn't completed. During this window, if a new fetch starts, applyUrlParams would see canLongPoll: true (because !this.#isRefreshing is true), which is wrong — we should NOT long-poll during a refresh.

This is a narrow timing window and may never manifest in practice, but it demonstrates the general fragility of managing #isRefreshing through two different async clearing mechanisms. No test covers this.

Proposed test:

it(`concurrent forceDisconnectAndRefresh + wake detection: isRefreshing stays true`, async () => {
  const { stream, triggerWake, getNextFetchUrl } = createMockShapeStream()
  
  const refreshPromise = stream.forceDisconnectAndRefresh()
  triggerWake()  // fire wake detection while refresh is in progress
  
  // Before refresh completes, any URL construction should see isRefreshing=true
  // (i.e., canLongPoll should be false)
  const url = getNextFetchUrl()
  expect(url.searchParams.has('live')).toBe(false)
  
  await refreshPromise
})

Risk 9: PausedState/ErrorState Delegation Chain Depth

PausedState wraps any state:

pause(): PausedState {
  return new PausedState(this)  // wraps current state
}

ErrorState also wraps:

toErrorState(error: Error): ErrorState {
  return new ErrorState(this, error)
}

There's nothing in the state machine preventing nesting:

  • state.pause().pause()PausedState(PausedState(original))
  • state.toErrorState(e1).toErrorState(e2)ErrorState(ErrorState(original, e1), e2)
  • state.pause().toErrorState(e)ErrorState(PausedState(original), e)

The ShapeStream.#pause() guard (!(this.#syncState instanceof PausedState)) prevents double-pause at the ShapeStream level. But the state machine itself is unprotected — it's a library that could be used elsewhere or called from unexpected paths.

Specific concern with double-pause:

// If somehow pause() is called twice:
const paused2 = state.pause().pause()
paused2.resume()  // returns PausedState(original), NOT original
// Need resume() twice to get back to original

And with error-during-pause:

const errored = state.pause().toErrorState(err)
errored.retry()   // returns PausedState(original)
// Now we're in PausedState — is this intended?
// #resume() would need to be called to actually resume

That error-during-pause case is actually interesting — if the stream is paused and an error occurs (maybe from a snapshot request), retry() returns the paused state. Is this the right behavior? It preserves the pause, which seems correct. But it's an interaction that should be explicitly tested.

Proposed tests:

describe(`delegation chain edge cases`, () => {
  it(`error during pause: retry returns PausedState`, () => {
    const live = new LiveState(makeShared())
    const paused = live.pause()
    const errored = paused.toErrorState(new Error('snapshot failed'))
    
    const retried = errored.retry()
    expect(retried).toBeInstanceOf(PausedState)
    expect(retried.isUpToDate).toBe(true)  // paused-from-live
    
    // Resume from the paused state should get back to live
    expect((retried as PausedState).resume()).toBeInstanceOf(LiveState)
  })

  it(`markMustRefetch from error-during-pause: resets to InitialState (not PausedState)`, () => {
    const live = new LiveState(makeShared())
    const errored = live.pause().toErrorState(new Error('x'))
    
    const reset = errored.reset('new-handle')
    expect(reset).toBeInstanceOf(InitialState)  // fully unwrapped
    expect(reset.handle).toBe('new-handle')
  })

  it(`double-wrap protection: pause() on PausedState creates nested wrapper`, () => {
    const live = new LiveState(makeShared())
    const paused1 = live.pause()
    const paused2 = paused1.pause()
    
    // This creates a double-wrapped PausedState
    expect(paused2).toBeInstanceOf(PausedState)
    expect(paused2.resume()).toBeInstanceOf(PausedState)  // only unwraps one layer
    expect((paused2.resume() as PausedState).resume()).toBeInstanceOf(LiveState)
    
    // Consider: should pause() be idempotent on PausedState?
    // If so, this is a design decision worth making explicit.
  })
})

Risk 10: The Parallel State That Didn't Move

The state machine extracted the sync-related fields, but ShapeStream still maintains its own parallel state:

Field Lifecycle Used for
#connected Set true in #fetchShape, false in error paths + #reset() + normal completion isConnected() public API
#isMidStream true on messages, false on up-to-date, true on reset #waitForStreamEnd() for snapshot coordination
#isRefreshing true before abort, false after next tick / via microtask shouldUseSse(), canLongPoll in URL params
#pauseRequested true by #pause(), consumed by #requestShape() Coordinates async pause with state machine
#activeSnapshotRequests Incremented/decremented around snapshot requests, reset in #reset() Coordinates pause/resume for concurrent snapshots
#started true in #start(), false before retry Guards against multiple starts

These form their own implicit state machine with invariants that should hold:

  • #connected should be true only when a fetch/SSE connection is active
  • #isMidStream should be true only between receiving data messages and the up-to-date control message
  • #isRefreshing should be true only during a brief abort→reconnect window
  • #pauseRequested should be true only between #pause() and the next #requestShape() iteration
  • #activeSnapshotRequests should never go negative

None of these invariants are tested. And the interactions between them are subtle — for example, #reset() clears #activeSnapshotRequests to 0, but the comment at line 1696 explicitly says snapshot 409s DON'T call #reset() to avoid breaking the counter. This constraint is enforced by convention, not by tests.

Proposed: parallel state invariant checker

function assertParallelStateInvariants(stream: TestableShapeStream) {
  // activeSnapshotRequests is never negative
  expect(stream.activeSnapshotRequests).toBeGreaterThanOrEqual(0)
  
  // If stream is paused and not mid-snapshot, pauseRequested should be false
  // (it should have been consumed by #requestShape)
  if (stream.isPaused() && stream.activeSnapshotRequests === 0) {
    expect(stream.pauseRequested).toBe(false)
  }
  
  // If not started, connected must be false
  if (!stream.hasStarted()) {
    expect(stream.isConnected()).toBe(false)
  }
  
  // If isRefreshing, we should be in an abort→reconnect cycle
  // (hard to check directly, but can verify it doesn't persist)
}

Summary of All Three Reviews

Review Focus Key Gaps
Part 1 State machine in isolation No algebraic property tests, no scenario DSL, no fuzzing, missing edge cases
Part 2 Glue layer (state machine ↔ real world) No input extraction tests, no transition branch tests, no pause protocol tests, no snapshot 409 distinction test
Part 3 Concurrency + parallel state Snapshot↔visibility pause conflict (potential bug), #isRefreshing microtask race, delegation depth, parallel state invariants

The three layers form a testing pyramid:

  • Bottom: Pure state machine (algebraic properties, fuzz, DSL) — cheapest to write, fastest to run
  • Middle: Glue layer (mock fetch harness, targeted adapter tests) — moderate cost, high value
  • Top: Concurrency interactions (snapshot↔visibility, wake↔refresh, delegation chains) — hardest to test, but where the most surprising bugs live

The state machine refactoring was the right move — it makes the bottom two layers testable for the first time. The concurrency issues in the top layer predate this PR, but are now more visible because the state machine makes it clear when writes to #syncState could conflict.

@KyleAMathews
Copy link
Contributor

KyleAMathews commented Feb 10, 2026

Design Review: Parallel State & Pause Coordination

Following up on the testing reviews (Part 1, Part 2, Part 3) — this comment looks at the remaining transport/connection state that lives outside the state machine, and proposes a targeted fix for the coordination bugs identified in Part 3.


The Two Layers

The PR cleanly extracts the sync protocol into a state machine: offset, handle, cursor, schema, up-to-date status, replay mode, stale cache retry. This is a state progression (Initial → Syncing → Live) where each state carries different data and responds to events differently. State machine is the right abstraction here.

The transport/connection layer stays as fields on ShapeStream:

Field What it tracks Contention?
#started Has subscribe() been called No — simple lifecycle
#connected Is a fetch/SSE physically active No — set true on fetch start, false on end
#isMidStream Between data messages and up-to-date No — toggled by message processing
#isRefreshing In an abort→reconnect window Minor — two different clearing mechanisms
#pauseRequested Pause intent not yet consumed by request loop Yes — shared between visibility, snapshots, request loop
#activeSnapshotRequests Concurrent snapshot counter Yes — coordinates pause/resume with visibility

The first three are simple lifecycle flags with no contention — they're fine as booleans. The last three are where the coordination complexity (and bugs) live.


The Coordination Problem

#pause() and #resume() are called from three independent sources:

  1. Visibility handler — tab hidden → pause, tab visible → resume
  2. Snapshot requests — first snapshot → pause, last snapshot completes → resume
  3. #requestShape() loop — consumes #pauseRequested, transitions sync state to PausedState

These callers don't know about each other. The current code uses #pauseRequested (boolean) + #activeSnapshotRequests (counter) + instanceof PausedState (sync state check) to coordinate them, which produces bugs:

Bug 1: Snapshot resume overrides visibility pause

  1. Tab visible, stream Live
  2. requestSnapshot() → counter=1, calls #pause()
  3. Tab goes hidden → #pause() no-ops (already paused/pause-requested)
  4. Snapshot completes → counter=0, calls #resume()stream resumes while tab hidden

Bug 2: Visibility resume overrides snapshot pause

  1. Snapshot in progress, stream paused for snapshot
  2. Tab goes hidden → #pause() no-ops (already paused)
  3. Tab goes visible → #resume()stream resumes while snapshot still in flight

Bug 3: Snapshot blocks on live long-poll, consuming browser connections

This one was reported separately and has two interacting issues:

Issue A: requestSnapshot calls #waitForStreamEnd() BEFORE #pause(). When the stream is in a live long-poll (which can hold for up to 20 seconds), #isMidStream is false so #waitForStreamEnd() returns immediately — but the long-poll is still active. The subsequent #pause() sets #pauseRequested and aborts the controller, but the snapshot fetch then competes with the dying long-poll for browser HTTP connections (especially on HTTP/1.1 with connection limits).

Issue B: In #requestShape(), the finally block sets this.#requestAbortController = undefined BEFORE the recursive call creates a new one via #createAbortListener. During this gap, #pause() can't abort anything because the controller is undefined. The request loop then proceeds to start a new long-poll despite #pauseRequested being true, because the pause check at the top of #requestShape already passed on the current iteration. This new long-poll blocks the snapshot POST.

All three bugs stem from the same root cause: pause coordination is split across multiple mechanisms (#pauseRequested boolean, #activeSnapshotRequests counter, abort controller, sync state instanceof checks) with no single source of truth.


Proposed Fix: Pause Lock

Replace #pauseRequested, #activeSnapshotRequests, and the pause-related logic with a counting lock that tracks pause reasons:

class PauseLock {
  #holders = new Set<string>()
  #onStateChange: (isPaused: boolean) => void

  constructor(onStateChange: (isPaused: boolean) => void) {
    this.#onStateChange = onStateChange
  }

  acquire(reason: string): void {
    if (this.#holders.has(reason)) {
      // Set-based lock is naturally idempotent — double acquire is safe
      // but likely indicates a caller bug (e.g., visibilitychange firing
      // 'hidden' twice without a 'visible' in between)
      console.warn(
        `[Electric] PauseLock: "${reason}" already held — ignoring duplicate acquire`
      )
      return
    }
    const wasEmpty = this.#holders.size === 0
    this.#holders.add(reason)
    if (wasEmpty) this.#onStateChange(true)
  }

  release(reason: string): void {
    this.#holders.delete(reason)
    if (this.#holders.size === 0) {
      this.#onStateChange(false)
    }
  }

  get isPaused(): boolean {
    return this.#holders.size > 0
  }

  /** Check if a specific reason is holding the lock */
  isHeldBy(reason: string): boolean {
    return this.#holders.has(reason)
  }
}

The Set-based design means double-acquire is safe (idempotent no-op), but the warning helps catch caller bugs early — if acquire('visibility') fires twice, something is wrong with the visibility handler. Different reasons coexisting is the whole point of the lock; the same reason appearing twice is likely a bug.

Usage in ShapeStream:

// In constructor:
this.#pauseLock = new PauseLock((isPaused) => {
  if (isPaused) {
    this.#requestAbortController?.abort(PAUSE_STREAM)
  } else {
    if (this.options.signal?.aborted) return
    this.#start()
  }
})

// Visibility handler — simple, no guards needed:
if (document.hidden) this.#pauseLock.acquire('visibility')
else this.#pauseLock.release('visibility')

// Snapshot requests — acquire BEFORE waitForStreamEnd:
async requestSnapshot(opts) {
  this.#pauseLock.acquire(`snapshot-${snapshotId}`)
  // acquire() immediately aborts the live long-poll via onStateChange,
  // so waitForStreamEnd() resolves fast instead of blocking 20s
  await this.#waitForStreamEnd()
  try {
    return await this.fetchSnapshot(opts)
  } finally {
    this.#pauseLock.release(`snapshot-${snapshotId}`)
  }
}

// Wake detection — doesn't need pause at all, just refresh:
// (unchanged)

// #requestShape — check the lock instead of #pauseRequested:
if (this.#pauseLock.isPaused) {
  this.#syncState = this.#syncState.pause()
  return
}

What this fixes:

  • Bug 1 — Snapshot resume while tab hidden: snapshot releases its lock, but visibility lock is still held → stream stays paused ✓
  • Bug 2 — Visibility resume during snapshot: visibility releases its lock, but snapshot lock is still held → stream stays paused ✓
  • Bug 3 — Snapshot blocks on long-poll: acquire() immediately kills the long-poll (via onStateChange → abort). And the #requestShape loop checks pauseLock.isPaused on every iteration — lock state is always consistent regardless of abort controller lifecycle, so no new long-poll can sneak through the gap. ✓
  • Rapid pause→resume: acquire + release before request loop ticks → lock is empty → no pause transition ✓
  • Multiple concurrent snapshots: each holds its own named lock, stream resumes only when all release ✓

What this eliminates:

  • #pauseRequested boolean — lock acquisition IS the request
  • #activeSnapshotRequests counter — each snapshot holds a named lock
  • The two-phase pause protocol (set flag → consume flag) — lock state is always consistent
  • The guard conditions in #pause() (!this.#pauseRequested && !(this.#syncState instanceof PausedState)) — the lock handles idempotency
  • The abort controller gap race — lock doesn't depend on the controller existing

What stays the same:

  • The sync state machine's PausedState#requestShape still transitions #syncState to PausedState when the lock is held, and back when it's not
  • The resumingFromPause detection for avoiding live long-polling
  • All the external behavior (subscribers, URL params, etc.)

The #isRefreshing Simplification

Separately, #isRefreshing has two different clearing mechanisms that can race:

// Pattern A: forceDisconnectAndRefresh
this.#isRefreshing = true
this.#requestAbortController?.abort(FORCE_DISCONNECT_AND_REFRESH)
await this.#nextTick()
this.#isRefreshing = false

// Pattern B: wake detection
this.#isRefreshing = true
this.#requestAbortController.abort(SYSTEM_WAKE)
queueMicrotask(() => { this.#isRefreshing = false })

If both fire concurrently, the queueMicrotask can clear #isRefreshing before forceDisconnectAndRefresh's await completes. The simplest fix is to use the same mechanism everywhere — either always await #nextTick() or use a counter:

#refreshCount = 0

get #isRefreshing() { return this.#refreshCount > 0 }

async forceDisconnectAndRefresh() {
  this.#refreshCount++
  try {
    this.#requestAbortController?.abort(FORCE_DISCONNECT_AND_REFRESH)
    await this.#nextTick()
  } finally {
    this.#refreshCount--
  }
}

// Wake detection: same pattern with increment/try/finally/decrement

This eliminates the microtask race entirely.


What NOT to Change

The remaining transport flags are fine as simple booleans:

  • #connected — set true on fetch start, false on end. No contention, no coordination needed.
  • #isMidStream — toggled by message processing. Used by #waitForStreamEnd() with a promise resolver. Simple and correct.
  • #started — lifecycle guard. Simple and correct.

These don't need a state machine or any coordination abstraction. A state machine is the right tool for state progressions (the sync protocol). A lock is the right tool for coordination (pause/resume). Simple flags are the right tool for independent lifecycle tracking.


Summary

Component Current Proposed Why
Sync protocol State machine ✓ Keep as-is Clean state progression, already well-designed
Pause coordination #pauseRequested + #activeSnapshotRequests + guards Pause lock Fixes bugs 1–3, eliminates two-phase protocol and abort controller race
Refresh flag #isRefreshing with two clearing mechanisms Counter or single clearing mechanism Eliminates microtask race
Connection/lifecycle #connected, #isMidStream, #started Keep as-is Simple, independent, no contention

The pause lock is ~20 lines, trivially testable in isolation, and directly fixes all three coordination bugs. It's a much smaller and more targeted change than a full transport state machine — right tool for the right problem.

KyleAMathews pushed a commit that referenced this pull request Feb 10, 2026
Fix a concurrency bug where the visibility handler and snapshot
requests could override each other's pause state. Without this fix:

1. A snapshot completing while the tab is hidden would resume the
   stream, wasting bandwidth on a long-poll the user can't see.
2. A tab becoming visible during an active snapshot would resume
   the stream, causing concurrent writes from both the main stream
   and the snapshot.

Each resume path now checks whether the other pause reason still
holds, inspired by the PauseLock concept from PR #3816 review.

https://claude.ai/code/session_01UGPdwB6UpFkkQi9p4sjPRj
KyleAMathews pushed a commit that referenced this pull request Feb 10, 2026
Replace the #pauseRequested boolean + #activeSnapshotRequests counter
with a set-based PauseLock that tracks *why* the stream is paused.

This fixes three concurrency bugs identified in PR #3816 review:

1. Snapshot resume while tab hidden: snapshot completes and resumes
   the stream even though the tab is still hidden, wasting bandwidth.
   Fix: snapshot releases its lock reason, but 'visibility' reason
   remains held — stream stays paused.

2. Visibility resume during active snapshot: tab becomes visible and
   resumes the stream while a snapshot is in flight, causing both the
   main stream and snapshot to write concurrently.
   Fix: visibility releases its lock reason, but 'snapshot-N' reason
   remains held — stream stays paused.

3. Snapshot blocks on live long-poll: requestSnapshot called
   #waitForStreamEnd BEFORE #pause, blocking up to 20s waiting for
   the long-poll to complete.
   Fix: PauseLock.acquire() is called BEFORE waitForStreamEnd,
   immediately aborting the long-poll via the onAcquired callback.

Also fixes the #isRefreshing microtask race by replacing the boolean
flag with a counter + getter pattern. forceDisconnectAndRefresh and
wake detection both increment/decrement in try/finally blocks,
eliminating the window where concurrent operations could clear the
flag prematurely.

https://claude.ai/code/session_01UGPdwB6UpFkkQi9p4sjPRj
@kevin-dp
Copy link
Contributor Author

@KyleAMathews tl;dr of those reviews? ;D

@KyleAMathews
Copy link
Contributor

@kevin-dp and I chatted and we'll work on my suggestions as follow up PRs.

Copy link
Contributor

@KyleAMathews KyleAMathews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: huge improvement! The code is way more readable and reliable feeling now.

This PR reproduces a bug where the schema becomes `undefined` after
handling a stale response which may lead to parse errors ([see CI test
failure](https://github.com/electric-sql/electric/actions/runs/21870136703/job/63122684063)).
This bug was found by Claude during a review:

**Issue: schema undefined + ignored stale response → crash on
⁨`schema!`⁩**

The original code has the exact same flow:
1. Stale response with local handle → ⁨`return`⁩ from
⁨`#onInitialResponse`⁩ (line 1129 on main), skipping ⁨`this.#schema =
this.#schema ?? getSchemaFromHeaders(headers)`⁩ (line 1144 on main)
2. Control returns to ⁨`#requestShapeLongPoll`⁩ which does ⁨`const
schema = this.#schema!`⁩
3. If schema was undefined (fresh session resuming from persisted
handle/offset), this crashes

The refactored code does the same thing: ignored transition → return
early → ⁨`this.#syncState.schema!`⁩. Identical behavior.
This PR
[reproduces](https://github.com/electric-sql/electric/actions/runs/21897631349/job/63217468839?pr=3828)
and fixes a bug related to stale handlers.

### Bug: stale cache detection fails when client's own handle is the
expired handle

When a shape handle is marked as expired (e.g. after a 409 response),
the client is supposed to retry with a cache buster query parameter to
bypass stale CDN/proxy caches. However, this only works when the client
has **no handle** (fresh start) or a **different handle** than the
expired one.

When the client resumes with a persisted handle that happens to be the
same as the expired handle (`localHandle === expiredHandle`), the stale
detection logic sees that the client already has a handle and returns
`ignored` instead of `stale-retry`. The client logs a warning ("Ignoring
the stale response") but never adds a cache buster — so it just keeps
receiving the same stale cached response in an infinite loop.

### How the test reproduces it

1. Marks handle `expired-H1` as expired in the `ExpiredShapesCache`
2. Creates a `ShapeStream` that resumes with `handle: expired-H1`
(simulating a client that persisted this handle from a previous session)
3. The mock backend always returns responses with that same expired
handle (mimicks CDN behaviour)
4. Asserts that the client should use a `cache_buster` query parameter
to escape the stale cache — which currently fails because the client
takes the `ignored` path instead of `stale-retry`

### Root cause

In `checkStaleResponse` (line 311-344), the condition at line 322 is:

```typescript
if (this.#shared.handle === undefined) {
  // enter stale retry
}
// else: "We have a valid local handle — ignore this stale response"
```

This assumes that if the client has a local handle, it's a *different*
handle from the expired one, so the stale response can be safely
ignored. But that assumption is wrong when `localHandle ===
expiredHandle` — the client resumed with the same handle that was marked
expired.

At this point in the code, we already know `responseHandle ===
expiredHandle` (line 317). The missing check is whether
`this.#shared.handle` is *also* the expired handle.

### Fix

Change the condition at line 322 from:
```typescript
if (this.#shared.handle === undefined) {
```
to:
```typescript
if (this.#shared.handle === undefined || this.#shared.handle === expiredHandle) {
```
That's it — one condition added. When the client's own handle matches
the expired handle, it enters `stale-retry` (gets a cache buster)
instead of falling through to `ignored`. The rest of the stale-retry
machinery already handles everything correctly from there.
@kevin-dp kevin-dp merged commit b0cbe75 into main Feb 11, 2026
42 checks passed
@kevin-dp kevin-dp deleted the kevin/state-machine branch February 11, 2026 16:44
KyleAMathews pushed a commit that referenced this pull request Feb 11, 2026
Replace the #pauseRequested boolean + #activeSnapshotRequests counter
with a set-based PauseLock that tracks *why* the stream is paused.

This fixes three concurrency bugs identified in PR #3816 review:

1. Snapshot resume while tab hidden: snapshot completes and resumes
   the stream even though the tab is still hidden, wasting bandwidth.
   Fix: snapshot releases its lock reason, but 'visibility' reason
   remains held — stream stays paused.

2. Visibility resume during active snapshot: tab becomes visible and
   resumes the stream while a snapshot is in flight, causing both the
   main stream and snapshot to write concurrently.
   Fix: visibility releases its lock reason, but 'snapshot-N' reason
   remains held — stream stays paused.

3. Snapshot blocks on live long-poll: requestSnapshot called
   #waitForStreamEnd BEFORE #pause, blocking up to 20s waiting for
   the long-poll to complete.
   Fix: PauseLock.acquire() is called BEFORE waitForStreamEnd,
   immediately aborting the long-poll via the onAcquired callback.

Also fixes the #isRefreshing microtask race by replacing the boolean
flag with a counter + getter pattern. forceDisconnectAndRefresh and
wake detection both increment/decrement in try/finally blocks,
eliminating the window where concurrent operations could clear the
flag prematurely.

https://claude.ai/code/session_01UGPdwB6UpFkkQi9p4sjPRj
@K-Mistele
Copy link

amazing, when should we expect a new tag?

@github-actions
Copy link
Contributor

This PR has been released! 🚀

The following packages include changes from this PR:

  • @electric-sql/client@1.5.3
  • @electric-sql/client@1.5.3
  • @electric-sql/client@1.5.3

Thanks for contributing to Electric!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate converting ShapeStream to explicit state machine

3 participants