Refactor `ShapeStream` state to be an explicit state machine by kevin-dp · Pull Request #3816 · electric-sql/electric

kevin-dp · 2026-02-09T16:01:50Z

This PR refactors the ShapeStream class into an explicit state machine. This removes many state variables and code paths from the ShapeStream into dedicated state classes.

Summary

Extracts the implicit sync state from ShapeStream into an explicit state machine (ShapeStreamState).

The original ShapeStream tracked sync state as ~12 flat private fields (#lastOffset, #shapeHandle, #isUpToDate, #liveCacheBuster, #schema, #lastSyncedAt, #lastSeenCursor, #consecutiveShortSseConnections, #sseFallbackToLongPolling, #staleCacheBuster, #staleCacheRetryCount, #state) with transition logic scattered across #onInitialResponse, #onMessages, #reset, #constructUrl, and #requestShape. This made it hard to reason about which fields were relevant in which phase of the sync lifecycle.

The new design replaces these with a single #syncState: ShapeStreamState field backed by an immutable state machine:

ShapeStreamState (abstract base)
  ├── ActiveState (abstract — shared field storage, response/message helpers)
  │   ├── FetchingState (abstract — shared Initial/Syncing/StaleRetry behavior)
  │   │   ├── InitialState
  │   │   ├── SyncingState
  │   │   └── StaleRetryState (+staleCacheBuster, +staleCacheRetryCount)
  │   ├── LiveState (+SSE tracking, live-specific response/up-to-date/URL handling)
  │   └── ReplayingState (+cursor, replay suppression logic)
  ├── PausedState (delegates to previousState)
  └── ErrorState (delegates to previousState)

Each state carries only the fields relevant to it and defines its own behavior:

Response handling — each active state has its own handleResponseMetadata (stale detection, field parsing, state-specific transitions)
Up-to-date handling — LiveState preserves SSE tracking, ReplayingState does cursor-based suppression, fetching states transition to LiveState
URL construction — applyUrlParams(url) lets each state add its own query parameters (offset, handle, cache busters) instead of the client branching on fields
SSE decisions — shouldUseSse() and handleSseConnectionClosed() live on LiveState where the tracking state is

ShapeStream is simplified to orchestration: it drives the request loop, handles errors, manages async coordination (pause/resume, snapshots, visibility), and delegates all sync state decisions to the state machine.

pkg-pr-new · 2026-02-10T09:18:04Z

Open in StackBlitz

npm i https://pkg.pr.new/@electric-sql/react@3816

npm i https://pkg.pr.new/@electric-sql/client@3816

npm i https://pkg.pr.new/@electric-sql/y-electric@3816

commit: 3f487f9

netlify · 2026-02-10T09:18:36Z

✅ Deploy Preview for electric-next ready!

Name	Link
🔨 Latest commit	`ba30fba`
🔍 Latest deploy log	https://app.netlify.com/projects/electric-next/deploys/698af78045b29e0008f57f48
😎 Deploy Preview	https://deploy-preview-3816--electric-next.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

codecov · 2026-02-10T09:19:35Z

Codecov Report

❌ Patch coverage is 88.51541% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.28%. Comparing base (091a232) to head (3f487f9).
⚠️ Report is 11 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
...ckages/typescript-client/src/shape-stream-state.ts	86.22%	35 Missing ⚠️
packages/typescript-client/src/client.ts	94.17%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3816      +/-   ##
==========================================
- Coverage   87.68%   87.28%   -0.40%     
==========================================
  Files          23       24       +1     
  Lines        2078     2305     +227     
  Branches      548      575      +27     
==========================================
+ Hits         1822     2012     +190     
- Misses        254      291      +37     
  Partials        2        2

Flag	Coverage Δ
packages/experimental	`87.73% <ø> (ø)`
packages/react-hooks	`86.48% <ø> (ø)`
packages/start	`82.83% <ø> (ø)`
packages/typescript-client	`92.24% <88.51%> (-1.41%)`	⬇️
packages/y-electric	`56.05% <ø> (ø)`
typescript	`87.28% <88.51%> (-0.40%)`	⬇️
unit-tests	`87.28% <88.51%> (-0.40%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

KyleAMathews · 2026-02-10T17:28:35Z

Test Review: Applying DSL & Property-Based Testing Ideas

The state machine extraction is a great structural move — pulling scattered mutable fields out of the 1800-line ShapeStream class into a pure, immutable state machine with explicit transitions. The current 42 tests verify individual transition edges competently, but I think we can dramatically improve coverage by applying ideas from this guide on building testing DSLs for complex systems.

Key ideas from the guide that apply here

The guide argues that when you have a well-defined state machine (which is exactly what this PR creates), example-based tests ("given this state and this input, expect that output") are the weakest form of verification. The stronger techniques are:

Algebraic property testing — verify that operators satisfy mathematical properties (idempotence, round-trip, commutativity) across all states, not just the ones you remembered to test.
Two-tier DSL design — a typed fluent builder for well-formed multi-step scenarios (making it easy to test journeys through the state graph), plus raw constructors for adversarial edge cases.
History/trace-based verification — record execution traces and run invariant checkers at every step, shifting from "did I assert enough?" to "do my scenarios visit enough states?"
Fuzz testing with shrinking — generate random event sequences with seeded RNGs for reproducibility, then verify invariants hold at every step. This explores the state space far beyond hand-written tests.
Small-scope exhaustive exploration — for a state machine with 7 states and ~8 event types, you can exhaustively test all reachable (state, event) pairs within small bounds.

The core insight: you've built a pure, immutable, isolated state machine. This is the perfect candidate for these techniques — no mocking fetch or SSE, no async, just inputs → outputs. Don't leave that leverage on the table.

1. Scenario DSL (High Priority)

Bugs in state machines almost always come from unexpected sequences, not individual steps. The current tests check one transition at a time. A fluent scenario builder would let you test full journeys:

function scenario(initial?: Partial<SharedStateFields>) {
  let state: ShapeStreamState = createInitialState({ 
    offset: initial?.offset ?? `-1`,
    handle: initial?.handle 
  })
  const trace: Array<{ event: string; before: string; after: string }> = []

  const self = {
    response(input: Partial<ResponseMetadataInput>) {
      const before = state.kind
      const transition = state.handleResponseMetadata(makeResponseInput(input))
      state = transition.state
      trace.push({ event: `response`, before, after: state.kind })
      assertStateInvariants(state) // automatic invariant checking at every step
      return self
    },
    messages(input: Partial<MessageBatchInput>) { /* similar */ return self },
    pause() { /* ... */ return self },
    resume() { /* ... */ return self },
    error(msg: string) { /* ... */ return self },
    retry() { /* ... */ return self },
    reset(handle?: string) { /* ... */ return self },
    sseClose(input: Partial<SseCloseInput>) { /* ... */ return self },

    expectKind(kind: ShapeStreamStateKind) {
      expect(state.kind).toBe(kind)
      return self
    },
    expectUpToDate(expected: boolean) {
      expect(state.isUpToDate).toBe(expected)
      return self
    },
    expectHandle(h: string | undefined) {
      expect(state.handle).toBe(h)
      return self
    },

    done() { return { state, trace } }
  }
  return self
}

This enables readable multi-step tests:

it(`full lifecycle: initial → sync → live → pause → resume → error → retry`, () => {
  scenario()
    .response({ responseHandle: `h1`, responseOffset: `0_5` })
    .expectKind(`syncing`)
    .messages({ hasUpToDateMessage: true })
    .expectKind(`live`)
    .expectUpToDate(true)
    .pause()
    .expectKind(`paused`)
    .expectUpToDate(true)  // paused-from-live preserves isUpToDate
    .resume()
    .expectKind(`live`)
    .error(`connection lost`)
    .expectKind(`error`)
    .retry()
    .expectKind(`live`)
})

it(`stale CDN → retry → fresh response → sync → live`, () => {
  scenario()
    .response({ responseHandle: `stale-h`, expiredHandle: `stale-h` })
    .expectKind(`stale-retry`)
    .response({ responseHandle: `fresh-h`, responseOffset: `0_0` })
    .expectKind(`syncing`)
    .messages({ hasUpToDateMessage: true })
    .expectKind(`live`)
    .expectHandle(`fresh-h`)
})

The builder is the "well-formed scenario" tier. For adversarial testing, keep the raw constructors (new SyncingState(...)) to create states the builder wouldn't normally produce (e.g., PausedState wrapping PausedState).

2. Algebraic Property Tests (High Priority)

Pause/resume, error/retry, withHandle, and markMustRefetch should be verified for every state, not just the ones that happened to get a test:

const allStates = (): ShapeStreamState[] => {
  const shared = makeShared()
  return [
    createInitialState({ offset: `-1` }),
    new SyncingState(shared),
    new LiveState(shared),
    new ReplayingState({ ...shared, replayCursor: `c1` }),
    new StaleRetryState({ ...shared, staleCacheBuster: `cb`, staleCacheRetryCount: 1 }),
    new LiveState(shared).pause(),
    new SyncingState(shared).toErrorState(new Error(`test`)),
  ]
}

describe(`algebraic properties`, () => {
  it.each(allStates().map(s => [s.kind, s]))(
    `%s: pause().resume() round-trips`,
    (_kind, state) => {
      const roundTripped = state.pause().resume()
      expect(roundTripped.kind).toBe(state.kind)
      expect(roundTripped.handle).toBe(state.handle)
      expect(roundTripped.offset).toBe(state.offset)
      expect(roundTripped.isUpToDate).toBe(state.isUpToDate)
    }
  )

  it.each(allStates().map(s => [s.kind, s]))(
    `%s: toErrorState(e).retry() round-trips`,
    (_kind, state) => {
      const roundTripped = state.toErrorState(new Error(`x`)).retry()
      expect(roundTripped.kind).toBe(state.kind)
      expect(roundTripped.handle).toBe(state.handle)
      expect(roundTripped.offset).toBe(state.offset)
    }
  )

  it.each(allStates().map(s => [s.kind, s]))(
    `%s: markMustRefetch always → InitialState with offset=-1`,
    (_kind, state) => {
      const reset = state.markMustRefetch(`new-h`)
      expect(reset).toBeInstanceOf(InitialState)
      expect(reset.offset).toBe(`-1`)
      expect(reset.handle).toBe(`new-h`)
      expect(reset.schema).toBeUndefined()
      expect(reset.isUpToDate).toBe(false)
    }
  )

  it.each(allStates().map(s => [s.kind, s]))(
    `%s: withHandle changes only handle`,
    (_kind, state) => {
      const updated = state.withHandle(`changed`)
      expect(updated.handle).toBe(`changed`)
      expect(updated.offset).toBe(state.offset)
      expect(updated.kind).toBe(state.kind)
      expect(updated.isUpToDate).toBe(state.isUpToDate)
    }
  )
})

3. Random Sequence Fuzzing (Medium Priority)

Generate random event sequences and verify invariants hold at every step. A single fuzz run like this explores more of the state space than all 42 hand-written tests combined:

function applyEvent(state: ShapeStreamState, event: Event): ShapeStreamState {
  switch (event.type) {
    case `response`: return state.handleResponseMetadata(event.input).state
    case `messages`: return state.handleMessageBatch(event.input).state
    case `pause`: return state.pause()
    case `resume`: return state instanceof PausedState ? state.resume() : state
    case `error`: return state.toErrorState(new Error(`fuzz`))
    case `retry`: return state instanceof ErrorState ? state.retry() : state
    case `markMustRefetch`: return state.markMustRefetch()
    case `sseClose`: return state.handleSseConnectionClosed(event.input).state
  }
}

function checkInvariants(state: ShapeStreamState) {
  expect(state).toBeDefined()
  expect([`initial`,`syncing`,`live`,`replaying`,`stale-retry`,`paused`,`error`]).toContain(state.kind)
  expect(typeof state.offset).toBe(`string`)
  
  // Only LiveState (or delegates to it) should be up-to-date
  if ([`initial`, `syncing`, `stale-retry`, `replaying`].includes(state.kind)) {
    expect(state.isUpToDate).toBe(false)
  }
  
  // staleCacheBuster only present in StaleRetryState (or delegates)
  if (![`stale-retry`, `paused`, `error`].includes(state.kind)) {
    expect(state.staleCacheBuster).toBeUndefined()
  }
}

it(`survives 1000 random 50-step sequences without invariant violations`, () => {
  for (let seed = 0; seed < 1000; seed++) {
    let state: ShapeStreamState = createInitialState({ offset: `-1` })
    const rng = mulberry32(seed) // seeded PRNG for reproducibility
    for (let step = 0; step < 50; step++) {
      const event = randomEvent(rng)
      state = applyEvent(state, event)
      checkInvariants(state)
    }
  }
})

When a seed fails, you have a fully reproducible failing sequence you can minimize.

4. Specific Missing Edge Cases (High Priority)

Even without the DSL/fuzzing infrastructure, these gaps should be filled now:

Double-pause nesting — potential bug if ShapeStream accidentally calls pause() twice:

it(`double pause creates nested PausedState — resume only unwraps one layer`, () => {
  const live = new LiveState(makeShared())
  const paused1 = live.pause()
  const paused2 = paused1.pause()
  expect(paused2).toBeInstanceOf(PausedState)
  const resumed1 = paused2.resume()
  expect(resumed1).toBeInstanceOf(PausedState) // still paused once
})

(Consider: should pause() be idempotent on PausedState? If so, that's a code change.)

204 response handling:

it(`204 response sets lastSyncedAt`, () => {
  const syncing = new SyncingState(makeShared({ lastSyncedAt: undefined }))
  const transition = syncing.handleResponseMetadata(
    makeResponseInput({ status: 204, now: 1700000000 })
  )
  expect(transition.state.lastSyncedAt).toBe(1700000000)
})

SSE vs. non-SSE offset handling:

it(`SSE up-to-date message updates offset`, () => {
  const syncing = new SyncingState(makeShared({ offset: `0_0` }))
  const transition = syncing.handleMessageBatch(
    makeMessageBatchInput({ isSse: true, upToDateOffset: `5_3` as Offset })
  )
  expect(transition.state.offset).toBe(`5_3`)
})

it(`non-SSE up-to-date message does NOT update offset`, () => {
  const syncing = new SyncingState(makeShared({ offset: `0_0` }))
  const transition = syncing.handleMessageBatch(
    makeMessageBatchInput({ isSse: false, upToDateOffset: `5_3` as Offset })
  )
  expect(transition.state.offset).toBe(`0_0`)
})

Schema set-once semantics:

it(`schema is only set once (first response wins)`, () => {
  const initial = createInitialState({ offset: `-1` })
  const t1 = initial.handleResponseMetadata(
    makeResponseInput({ responseSchema: { id: { type: `int4` } } })
  )
  const t2 = t1.state.handleResponseMetadata(
    makeResponseInput({ responseSchema: { name: { type: `text` } } })
  )
  expect(t2.state.schema).toEqual({ id: { type: `int4` } })
})

Events on Paused/Error states (defensive no-ops):

it(`PausedState.handleResponseMetadata returns ignored`, () => {
  const paused = new SyncingState(makeShared()).pause()
  const transition = paused.handleResponseMetadata(makeResponseInput())
  expect(transition.action).toBe(`ignored`)
})

it(`ErrorState.handleMessageBatch returns no-op`, () => {
  const errored = new SyncingState(makeShared()).toErrorState(new Error(`x`))
  const transition = errored.handleMessageBatch(makeMessageBatchInput())
  expect(transition.suppressBatch).toBe(false)
  expect(transition.state).toBe(errored)
})

Summary

Priority	What	Why
High	Scenario DSL builder	Tests sequences, not just individual transitions — most bugs hide in sequences
High	Algebraic property tests over all states	pause/resume, error/retry, withHandle, markMustRefetch should hold universally
High	Missing edge cases (double-pause, 204, schema set-once, SSE offset)	Direct gaps in the current suite
Medium	Random sequence fuzzing	Explores the state space far beyond hand-written tests
Medium	Invariant checker at every transition	Catches violations early, makes trace failures debuggable

The current test suite is a solid "did I implement this correctly?" check. What these techniques add is a "can anything break this?" check. The state machine is pure and immutable — it's the perfect candidate for property-based and trace-based testing. That's the whole payoff of extracting it from ShapeStream.

KyleAMathews · 2026-02-10T20:25:47Z

Test Review Part 2: The Glue Layer Between State Machine and Real World

The first review focused on testing the state machine in isolation — algebraic properties, scenario DSLs, fuzz testing. This review addresses the other major risk surface: the adapter code in client.ts that connects the pure state machine to messy real-world events (HTTP headers, SSE messages, abort signals, visibility changes).

The state machine refactoring creates a clean seam, but that seam is also where the new risk concentrates. The state machine is now a pure function: inputs → (new state, transition metadata). But the glue code has two jobs that are both undertested:

Extracting the right inputs from HTTP responses, SSE events, and abort signals
Interpreting transition results into the right side effects (throw, console.warn, abort, notify subscribers, sleep for backoff)

The state machine unit tests verify the engine is correct; they say nothing about whether the steering wheel is connected to the wheels.

Mapping the Risk Surface

I traced every this.#syncState access in client.ts — 38 total: 9 writes (state transitions) and 29 reads. Each write is an on-ramp where a real-world event gets translated into a state machine call. Here are the six specific risk zones:

Risk 1: Input Extraction at `#onInitialResponse` (line ~1058)

This is the adapter between raw Response and handleResponseMetadata():

const transition = this.#syncState.handleResponseMetadata({
  status,
  responseHandle: shapeHandle,                                       // headers.get(SHAPE_HANDLE_HEADER)
  responseOffset: headers.get(CHUNK_LAST_OFFSET_HEADER) as Offset,   // cast!
  responseCursor: headers.get(LIVE_CACHE_BUSTER_HEADER),
  responseSchema: getSchemaFromHeaders(headers),
  expiredHandle,                                                     // from expiredShapesCache lookup
  now: Date.now(),
  maxStaleCacheRetries: this.#maxStaleCacheRetries,
  createCacheBuster: () => `${Date.now()}-${Math.random()...}`,
})

No existing test verifies these extractions are correct. The state machine unit tests use makeResponseInput() which constructs the input object directly — bypassing all header parsing. If someone changes a header constant, breaks the as Offset cast, or alters getSchemaFromHeaders, the state machine gets wrong inputs and its own unit tests won't catch it.

Risk 2: Transition Result Branching

After handleResponseMetadata() returns, client.ts branches on transition.action:

stale-retry → cancel body, maybe throw FetchError(502), else console.warn + throw StaleCacheError
ignored → console.warn + early return (skip body processing entirely)
accepted → continue to parse body

And after handleMessageBatch() returns, it branches on transition.suppressBatch:

true → return early, skipping subscriber notification AND upToDateTracker.recordUpToDate
false → publish to subscribers, record up-to-date

No test isolates these branches against specific transition outcomes. The integration tests in client.test.ts test overall behavior (error recovery, shape rotation), but they can't distinguish whether a bug is in the state machine or in the branching logic.

Risk 3: The `#pauseRequested` / `PausedState` Dual-State Protocol

This is the most complex glue in the file. There are two parallel pause mechanisms:

#pause():
  sets #pauseRequested = true
  aborts requestAbortController with PAUSE_STREAM
      ↓
#requestShape() entry: 
  if #pauseRequested → syncState.pause(), clear flag, return
      OR
catch FetchBackoffAbortError:
  if abort reason === PAUSE_STREAM && #pauseRequested → syncState.pause(), clear flag
      ↓
#resume():
  clears #pauseRequested
  calls #start()  (does NOT call syncState.resume() — see below)
      ↓
#requestShape() entry:
  if syncState instanceof PausedState → resumingFromPause = true, syncState.resume()

The comment at line 1356 explains why #resume() doesn't immediately transition the state machine — it defers to #requestShape() so it can detect resumingFromPause and avoid live long-polling. This is a deliberate split-brain design with a subtle invariant: the #pauseRequested flag and PausedState must stay coordinated across async boundaries.

Existing coverage: client.test.ts tests visibility-based pause/resume, but doesn't test:

Rapid pause→resume before the request loop ticks (flag cleared without PausedState ever being created)
Pause during active fetch vs. pause between fetches (two different catch paths)
Resume while #pauseRequested is true but not yet consumed

Risk 4: `#isMidStream` Parallel State

#isMidStream is managed outside the state machine but must stay consistent with it:

// In #onMessages:
this.#isMidStream = true        // line 1127 — set BEFORE state machine call

const transition = this.#syncState.handleMessageBatch({...})
this.#syncState = transition.state

if (hasUpToDateMessage) {
  this.#isMidStream = false      // line 1140 — set after state machine call
  
  if (transition.suppressBatch) {
    return                        // line 1145 — early return, subscribers NOT notified
  }
  // ... record up-to-date, publish to subscribers
}

When a batch is suppressed (replay mode with unchanged cursor), #isMidStream gets toggled true → false and the promise resolver fires, but subscribers aren't notified. Is this the intended behavior? The #isMidStream toggle plus promise resolution are side effects that happen regardless of suppression — if any code awaits the mid-stream promise expecting subscriber notification to follow, it won't.

Risk 5: Snapshot 409 — `withHandle()` vs `markMustRefetch()`

At line ~1707 (in fetchSnapshot), a 409 uses:

this.#syncState = this.#syncState.withHandle(nextHandle)  // handle only

While the main stream's 409 handler at line ~1578 uses:

this.#syncState = this.#syncState.markMustRefetch(handle)  // full reset

This distinction is critical — snapshot 409s should NOT reset offset/schema/etc because the main stream is paused and should not be disturbed. No test verifies this distinction. If someone "simplifies" the snapshot path to use markMustRefetch (thinking it's equivalent), the main stream state gets wiped.

Risk 6: SSE Close → Backoff Glue

The finally block at line ~1297:

const transition = this.#syncState.handleSseConnectionClosed({...})
this.#syncState = transition.state

if (transition.fellBackToLongPolling) {
  console.warn(...)
} else if (transition.wasShortConnection) {
  const maxDelay = Math.min(
    this.#sseBackoffMaxDelay,
    this.#sseBackoffBaseDelay * Math.pow(2, this.#syncState.consecutiveShortSseConnections)
    //                                       ^^^^^^^^^^^^^^^^ reads NEW state
  )
  const delayMs = Math.floor(Math.random() * maxDelay)
  await new Promise((resolve) => setTimeout(resolve, delayMs))
}

The backoff reads consecutiveShortSseConnections from the post-transition state. The state machine tests verify the counter increments correctly, but nothing verifies the glue code correctly uses that counter for the delay. No test covers this path.

What Existing Tests Cover

Test file	Covers	Glue layer gaps
`shape-stream-state.test.ts`	Pure state machine transitions	Doesn't touch `client.ts` at all
`client.test.ts`	Error recovery w/ `onError`, visibility pause/resume, shape rotation, `isConnected`, `isLoading`	No header extraction, no transition branching, no rapid pause/resume, no snapshot 409 distinction
`integration.test.ts`	End-to-end with real server	Can't isolate glue bugs from state machine bugs or server bugs
`stream.test.ts`	URL construction, column mapping	No state transitions
`fetch.test.ts`	Fetch wrapper retries, backoff, prefetch	No state machine interaction

The gap: there are no tests sitting between the state machine unit tests and the full integration tests. Nothing tests the adapter layer in isolation.

Proposed Tests

A. Input Extraction Contract Tests (High Priority)

Verify that #onInitialResponse correctly maps HTTP headers to state machine inputs:

describe(`glue: #onInitialResponse header extraction`, () => {
  it(`maps HTTP headers to state machine input fields`, async () => {
    const { stream, nextFetch } = createMockShapeStream()

    nextFetch.respond({
      status: 200,
      headers: {
        [SHAPE_HANDLE_HEADER]: `test-handle`,
        [CHUNK_LAST_OFFSET_HEADER]: `5_3`,
        [LIVE_CACHE_BUSTER_HEADER]: `cursor-42`,
        // + schema headers
      },
      body: `[]`,
    })

    await stream.waitForNextTick()

    expect(stream.shapeHandle).toBe(`test-handle`)
    expect(stream.lastOffset).toBe(`5_3`)
  })

  it(`looks up expired handle from cache and triggers stale-retry`, async () => {
    // Pre-populate expiredShapesCache with a handle
    // Respond with that same handle
    // Verify: StaleCacheError thrown, console.warn emitted
  })

  it(`204 response sets lastSyncedAt via state machine`, async () => {
    // Respond with 204
    // Verify: lastSyncedAt() returns a recent timestamp
  })
})

B. Transition Branch Tests (High Priority)

For each transition.action value, verify the correct side effect:

describe(`glue: transition result branching`, () => {
  it(`stale-retry cancels body and throws StaleCacheError`, async () => {
    // Setup: respond with handle matching expired handle, no local handle
    // Verify: response.body.cancel() called, StaleCacheError thrown
  })

  it(`stale-retry exceeding max retries throws FetchError 502`, async () => {
    // Setup: trigger stale-retry more than maxStaleCacheRetries times
    // Verify: FetchError with status 502
  })

  it(`ignored stale response logs warning and skips body processing`, async () => {
    // Setup: local handle exists, respond with different expired handle
    // Verify: console.warn includes "Ignoring", no subscriber notification from this response
  })

  it(`suppressBatch skips subscriber notification but resolves midStream promise`, async () => {
    // Setup: enter replay mode, respond with up-to-date + unchanged cursor
    // Verify: subscriber NOT called, but midStream promise resolves
  })
})

C. Pause/Resume Protocol Tests (High Priority)

describe(`glue: pause/resume protocol`, () => {
  it(`pause during idle → next requestShape creates PausedState`, async () => {
    const { stream } = createLiveShapeStream()
    stream.triggerPause()
    await stream.waitForNextTick()
    expect(stream.isPaused()).toBe(true)
  })

  it(`rapid pause→resume before request loop: no PausedState created`, async () => {
    const { stream } = createLiveShapeStream()
    stream.triggerPause()
    stream.triggerResume()  // immediately, before #requestShape runs
    await stream.waitForNextTick()
    expect(stream.isPaused()).toBe(false)
    // Verify: state was never PausedState
  })

  it(`pause during active fetch: abort caught, transitions to PausedState`, async () => {
    const { stream, hangingFetch } = createMockShapeStream()
    hangingFetch()  // fetch that never resolves until aborted
    stream.triggerPause()
    await stream.waitForNextTick()
    expect(stream.isPaused()).toBe(true)
  })

  it(`resume detects resumingFromPause, avoids live long-poll param`, async () => {
    const { stream, nextFetch, getLastFetchUrl } = createPausedLiveShapeStream()
    stream.triggerResume()
    nextFetch.respond({ /* up-to-date response */ })
    const url = getLastFetchUrl()
    expect(url.searchParams.has(`live`)).toBe(false)  // no long-poll!
  })

  it(`resume with aborted user signal: stays paused`, async () => {
    const controller = new AbortController()
    const { stream } = createPausedShapeStream({ signal: controller.signal })
    controller.abort()
    stream.triggerResume()
    expect(stream.isPaused()).toBe(true)
  })
})

D. Snapshot 409 Distinction Test (Medium Priority)

describe(`glue: snapshot 409 uses withHandle not markMustRefetch`, () => {
  it(`updates handle but preserves offset and schema`, async () => {
    const { stream, triggerSnapshot409 } = createLiveShapeStream({
      handle: `h1`, offset: `5_3`
    })

    triggerSnapshot409({ newHandle: `h2` })
    await stream.waitForNextTick()

    expect(stream.shapeHandle).toBe(`h2`)    // updated
    expect(stream.lastOffset).toBe(`5_3`)    // NOT reset to -1
    expect(stream.isUpToDate).toBe(true)     // NOT reset to false
  })
})

E. Dual-State Consistency Invariants (Medium Priority)

Add an invariant checker that can be wired into the mock harness:

function assertGlueConsistency(stream: ShapeStream) {
  // isLoading is the inverse of isUpToDate
  expect(stream.isLoading()).toBe(!stream.isUpToDate)

  // isPaused should only be true when syncState is PausedState
  // (not when #pauseRequested is true but not yet consumed)
  if (stream.isPaused()) {
    // state machine should be in PausedState
    // #pauseRequested should be false (consumed)
  }
}

Run this after every mock response and every pause/resume operation.

F. SSE Backoff Glue Test (Low Priority)

describe(`glue: SSE close → backoff computation`, () => {
  it(`short connection triggers sleep with exponential delay`, async () => {
    // Mock setTimeout to capture delay
    // Trigger SSE connection that closes after 50ms (< minSseConnectionDuration)
    // Verify: setTimeout called with delay based on 2^consecutiveShortSseConnections
  })

  it(`fallback to long polling emits warning`, async () => {
    // Trigger maxShortSseConnections consecutive short connections
    // Verify: console.warn about proxy buffering
    // Verify: next request does NOT include SSE params
  })
})

The Testing Harness

All of the above require a mock fetch harness that sits between the state machine unit tests and the full integration tests. The pattern already exists in stream.test.ts (with fetchWrapper), but needs to be extended to:

Queue responses — nextFetch.respond({status, headers, body}) for sequencing multi-step scenarios
Hang fetches — hangingFetch() returns a promise that never resolves (for testing pause during active fetch)
Capture requests — getLastFetchUrl() to verify URL params the glue code constructed
Expose internals — access to isPaused(), isUpToDate, lastOffset, shapeHandle for assertions

This harness would make it trivial to write targeted tests for each glue-layer risk zone without the overhead of a real server.

Summary

Priority	Risk Zone	What to Test	Current Coverage
High	Input extraction (`#onInitialResponse`)	HTTP headers → state machine input mapping	None — unit tests bypass this entirely
High	Transition branching	`stale-retry` / `ignored` / `suppressBatch` → correct side effects	Indirect only via integration tests
High	Pause/resume protocol	`#pauseRequested` ↔ `PausedState` synchronization, race conditions	Partial — only visibility-based pause/resume
Medium	Snapshot 409 distinction	`withHandle()` preserves offset/schema vs `markMustRefetch()` resets	None
Medium	Dual-state consistency	`#isMidStream`, `#connected`, `isLoading` stay in sync with `#syncState`	None
Low	SSE backoff glue	Duration → backoff delay computation, long-polling fallback	None

The state machine extraction was the right move — it makes the core logic testable in isolation. The next step is testing the wiring harness that connects it to the real world. A mock fetch harness plus these targeted tests would close the gap between "state machine is correct" and "the system behaves correctly."

KyleAMathews · 2026-02-10T20:36:41Z

Test Review Part 3: Concurrency, Parallel State, and Delegation Depth

Parts 1 and 2 covered the state machine in isolation and the glue layer between the state machine and real-world events. This final review covers three remaining risk areas: concurrent pause/resume interactions, the parallel state that didn't move into the state machine, and delegation chain depth in PausedState/ErrorState.

Risk 7: Snapshot ↔ Visibility Pause/Resume Interaction

The snapshot request flow pauses the main stream, fetches data, then resumes:

// requestSnapshot (line ~1597)
this.#activeSnapshotRequests++
if (this.#activeSnapshotRequests === 1) {
  this.#pause()                          // pause main stream
}
const { metadata, data } = await this.fetchSnapshot(opts)
// ... inject data ...
finally {
  this.#activeSnapshotRequests--
  if (this.#activeSnapshotRequests === 0) {
    this.#resume()                       // resume main stream
  }
}

The visibility handler also calls #pause() / #resume():

const visibilityHandler = () => {
  if (document.hidden) this.#pause()
  else this.#resume()
}

These can interleave in a problematic way:

Tab is visible, stream is Live
requestSnapshot() → counter=1, calls #pause()
Tab goes hidden → visibilityHandler calls #pause() → guard prevents double-pause ✓
fetchSnapshot() completes
Counter goes to 0, calls #resume() → stream resumes, even though tab is still hidden

#resume() doesn't check tab visibility — it unconditionally resumes if the state is paused or #pauseRequested is true. So the snapshot's finally block can override the visibility system's pause.

The reverse is also problematic:

Snapshot in progress, stream paused for snapshot
Tab goes hidden → #pause() no-ops (already paused)
Tab goes visible → #resume() → stream resumes while snapshot is still in flight
Now the main stream and snapshot are both running concurrently, both potentially writing #syncState

This isn't a new bug introduced by this PR — the pause/resume control flow is unchanged — but it IS an untested interaction that could produce state inconsistencies. The state machine refactoring makes it easier to expose via targeted tests.

Proposed test:

describe(`snapshot ↔ visibility interaction`, () => {
  it(`snapshot resume while tab hidden: stream should stay paused`, async () => {
    const { stream, mockVisibility, triggerSnapshot } = createMockShapeStream()
    
    mockVisibility.hide()  // tab goes hidden → stream paused
    await triggerSnapshot() // snapshot pauses (no-op), fetches, resumes
    
    // After snapshot completes, stream should STILL be paused
    // because tab is still hidden
    expect(stream.isPaused()).toBe(true)  // THIS LIKELY FAILS — revealing the bug
  })

  it(`visibility resume during snapshot: stream should stay paused until snapshot completes`, async () => {
    const { stream, mockVisibility, hangingSnapshot } = createMockShapeStream()
    
    hangingSnapshot()       // snapshot starts, pauses stream
    mockVisibility.show()   // tab visible → resume called
    
    // Stream should NOT resume while snapshot is in flight
    expect(stream.isPaused()).toBe(true)
  })
})

Risk 8: The `#isRefreshing` Flag and queueMicrotask Timing

#isRefreshing has two different lifecycle patterns:

Pattern A — forceDisconnectAndRefresh() (line ~1453):

this.#isRefreshing = true
this.#requestAbortController?.abort(FORCE_DISCONNECT_AND_REFRESH)
await this.#nextTick()          // async wait
this.#isRefreshing = false      // cleared after await

Pattern B — system wake detection (line ~1553):

this.#isRefreshing = true
this.#requestAbortController.abort(SYSTEM_WAKE)
queueMicrotask(() => {
  this.#isRefreshing = false    // cleared via microtask
})

If wake detection fires during a forceDisconnectAndRefresh():

forceDisconnectAndRefresh sets #isRefreshing = true, aborts, awaits #nextTick()
setInterval fires wake detection → sets #isRefreshing = true (already true, no-op)
Queues microtask to clear → #isRefreshing = false
#nextTick() resolves → forceDisconnectAndRefresh sets #isRefreshing = false (already false)

Between step 3 and 4, there's a window where #isRefreshing is false even though forceDisconnectAndRefresh hasn't completed. During this window, if a new fetch starts, applyUrlParams would see canLongPoll: true (because !this.#isRefreshing is true), which is wrong — we should NOT long-poll during a refresh.

This is a narrow timing window and may never manifest in practice, but it demonstrates the general fragility of managing #isRefreshing through two different async clearing mechanisms. No test covers this.

Proposed test:

it(`concurrent forceDisconnectAndRefresh + wake detection: isRefreshing stays true`, async () => {
  const { stream, triggerWake, getNextFetchUrl } = createMockShapeStream()
  
  const refreshPromise = stream.forceDisconnectAndRefresh()
  triggerWake()  // fire wake detection while refresh is in progress
  
  // Before refresh completes, any URL construction should see isRefreshing=true
  // (i.e., canLongPoll should be false)
  const url = getNextFetchUrl()
  expect(url.searchParams.has('live')).toBe(false)
  
  await refreshPromise
})

Risk 9: PausedState/ErrorState Delegation Chain Depth

PausedState wraps any state:

pause(): PausedState {
  return new PausedState(this)  // wraps current state
}

ErrorState also wraps:

toErrorState(error: Error): ErrorState {
  return new ErrorState(this, error)
}

There's nothing in the state machine preventing nesting:

state.pause().pause() → PausedState(PausedState(original))
state.toErrorState(e1).toErrorState(e2) → ErrorState(ErrorState(original, e1), e2)
state.pause().toErrorState(e) → ErrorState(PausedState(original), e)

The ShapeStream.#pause() guard (!(this.#syncState instanceof PausedState)) prevents double-pause at the ShapeStream level. But the state machine itself is unprotected — it's a library that could be used elsewhere or called from unexpected paths.

Specific concern with double-pause:

// If somehow pause() is called twice:
const paused2 = state.pause().pause()
paused2.resume()  // returns PausedState(original), NOT original
// Need resume() twice to get back to original

And with error-during-pause:

const errored = state.pause().toErrorState(err)
errored.retry()   // returns PausedState(original)
// Now we're in PausedState — is this intended?
// #resume() would need to be called to actually resume

That error-during-pause case is actually interesting — if the stream is paused and an error occurs (maybe from a snapshot request), retry() returns the paused state. Is this the right behavior? It preserves the pause, which seems correct. But it's an interaction that should be explicitly tested.

Proposed tests:

describe(`delegation chain edge cases`, () => {
  it(`error during pause: retry returns PausedState`, () => {
    const live = new LiveState(makeShared())
    const paused = live.pause()
    const errored = paused.toErrorState(new Error('snapshot failed'))
    
    const retried = errored.retry()
    expect(retried).toBeInstanceOf(PausedState)
    expect(retried.isUpToDate).toBe(true)  // paused-from-live
    
    // Resume from the paused state should get back to live
    expect((retried as PausedState).resume()).toBeInstanceOf(LiveState)
  })

  it(`markMustRefetch from error-during-pause: resets to InitialState (not PausedState)`, () => {
    const live = new LiveState(makeShared())
    const errored = live.pause().toErrorState(new Error('x'))
    
    const reset = errored.reset('new-handle')
    expect(reset).toBeInstanceOf(InitialState)  // fully unwrapped
    expect(reset.handle).toBe('new-handle')
  })

  it(`double-wrap protection: pause() on PausedState creates nested wrapper`, () => {
    const live = new LiveState(makeShared())
    const paused1 = live.pause()
    const paused2 = paused1.pause()
    
    // This creates a double-wrapped PausedState
    expect(paused2).toBeInstanceOf(PausedState)
    expect(paused2.resume()).toBeInstanceOf(PausedState)  // only unwraps one layer
    expect((paused2.resume() as PausedState).resume()).toBeInstanceOf(LiveState)
    
    // Consider: should pause() be idempotent on PausedState?
    // If so, this is a design decision worth making explicit.
  })
})

Risk 10: The Parallel State That Didn't Move

The state machine extracted the sync-related fields, but ShapeStream still maintains its own parallel state:

Field	Lifecycle	Used for
`#connected`	Set `true` in `#fetchShape`, `false` in error paths + `#reset()` + normal completion	`isConnected()` public API
`#isMidStream`	`true` on messages, `false` on up-to-date, `true` on reset	`#waitForStreamEnd()` for snapshot coordination
`#isRefreshing`	`true` before abort, `false` after next tick / via microtask	`shouldUseSse()`, `canLongPoll` in URL params
`#pauseRequested`	`true` by `#pause()`, consumed by `#requestShape()`	Coordinates async pause with state machine
`#activeSnapshotRequests`	Incremented/decremented around snapshot requests, reset in `#reset()`	Coordinates pause/resume for concurrent snapshots
`#started`	`true` in `#start()`, `false` before retry	Guards against multiple starts

These form their own implicit state machine with invariants that should hold:

#connected should be true only when a fetch/SSE connection is active
#isMidStream should be true only between receiving data messages and the up-to-date control message
#isRefreshing should be true only during a brief abort→reconnect window
#pauseRequested should be true only between #pause() and the next #requestShape() iteration
#activeSnapshotRequests should never go negative

None of these invariants are tested. And the interactions between them are subtle — for example, #reset() clears #activeSnapshotRequests to 0, but the comment at line 1696 explicitly says snapshot 409s DON'T call #reset() to avoid breaking the counter. This constraint is enforced by convention, not by tests.

Proposed: parallel state invariant checker

function assertParallelStateInvariants(stream: TestableShapeStream) {
  // activeSnapshotRequests is never negative
  expect(stream.activeSnapshotRequests).toBeGreaterThanOrEqual(0)
  
  // If stream is paused and not mid-snapshot, pauseRequested should be false
  // (it should have been consumed by #requestShape)
  if (stream.isPaused() && stream.activeSnapshotRequests === 0) {
    expect(stream.pauseRequested).toBe(false)
  }
  
  // If not started, connected must be false
  if (!stream.hasStarted()) {
    expect(stream.isConnected()).toBe(false)
  }
  
  // If isRefreshing, we should be in an abort→reconnect cycle
  // (hard to check directly, but can verify it doesn't persist)
}

Summary of All Three Reviews

Review	Focus	Key Gaps
Part 1	State machine in isolation	No algebraic property tests, no scenario DSL, no fuzzing, missing edge cases
Part 2	Glue layer (state machine ↔ real world)	No input extraction tests, no transition branch tests, no pause protocol tests, no snapshot 409 distinction test
Part 3	Concurrency + parallel state	Snapshot↔visibility pause conflict (potential bug), `#isRefreshing` microtask race, delegation depth, parallel state invariants

The three layers form a testing pyramid:

Bottom: Pure state machine (algebraic properties, fuzz, DSL) — cheapest to write, fastest to run
Middle: Glue layer (mock fetch harness, targeted adapter tests) — moderate cost, high value
Top: Concurrency interactions (snapshot↔visibility, wake↔refresh, delegation chains) — hardest to test, but where the most surprising bugs live

The state machine refactoring was the right move — it makes the bottom two layers testable for the first time. The concurrency issues in the top layer predate this PR, but are now more visible because the state machine makes it clear when writes to #syncState could conflict.

KyleAMathews · 2026-02-10T21:05:58Z

Design Review: Parallel State & Pause Coordination

Following up on the testing reviews (Part 1, Part 2, Part 3) — this comment looks at the remaining transport/connection state that lives outside the state machine, and proposes a targeted fix for the coordination bugs identified in Part 3.

The Two Layers

The PR cleanly extracts the sync protocol into a state machine: offset, handle, cursor, schema, up-to-date status, replay mode, stale cache retry. This is a state progression (Initial → Syncing → Live) where each state carries different data and responds to events differently. State machine is the right abstraction here.

The transport/connection layer stays as fields on ShapeStream:

Field	What it tracks	Contention?
`#started`	Has `subscribe()` been called	No — simple lifecycle
`#connected`	Is a fetch/SSE physically active	No — set true on fetch start, false on end
`#isMidStream`	Between data messages and up-to-date	No — toggled by message processing
`#isRefreshing`	In an abort→reconnect window	Minor — two different clearing mechanisms
`#pauseRequested`	Pause intent not yet consumed by request loop	Yes — shared between visibility, snapshots, request loop
`#activeSnapshotRequests`	Concurrent snapshot counter	Yes — coordinates pause/resume with visibility

The first three are simple lifecycle flags with no contention — they're fine as booleans. The last three are where the coordination complexity (and bugs) live.

The Coordination Problem

#pause() and #resume() are called from three independent sources:

Visibility handler — tab hidden → pause, tab visible → resume
Snapshot requests — first snapshot → pause, last snapshot completes → resume
#requestShape() loop — consumes #pauseRequested, transitions sync state to PausedState

These callers don't know about each other. The current code uses #pauseRequested (boolean) + #activeSnapshotRequests (counter) + instanceof PausedState (sync state check) to coordinate them, which produces bugs:

Bug 1: Snapshot resume overrides visibility pause

Tab visible, stream Live
requestSnapshot() → counter=1, calls #pause()
Tab goes hidden → #pause() no-ops (already paused/pause-requested)
Snapshot completes → counter=0, calls #resume() → stream resumes while tab hidden

Bug 2: Visibility resume overrides snapshot pause

Snapshot in progress, stream paused for snapshot
Tab goes hidden → #pause() no-ops (already paused)
Tab goes visible → #resume() → stream resumes while snapshot still in flight

Bug 3: Snapshot blocks on live long-poll, consuming browser connections

This one was reported separately and has two interacting issues:

Issue A: requestSnapshot calls #waitForStreamEnd() BEFORE #pause(). When the stream is in a live long-poll (which can hold for up to 20 seconds), #isMidStream is false so #waitForStreamEnd() returns immediately — but the long-poll is still active. The subsequent #pause() sets #pauseRequested and aborts the controller, but the snapshot fetch then competes with the dying long-poll for browser HTTP connections (especially on HTTP/1.1 with connection limits).

Issue B: In #requestShape(), the finally block sets this.#requestAbortController = undefined BEFORE the recursive call creates a new one via #createAbortListener. During this gap, #pause() can't abort anything because the controller is undefined. The request loop then proceeds to start a new long-poll despite #pauseRequested being true, because the pause check at the top of #requestShape already passed on the current iteration. This new long-poll blocks the snapshot POST.

All three bugs stem from the same root cause: pause coordination is split across multiple mechanisms (#pauseRequested boolean, #activeSnapshotRequests counter, abort controller, sync state instanceof checks) with no single source of truth.

Proposed Fix: Pause Lock

Replace #pauseRequested, #activeSnapshotRequests, and the pause-related logic with a counting lock that tracks pause reasons:

class PauseLock {
  #holders = new Set<string>()
  #onStateChange: (isPaused: boolean) => void

  constructor(onStateChange: (isPaused: boolean) => void) {
    this.#onStateChange = onStateChange
  }

  acquire(reason: string): void {
    if (this.#holders.has(reason)) {
      // Set-based lock is naturally idempotent — double acquire is safe
      // but likely indicates a caller bug (e.g., visibilitychange firing
      // 'hidden' twice without a 'visible' in between)
      console.warn(
        `[Electric] PauseLock: "${reason}" already held — ignoring duplicate acquire`
      )
      return
    }
    const wasEmpty = this.#holders.size === 0
    this.#holders.add(reason)
    if (wasEmpty) this.#onStateChange(true)
  }

  release(reason: string): void {
    this.#holders.delete(reason)
    if (this.#holders.size === 0) {
      this.#onStateChange(false)
    }
  }

  get isPaused(): boolean {
    return this.#holders.size > 0
  }

  /** Check if a specific reason is holding the lock */
  isHeldBy(reason: string): boolean {
    return this.#holders.has(reason)
  }
}

The Set-based design means double-acquire is safe (idempotent no-op), but the warning helps catch caller bugs early — if acquire('visibility') fires twice, something is wrong with the visibility handler. Different reasons coexisting is the whole point of the lock; the same reason appearing twice is likely a bug.

Usage in ShapeStream:

// In constructor:
this.#pauseLock = new PauseLock((isPaused) => {
  if (isPaused) {
    this.#requestAbortController?.abort(PAUSE_STREAM)
  } else {
    if (this.options.signal?.aborted) return
    this.#start()
  }
})

// Visibility handler — simple, no guards needed:
if (document.hidden) this.#pauseLock.acquire('visibility')
else this.#pauseLock.release('visibility')

// Snapshot requests — acquire BEFORE waitForStreamEnd:
async requestSnapshot(opts) {
  this.#pauseLock.acquire(`snapshot-${snapshotId}`)
  // acquire() immediately aborts the live long-poll via onStateChange,
  // so waitForStreamEnd() resolves fast instead of blocking 20s
  await this.#waitForStreamEnd()
  try {
    return await this.fetchSnapshot(opts)
  } finally {
    this.#pauseLock.release(`snapshot-${snapshotId}`)
  }
}

// Wake detection — doesn't need pause at all, just refresh:
// (unchanged)

// #requestShape — check the lock instead of #pauseRequested:
if (this.#pauseLock.isPaused) {
  this.#syncState = this.#syncState.pause()
  return
}

What this fixes:

Bug 1 — Snapshot resume while tab hidden: snapshot releases its lock, but visibility lock is still held → stream stays paused ✓
Bug 2 — Visibility resume during snapshot: visibility releases its lock, but snapshot lock is still held → stream stays paused ✓
Bug 3 — Snapshot blocks on long-poll: acquire() immediately kills the long-poll (via onStateChange → abort). And the #requestShape loop checks pauseLock.isPaused on every iteration — lock state is always consistent regardless of abort controller lifecycle, so no new long-poll can sneak through the gap. ✓
Rapid pause→resume: acquire + release before request loop ticks → lock is empty → no pause transition ✓
Multiple concurrent snapshots: each holds its own named lock, stream resumes only when all release ✓

What this eliminates:

#pauseRequested boolean — lock acquisition IS the request
#activeSnapshotRequests counter — each snapshot holds a named lock
The two-phase pause protocol (set flag → consume flag) — lock state is always consistent
The guard conditions in #pause() (!this.#pauseRequested && !(this.#syncState instanceof PausedState)) — the lock handles idempotency
The abort controller gap race — lock doesn't depend on the controller existing

What stays the same:

The sync state machine's PausedState — #requestShape still transitions #syncState to PausedState when the lock is held, and back when it's not
The resumingFromPause detection for avoiding live long-polling
All the external behavior (subscribers, URL params, etc.)

The `#isRefreshing` Simplification

Separately, #isRefreshing has two different clearing mechanisms that can race:

// Pattern A: forceDisconnectAndRefresh
this.#isRefreshing = true
this.#requestAbortController?.abort(FORCE_DISCONNECT_AND_REFRESH)
await this.#nextTick()
this.#isRefreshing = false

// Pattern B: wake detection
this.#isRefreshing = true
this.#requestAbortController.abort(SYSTEM_WAKE)
queueMicrotask(() => { this.#isRefreshing = false })

If both fire concurrently, the queueMicrotask can clear #isRefreshing before forceDisconnectAndRefresh's await completes. The simplest fix is to use the same mechanism everywhere — either always await #nextTick() or use a counter:

#refreshCount = 0

get #isRefreshing() { return this.#refreshCount > 0 }

async forceDisconnectAndRefresh() {
  this.#refreshCount++
  try {
    this.#requestAbortController?.abort(FORCE_DISCONNECT_AND_REFRESH)
    await this.#nextTick()
  } finally {
    this.#refreshCount--
  }
}

// Wake detection: same pattern with increment/try/finally/decrement

This eliminates the microtask race entirely.

What NOT to Change

The remaining transport flags are fine as simple booleans:

#connected — set true on fetch start, false on end. No contention, no coordination needed.
#isMidStream — toggled by message processing. Used by #waitForStreamEnd() with a promise resolver. Simple and correct.
#started — lifecycle guard. Simple and correct.

These don't need a state machine or any coordination abstraction. A state machine is the right tool for state progressions (the sync protocol). A lock is the right tool for coordination (pause/resume). Simple flags are the right tool for independent lifecycle tracking.

Summary

Component	Current	Proposed	Why
Sync protocol	State machine ✓	Keep as-is	Clean state progression, already well-designed
Pause coordination	`#pauseRequested` + `#activeSnapshotRequests` + guards	Pause lock	Fixes bugs 1–3, eliminates two-phase protocol and abort controller race
Refresh flag	`#isRefreshing` with two clearing mechanisms	Counter or single clearing mechanism	Eliminates microtask race
Connection/lifecycle	`#connected`, `#isMidStream`, `#started`	Keep as-is	Simple, independent, no contention

The pause lock is ~20 lines, trivially testable in isolation, and directly fixes all three coordination bugs. It's a much smaller and more targeted change than a full transport state machine — right tool for the right problem.

Fix a concurrency bug where the visibility handler and snapshot requests could override each other's pause state. Without this fix: 1. A snapshot completing while the tab is hidden would resume the stream, wasting bandwidth on a long-poll the user can't see. 2. A tab becoming visible during an active snapshot would resume the stream, causing concurrent writes from both the main stream and the snapshot. Each resume path now checks whether the other pause reason still holds, inspired by the PauseLock concept from PR #3816 review. https://claude.ai/code/session_01UGPdwB6UpFkkQi9p4sjPRj

Replace the #pauseRequested boolean + #activeSnapshotRequests counter with a set-based PauseLock that tracks *why* the stream is paused. This fixes three concurrency bugs identified in PR #3816 review: 1. Snapshot resume while tab hidden: snapshot completes and resumes the stream even though the tab is still hidden, wasting bandwidth. Fix: snapshot releases its lock reason, but 'visibility' reason remains held — stream stays paused. 2. Visibility resume during active snapshot: tab becomes visible and resumes the stream while a snapshot is in flight, causing both the main stream and snapshot to write concurrently. Fix: visibility releases its lock reason, but 'snapshot-N' reason remains held — stream stays paused. 3. Snapshot blocks on live long-poll: requestSnapshot called #waitForStreamEnd BEFORE #pause, blocking up to 20s waiting for the long-poll to complete. Fix: PauseLock.acquire() is called BEFORE waitForStreamEnd, immediately aborting the long-poll via the onAcquired callback. Also fixes the #isRefreshing microtask race by replacing the boolean flag with a counter + getter pattern. forceDisconnectAndRefresh and wake detection both increment/decrement in try/finally blocks, eliminating the window where concurrent operations could clear the flag prematurely. https://claude.ai/code/session_01UGPdwB6UpFkkQi9p4sjPRj

kevin-dp · 2026-02-11T08:31:07Z

@KyleAMathews tl;dr of those reviews? ;D

KyleAMathews · 2026-02-11T16:00:46Z

@kevin-dp and I chatted and we'll work on my suggestions as follow up PRs.

KyleAMathews

huge improvement! The code is way more readable and reliable feeling now.

This PR reproduces a bug where the schema becomes `undefined` after handling a stale response which may lead to parse errors ([see CI test failure](https://github.com/electric-sql/electric/actions/runs/21870136703/job/63122684063)). This bug was found by Claude during a review: **Issue: schema undefined + ignored stale response → crash on ⁨`schema!`⁩** The original code has the exact same flow: 1. Stale response with local handle → ⁨`return`⁩ from ⁨`#onInitialResponse`⁩ (line 1129 on main), skipping ⁨`this.#schema = this.#schema ?? getSchemaFromHeaders(headers)`⁩ (line 1144 on main) 2. Control returns to ⁨`#requestShapeLongPoll`⁩ which does ⁨`const schema = this.#schema!`⁩ 3. If schema was undefined (fresh session resuming from persisted handle/offset), this crashes The refactored code does the same thing: ignored transition → return early → ⁨`this.#syncState.schema!`⁩. Identical behavior.

This PR [reproduces](https://github.com/electric-sql/electric/actions/runs/21897631349/job/63217468839?pr=3828) and fixes a bug related to stale handlers. ### Bug: stale cache detection fails when client's own handle is the expired handle When a shape handle is marked as expired (e.g. after a 409 response), the client is supposed to retry with a cache buster query parameter to bypass stale CDN/proxy caches. However, this only works when the client has **no handle** (fresh start) or a **different handle** than the expired one. When the client resumes with a persisted handle that happens to be the same as the expired handle (`localHandle === expiredHandle`), the stale detection logic sees that the client already has a handle and returns `ignored` instead of `stale-retry`. The client logs a warning ("Ignoring the stale response") but never adds a cache buster — so it just keeps receiving the same stale cached response in an infinite loop. ### How the test reproduces it 1. Marks handle `expired-H1` as expired in the `ExpiredShapesCache` 2. Creates a `ShapeStream` that resumes with `handle: expired-H1` (simulating a client that persisted this handle from a previous session) 3. The mock backend always returns responses with that same expired handle (mimicks CDN behaviour) 4. Asserts that the client should use a `cache_buster` query parameter to escape the stale cache — which currently fails because the client takes the `ignored` path instead of `stale-retry` ### Root cause In `checkStaleResponse` (line 311-344), the condition at line 322 is: ```typescript if (this.#shared.handle === undefined) { // enter stale retry } // else: "We have a valid local handle — ignore this stale response" ``` This assumes that if the client has a local handle, it's a *different* handle from the expired one, so the stale response can be safely ignored. But that assumption is wrong when `localHandle === expiredHandle` — the client resumed with the same handle that was marked expired. At this point in the code, we already know `responseHandle === expiredHandle` (line 317). The missing check is whether `this.#shared.handle` is *also* the expired handle. ### Fix Change the condition at line 322 from: ```typescript if (this.#shared.handle === undefined) { ``` to: ```typescript if (this.#shared.handle === undefined || this.#shared.handle === expiredHandle) { ``` That's it — one condition added. When the client's own handle matches the expired handle, it enters `stale-retry` (gets a cache buster) instead of falling through to `ignored`. The rest of the stale-retry machinery already handles everything correctly from there.

Replace the #pauseRequested boolean + #activeSnapshotRequests counter with a set-based PauseLock that tracks *why* the stream is paused. This fixes three concurrency bugs identified in PR #3816 review: 1. Snapshot resume while tab hidden: snapshot completes and resumes the stream even though the tab is still hidden, wasting bandwidth. Fix: snapshot releases its lock reason, but 'visibility' reason remains held — stream stays paused. 2. Visibility resume during active snapshot: tab becomes visible and resumes the stream while a snapshot is in flight, causing both the main stream and snapshot to write concurrently. Fix: visibility releases its lock reason, but 'snapshot-N' reason remains held — stream stays paused. 3. Snapshot blocks on live long-poll: requestSnapshot called #waitForStreamEnd BEFORE #pause, blocking up to 20s waiting for the long-poll to complete. Fix: PauseLock.acquire() is called BEFORE waitForStreamEnd, immediately aborting the long-poll via the onAcquired callback. Also fixes the #isRefreshing microtask race by replacing the boolean flag with a counter + getter pattern. forceDisconnectAndRefresh and wake detection both increment/decrement in try/finally blocks, eliminating the window where concurrent operations could clear the flag prematurely. https://claude.ai/code/session_01UGPdwB6UpFkkQi9p4sjPRj

K-Mistele · 2026-02-11T17:20:37Z

amazing, when should we expect a new tag?

github-actions · 2026-02-11T18:40:32Z

This PR has been released! 🚀

The following packages include changes from this PR:

@electric-sql/client@1.5.3
@electric-sql/client@1.5.3
@electric-sql/client@1.5.3

Thanks for contributing to Electric!

kevin-dp added 2 commits February 9, 2026 17:08

Initial refactor to a state machine

946b0d7

Move URL param assignment to the individual state classes

6631758

kevin-dp force-pushed the kevin/state-machine branch from bd8df84 to 6631758 Compare February 9, 2026 16:10

Fix conditional cache buster

ba30fba

This comment has been minimized.

Sign in to view

kevin-dp added 4 commits February 10, 2026 10:44

Fix 409 handling to be equivalent to old code

c291703

Make interface properties readonly

c4ea4f6

Add state diagram as comment in the code

b459711

Update comment on lastSyncedAt

7104843

Changeset

667146e

KyleAMathews approved these changes Feb 11, 2026

View reviewed changes

kevin-dp added 2 commits February 11, 2026 17:30

kevin-dp merged commit b0cbe75 into main Feb 11, 2026
42 checks passed

kevin-dp deleted the kevin/state-machine branch February 11, 2026 16:44

github-actions bot mentioned this pull request Feb 11, 2026

Investigate converting ShapeStream to explicit state machine #3785

Closed

kevin-dp mentioned this pull request Feb 17, 2026

Add testing DSL for ShapeStream state machine #3840

Open

Conversation

kevin-dp commented Feb 9, 2026

Summary

Uh oh!

pkg-pr-new bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Feb 10, 2026

✅ Deploy Preview for electric-next ready!

Uh oh!

codecov bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment has been minimized.

KyleAMathews commented Feb 10, 2026

Test Review: Applying DSL & Property-Based Testing Ideas

Key ideas from the guide that apply here

1. Scenario DSL (High Priority)

2. Algebraic Property Tests (High Priority)

3. Random Sequence Fuzzing (Medium Priority)

4. Specific Missing Edge Cases (High Priority)

Summary

Uh oh!

KyleAMathews commented Feb 10, 2026

Test Review Part 2: The Glue Layer Between State Machine and Real World

Mapping the Risk Surface

Risk 1: Input Extraction at #onInitialResponse (line ~1058)

Risk 2: Transition Result Branching

Risk 3: The #pauseRequested / PausedState Dual-State Protocol

Risk 4: #isMidStream Parallel State

Risk 5: Snapshot 409 — withHandle() vs markMustRefetch()

Risk 6: SSE Close → Backoff Glue

What Existing Tests Cover

Proposed Tests

A. Input Extraction Contract Tests (High Priority)

B. Transition Branch Tests (High Priority)

C. Pause/Resume Protocol Tests (High Priority)

D. Snapshot 409 Distinction Test (Medium Priority)

E. Dual-State Consistency Invariants (Medium Priority)

F. SSE Backoff Glue Test (Low Priority)

The Testing Harness

Summary

Uh oh!

KyleAMathews commented Feb 10, 2026

Test Review Part 3: Concurrency, Parallel State, and Delegation Depth

Risk 7: Snapshot ↔ Visibility Pause/Resume Interaction

Risk 8: The #isRefreshing Flag and queueMicrotask Timing

Risk 9: PausedState/ErrorState Delegation Chain Depth

Risk 10: The Parallel State That Didn't Move

Summary of All Three Reviews

Uh oh!

KyleAMathews commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design Review: Parallel State & Pause Coordination

The Two Layers

The Coordination Problem

Proposed Fix: Pause Lock

The #isRefreshing Simplification

What NOT to Change

Summary

Uh oh!

kevin-dp commented Feb 11, 2026

Uh oh!

KyleAMathews commented Feb 11, 2026

Uh oh!

KyleAMathews left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

K-Mistele commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

pkg-pr-new bot commented Feb 10, 2026 •

edited

Loading

codecov bot commented Feb 10, 2026 •

edited

Loading

Risk 1: Input Extraction at `#onInitialResponse` (line ~1058)

Risk 3: The `#pauseRequested` / `PausedState` Dual-State Protocol

Risk 4: `#isMidStream` Parallel State

Risk 5: Snapshot 409 — `withHandle()` vs `markMustRefetch()`

Risk 8: The `#isRefreshing` Flag and queueMicrotask Timing

KyleAMathews commented Feb 10, 2026 •

edited

Loading

The `#isRefreshing` Simplification