fix(agents-server): release pull-wake claim row even when in-memory token is missing by kevin-dp · Pull Request #4346 · electric-sql/electric

kevin-dp · 2026-05-18T13:09:08Z

Summary

Fixes #4340 — pull-wake claims leaking in `consumer_claims` after dispatch, regardless of whether the agent run succeeded or failed.

The release path in `callback-forward` was gated by `stillOwnsClaim` — an in-memory check that fails after server restart or when a newer wake on the same stream evicts the token. When that happened, the DB row stayed at `status='active'` indefinitely and the entity remained stuck at `status='running'` long after `done` arrived. Live testing observed this consistently: `active_count` would grow by 1 per dispatched wake and never decrement, even though the agent visibly completed in the UI.

Root cause

packages/agents-server/src/routing/internal-router.ts had all three release actions behind the same in-memory gate:

```ts
if (entity && stillOwnsClaim) {
await materializeReleasedClaim(...) // DB row release
await updateStatus(entity.url, 'idle') // entity status
clearStream(...) // in-memory token cleanup
await onEntityChanged(entity.url)
} else if (stillOwnsClaim) {
clearStream(...)
} else if (entity) {
log.info('done ignored for stale claim ...')
}
```

Any path that lost the in-memory token (server restart, parallel wakes evicting each other, retries after a transient failure) skipped `materializeReleasedClaim` entirely, leaving the row in the table forever. The in-memory token is a write-authorization concern; it shouldn't gate the durable-row release, which is keyed by `(consumerId, epoch)` — sufficient and authoritative DB identity.

The fix

Three concerns, three different gates:

DB row release (`materializeReleasedClaim`) — runs whenever `epoch` is defined. `(consumerId, epoch)` is the DB primary key; it's enough to identify and release the row safely.
Entity status → idle + `onEntityChanged` — runs when `entityCleared || stillOwnsClaim`. `entityCleared` is a new return field from `materializeReleasedClaim` indicating whether our `(consumerId, epoch)` was the active dispatch (and we just cleared it). The OR handles: (a) retry after a failed `updateStatus` (token still owned, but state already cleared by first attempt), (b) the existing "stale done" semantics (no newer wake materialized in DB, token lost in test setup → leave status alone — wait, this is the opposite). See Test cases below — the combined gate correctly handles all five scenarios.
In-memory token cleanup (`clearStream`) — remains gated by `stillOwnsClaim` so we never clear a newer consumer's token from under it.

`materializeReleasedClaim` API change

```diff

): Promise<ConsumerClaim | null> {

): Promise<{ claim: ConsumerClaim | null; entityCleared: boolean }> {
```

Only one production caller (`internal-router.ts`); both production caller and the test mock are updated. The `.returning()` on the `entityDispatchState` UPDATE now reports whether our row was actually cleared (vs. a no-op because a newer claim has taken over).

Test cases

Five scenarios, all behaving correctly with the new gating:

Scenario	`entityCleared`	`stillOwnsClaim`	DB row released?	Entity → idle?
A. Happy path (mint + done)	true	true	✅	✅
B. Server restart (no in-memory token, but DB row active)	true	false	✅	✅
C. Newer wake (wake-1 done after wake-2 took over the stream)	false	false	✅ (wake-1's row)	❌ (wake-2 is in flight)
D. Retry (first done's `updateStatus` threw; same done retried)	false	true	✅ (no-op)	✅
E. Legacy stale-done test (no `materializeActiveClaim` in test setup, token evicted)	false	false	n/a	❌

New tests live in `packages/agents-server/test/webhook-forward-routing.test.ts` under `claim release on done callback (regression for #4340)`. Existing tests (`server-claim-write-token.test.ts > stale done does not mark a newer active claim idle`, `> done retries still transition to idle when updateStatus fails on first attempt`) continue to pass.

Verified manually

Tested in the desktop app against a local agents-server:

Baseline: one pre-existing stuck claim (`active_count: 1`) from a prior orphan that this fix cannot retroactively clean up.
Sent two messages (one in an existing session, one in a new session) → `active_count: 1 → 3`.
Both agents finished → `active_count: 3 → 2 → 1` (each release fires once the runtime's `sendDone` reaches the server).

The pre-existing orphan from before the fix remains stuck — that needs a separate reaper job or manual cleanup (out of scope here).

Not addressed in this PR

Pre-existing orphan rows: rows that leaked under the old code can't be released because no fresh `done` callback is coming for them. Would need a reaper job or admin command to clean up.
`lease_expires_at: null` issue (Pull-wake: materialized claim has null lease_expires_at when upstream omits lease_ttl_ms #4341): independent issue. Without a lease, even a reaper job can't time-out claims safely.
`sendDone` latency: in testing, `done` arrives ~minutes after the agent visibly completes. That's a runtime-side concern, not part of this fix.

Base branch note

This PR targets `fix-pull-wake` (#4339), not `main`, because `materializeReleasedClaim` was introduced in #4308 which is part of the `fix-pull-wake` lineage but not yet in `main`. Merge order: this → fix-pull-wake → main.

🤖 Generated with Claude Code

…ver dev fallback The desktop's default owner_principal was `system:local-desktop`, but the agents-server falls back to `system:dev-local` when no auth header is present. The principal mismatch caused runner registration to fail with 403 UNAUTHORIZED for unauthenticated local development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oken is missing The release path in callback-forward was gated by `stillOwnsClaim`, an in-memory check that fails after server restart or when a newer wake on the same stream evicts the token. When that happened, the consumer_claims row stayed at status=active indefinitely and the entity remained stuck at status=running long after `done` arrived. Decouple the concerns: - materializeReleasedClaim runs whenever epoch is defined (DB identity is sufficient to release the row). Now returns `{ claim, entityCleared }` where entityCleared is true iff our (consumerId, epoch) was the active dispatch and we just cleared it. - updateStatus(idle) and onEntityChanged fire when `entityCleared || stillOwnsClaim` — covers happy path, server restart, retry-after-failed- updateStatus, while still leaving status=running when a newer wake holds the entity's active dispatch. - clearStream remains gated by stillOwnsClaim so we never clear another consumer's token from under it. Regression tests in test/webhook-forward-routing.test.ts cover the three failure modes (lost token, evicted token, retry). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-18T13:11:55Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
223	1	222	2

View the full list of 1 ❄️ flaky test(s)

test/horton-pull-wake-e2e.test.ts > pull-wake Horton e2e with mocked LLM > dispatches explicit runner-policy wakes and Horton writes mocked responses

Flake rate in main: 100.00% (Passed 0 times, Failed 8 times)

Stack Traces | 0.0775s run time

AssertionError: expected 500 to be 204 // Object.is equality

- Expected
+ Received

- 204
+ 500

 ❯ test/horton-pull-wake-e2e.test.ts:183:28

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

kevin-dp and others added 2 commits May 18, 2026 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346

fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346
kevin-dp wants to merge 2 commits into
fix-pull-wakefrom
fix-claim-release-after-dispatch

kevin-dp commented May 18, 2026

Uh oh!

codecov Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kevin-dp commented May 18, 2026

Summary

Root cause

The fix

`materializeReleasedClaim` API change

Test cases

Verified manually

Not addressed in this PR

Base branch note

Uh oh!

codecov Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ 1 Tests Failed:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 18, 2026 •

edited

Loading