Skip to content

fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346

Open
kevin-dp wants to merge 2 commits into
fix-pull-wakefrom
fix-claim-release-after-dispatch
Open

fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346
kevin-dp wants to merge 2 commits into
fix-pull-wakefrom
fix-claim-release-after-dispatch

Conversation

@kevin-dp
Copy link
Copy Markdown
Contributor

Summary

Fixes #4340 — pull-wake claims leaking in `consumer_claims` after dispatch, regardless of whether the agent run succeeded or failed.

The release path in `callback-forward` was gated by `stillOwnsClaim` — an in-memory check that fails after server restart or when a newer wake on the same stream evicts the token. When that happened, the DB row stayed at `status='active'` indefinitely and the entity remained stuck at `status='running'` long after `done` arrived. Live testing observed this consistently: `active_count` would grow by 1 per dispatched wake and never decrement, even though the agent visibly completed in the UI.

Root cause

packages/agents-server/src/routing/internal-router.ts had all three release actions behind the same in-memory gate:

```ts
if (entity && stillOwnsClaim) {
await materializeReleasedClaim(...) // DB row release
await updateStatus(entity.url, 'idle') // entity status
clearStream(...) // in-memory token cleanup
await onEntityChanged(entity.url)
} else if (stillOwnsClaim) {
clearStream(...)
} else if (entity) {
log.info('done ignored for stale claim ...')
}
```

Any path that lost the in-memory token (server restart, parallel wakes evicting each other, retries after a transient failure) skipped `materializeReleasedClaim` entirely, leaving the row in the table forever. The in-memory token is a write-authorization concern; it shouldn't gate the durable-row release, which is keyed by `(consumerId, epoch)` — sufficient and authoritative DB identity.

The fix

Three concerns, three different gates:

  1. DB row release (`materializeReleasedClaim`) — runs whenever `epoch` is defined. `(consumerId, epoch)` is the DB primary key; it's enough to identify and release the row safely.
  2. Entity status → idle + `onEntityChanged` — runs when `entityCleared || stillOwnsClaim`. `entityCleared` is a new return field from `materializeReleasedClaim` indicating whether our `(consumerId, epoch)` was the active dispatch (and we just cleared it). The OR handles: (a) retry after a failed `updateStatus` (token still owned, but state already cleared by first attempt), (b) the existing "stale done" semantics (no newer wake materialized in DB, token lost in test setup → leave status alone — wait, this is the opposite). See Test cases below — the combined gate correctly handles all five scenarios.
  3. In-memory token cleanup (`clearStream`) — remains gated by `stillOwnsClaim` so we never clear a newer consumer's token from under it.

`materializeReleasedClaim` API change

```diff

  • ): Promise<ConsumerClaim | null> {
  • ): Promise<{ claim: ConsumerClaim | null; entityCleared: boolean }> {
    ```

Only one production caller (`internal-router.ts`); both production caller and the test mock are updated. The `.returning()` on the `entityDispatchState` UPDATE now reports whether our row was actually cleared (vs. a no-op because a newer claim has taken over).

Test cases

Five scenarios, all behaving correctly with the new gating:

Scenario `entityCleared` `stillOwnsClaim` DB row released? Entity → idle?
A. Happy path (mint + done) true true
B. Server restart (no in-memory token, but DB row active) true false
C. Newer wake (wake-1 done after wake-2 took over the stream) false false ✅ (wake-1's row) ❌ (wake-2 is in flight)
D. Retry (first done's `updateStatus` threw; same done retried) false true ✅ (no-op)
E. Legacy stale-done test (no `materializeActiveClaim` in test setup, token evicted) false false n/a

New tests live in `packages/agents-server/test/webhook-forward-routing.test.ts` under `claim release on done callback (regression for #4340)`. Existing tests (`server-claim-write-token.test.ts > stale done does not mark a newer active claim idle`, `> done retries still transition to idle when updateStatus fails on first attempt`) continue to pass.

Verified manually

Tested in the desktop app against a local agents-server:

  1. Baseline: one pre-existing stuck claim (`active_count: 1`) from a prior orphan that this fix cannot retroactively clean up.
  2. Sent two messages (one in an existing session, one in a new session) → `active_count: 1 → 3`.
  3. Both agents finished → `active_count: 3 → 2 → 1` (each release fires once the runtime's `sendDone` reaches the server).

The pre-existing orphan from before the fix remains stuck — that needs a separate reaper job or manual cleanup (out of scope here).

Not addressed in this PR

  • Pre-existing orphan rows: rows that leaked under the old code can't be released because no fresh `done` callback is coming for them. Would need a reaper job or admin command to clean up.
  • `lease_expires_at: null` issue (Pull-wake: materialized claim has null lease_expires_at when upstream omits lease_ttl_ms #4341): independent issue. Without a lease, even a reaper job can't time-out claims safely.
  • `sendDone` latency: in testing, `done` arrives ~minutes after the agent visibly completes. That's a runtime-side concern, not part of this fix.

Base branch note

This PR targets `fix-pull-wake` (#4339), not `main`, because `materializeReleasedClaim` was introduced in #4308 which is part of the `fix-pull-wake` lineage but not yet in `main`. Merge order: this → fix-pull-wake → main.

🤖 Generated with Claude Code

kevin-dp and others added 2 commits May 18, 2026 10:04
…ver dev fallback

The desktop's default owner_principal was `system:local-desktop`, but the
agents-server falls back to `system:dev-local` when no auth header is
present. The principal mismatch caused runner registration to fail with
403 UNAUTHORIZED for unauthenticated local development.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oken is missing

The release path in callback-forward was gated by `stillOwnsClaim`, an
in-memory check that fails after server restart or when a newer wake on
the same stream evicts the token. When that happened, the consumer_claims
row stayed at status=active indefinitely and the entity remained stuck at
status=running long after `done` arrived.

Decouple the concerns:
- materializeReleasedClaim runs whenever epoch is defined (DB identity is
  sufficient to release the row). Now returns `{ claim, entityCleared }`
  where entityCleared is true iff our (consumerId, epoch) was the active
  dispatch and we just cleared it.
- updateStatus(idle) and onEntityChanged fire when `entityCleared ||
  stillOwnsClaim` — covers happy path, server restart, retry-after-failed-
  updateStatus, while still leaving status=running when a newer wake holds
  the entity's active dispatch.
- clearStream remains gated by stillOwnsClaim so we never clear another
  consumer's token from under it.

Regression tests in test/webhook-forward-routing.test.ts cover the three
failure modes (lost token, evicted token, retry).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
223 1 222 2
View the full list of 1 ❄️ flaky test(s)
test/horton-pull-wake-e2e.test.ts > pull-wake Horton e2e with mocked LLM > dispatches explicit runner-policy wakes and Horton writes mocked responses

Flake rate in main: 100.00% (Passed 0 times, Failed 8 times)

Stack Traces | 0.0775s run time
AssertionError: expected 500 to be 204 // Object.is equality

- Expected
+ Received

- 204
+ 500

 ❯ test/horton-pull-wake-e2e.test.ts:183:28

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant