fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346
Open
kevin-dp wants to merge 2 commits into
Open
fix(agents-server): release pull-wake claim row even when in-memory token is missing#4346kevin-dp wants to merge 2 commits into
kevin-dp wants to merge 2 commits into
Conversation
…ver dev fallback The desktop's default owner_principal was `system:local-desktop`, but the agents-server falls back to `system:dev-local` when no auth header is present. The principal mismatch caused runner registration to fail with 403 UNAUTHORIZED for unauthenticated local development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oken is missing
The release path in callback-forward was gated by `stillOwnsClaim`, an
in-memory check that fails after server restart or when a newer wake on
the same stream evicts the token. When that happened, the consumer_claims
row stayed at status=active indefinitely and the entity remained stuck at
status=running long after `done` arrived.
Decouple the concerns:
- materializeReleasedClaim runs whenever epoch is defined (DB identity is
sufficient to release the row). Now returns `{ claim, entityCleared }`
where entityCleared is true iff our (consumerId, epoch) was the active
dispatch and we just cleared it.
- updateStatus(idle) and onEntityChanged fire when `entityCleared ||
stillOwnsClaim` — covers happy path, server restart, retry-after-failed-
updateStatus, while still leaving status=running when a newer wake holds
the entity's active dispatch.
- clearStream remains gated by stillOwnsClaim so we never clear another
consumer's token from under it.
Regression tests in test/webhook-forward-routing.test.ts cover the three
failure modes (lost token, evicted token, retry).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
❌ 1 Tests Failed:
View the full list of 1 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #4340 — pull-wake claims leaking in `consumer_claims` after dispatch, regardless of whether the agent run succeeded or failed.
The release path in `callback-forward` was gated by `stillOwnsClaim` — an in-memory check that fails after server restart or when a newer wake on the same stream evicts the token. When that happened, the DB row stayed at `status='active'` indefinitely and the entity remained stuck at `status='running'` long after `done` arrived. Live testing observed this consistently: `active_count` would grow by 1 per dispatched wake and never decrement, even though the agent visibly completed in the UI.
Root cause
packages/agents-server/src/routing/internal-router.tshad all three release actions behind the same in-memory gate:```ts
if (entity && stillOwnsClaim) {
await materializeReleasedClaim(...) // DB row release
await updateStatus(entity.url, 'idle') // entity status
clearStream(...) // in-memory token cleanup
await onEntityChanged(entity.url)
} else if (stillOwnsClaim) {
clearStream(...)
} else if (entity) {
log.info('done ignored for stale claim ...')
}
```
Any path that lost the in-memory token (server restart, parallel wakes evicting each other, retries after a transient failure) skipped `materializeReleasedClaim` entirely, leaving the row in the table forever. The in-memory token is a write-authorization concern; it shouldn't gate the durable-row release, which is keyed by `(consumerId, epoch)` — sufficient and authoritative DB identity.
The fix
Three concerns, three different gates:
`materializeReleasedClaim` API change
```diff
```
Only one production caller (`internal-router.ts`); both production caller and the test mock are updated. The `.returning()` on the `entityDispatchState` UPDATE now reports whether our row was actually cleared (vs. a no-op because a newer claim has taken over).
Test cases
Five scenarios, all behaving correctly with the new gating:
New tests live in `packages/agents-server/test/webhook-forward-routing.test.ts` under `claim release on done callback (regression for #4340)`. Existing tests (`server-claim-write-token.test.ts > stale done does not mark a newer active claim idle`, `> done retries still transition to idle when updateStatus fails on first attempt`) continue to pass.
Verified manually
Tested in the desktop app against a local agents-server:
The pre-existing orphan from before the fix remains stuck — that needs a separate reaper job or manual cleanup (out of scope here).
Not addressed in this PR
Base branch note
This PR targets `fix-pull-wake` (#4339), not `main`, because `materializeReleasedClaim` was introduced in #4308 which is part of the `fix-pull-wake` lineage but not yet in `main`. Merge order: this → fix-pull-wake → main.
🤖 Generated with Claude Code