Summary
When the agents-server is restarted while a connected runtime (agents-runtime) is still running, the runtime appears to "bug out" rather than transparently reconnecting. The runtime side has no retry/reconnect logic for its HTTP calls to the server, and several recovery code paths on the server side rely on Electric shape-stream cursors / handles that may not be valid post-restart.
This issue is filed from a code review pass — it documents specific failure modes in the reconnect path that are visible in the source today, rather than a single confirmed reproduction. Anyone hitting this in practice: please attach logs from both the runtime and the server during the restart window.
Symptom (reported)
After bouncing agents-server, the runtime that was already attached doesn't fully recover — wakes don't fire, entity updates appear stuck, or HTTP calls into the server start failing without recovering.
Code-level evidence
1. Runtime has no retry/reconnect on HTTP calls
packages/agents-runtime/src/runtime-server-client.ts:138-140 is the only request path; it's a plain fetch() with no retry, no backoff, and no idempotency-aware re-issue:
const request = (path: string, init?: RequestInit): Promise<Response> => {
return track(fetchImpl(`${config.baseUrl}${path}`, init))
}
Any call in flight while the server is down (spawnEntity, sendEntityMessage, registerWake, upsertCronSchedule, etc.) fails with a network error and is surfaced to the caller. Type registration (packages/agents-runtime/src/create-handler.ts:355-472) is also only invoked once at runtime startup — there's no mechanism to re-register if the server later forgets us.
In practice this is mostly fine, because most server-side state is persisted (entity types, subscription webhooks, wake registrations, scheduled tasks all live in Postgres). But the runtime never finds out the server came back, so any in-flight request during the restart window is lost forever from the caller's perspective.
2. Entity bridge resumes from a possibly-stale shape handle
packages/agents-server/src/entity-bridge-manager.ts:153-160 (in start()):
if (this.initialShapeHandle && this.initialShapeOffset) {
const initialOffset = parseElectricOffset(this.initialShapeOffset)
if (initialOffset) {
this.startLiveStream(initialOffset, this.initialShapeHandle)
return
}
}
await this.resync(`startup`)
On restart we try to resume an Electric shape subscription using the last persisted shape_handle + shape_offset. If Electric has rotated/compacted the shape since then, the handle is invalid. The error path:
createShapeStream onError at entity-bridge-manager.ts:302-311 only logs a warning and returns {} — it does not trigger a resync.
must-refetch is handled correctly at entity-bridge-manager.ts:332-342 (clears cursor, rescans).
- The subscription-error callback at
entity-bridge-manager.ts:373-388 does call requestResync(\subscription-error`)`.
So whether we recover depends on which of those paths Electric drives us down. A silent staleness in the live stream (no must-refetch, no subscription error, just no new messages) would leave the bridge as a zombie — in-memory but not receiving updates. Wake conditions on entity changes would then never fire.
3. Wake registry recovery is best-effort
packages/agents-server/src/wake-registry.ts:238-278 (recoverSync) handles shape-stream errors by stopping, reloading registrations from Postgres, and re-subscribing. This is solid for the catastrophic error case, but the same caveat as above applies: a silent live-stream staleness would not be caught.
4. Wake delivery during the restart window
Wake delivery is split across:
- Wakes are appended to the subscriber's durable stream (
electric-agents-manager.ts:946-952) — durable, OK.
- Subscribers are notified via
subscription_webhooks lookup at /_electric/webhook-forward/<id> (server.ts:1478-1520).
If durable-streams tries to deliver a webhook while agents-server is restarting, success depends on durable-streams' retry behaviour. Worth confirming this doesn't drop events.
5. What does work
For balance — these are all explicitly covered:
- Scheduled / delayed sends survive restart:
packages/agents-server/test/scheduler-integration.test.ts:117 (delayed_send survives server restart and lands exactly once).
- Re-registering an entity type after restart updates it:
scheduler-integration.test.ts:158.
- Entity bridge handles
must-refetch from Electric: entity-bridge-manager.test.ts:362.
- Wake registrations and webhook subscriptions are persisted in Postgres and rebuilt on startup (
server.ts:405-409).
There is no test exercising the runtime side of restart — i.e. a runtime that's already attached when agents-server bounces.
Suggested repro
packages/agents-server/docker-compose.dev.yml provides Postgres + Electric. To exercise the runtime path:
- Start the dev stack (Postgres + Electric + durable-streams + agents-server).
- Start a runtime (e.g. via
packages/agents) pointing at the agents-server. Register an entity type that has a wake on entity changes from another tag-filtered entity.
- Trigger something from the runtime (spawn entity, register wake) and confirm it works.
- Restart only the
agents-server process (leave Electric, Postgres, durable-streams, and the runtime running).
- From the runtime, retry the same operations, and additionally trigger entity changes that should fire wakes. Observe:
- Does any HTTP call from the runtime fail without recovering?
- Do wakes still fire after a few seconds?
- Do entity bridge updates still propagate?
Ideas for the fix
- Add retry-with-backoff to
runtime-server-client.ts for idempotent operations (and handle non-idempotent ones explicitly).
- Add a periodic health-probe / heartbeat from runtime → server so the runtime can re-register types if the server returns a "you don't know me" response.
- For entity bridges, add a periodic liveness check (e.g. compare last-seen offset to Electric's current head) to catch silent staleness, not just explicit errors.
- Add an integration test mirroring
delayed_send survives server restart but covering a live runtime: keep the runtime up across the restart and assert wakes still fire and HTTP calls succeed.
Notes
This issue is opened from static analysis — symptom matches but I have not bisected to a concrete failure log. Anyone with logs from a live recurrence: please attach.
Summary
When the
agents-serveris restarted while a connected runtime (agents-runtime) is still running, the runtime appears to "bug out" rather than transparently reconnecting. The runtime side has no retry/reconnect logic for its HTTP calls to the server, and several recovery code paths on the server side rely on Electric shape-stream cursors / handles that may not be valid post-restart.This issue is filed from a code review pass — it documents specific failure modes in the reconnect path that are visible in the source today, rather than a single confirmed reproduction. Anyone hitting this in practice: please attach logs from both the runtime and the server during the restart window.
Symptom (reported)
After bouncing
agents-server, the runtime that was already attached doesn't fully recover — wakes don't fire, entity updates appear stuck, or HTTP calls into the server start failing without recovering.Code-level evidence
1. Runtime has no retry/reconnect on HTTP calls
packages/agents-runtime/src/runtime-server-client.ts:138-140is the only request path; it's a plainfetch()with no retry, no backoff, and no idempotency-aware re-issue:Any call in flight while the server is down (
spawnEntity,sendEntityMessage,registerWake,upsertCronSchedule, etc.) fails with a network error and is surfaced to the caller. Type registration (packages/agents-runtime/src/create-handler.ts:355-472) is also only invoked once at runtime startup — there's no mechanism to re-register if the server later forgets us.In practice this is mostly fine, because most server-side state is persisted (entity types, subscription webhooks, wake registrations, scheduled tasks all live in Postgres). But the runtime never finds out the server came back, so any in-flight request during the restart window is lost forever from the caller's perspective.
2. Entity bridge resumes from a possibly-stale shape handle
packages/agents-server/src/entity-bridge-manager.ts:153-160(instart()):On restart we try to resume an Electric shape subscription using the last persisted
shape_handle+shape_offset. If Electric has rotated/compacted the shape since then, the handle is invalid. The error path:createShapeStreamonErroratentity-bridge-manager.ts:302-311only logs a warning and returns{}— it does not trigger a resync.must-refetchis handled correctly atentity-bridge-manager.ts:332-342(clears cursor, rescans).entity-bridge-manager.ts:373-388does callrequestResync(\subscription-error`)`.So whether we recover depends on which of those paths Electric drives us down. A silent staleness in the live stream (no
must-refetch, no subscription error, just no new messages) would leave the bridge as a zombie — in-memory but not receiving updates. Wake conditions on entity changes would then never fire.3. Wake registry recovery is best-effort
packages/agents-server/src/wake-registry.ts:238-278(recoverSync) handles shape-stream errors by stopping, reloading registrations from Postgres, and re-subscribing. This is solid for the catastrophic error case, but the same caveat as above applies: a silent live-stream staleness would not be caught.4. Wake delivery during the restart window
Wake delivery is split across:
electric-agents-manager.ts:946-952) — durable, OK.subscription_webhookslookup at/_electric/webhook-forward/<id>(server.ts:1478-1520).If durable-streams tries to deliver a webhook while agents-server is restarting, success depends on durable-streams' retry behaviour. Worth confirming this doesn't drop events.
5. What does work
For balance — these are all explicitly covered:
packages/agents-server/test/scheduler-integration.test.ts:117(delayed_send survives server restart and lands exactly once).scheduler-integration.test.ts:158.must-refetchfrom Electric:entity-bridge-manager.test.ts:362.server.ts:405-409).There is no test exercising the runtime side of restart — i.e. a runtime that's already attached when
agents-serverbounces.Suggested repro
packages/agents-server/docker-compose.dev.ymlprovides Postgres + Electric. To exercise the runtime path:packages/agents) pointing at the agents-server. Register an entity type that has a wake on entity changes from another tag-filtered entity.agents-serverprocess (leave Electric, Postgres, durable-streams, and the runtime running).Ideas for the fix
runtime-server-client.tsfor idempotent operations (and handle non-idempotent ones explicitly).delayed_send survives server restartbut covering a live runtime: keep the runtime up across the restart and assert wakes still fire and HTTP calls succeed.Notes
This issue is opened from static analysis — symptom matches but I have not bisected to a concrete failure log. Anyone with logs from a live recurrence: please attach.