Skip to content

Commit 8de0ce3

Browse files
Enable alarm-backed APIs in sub-agents (#1418)
* Add sub-agent alarm recovery support Made-with: Cursor * Tighten facet cleanup bookkeeping Made-with: Cursor * Hide internal schedule storage fields Made-with: Cursor * Stabilize destroy cleanup schema test Made-with: Cursor
1 parent cdbef87 commit 8de0ce3

42 files changed

Lines changed: 3596 additions & 539 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.changeset/facet-schedules.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
"agents": patch
3+
---
4+
5+
Allow sub-agents to use alarm-backed APIs by delegating the physical Durable Object alarm to the top-level parent while executing logical work inside the owning sub-agent. This enables `schedule()`, `scheduleEvery()`, `cancelSchedule()`, `getScheduleById()`, `listSchedules()`, `keepAlive()`, `keepAliveWhile()`, `runFiber()`, and Think chat recovery inside sub-agents.
6+
7+
Sub-agent schedules are scoped to the calling child, so sibling sub-agents cannot cancel each other's schedules by id. The deprecated synchronous `getSchedule()` and `getSchedules()` APIs now throw inside sub-agents; use the async alternatives instead. Destroying a sub-agent now delegates cleanup through the parent so parent-owned schedules and descendant fiber recovery leases are removed consistently.

design/rfc-helper-sub-agent-orchestration.md

Lines changed: 31 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,13 @@ The framework-provided runner owns the protocol bridge:
128128
- read stored chunks, final text, and stream errors for replay/result synthesis
129129
- prevent concurrent framework-driven turns on one agent tool instance
130130

131+
Sub-agent scheduling is now available through the normal `Agent` scheduling
132+
APIs. Facets still do not own independent physical alarm slots, but the
133+
top-level parent stores child-owned schedule rows with an owner path and routes
134+
callbacks back into the owning child when the alarm fires. This matters for
135+
agent-tool recovery: a child can schedule recovered continuations from inside
136+
the facet, and that callback runs with the child as `this`.
137+
131138
There should not be a separate public base class for agent tools unless the
132139
implementation later proves it needs one. The shared base class extracted in
133140
`examples/agents-as-tools` is prototype structure, not proposed public API.
@@ -655,11 +662,19 @@ call as interrupted. Imperative `runAgentTool(...)` calls have a cleaner
655662
recovery story because application code can inspect the run later without
656663
reconstructing an in-flight LLM turn.
657664

658-
Today, facets still have lifecycle limits: no independent alarms, and
659-
`keepAlive()` is a soft no-op inside facets. The first implementation should
660-
therefore guarantee durable replay and honest terminal/interrupted state, while
661-
designing the run metadata and observer API so `detached` and full live reattach
662-
can be added when facet recovery improves.
665+
Sub-agent schedules and delegated keepAlive remove the earlier blockers for
666+
child-side recovery. A facet still does not own an independent physical alarm,
667+
but a child can schedule logical callbacks through the top-level parent's alarm,
668+
hold a root-owned heartbeat ref while work is active, and register facet fibers
669+
in a small root-side index. Think chat recovery and `runFiber()` therefore work
670+
for long-lived agent-tool facets even when the child is otherwise idle: the root
671+
alarm routes recovery checks back into the child that owns the fiber row.
672+
673+
The remaining V1 limitation is not "facets cannot recover"; it is "the parent
674+
observer may be gone." V1 should still guarantee durable replay and honest
675+
terminal/interrupted state. `detached` and full live reattach are observer
676+
features that can be added once `tailAgentToolRun` / live-tail support exists,
677+
without changing the public `runAgentTool` / `agentTool` surface.
663678

664679
### Imperative API: `runAgentTool`
665680

@@ -951,10 +966,12 @@ leave an orphaned child transcript that is no longer reachable through replay or
951966
drill-in. If a future API allows registry-only deletion, it must make that
952967
orphaning behavior explicit.
953968

954-
V1 should defer automatic TTL, count-based GC, background alarms, and
955-
`retain: false` on `runAgentTool`. Those are useful policy knobs, but the first
956-
implementation only needs the reliable primitive that applications can call from
957-
their own lifecycle code.
969+
V1 should defer automatic TTL, count-based GC, and `retain: false` on
970+
`runAgentTool`. Those are useful policy knobs, but the first implementation only
971+
needs the reliable primitive that applications can call from their own lifecycle
972+
code. Sub-agent scheduling means the framework is no longer blocked on a
973+
mechanism for future background GC; the remaining question is what retention
974+
policy to choose.
958975

959976
### Think and AIChatAgent
960977

@@ -1083,7 +1100,10 @@ retaining the child facet and parent registry row. Cleanup should be explicit.
10831100
Deferred. Time-based and count-based cleanup are useful, but they are policy
10841101
decisions and may depend on account, workspace, or chat-history lifecycle.
10851102
Shipping `clearAgentToolRuns(...)` first gives applications a reliable primitive
1086-
without committing to scheduler behavior or default retention windows.
1103+
without committing to default retention windows. Sub-agent scheduling means a
1104+
future TTL/GC implementation can run from either the parent or a retained
1105+
agent-tool facet, but the retention policy should still be explicit rather than
1106+
hidden in V1 defaults.
10871107

10881108
### Ship only protocol types, no client hook
10891109

@@ -1151,7 +1171,7 @@ The implementation does not need to land all at once. A reasonable order is:
11511171
docs and examples.
11521172
4. Ergonomics: `defineAgentTool(...)` or class-level reusable contracts;
11531173
richer error shape; structured-output convenience.
1154-
5. Live-tail reattach when facet recovery improves; `detached` observer state;
1174+
5. Live-tail reattach via `tailAgentToolRun`; `detached` observer state;
11551175
tracing/cost integrations.
11561176

11571177
After Phase 1 lands, `examples/agents-as-tools` should be rewritten on top of

design/sub-agent-routing.md

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,11 @@ They are implemented on top of workerd facets (`ctx.facets`) and have:
1717
- their own WebSocket clients (once addressed through `/sub/...`)
1818
- colocation with the parent on the same machine
1919

20-
They do **not** have independent alarms today — `schedule()` is unsupported on facets, and `keepAlive()` is a soft no-op.
20+
They do **not** have independent alarm slots today. Sub-agent `schedule()` and
21+
`scheduleEvery()` calls are logical child schedules stored in the top-level
22+
parent's scheduler table with an owner path. When the parent alarm fires, the
23+
SDK routes the due callback back through the facet tree and executes it inside
24+
the owning sub-agent.
2125

2226
## Addressing
2327

@@ -115,9 +119,23 @@ RPC if you need parent-side side effects.
115119

116120
## Lifecycle caveats
117121

118-
- `schedule()` / `scheduleEvery()` / `cancelSchedule()` are unsupported on facets.
119-
- `keepAlive()` is a soft no-op on facets.
120-
- `deleteSubAgent()` is idempotent.
122+
- `schedule()` / `scheduleEvery()` / `cancelSchedule()` work on facets, but the
123+
top-level parent owns the physical alarm.
124+
- `getScheduleById()` / `listSchedules()` work on facets by delegating to the
125+
top-level parent.
126+
- `getSchedule()` / `getSchedules()` are deprecated synchronous storage reads
127+
and throw on facets.
128+
- `keepAlive()` and `keepAliveWhile()` work on facets by delegating their
129+
heartbeat ref to the top-level parent. Facets still do not get an independent
130+
physical alarm slot.
131+
- `runFiber()` works on facets. Fiber rows and snapshots live in the child
132+
SQLite database, while the root parent keeps a small index of active facet
133+
fibers so alarm housekeeping can route recovery checks back into idle
134+
children.
135+
- Think chat recovery works on facets; recovered continuations can schedule from
136+
the child and are routed through the top-level parent's alarm.
137+
- `deleteSubAgent()` is idempotent and removes pending schedules for that
138+
descendant tree before deleting the facet.
121139
- Class names whose kebab-case equals `"sub"` are rejected (e.g. `Sub`, `SUB`,
122140
`Sub_`) because they collide with the `/sub/` URL separator.
123141

@@ -126,6 +144,9 @@ RPC if you need parent-side side effects.
126144
- **Good:** direct child connections, low-latency parent↔child RPC, clean
127145
parent/index + child/leaf app structure.
128146
- **Good:** parent-owned registry gives us strict gating and enumeration for free.
129-
- **Tradeoff:** no independent alarms on facets yet.
147+
- **Good:** sub-agent code can use the normal scheduling API even though the
148+
parent owns the runtime alarm.
149+
- **Tradeoff:** no independent physical alarms on facets yet; the root parent
150+
multiplexes schedules for the whole facet tree.
130151
- **Tradeoff:** `parentAgent(Cls)` only does the one-hop case; deeper ancestor
131152
lookup stays explicit.

docs/agent-class.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -293,7 +293,7 @@ Tasks are stored in the `cf_agents_queues` SQL table and are automatically flush
293293

294294
### `this.schedule` and friends
295295

296-
Agents support scheduled execution of methods by wrapping the Durable Object's `alarm()`. The available methods are `this.schedule`, `this.getSchedule`, `this.getSchedules`, `this.cancelSchedule`. Schedules can be one-time, delayed, or recurring (using cron expressions).
296+
Agents support scheduled execution of methods by wrapping the Durable Object's `alarm()`. The available methods are `this.schedule`, `this.getScheduleById`, `this.listSchedules`, `this.cancelSchedule`, and the deprecated synchronous `this.getSchedule` / `this.getSchedules`. Schedules can be one-time, delayed, or recurring (using cron expressions).
297297

298298
Since DOs only allow one alarm at a time, the `Agent` class works around this by managing multiple schedules in SQL and using a single alarm.
299299

@@ -450,7 +450,7 @@ class MyAgent extends Agent {
450450

451451
### `this.keepAlive`
452452

453-
`this.keepAlive()` prevents the Durable Object from being evicted due to inactivity by creating a 30-second heartbeat schedule. Returns a disposer function to stop the heartbeat. For scoped work, use `this.keepAliveWhile(fn)` which automatically cleans up when the function completes. See [Keeping the Agent Alive](./scheduling.md#keeping-the-agent-alive) for full documentation.
453+
`this.keepAlive()` prevents the Durable Object from being evicted due to inactivity by holding an alarm-backed heartbeat ref. Returns a disposer function to stop the heartbeat. For scoped work, use `this.keepAliveWhile(fn)` which automatically cleans up when the function completes. See [Keeping the Agent Alive](./scheduling.md#keeping-the-agent-alive) for full documentation.
454454

455455
### Routing
456456

docs/durable-execution.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ try {
8181

8282
While any `keepAlive` ref is held, an alarm fires every 30 seconds that resets the inactivity timer. When all disposers are called, alarms stop and the DO can go idle naturally.
8383

84-
The heartbeat is invisible to `getSchedules()` — no schedule rows are created. It does not conflict with your own schedules; the alarm system multiplexes all schedules and the keepAlive heartbeat through a single alarm slot.
84+
The heartbeat is invisible to `listSchedules()` — no schedule rows are created. It does not conflict with your own schedules; the alarm system multiplexes all schedules and the keepAlive heartbeat through a single alarm slot.
8585

8686
### Configurable interval
8787

@@ -169,6 +169,14 @@ runFiber("work", fn)
169169

170170
Both recovery paths call the same hook. The alarm path is critical for background agents that have no incoming client connections — the persisted alarm wakes the agent on its own.
171171

172+
#### Sub-agents
173+
174+
Fibers also work inside sub-agents. The fiber row and snapshots are stored in the sub-agent's own SQLite database, and `onFiberRecovered()` runs with the sub-agent as `this`.
175+
176+
Sub-agents do not have independent alarm slots, so the top-level parent owns the physical heartbeat. When a sub-agent starts a fiber, the parent stores a small root-side index entry for that facet and fiber ID. Root alarm housekeeping uses that index to route recovery checks back into the owning sub-agent, even if the child has no client connection or incoming RPC.
177+
178+
This keeps recovery local to the child while preserving the single physical alarm slot owned by the parent. A recovered continuation can use `schedule()` from inside the facet; the parent owns the physical alarm and routes the callback back to the child.
179+
172180
#### Error during execution
173181

174182
```

docs/long-running-agents.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -514,7 +514,9 @@ export class ProjectManager extends Agent<ProjectState> {
514514
}
515515
```
516516

517-
Sub-agents have their own state and lifecycle, but not full parity with top-level Durable Objects yet. In particular, they do **not** have their own alarms today — `schedule()` / `scheduleEvery()` are not supported on facets yet (support is coming soon). Put scheduled work on the parent and let it dispatch into children by RPC. The parent also does not need to stay awake while the child handles request-scoped work; once the child is woken it can complete the current turn independently.
517+
Sub-agents have their own state and lifecycle. They can schedule their own logical callbacks and run durable fibers; the top-level parent owns the physical alarm and routes scheduled work back into the child. Recovery rows live in the child's SQLite database, so `onFiberRecovered()` and Think `chatRecovery` run with the child as `this`.
518+
519+
Sub-agents still do not have independent physical alarm slots. The root parent keeps a small index of active child fibers, and its alarm routes recovery checks back into idle children. The parent does not need to stay awake while the child handles request-scoped work; once the child is woken it can complete the current turn independently.
518520

519521
For a full user-facing guide to the routing primitive (`subAgent`, `onBeforeSubAgent`, `useAgent({ sub })`, `parentAgent`, `hasSubAgent`, `listSubAgents`), see [Sub-agents](./sub-agents.md).
520522

@@ -622,7 +624,7 @@ A long-running agent eventually completes its purpose. The project ships, the in
622624
export class ProjectManager extends Agent<ProjectState> {
623625
async completeProject() {
624626
// Cancel remaining schedules
625-
const schedules = this.getSchedules();
627+
const schedules = await this.listSchedules();
626628
for (const schedule of schedules) {
627629
await this.cancelSchedule(schedule.id);
628630
}

docs/scheduling.md

Lines changed: 36 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -269,7 +269,7 @@ async syncData() {
269269

270270
Durable Objects are evicted after a period of inactivity (typically 70-140 seconds with no incoming requests, WebSocket messages, or alarms). During long-running operations — streaming LLM responses, waiting on external APIs, running multi-step computations — the agent can be evicted mid-flight.
271271

272-
`keepAlive()` prevents this by creating a 30-second heartbeat schedule that keeps the agent active until you are done:
272+
`keepAlive()` prevents this by holding an alarm-backed heartbeat ref that keeps the agent active until you are done:
273273

274274
```typescript
275275
const dispose = await this.keepAlive();
@@ -299,10 +299,12 @@ This is the recommended approach since you cannot forget to dispose the heartbea
299299

300300
### How it works
301301

302-
`keepAlive()` uses an in-memory reference count and the Durable Object alarm system directly. Each call increments the count; the disposer decrements it. While the count is above zero, `_scheduleNextAlarm()` ensures an alarm fires every 30 seconds, which resets the inactivity timer. No schedule rows are created and no observability events are emitted — the heartbeat is invisible to `getSchedules()` and the `agents:schedule` diagnostics channel.
302+
`keepAlive()` uses an in-memory reference count and the Durable Object alarm system directly. Each call increments the count; the disposer decrements it. While the count is above zero, `_scheduleNextAlarm()` ensures an alarm fires every 30 seconds, which resets the inactivity timer. No schedule rows are created and no observability events are emitted — the heartbeat is invisible to `listSchedules()` and the `agents:schedule` diagnostics channel.
303303

304304
The heartbeat does not conflict with your own schedules — the alarm system multiplexes all schedules and the keepAlive heartbeat through a single alarm slot.
305305

306+
Inside sub-agents, `keepAlive()` delegates that heartbeat ref to the top-level parent because facets do not have independent alarm slots. `keepAliveWhile()` works the same way because it calls `keepAlive()` and automatically disposes the delegated ref when the scoped work completes.
307+
306308
### Multiple concurrent callers
307309

308310
Each `keepAlive()` call returns an independent disposer:
@@ -340,7 +342,7 @@ dispose2(); // Ref count reaches 0 — agent can go idle
340342
Retrieve a scheduled task by its ID:
341343

342344
```typescript
343-
const schedule = await this.getSchedule(scheduleId);
345+
const schedule = await this.getScheduleById(scheduleId);
344346

345347
if (schedule) {
346348
console.log(
@@ -359,24 +361,24 @@ Query scheduled tasks with optional filters:
359361

360362
```typescript
361363
// Get all scheduled tasks
362-
const allSchedules = this.getSchedules();
364+
const allSchedules = await this.listSchedules();
363365

364366
// Get only cron jobs
365-
const cronJobs = this.getSchedules({ type: "cron" });
367+
const cronJobs = await this.listSchedules({ type: "cron" });
366368

367369
// Get tasks in the next hour
368-
const upcoming = this.getSchedules({
370+
const upcoming = await this.listSchedules({
369371
timeRange: {
370372
start: new Date(),
371373
end: new Date(Date.now() + 60 * 60 * 1000)
372374
}
373375
});
374376

375377
// Get a specific task by ID
376-
const specific = this.getSchedules({ id: "abc123" });
378+
const specific = await this.listSchedules({ id: "abc123" });
377379

378380
// Combine filters
379-
const upcomingCronJobs = this.getSchedules({
381+
const upcomingCronJobs = await this.listSchedules({
380382
type: "cron",
381383
timeRange: {
382384
start: new Date(),
@@ -399,6 +401,8 @@ if (cancelled) {
399401
}
400402
```
401403

404+
`cancelSchedule(id)` only matches schedules owned by the agent it is called on. A top-level agent cannot cancel a sub-agent's schedules by id, and a sub-agent cannot reach a sibling's schedules. To clear every schedule under a sub-agent (and any of its descendants), call `parent.deleteSubAgent(Cls, name)` from the parent — that bulk-cancels the prefix and tears the sub-agent down.
405+
402406
**Example: Cancellable reminders**
403407

404408
```typescript
@@ -504,7 +508,7 @@ class PollingAgent extends Agent {
504508

505509
async stopPolling() {
506510
// Cancel all polling schedules
507-
const schedules = this.getSchedules({ type: "delayed" });
511+
const schedules = await this.listSchedules({ type: "delayed" });
508512
for (const schedule of schedules) {
509513
if (schedule.callback === "poll") {
510514
await this.cancelSchedule(schedule.id);
@@ -815,13 +819,33 @@ Schedule a task to run repeatedly at a fixed interval.
815819
- If callback throws an error, the interval continues
816820
- Cancel with `cancelSchedule(id)` to stop the entire interval
817821

822+
### getScheduleById()
823+
824+
```typescript
825+
async getScheduleById(id: string): Promise<Schedule<unknown> | undefined>
826+
```
827+
828+
Get a scheduled task by ID. This method works in both top-level agents and sub-agents.
829+
830+
### listSchedules()
831+
832+
```typescript
833+
async listSchedules(criteria?: {
834+
id?: string;
835+
type?: "scheduled" | "delayed" | "cron" | "interval";
836+
timeRange?: { start?: Date; end?: Date };
837+
}): Promise<Schedule<unknown>[]>
838+
```
839+
840+
Get scheduled tasks matching the criteria. This method works in both top-level agents and sub-agents.
841+
818842
### getSchedule()
819843

820844
```typescript
821845
getSchedule<T = string>(id: string): Schedule<T> | undefined
822846
```
823847
824-
Get a scheduled task by ID. This method is synchronous.
848+
Deprecated. Get a scheduled task by ID synchronously. This method only works in top-level agents; use `await this.getScheduleById(id)` instead.
825849
826850
### getSchedules()
827851
@@ -833,7 +857,7 @@ getSchedules<T = string>(criteria?: {
833857
}): Schedule<T>[]
834858
```
835859
836-
Get scheduled tasks matching the criteria. This method is synchronous.
860+
Deprecated. Get scheduled tasks matching the criteria synchronously. This method only works in top-level agents; use `await this.listSchedules(criteria)` instead.
837861
838862
### cancelSchedule()
839863
@@ -849,7 +873,7 @@ Cancel a scheduled task. Returns `true` if cancelled, `false` if not found.
849873
async keepAlive(): Promise<() => void>
850874
```
851875
852-
Create a 30-second heartbeat schedule that prevents the Durable Object from being evicted due to inactivity. Returns a disposer function that cancels the heartbeat when called. The disposer is idempotent — calling it multiple times is safe.
876+
Create an alarm-backed heartbeat that prevents the Durable Object from being evicted due to inactivity. Returns a disposer function that cancels the heartbeat when called. The disposer is idempotent — calling it multiple times is safe.
853877
854878
See [Keeping the Agent Alive](#keeping-the-agent-alive) for usage details.
855879

0 commit comments

Comments
 (0)