You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: add voice-graph integration guide to VOICE_PIPELINE.md
Add a comprehensive "Voice-Graph Integration" section covering: voiceNode()
builder usage and the GraphNode properties it sets, voice transport mode for
whole-workflow call flows using VoiceTransportAdapter, YAML syntax for both
per-step voice config and top-level transport block (with a full field
reference table), barge-in routing with all exit condition reasons and
example loopback edges, the full set of voice-related GraphEvent types with
a consumption example, and checkpoint support including VoiceNodeCheckpoint
shape and how to resume turn counts across graph runs.
Copy file name to clipboardExpand all lines: docs/VOICE_PIPELINE.md
+200Lines changed: 200 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -202,3 +202,203 @@ The `SpeechProviderResolver` and `createStreamingPipeline()` currently resolve v
202
202
### No Call Recording or Transcript Persistence
203
203
204
204
Call transcripts are held in memory during the call but are not persisted to storage after the call ends. Future: integrate with AgentOS storage/memory system.
205
+
206
+
---
207
+
208
+
## Voice-Graph Integration
209
+
210
+
AgentOS lets you embed voice I/O directly inside an orchestration graph. There are two complementary integration modes: **voice nodes** (one step in a larger graph is a voice session) and **voice transport** (the entire graph runs inside a phone call or real-time voice session).
211
+
212
+
### Voice as a Graph Node Type
213
+
214
+
Use the `voiceNode()` builder to create a `GraphNode` of type `'voice'`. The node manages a full multi-turn STT/TTS session and exits when one of its configured exit conditions fires.
|`checkpoint`|`'before'` — snapshot taken before the session starts |
243
+
244
+
Exit reasons map to the next node via `.on(exitReason, targetNodeId)`. The `.on()` chain is order-independent; the voice executor resolves the correct edge after the session ends.
245
+
246
+
### Voice Transport Mode
247
+
248
+
When the entire workflow should run inside a single phone call, declare a `transport` at the workflow level. All nodes in the graph then receive input from STT and deliver output to TTS via a `VoiceTransportAdapter`.
The `VoiceTransportAdapter` bridges the graph I/O cycle:
266
+
267
+
-`getNodeInput(nodeId)` — waits for the user's next speech turn (resolves on `turn_complete`).
268
+
-`deliverNodeOutput(nodeId, text)` — sends the node's response to TTS and emits a `voice_audio` graph event.
269
+
-`init(state)` — injects `state.scratch.voiceTransport` so voice nodes can access the transport.
270
+
-`dispose()` — emits `voice_session ended` and tears down the adapter.
271
+
272
+
### YAML Syntax
273
+
274
+
#### Voice step in a YAML workflow
275
+
276
+
```yaml
277
+
name: phone-intake
278
+
steps:
279
+
- id: greet
280
+
voice:
281
+
mode: speak-only
282
+
tts: openai
283
+
voice: alloy
284
+
285
+
- id: collect-info
286
+
voice:
287
+
mode: conversation
288
+
stt: deepgram
289
+
endpointing: heuristic
290
+
bargeIn: hard-cut
291
+
maxTurns: 5
292
+
exitOn: keyword
293
+
exitKeywords:
294
+
- confirmed
295
+
- cancel
296
+
```
297
+
298
+
#### Voice transport at workflow level
299
+
300
+
```yaml
301
+
name: phone-intake
302
+
transport:
303
+
type: voice
304
+
stt: deepgram
305
+
tts: elevenlabs
306
+
voice: nova
307
+
bargeIn: hard-cut
308
+
endpointing: heuristic
309
+
steps:
310
+
- id: greet
311
+
voice:
312
+
mode: speak-only
313
+
- id: intake
314
+
voice:
315
+
mode: conversation
316
+
maxTurns: 3
317
+
exitOn: keyword
318
+
exitKeywords: [confirmed, done]
319
+
```
320
+
321
+
When `transport.type: voice` is present, `compileWorkflowYaml()` attaches the config to `compiled._transport` so the caller can detect that the workflow expects a `VoiceTransportAdapter` at runtime.
The `VoiceNodeExecutor` races multiple exit conditions simultaneously via a `Promise.race`. The first condition to fire determines the `exitReason` string, which is then looked up in the node's edge map to resolve the `routeTarget`.
342
+
343
+
| `exitReason` | Trigger | Typical edge target |
344
+
|---|---|---|
345
+
| `hangup` | Transport emits `close` or `disconnected` | `end` / cleanup node |
| `keyword:<word>` | `final_transcript` contains a phrase from `exitKeywords` | intent-specific handler |
348
+
| `silence-timeout` | No speech for 30 s when `exitOn: silence-timeout` | timeout handler / retry |
349
+
| `interrupted` | `AbortController` fired with a `VoiceInterruptError` (barge-in) | re-listen / cancel TTS |
350
+
351
+
When a barge-in occurs, the executor catches the `VoiceInterruptError` and returns `exitReason: 'interrupted'`. Wire a loopback edge `.on('interrupted', 'listen')` to restart the listen cycle:
Voice nodes use `checkpoint: 'before'` so the runtime takes a state snapshot before each voice session starts. If the process crashes mid-call, the graph can be resumed from the beginning of that voice node.
391
+
392
+
In addition, the `VoiceNodeExecutor` writes a `VoiceNodeCheckpoint` to `scratchUpdate[nodeId]` after every execution:
393
+
394
+
```typescript
395
+
interfaceVoiceNodeCheckpoint {
396
+
turnIndex:number; // total turns completed (inclusive of prior runs)
397
+
transcript:TranscriptEntry[]; // full buffered transcript
398
+
lastExitReason:string|null;
399
+
speakerMap:Record<string, string>;
400
+
sessionConfig:VoiceNodeConfig;
401
+
}
402
+
```
403
+
404
+
Pass `state.scratch[nodeId].turnIndex` back as the `initialTurnCount` when constructing a `VoiceTurnCollector` to resume the turn counter from where the previous run left off — enabling a call that spans multiple graph runs (e.g. after a human-approval pause) to count turns continuously rather than resetting to zero.
0 commit comments