Skip to content

Commit fdeab10

Browse files
committed
docs: add voice-graph integration guide to VOICE_PIPELINE.md
Add a comprehensive "Voice-Graph Integration" section covering: voiceNode() builder usage and the GraphNode properties it sets, voice transport mode for whole-workflow call flows using VoiceTransportAdapter, YAML syntax for both per-step voice config and top-level transport block (with a full field reference table), barge-in routing with all exit condition reasons and example loopback edges, the full set of voice-related GraphEvent types with a consumption example, and checkpoint support including VoiceNodeCheckpoint shape and how to resume turn counts across graph runs.
1 parent b39539a commit fdeab10

1 file changed

Lines changed: 200 additions & 0 deletions

File tree

docs/VOICE_PIPELINE.md

Lines changed: 200 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -202,3 +202,203 @@ The `SpeechProviderResolver` and `createStreamingPipeline()` currently resolve v
202202
### No Call Recording or Transcript Persistence
203203

204204
Call transcripts are held in memory during the call but are not persisted to storage after the call ends. Future: integrate with AgentOS storage/memory system.
205+
206+
---
207+
208+
## Voice-Graph Integration
209+
210+
AgentOS lets you embed voice I/O directly inside an orchestration graph. There are two complementary integration modes: **voice nodes** (one step in a larger graph is a voice session) and **voice transport** (the entire graph runs inside a phone call or real-time voice session).
211+
212+
### Voice as a Graph Node Type
213+
214+
Use the `voiceNode()` builder to create a `GraphNode` of type `'voice'`. The node manages a full multi-turn STT/TTS session and exits when one of its configured exit conditions fires.
215+
216+
```typescript
217+
import { voiceNode } from '@framers/agentos/orchestration';
218+
219+
const listenNode = voiceNode('intake', {
220+
mode: 'conversation',
221+
stt: 'deepgram',
222+
tts: 'elevenlabs',
223+
maxTurns: 5,
224+
exitOn: 'keyword',
225+
exitKeywords: ['confirmed', 'cancel'],
226+
})
227+
.on('keyword:confirmed', 'process-intake')
228+
.on('keyword:cancel', 'goodbye')
229+
.on('hangup', 'end')
230+
.on('turns-exhausted', 'fallback')
231+
.build();
232+
```
233+
234+
The builder produces a `GraphNode` with:
235+
236+
| Property | Value |
237+
|----------|-------|
238+
| `type` | `'voice'` |
239+
| `executorConfig.type` | `'voice'` |
240+
| `executionMode` | `'react_bounded'` — models the multi-turn loop |
241+
| `effectClass` | `'external'` — touches real-world audio I/O |
242+
| `checkpoint` | `'before'` — snapshot taken before the session starts |
243+
244+
Exit reasons map to the next node via `.on(exitReason, targetNodeId)`. The `.on()` chain is order-independent; the voice executor resolves the correct edge after the session ends.
245+
246+
### Voice Transport Mode
247+
248+
When the entire workflow should run inside a single phone call, declare a `transport` at the workflow level. All nodes in the graph then receive input from STT and deliver output to TTS via a `VoiceTransportAdapter`.
249+
250+
```typescript
251+
import { workflow } from '@framers/agentos/orchestration';
252+
import { VoiceTransportAdapter } from '@framers/agentos/orchestration/runtime/VoiceTransportAdapter';
253+
254+
const callFlow = workflow('phone-intake')
255+
.input(inputSchema)
256+
.returns(outputSchema)
257+
.transport('voice', { stt: 'deepgram', tts: 'openai', voice: 'alloy' })
258+
.step('greet', { voice: { mode: 'speak-only' } })
259+
.step('listen', { voice: { mode: 'conversation', maxTurns: 3 } })
260+
.step('confirm', { voice: { mode: 'conversation', exitOn: 'keyword', exitKeywords: ['yes', 'no'] } })
261+
.step('process', { tool: 'crm_update' })
262+
.compile();
263+
```
264+
265+
The `VoiceTransportAdapter` bridges the graph I/O cycle:
266+
267+
- `getNodeInput(nodeId)` — waits for the user's next speech turn (resolves on `turn_complete`).
268+
- `deliverNodeOutput(nodeId, text)` — sends the node's response to TTS and emits a `voice_audio` graph event.
269+
- `init(state)` — injects `state.scratch.voiceTransport` so voice nodes can access the transport.
270+
- `dispose()` — emits `voice_session ended` and tears down the adapter.
271+
272+
### YAML Syntax
273+
274+
#### Voice step in a YAML workflow
275+
276+
```yaml
277+
name: phone-intake
278+
steps:
279+
- id: greet
280+
voice:
281+
mode: speak-only
282+
tts: openai
283+
voice: alloy
284+
285+
- id: collect-info
286+
voice:
287+
mode: conversation
288+
stt: deepgram
289+
endpointing: heuristic
290+
bargeIn: hard-cut
291+
maxTurns: 5
292+
exitOn: keyword
293+
exitKeywords:
294+
- confirmed
295+
- cancel
296+
```
297+
298+
#### Voice transport at workflow level
299+
300+
```yaml
301+
name: phone-intake
302+
transport:
303+
type: voice
304+
stt: deepgram
305+
tts: elevenlabs
306+
voice: nova
307+
bargeIn: hard-cut
308+
endpointing: heuristic
309+
steps:
310+
- id: greet
311+
voice:
312+
mode: speak-only
313+
- id: intake
314+
voice:
315+
mode: conversation
316+
maxTurns: 3
317+
exitOn: keyword
318+
exitKeywords: [confirmed, done]
319+
```
320+
321+
When `transport.type: voice` is present, `compileWorkflowYaml()` attaches the config to `compiled._transport` so the caller can detect that the workflow expects a `VoiceTransportAdapter` at runtime.
322+
323+
#### YAML voice step fields
324+
325+
| Field | Type | Description |
326+
|-------|------|-------------|
327+
| `mode` | `conversation` \| `listen-only` \| `speak-only` | **Required.** Session direction. |
328+
| `stt` | string | STT provider override (e.g. `deepgram`, `openai`). |
329+
| `tts` | string | TTS provider override (e.g. `openai`, `elevenlabs`). |
330+
| `voice` | string | TTS voice name. |
331+
| `endpointing` | `acoustic` \| `heuristic` \| `semantic` | Endpoint detection mode. |
332+
| `bargeIn` | `hard-cut` \| `soft-fade` \| `disabled` | Barge-in handling. |
333+
| `diarization` | boolean | Enable speaker diarization. |
334+
| `language` | string | BCP-47 language tag (e.g. `en-US`). |
335+
| `maxTurns` | number | Maximum turns before `turns-exhausted` exit. `0` = unlimited. |
336+
| `exitOn` | string | Primary exit condition: `hangup`, `silence-timeout`, `keyword`, `turns-exhausted`, `manual`. |
337+
| `exitKeywords` | string[] | Phrases that trigger keyword exit. Case-insensitive substring match. |
338+
339+
### Barge-in Routing with Exit Conditions
340+
341+
The `VoiceNodeExecutor` races multiple exit conditions simultaneously via a `Promise.race`. The first condition to fire determines the `exitReason` string, which is then looked up in the node's edge map to resolve the `routeTarget`.
342+
343+
| `exitReason` | Trigger | Typical edge target |
344+
|---|---|---|
345+
| `hangup` | Transport emits `close` or `disconnected` | `end` / cleanup node |
346+
| `turns-exhausted` | `turn_complete` fires and `turnCount >= maxTurns` | summarize / fallback node |
347+
| `keyword:<word>` | `final_transcript` contains a phrase from `exitKeywords` | intent-specific handler |
348+
| `silence-timeout` | No speech for 30 s when `exitOn: silence-timeout` | timeout handler / retry |
349+
| `interrupted` | `AbortController` fired with a `VoiceInterruptError` (barge-in) | re-listen / cancel TTS |
350+
351+
When a barge-in occurs, the executor catches the `VoiceInterruptError` and returns `exitReason: 'interrupted'`. Wire a loopback edge `.on('interrupted', 'listen')` to restart the listen cycle:
352+
353+
```typescript
354+
voiceNode('listen', { mode: 'conversation' })
355+
.on('interrupted', 'listen') // barge-in → re-listen
356+
.on('turns-exhausted', 'summarize')
357+
.on('hangup', 'end')
358+
.build();
359+
```
360+
361+
### Graph Events for Voice
362+
363+
Voice nodes emit the following `GraphEvent` values in causal order:
364+
365+
| Event type | When |
366+
|---|---|
367+
| `voice_session` (action: `started`) | Immediately on `execute()` entry |
368+
| `voice_transcript` (isFinal: false) | Each `interim_transcript` from STT |
369+
| `voice_transcript` (isFinal: true) | Each confirmed `final_transcript` |
370+
| `voice_turn_complete` | Each `turn_complete` from endpoint detector |
371+
| `voice_audio` (direction: `outbound`) | When TTS delivery is triggered by `VoiceTransportAdapter.deliverNodeOutput()` |
372+
| `voice_barge_in` | Each `barge_in` event from the pipeline session |
373+
| `voice_session` (action: `ended`) | On node exit, with `exitReason` |
374+
375+
Consume events via the `GraphRuntime` stream:
376+
377+
```typescript
378+
for await (const event of runtime.stream(graph, input)) {
379+
if (event.type === 'voice_transcript' && event.isFinal) {
380+
console.log(`[${event.speaker}] ${event.text}`);
381+
}
382+
if (event.type === 'voice_session' && event.action === 'ended') {
383+
console.log('Session exit reason:', event.exitReason);
384+
}
385+
}
386+
```
387+
388+
### Checkpoint Support
389+
390+
Voice nodes use `checkpoint: 'before'` so the runtime takes a state snapshot before each voice session starts. If the process crashes mid-call, the graph can be resumed from the beginning of that voice node.
391+
392+
In addition, the `VoiceNodeExecutor` writes a `VoiceNodeCheckpoint` to `scratchUpdate[nodeId]` after every execution:
393+
394+
```typescript
395+
interface VoiceNodeCheckpoint {
396+
turnIndex: number; // total turns completed (inclusive of prior runs)
397+
transcript: TranscriptEntry[]; // full buffered transcript
398+
lastExitReason: string | null;
399+
speakerMap: Record<string, string>;
400+
sessionConfig: VoiceNodeConfig;
401+
}
402+
```
403+
404+
Pass `state.scratch[nodeId].turnIndex` back as the `initialTurnCount` when constructing a `VoiceTurnCollector` to resume the turn counter from where the previous run left off — enabling a call that spans multiple graph runs (e.g. after a human-approval pause) to count turns continuously rather than resetting to zero.

0 commit comments

Comments
 (0)