Skip to content

Commit 99dfabd

Browse files
committed
docs: add voice pipeline architecture and configuration guide
1 parent f37a25b commit 99dfabd

1 file changed

Lines changed: 166 additions & 0 deletions

File tree

docs/VOICE_PIPELINE.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
---
2+
title: "Voice Pipeline"
3+
sidebar_position: 8
4+
---
5+
6+
# Streaming Voice Pipeline
7+
8+
AgentOS provides a real-time streaming voice pipeline for building conversational voice agents. The pipeline handles bidirectional audio streaming, speech-to-text, turn-taking, text-to-speech, speaker diarization, and barge-in detection.
9+
10+
## Architecture
11+
12+
The pipeline consists of 6 core interfaces wired together by the `VoicePipelineOrchestrator`:
13+
14+
```mermaid
15+
graph LR
16+
Client[Browser/App] -->|WebSocket| Transport[IStreamTransport]
17+
Transport -->|Audio Frames| STT[IStreamingSTT]
18+
Transport -->|Audio Frames| Diarization[IDiarizationEngine]
19+
STT -->|Transcripts| Endpoint[IEndpointDetector]
20+
Endpoint -->|Turn Complete| LLM[Agent LLM]
21+
LLM -->|Token Stream| TTS[IStreamingTTS]
22+
TTS -->|Audio Chunks| Transport
23+
STT -->|Speech Start| Bargein[IBargeinHandler]
24+
Bargein -->|Cancel/Pause| TTS
25+
```
26+
27+
## State Machine
28+
29+
The orchestrator manages a conversational loop through these states:
30+
31+
```mermaid
32+
stateDiagram-v2
33+
[*] --> IDLE
34+
IDLE --> LISTENING: startSession()
35+
LISTENING --> PROCESSING: turn_complete
36+
PROCESSING --> SPEAKING: LLM starts streaming
37+
SPEAKING --> LISTENING: TTS complete
38+
SPEAKING --> INTERRUPTING: barge-in detected
39+
INTERRUPTING --> LISTENING: TTS cancelled
40+
LISTENING --> CLOSED: disconnect
41+
SPEAKING --> CLOSED: disconnect
42+
```
43+
44+
## Quick Start
45+
46+
### CLI
47+
48+
```bash
49+
# Basic voice mode (Whisper STT + OpenAI TTS)
50+
wunderland chat --voice
51+
52+
# SOTA setup with Deepgram + ElevenLabs
53+
wunderland start my-agent --voice \
54+
--voice-stt deepgram \
55+
--voice-tts elevenlabs \
56+
--voice-endpointing semantic \
57+
--voice-diarization
58+
```
59+
60+
### Configuration
61+
62+
In `agent.config.json`:
63+
64+
```json
65+
{
66+
"voice": {
67+
"enabled": true,
68+
"pipeline": "streaming",
69+
"stt": "deepgram",
70+
"tts": "elevenlabs",
71+
"ttsVoice": "nova",
72+
"endpointing": "heuristic",
73+
"diarization": {
74+
"enabled": true,
75+
"expectedSpeakers": 2
76+
},
77+
"bargeIn": "hard-cut",
78+
"language": "en-US",
79+
"server": {
80+
"port": 8765,
81+
"host": "127.0.0.1"
82+
}
83+
}
84+
}
85+
```
86+
87+
CLI flags override config file values.
88+
89+
## Core Interfaces
90+
91+
| Interface | Purpose |
92+
|-----------|---------|
93+
| `IStreamTransport` | Bidirectional audio pipe (WebSocket now, WebRTC later) |
94+
| `IStreamingSTT` | Real-time speech-to-text with interim results |
95+
| `IEndpointDetector` | Turn-taking: decides when the user is done speaking |
96+
| `IDiarizationEngine` | Speaker identification and labeling |
97+
| `IStreamingTTS` | Token-stream to audio synthesis |
98+
| `IBargeinHandler` | Handles user interruption during agent speech |
99+
100+
## Endpointing Modes
101+
102+
| Mode | How it works | Latency | Cost |
103+
|------|-------------|---------|------|
104+
| `acoustic` | Pure energy-based VAD + silence timeout | Highest (~3s) | Free |
105+
| `heuristic` | Punctuation/syntax analysis + silence fallback | Low (~0.5s for `. ? !`) | Free |
106+
| `semantic` | LLM classifier for ambiguous pauses | Lowest (smart) | LLM API call per ambiguous turn |
107+
108+
## Barge-in Modes
109+
110+
| Mode | Behavior |
111+
|------|----------|
112+
| `hard-cut` | Immediately cancel TTS after 300ms of user speech. Injects `[interrupted]` marker into conversation history. |
113+
| `soft-fade` | Fade TTS over 200ms. If user speaks < 2s (backchannel), resume. If > 2s, cancel. |
114+
| `disabled` | Agent speaks to completion regardless of user speech. |
115+
116+
## Extension Packs
117+
118+
| Pack | npm Package | Provider | Env Var |
119+
|------|------------|----------|---------|
120+
| Deepgram STT | `@framers/agentos-ext-streaming-stt-deepgram` | Deepgram Nova-2 | `DEEPGRAM_API_KEY` |
121+
| Whisper STT | `@framers/agentos-ext-streaming-stt-whisper` | OpenAI Whisper | `OPENAI_API_KEY` |
122+
| OpenAI TTS | `@framers/agentos-ext-streaming-tts-openai` | OpenAI TTS-1 | `OPENAI_API_KEY` |
123+
| ElevenLabs TTS | `@framers/agentos-ext-streaming-tts-elevenlabs` | ElevenLabs | `ELEVENLABS_API_KEY` |
124+
| Diarization | `@framers/agentos-ext-diarization` | Local x-vector ||
125+
| Semantic Endpoint | `@framers/agentos-ext-endpoint-semantic` | Any LLM | LLM API key |
126+
127+
## WebSocket Protocol
128+
129+
The voice server communicates via WebSocket:
130+
131+
- **Binary messages**: Raw audio (client→server: PCM Float32 mono; server→client: encoded mp3/opus)
132+
- **Text messages**: JSON control/metadata
133+
134+
### Client → Server
135+
136+
```typescript
137+
// Text messages
138+
{ type: 'config', sampleRate: 16000, voice: 'nova', language: 'en-US' }
139+
{ type: 'control', action: 'mute' | 'unmute' | 'stop' }
140+
141+
// Binary messages: raw PCM Float32 mono audio
142+
```
143+
144+
### Server → Client
145+
146+
```typescript
147+
{ type: 'session_started', sessionId: '...', config: { sampleRate: 24000, format: 'opus' } }
148+
{ type: 'transcript', text: 'Hello', isFinal: false, speaker: 'Speaker_0' }
149+
{ type: 'agent_thinking' }
150+
{ type: 'agent_speaking', text: 'Hi there!' }
151+
{ type: 'agent_done' }
152+
{ type: 'barge_in', action: 'cancelled' }
153+
{ type: 'session_ended', reason: 'disconnect' }
154+
155+
// Binary messages: encoded audio (mp3/opus) in negotiated format
156+
```
157+
158+
## Error Recovery
159+
160+
| Failure | Recovery |
161+
|---------|----------|
162+
| STT connection drops | Auto-reconnect with exponential backoff (100ms → 5s). Audio frames buffered during reconnect. |
163+
| TTS connection drops | Cancel current utterance, re-create session, re-send buffered text. |
164+
| Transport disconnects | Tear down all sessions. Client must reconnect. |
165+
| Endpoint stuck | 30s watchdog timer forces `turn_complete`. |
166+
| Diarization lag | Non-blocking. Transcript sent to LLM immediately; speaker labels backfilled. |

0 commit comments

Comments
 (0)