|
| 1 | +--- |
| 2 | +title: "Voice Pipeline" |
| 3 | +sidebar_position: 8 |
| 4 | +--- |
| 5 | + |
| 6 | +# Streaming Voice Pipeline |
| 7 | + |
| 8 | +AgentOS provides a real-time streaming voice pipeline for building conversational voice agents. The pipeline handles bidirectional audio streaming, speech-to-text, turn-taking, text-to-speech, speaker diarization, and barge-in detection. |
| 9 | + |
| 10 | +## Architecture |
| 11 | + |
| 12 | +The pipeline consists of 6 core interfaces wired together by the `VoicePipelineOrchestrator`: |
| 13 | + |
| 14 | +```mermaid |
| 15 | +graph LR |
| 16 | + Client[Browser/App] -->|WebSocket| Transport[IStreamTransport] |
| 17 | + Transport -->|Audio Frames| STT[IStreamingSTT] |
| 18 | + Transport -->|Audio Frames| Diarization[IDiarizationEngine] |
| 19 | + STT -->|Transcripts| Endpoint[IEndpointDetector] |
| 20 | + Endpoint -->|Turn Complete| LLM[Agent LLM] |
| 21 | + LLM -->|Token Stream| TTS[IStreamingTTS] |
| 22 | + TTS -->|Audio Chunks| Transport |
| 23 | + STT -->|Speech Start| Bargein[IBargeinHandler] |
| 24 | + Bargein -->|Cancel/Pause| TTS |
| 25 | +``` |
| 26 | + |
| 27 | +## State Machine |
| 28 | + |
| 29 | +The orchestrator manages a conversational loop through these states: |
| 30 | + |
| 31 | +```mermaid |
| 32 | +stateDiagram-v2 |
| 33 | + [*] --> IDLE |
| 34 | + IDLE --> LISTENING: startSession() |
| 35 | + LISTENING --> PROCESSING: turn_complete |
| 36 | + PROCESSING --> SPEAKING: LLM starts streaming |
| 37 | + SPEAKING --> LISTENING: TTS complete |
| 38 | + SPEAKING --> INTERRUPTING: barge-in detected |
| 39 | + INTERRUPTING --> LISTENING: TTS cancelled |
| 40 | + LISTENING --> CLOSED: disconnect |
| 41 | + SPEAKING --> CLOSED: disconnect |
| 42 | +``` |
| 43 | + |
| 44 | +## Quick Start |
| 45 | + |
| 46 | +### CLI |
| 47 | + |
| 48 | +```bash |
| 49 | +# Basic voice mode (Whisper STT + OpenAI TTS) |
| 50 | +wunderland chat --voice |
| 51 | + |
| 52 | +# SOTA setup with Deepgram + ElevenLabs |
| 53 | +wunderland start my-agent --voice \ |
| 54 | + --voice-stt deepgram \ |
| 55 | + --voice-tts elevenlabs \ |
| 56 | + --voice-endpointing semantic \ |
| 57 | + --voice-diarization |
| 58 | +``` |
| 59 | + |
| 60 | +### Configuration |
| 61 | + |
| 62 | +In `agent.config.json`: |
| 63 | + |
| 64 | +```json |
| 65 | +{ |
| 66 | + "voice": { |
| 67 | + "enabled": true, |
| 68 | + "pipeline": "streaming", |
| 69 | + "stt": "deepgram", |
| 70 | + "tts": "elevenlabs", |
| 71 | + "ttsVoice": "nova", |
| 72 | + "endpointing": "heuristic", |
| 73 | + "diarization": { |
| 74 | + "enabled": true, |
| 75 | + "expectedSpeakers": 2 |
| 76 | + }, |
| 77 | + "bargeIn": "hard-cut", |
| 78 | + "language": "en-US", |
| 79 | + "server": { |
| 80 | + "port": 8765, |
| 81 | + "host": "127.0.0.1" |
| 82 | + } |
| 83 | + } |
| 84 | +} |
| 85 | +``` |
| 86 | + |
| 87 | +CLI flags override config file values. |
| 88 | + |
| 89 | +## Core Interfaces |
| 90 | + |
| 91 | +| Interface | Purpose | |
| 92 | +|-----------|---------| |
| 93 | +| `IStreamTransport` | Bidirectional audio pipe (WebSocket now, WebRTC later) | |
| 94 | +| `IStreamingSTT` | Real-time speech-to-text with interim results | |
| 95 | +| `IEndpointDetector` | Turn-taking: decides when the user is done speaking | |
| 96 | +| `IDiarizationEngine` | Speaker identification and labeling | |
| 97 | +| `IStreamingTTS` | Token-stream to audio synthesis | |
| 98 | +| `IBargeinHandler` | Handles user interruption during agent speech | |
| 99 | + |
| 100 | +## Endpointing Modes |
| 101 | + |
| 102 | +| Mode | How it works | Latency | Cost | |
| 103 | +|------|-------------|---------|------| |
| 104 | +| `acoustic` | Pure energy-based VAD + silence timeout | Highest (~3s) | Free | |
| 105 | +| `heuristic` | Punctuation/syntax analysis + silence fallback | Low (~0.5s for `. ? !`) | Free | |
| 106 | +| `semantic` | LLM classifier for ambiguous pauses | Lowest (smart) | LLM API call per ambiguous turn | |
| 107 | + |
| 108 | +## Barge-in Modes |
| 109 | + |
| 110 | +| Mode | Behavior | |
| 111 | +|------|----------| |
| 112 | +| `hard-cut` | Immediately cancel TTS after 300ms of user speech. Injects `[interrupted]` marker into conversation history. | |
| 113 | +| `soft-fade` | Fade TTS over 200ms. If user speaks < 2s (backchannel), resume. If > 2s, cancel. | |
| 114 | +| `disabled` | Agent speaks to completion regardless of user speech. | |
| 115 | + |
| 116 | +## Extension Packs |
| 117 | + |
| 118 | +| Pack | npm Package | Provider | Env Var | |
| 119 | +|------|------------|----------|---------| |
| 120 | +| Deepgram STT | `@framers/agentos-ext-streaming-stt-deepgram` | Deepgram Nova-2 | `DEEPGRAM_API_KEY` | |
| 121 | +| Whisper STT | `@framers/agentos-ext-streaming-stt-whisper` | OpenAI Whisper | `OPENAI_API_KEY` | |
| 122 | +| OpenAI TTS | `@framers/agentos-ext-streaming-tts-openai` | OpenAI TTS-1 | `OPENAI_API_KEY` | |
| 123 | +| ElevenLabs TTS | `@framers/agentos-ext-streaming-tts-elevenlabs` | ElevenLabs | `ELEVENLABS_API_KEY` | |
| 124 | +| Diarization | `@framers/agentos-ext-diarization` | Local x-vector | — | |
| 125 | +| Semantic Endpoint | `@framers/agentos-ext-endpoint-semantic` | Any LLM | LLM API key | |
| 126 | + |
| 127 | +## WebSocket Protocol |
| 128 | + |
| 129 | +The voice server communicates via WebSocket: |
| 130 | + |
| 131 | +- **Binary messages**: Raw audio (client→server: PCM Float32 mono; server→client: encoded mp3/opus) |
| 132 | +- **Text messages**: JSON control/metadata |
| 133 | + |
| 134 | +### Client → Server |
| 135 | + |
| 136 | +```typescript |
| 137 | +// Text messages |
| 138 | +{ type: 'config', sampleRate: 16000, voice: 'nova', language: 'en-US' } |
| 139 | +{ type: 'control', action: 'mute' | 'unmute' | 'stop' } |
| 140 | + |
| 141 | +// Binary messages: raw PCM Float32 mono audio |
| 142 | +``` |
| 143 | + |
| 144 | +### Server → Client |
| 145 | + |
| 146 | +```typescript |
| 147 | +{ type: 'session_started', sessionId: '...', config: { sampleRate: 24000, format: 'opus' } } |
| 148 | +{ type: 'transcript', text: 'Hello', isFinal: false, speaker: 'Speaker_0' } |
| 149 | +{ type: 'agent_thinking' } |
| 150 | +{ type: 'agent_speaking', text: 'Hi there!' } |
| 151 | +{ type: 'agent_done' } |
| 152 | +{ type: 'barge_in', action: 'cancelled' } |
| 153 | +{ type: 'session_ended', reason: 'disconnect' } |
| 154 | + |
| 155 | +// Binary messages: encoded audio (mp3/opus) in negotiated format |
| 156 | +``` |
| 157 | + |
| 158 | +## Error Recovery |
| 159 | + |
| 160 | +| Failure | Recovery | |
| 161 | +|---------|----------| |
| 162 | +| STT connection drops | Auto-reconnect with exponential backoff (100ms → 5s). Audio frames buffered during reconnect. | |
| 163 | +| TTS connection drops | Cancel current utterance, re-create session, re-send buffered text. | |
| 164 | +| Transport disconnects | Tear down all sessions. Client must reconnect. | |
| 165 | +| Endpoint stuck | 30s watchdog timer forces `turn_complete`. | |
| 166 | +| Diarization lag | Non-blocking. Transcript sent to LLM immediately; speaker labels backfilled. | |
0 commit comments