A server-based voice agent system that improves upon LiveKit's baseline VAD through enhanced turn detection, natural backchannel responses, and better handling of pauses and interruptions.
- Multi-signal fusion: Combines silence duration (40%), linguistic completeness (35%), and conversation context (25%)
- Reduces false interruptions: Won't cut off users mid-sentence
- Adaptive thresholds: Learns user speaking patterns over time
- 5 types: "mm-hmm", "okay", "yeah", "I see", "right"
- Context-aware selection: Chooses appropriate responses based on conversation
- Safe zone timing: 300ms delay with abort capability if user resumes speaking
- Anti-repetition: Varies backchannel types naturally
- Event-driven design: Loosely coupled components via event bus
- Chunked transcription: 1.5s chunks with 0.5s overlap for continuous STT
- Multi-channel audio: Separate channels for agent speech and backchannels
- WebRTC transport: Real-time audio streaming
- Framework: FastAPI + asyncio
- Audio Transport: WebRTC (aiortc) with STUN server
- VAD: Silero VAD (ONNX model)
- STT: OpenAI Whisper API
- LLM: OpenAI GPT-4o-mini
- TTS: OpenAI TTS (alloy voice)
- Audio Format: 16kHz mono PCM
- Python 3.9 or higher
- OpenAI API key
# Navigate to project directory
cd voice-agent
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtDownload the Silero VAD ONNX model:
- Visit: https://github.com/snakers4/silero-vad/releases
- Download
silero_vad.onnx(latest version) - Place it in the
models/directory:voice-agent/models/silero_vad.onnx
Copy the example environment file and add your API keys:
# Copy example file
copy .env.example .env
# Edit .env and add your key:
# OPENAI_API_KEY=sk-your-openai-api-key-hereAPI Key Required:
- OPENAI_API_KEY: Used for Whisper Speech-to-Text, TTS (Text-to-Speech), and LLM (GPT-4o-mini)
You have two options:
Option A: Generate with OpenAI TTS (Recommended)
Run the backchannel generation script (requires OpenAI API key):
python generate_backchannels.pyThis will create 5 natural-sounding WAV files in the backchannels/ directory using OpenAI's TTS API.
Option B: Create Placeholder Audio (No API Key Required)
If you don't have an OpenAI API key yet:
python create_placeholder_backchannels.pyThis creates simple tone-based placeholders. You can replace them later with real audio using Option A.
# Start the server
python -m server.main
# Or use uvicorn directly
uvicorn server.main:app --host 0.0.0.0 --port 8000The server will start on http://localhost:8000
Open your browser and navigate to:
http://localhost:8000
Click "Start Conversation" and grant microphone permissions.
voice-agent/
βββ server/ # Server-side Python code
β βββ main.py # FastAPI entry point
β βββ config.py # Configuration management
β βββ event_bus.py # Event system
β βββ conversation_manager.py # State management
β βββ audio_pipeline.py # Audio buffering
β βββ vad_processor.py # Voice activity detection
β βββ stt_client.py # Whisper STT client
β βββ transcription_coordinator.py # STT coordination
β βββ linguistic_analyzer.py # Text analysis
β βββ turn_detector.py # Turn detection logic
β βββ backchannel_*.py # Backchannel system (5 files)
β βββ llm_client.py # Gemini LLM client
β βββ tts_client.py # OpenAI TTS client
β βββ response_coordinator.py # Response generation
β βββ audio_mixer.py # Multi-channel mixing
β βββ webrtc_handler.py # WebRTC connections
βββ models/
β βββ silero_vad.onnx # VAD model (download separately)
βββ backchannels/ # Generated backchannel audio
β βββ mmhmm.wav
β βββ okay.wav
β βββ yeah.wav
β βββ i_see.wav
β βββ right.wav
βββ static/ # Web interface
β βββ index.html
β βββ app.js
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
βββ .env # Your API keys (create this)
βββ generate_backchannels.py # Backchannel generator
βββ README.md # This file
All tunable parameters are in .env file. Key settings:
TURN_END_SCORE_THRESHOLD=65- Threshold for turn ending (0-100)SHORT_PAUSE_MS=400- Short pause durationLONG_PAUSE_MS=1500- Long pause duration
BACKCHANNEL_BASE_PROBABILITY=0.4- Base 40% chanceBACKCHANNEL_MIN_INTERVAL_S=5- Minimum 5s between backchannelsBACKCHANNEL_SAFE_ZONE_MS=300- 300ms wait before playing
SILENCE_WEIGHT=0.4- 40% weight for silenceLINGUISTIC_WEIGHT=0.35- 35% weight for linguisticsCONTEXT_WEIGHT=0.25- 25% weight for context
GET /- Web interfacePOST /offer- WebRTC offer/answer exchangeGET /health- Health checkGET /status- Current conversation status
- Download
silero_vad.onnxand place inmodels/directory
- Run
python generate_backchannels.pyto create backchannel audio files
- Add
OPENAI_API_KEYto.envfile
- Add
GEMINI_API_KEYto.envfile
- Check firewall settings
- Ensure STUN server is accessible
- Try using HTTPS (required for some browsers)
To modify the system:
- Adjust turn detection: Edit
turn_detector.pyscoring logic - Change backchannel behavior: Modify
backchannel_trigger.pyprobability calculation - Add new backchannels: Add to
generate_backchannels.pyand regenerate - Tune parameters: Edit
.envfile values
MIT License - See LICENSE file for details
- Silero VAD: https://github.com/snakers4/silero-vad
- OpenAI Whisper & TTS: https://openai.com/
- Google Gemini: https://deepmind.google/technologies/gemini/
- aiortc: https://github.com/aiortc/aiortc