Give your AI agent a voice.
Mod³ is a Python MCP server that provides text-to-speech for Claude Code, Cursor, and other MCP-compatible AI tools. It runs four TTS engines locally on Apple Silicon, generates speech faster than realtime, and returns immediately so the agent keeps working while audio plays.
- Non-blocking speech --
speak()returns immediately with a job ID. Audio plays in the background. The agent writes code while it talks. - Queue-aware output -- Every
speak()return includes queue position, estimated wait time, and active job state. The agent knows what's playing without making a separate status call. - Barge-in detection -- VAD (voice activity detection) monitors the microphone. If the user starts talking, playback stops and the agent is notified. No talking over people.
- Turn-taking -- Bidirectional awareness of who's speaking. The agent can check user state before deciding to speak or wait.
- Multi-model routing -- Four TTS engines behind one interface. Voice name determines which engine handles the request.
- Adaptive buffering -- EMA-based arrival rate tracking with dynamic startup threshold. Gapless playback under normal load, graceful degradation under GPU contention.
- Structured metrics -- Every call returns TTFA, RTF, per-chunk timing, buffer health, underrun counts, and memory usage. The agent can diagnose its own audio quality.
| Engine | Model | Size | TTFA | Control Surfaces |
|---|---|---|---|---|
| Kokoro | Kokoro-82M-bf16 | 82M | ~60ms | Speed, emphasis (ALL CAPS), pacing (punctuation) |
| Voxtral | Voxtral-4B-TTS-mlx-4bit | 4B | ~500ms | 20 voice presets, multi-language |
| Chatterbox | chatterbox-4bit | ~1B | ~60ms | Emotion/exaggeration (0-1), voice cloning |
| Spark | Spark-TTS-0.5B-bf16 | 0.5B | ~1s | Pitch (5-level), speed, gender |
Models are downloaded on first use via HuggingFace Hub.
git clone https://github.com/cogos-dev/mod3.git
cd mod3
./setup.shThen add to your project's .mcp.json:
{
"mcpServers": {
"mod3": {
"command": "/path/to/mod3/.venv/bin/python",
"args": ["/path/to/mod3/server.py"]
}
}
}Synthesize text and play through speakers. Returns immediately with a job ID, queue state, and estimated wait time.
speak("Hello world") → default voice (bm_lewis @ 1.25x)
speak("Hello world", voice="casual_male") → Voxtral
speak("Hello world", voice="chatterbox", emotion=0.8) → Chatterbox with high emotion
speak("Hello world", voice="am_michael", speed=1.4) → Kokoro fast
Check if speech is still playing, or get metrics from the last completed job. Pass verbose=True for per-chunk detail.
Interrupt current speech immediately.
Check microphone for voice activity. Returns whether the user is currently speaking, enabling the agent to wait for a natural pause before responding.
List all available voices grouped by engine, with control surface tags.
List audio output devices, or switch the active one mid-session.
Show loaded engines, active jobs, output device, and last generation metrics.
Two files:
server.py-- MCP tool definitions, multi-model registry, sentence chunking, non-blocking job management, queue-aware returnsadaptive_player.py-- Callback-based audio playback with EMA arrival rate tracking, adaptive startup threshold, and structured metrics collection
The adaptive player is model-agnostic. Any TTS engine that produces audio chunks feeds the same pipeline.
- macOS with Apple Silicon (M1/M2/M3/M4)
- Python 3.10+
- espeak-ng (
brew install espeak-ng) -- required for Kokoro's phonemizer
See skills/voice/SKILL.md for the full guide on dual-modal communication -- when to speak vs write, non-blocking patterns, reading metrics, and anti-patterns.
Voice carries the ephemeral (context, intent, tone). Text carries the persistent (code, data, decisions). Both channels active simultaneously.
Mod³ is the voice layer in the CogOS ecosystem. It integrates as a modality channel -- the kernel routes intents to Mod³ when voice output is appropriate. Works standalone without CogOS.
| Repo | Purpose |
|---|---|
| cogos | The daemon |
| mod3 | Voice -- this repo |
| constellation | Distributed identity and trust |
| skills | Agent skill library |
| charts | Helm charts for deployment |
| desktop | macOS dashboard app |
MIT