Español · English
Desktop TTS app for streaming. Reads text with cloned voices using four engines: XTTSv2, Piper TTS, F5-TTS and Chatterbox TTS. Integrates with SAMMI and other systems via REST API and MCP.
- Four TTS engines — pick the right trade-off for each voice:
- XTTSv2 — multilingual voice cloning (17 languages), high quality, requires GPU
- Piper TTS — lightweight neural voices from a catalogue, no GPU needed
- F5-TTS — voice cloning from a 3-12 s WAV, optimised for English and Chinese (~3 GB download)
- Chatterbox — multilingual cloning (23 languages), very fast, includes imperceptible watermark
- Voice presets: combine a voice with speed, pitch, language and radio effect; save under a name
- Per-preset language: XTTS and Chatterbox respect the language set in the preset; F5-TTS works best for English/Chinese regardless of preset language
- Simple REST API: just
voice+text— no technical parameters - Saved phrases: library of texts attached to a preset; playable by name via API; saving with an existing name updates the phrase (upsert)
- Audio export: synthesize and download in WAV, MP3 or OGG; or grab the last played audio without re-synthesis
- Help tab: built-in workflow diagram (clone voice → preset → test/save/API)
- Radio effect: bandpass 400–3400 Hz + soft clipping + noise
- Test panel: pick a preset, listen and download the result
- Splash screen: animated progress bar during startup while the model loads
- Log viewer: activity log with auto-refresh, level/origin filters, and columns — caller (MCP/API/UI), voice preset, text preview and synthesis duration
- Priority queue: TTS requests are serialised through an
asyncio.PriorityQueue; MCP/API calls take priority over UI and saved-phrase requests - Webhook notifications: register HTTP endpoints to receive
speak_endevents with voice, text, caller and duration; manageable from the UI - Verbose mode: toggle DEBUG logging + full tracebacks via the UI, API or MCP tool; also available at launch via
MYVOICES_VERBOSE=1 - Diagnostic endpoint:
/api/diagnostics(and MCP toolget_diagnostics) returns per-engine availability, import errors, and installed package versions - Tests: 206 unit and integration tests (DB, utils, API CRUD, UI markup, MCP), runnable without GPU
- CI: GitHub Actions runs ruff + pytest on every push and PR
- MCP server (built-in): a Model Context Protocol endpoint mounted at
/mcp/, toggleable from the UI, with Bearer-token auth. Lets an LLM (Claude Desktop, Claude Code, Cursor, Gemini CLI, ChatGPT…) list voices, speak text, and play saved phrases. A legacymcp_server.py(stdio) is also shipped for clients that need it
| Engine | Voice cloning | Languages | GPU | Notes |
|---|---|---|---|---|
| XTTSv2 | WAV 10–30 s | 17 | Required for speed | Best multilingual quality |
| Piper | No (catalogue voices) | Per-model | Not required | Fastest, lowest VRAM |
| F5-TTS | WAV 3–12 s | EN/ZH best | ≥12 GB VRAM | English/Chinese; other languages may sound English-accented |
| Chatterbox | WAV 5+ s | 23 | 4–6 GB VRAM | Adds imperceptible watermark (Perth/Resemble AI) |
Required to compile native TTS dependencies.
- Download Visual Studio Build Tools
- Select "Desktop development with C++"
- Install and reboot if prompted
# 1. Clone the repository
git clone https://github.com/dataeschema/MyVoices.git
cd MyVoices
# 2. Create a virtual environment
python -m venv venv
venv\Scripts\activate
# 3. Install PyTorch with CUDA support (pick by GPU)
# RTX 50xx (Blackwell) — CUDA 12.8:
pip install --upgrade --force-reinstall torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu128
# RTX 40xx / 30xx / 20xx — CUDA 12.4:
pip install --upgrade --force-reinstall torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu124
# 4. Install the rest of the dependencies (XTTSv2 + Piper included)
pip install -r requirements.txtF5-TTS and Chatterbox are not in requirements.txt because they have heavy
optional dependencies. Install them only if you intend to use them:
# F5-TTS — requires >=1.1.20 to avoid pydantic conflict
pip install "f5-tts>=1.1.20"
# Chatterbox — install without deps to avoid torch version conflict
pip install chatterbox-tts --no-deps
pip install resemble-enhance # audio enhancement used by ChatterboxNo GPU? Piper TTS works without a GPU. XTTSv2 is very slow on CPU. F5-TTS and Chatterbox require a CUDA GPU.
venv\Scripts\activate
python main.pyA splash screen with an animated progress bar shows up while the XTTSv2 model loads. Once it's ready, the main window opens.
The first run downloads the XTTSv2 model (~2 GB) — takes several minutes. F5-TTS (~3 GB) and Chatterbox (~1-2 GB) are downloaded on first use.
The web panel is also available at: http://localhost:8000
venv\Scripts\activate
pip install -r requirements-dev.txt # first time only
pytest --cov206 tests across five suites (DB, utils, API CRUD, UI markup, MCP). No GPU and no downloaded models are required (the server boots in test mode without loading TTS).
build.bat is fully self-contained:
build.batThe script:
- Verifies Python 3.10+
- Creates the venv if it doesn't exist
- Asks which GPU you have (menu 1/2/3) and picks the right CUDA build
- Installs PyTorch,
requirements.txtand PyInstaller automatically - Builds with PyInstaller
The final executable lives in dist\MyVoices\MyVoices.exe.
See BUILD_GUIDE.md for details and troubleshooting.
POST http://localhost:8000/api/speak
Content-Type: application/json
{
"voice": "preset_name",
"text": "Hi chat, welcome to the stream!"
}
The synthesized WAV is cached server-side so you can grab the exact audio that played:
GET http://localhost:8000/api/speak/last → returns the last WAV
POST http://localhost:8000/api/speak/download?format=mp3
Content-Type: application/json
{
"voice": "preset_name",
"text": "Text to synthesize"
}
format accepts wav (default), mp3 or ogg. WAV is passthrough; MP3/OGG
require ffmpeg on PATH (mp3 at 192 kbps, ogg via libvorbis).
To download the last played audio without re-synthesis:
GET http://localhost:8000/api/speak/last?format=mp3
POST http://localhost:8000/api/phrases/{name}/play
Register an HTTP endpoint to receive events when synthesis completes:
GET http://localhost:8000/api/webhooks → list registered webhooks
POST http://localhost:8000/api/webhooks → add a webhook
DELETE http://localhost:8000/api/webhooks/{id} → remove a webhook
POST http://localhost:8000/api/webhooks/test/{id} → fire a test event
Add a webhook:
POST /api/webhooks
{ "url": "https://your-server/hook", "events": "speak_end" }Payload sent on speak_end:
{
"event": "speak_end",
"job_id": "a1b2c3d4",
"voice": "preset_name",
"text": "first 120 chars of the text",
"caller": "MCP",
"duration_ms": 1240
}caller is one of MCP, API or UI.
events can be speak_end or * (all events).
MyVoices exposes a Model Context Protocol endpoint so an LLM (Claude, Cursor, Gemini, ChatGPT…) can list voices, speak text and trigger saved phrases via tool calls.
There are three transports — pick whichever fits best:
- Start MyVoices, go to the Main tab → 🤖 Servidor MCP card and click 📦 Descargar MyVoices.dxt.
- Drag the downloaded
MyVoices.dxtonto Claude Desktop (or double-click it). - When prompted, select the
dist\MyVoices\folder (the one that containsMyVoices.exeandmcp_server.exe).
Claude Desktop will launch mcp_server.exe automatically on each session. No token or manual JSON editing required. The mcp_server.exe is built by build.bat and ships inside dist\MyVoices\.
Dev mode: run
python make_dxt.pyto generateMyVoices.dxtwithout the full build. Install it the same way, but point to the project root — then configure${user_config.myvoices_dir}to any folder containing amcp_server.exeyou have already compiled.
- Open MyVoices, go to the Main tab → 🤖 Servidor MCP card and flip the toggle.
- The card shows the URL (
http://localhost:8000/mcp/) and a Bearer token (auto-generated on first activation). - Open the Help tab, pick your client from the buttons, and copy the auto-rendered config snippet — URL, token and absolute paths are filled in for you.
The endpoint is gated by the toggle (returns 503 when off) and by Authorization: Bearer <token> (returns 401 on mismatch).
Run python mcp_server.py as a subprocess from your client config. Requires Python + the MyVoices venv. The app must be running.
| Tool | What it does |
|---|---|
get_status |
Server health: TTS engine, device, voice/preset counts |
list_voices |
Registered voices (all engines) |
list_presets |
Voice presets (voice + speed/pitch/lang/radio) |
list_phrases |
Saved phrases with their attached preset |
speak(voice, text) |
Synthesize and play text with the named preset |
play_phrase(name) |
Play a saved phrase by name |
download_last_audio |
Metadata for the last cached WAV |
get_logs |
Last N logs filtered by level, caller and substring |
get_diagnostics |
Full state: engines, import errors, package versions |
load_model |
Lazy-load a TTS engine (xtts/f5tts/chatterbox) |
set_verbose |
Toggle verbose mode (DEBUG + full tracebacks) |
| Client | Transport | Where to put the snippet |
|---|---|---|
| Claude Desktop | .dxt (recommended) | Drag MyVoices.dxt onto Claude Desktop |
| Claude Desktop | HTTP or stdio | %APPDATA%\Claude\claude_desktop_config.json |
| Claude Code (CLI) | HTTP | claude mcp add myvoices --transport http … |
| Cursor | HTTP | .cursor/mcp.json (project) or ~/.cursor/mcp.json (global) |
| Gemini CLI | HTTP | ~/.gemini/mcp.json |
| ChatGPT (Connectors) | HTTP | Settings → Connectors → Add MCP Server (plan-dependent) |
| Cline | stdio | Cline settings UI |
| Generic HTTP | HTTP | URL + Authorization: Bearer <token> header |
The Help tab inside MyVoices shows a copy-paste-ready snippet for each client, with URL, token and paths already substituted.
# Activate MCP from the UI first, then grab the token from the card.
TOKEN="<paste here>"
curl -X POST http://localhost:8000/mcp/ \
-H "Accept: application/json, text/event-stream" \
-H "Authorization: Bearer $TOKEN" \
-d '{"jsonrpc":"2.0","id":1,"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"smoke","version":"1"}}}'- XTTS2 tab → upload a reference WAV (10–30 s) → voice registered with an ID
- Piper tab → download a voice from the catalogue → registered automatically
- F5-TTS tab → upload a reference WAV (3–12 s) → voice registered (English/Chinese recommended)
- Chatterbox tab → upload a reference WAV (5+ s) → voice registered (23 languages)
- Main tab → pick a voice, tune speed/pitch/language/radio → save as preset
- Call the API with
{"voice": "preset_name", "text": "..."}from SAMMI or any other system - (Optional) Register webhooks in the Webhooks panel to receive
speak_endevents in external systems (OBS, Home Assistant, n8n…)
The Help tab inside the app contains the same workflow as a visual diagram.
When a TTS engine fails to load or you hit an opaque error:
# Enable verbose mode (DEBUG level + full tracebacks)
curl -X POST http://localhost:8000/api/verbose/true
# Or from an MCP client:
# tool: set_verbose(enabled=true)
# Inspect the full engine state and any import errors
curl http://localhost:8000/api/diagnostics
# MCP equivalent: tool: get_diagnostics
# Read the last 50 errors
curl 'http://localhost:8000/api/logs?level=ERROR&limit=50'get_diagnostics returns per-engine: availability, status, and the
import_error (with traceback) if the import failed. Also the installed
versions of torch, transformers, TTS, f5-tts and chatterbox.
You can also enable verbose at launch via env var MYVOICES_VERBOSE=1.
Everything persists across upgrades under %APPDATA%\MyVoices\:
%APPDATA%\MyVoices\
├── myvoices.db ← DB with voices, presets, phrases and logs
├── voices\ ← WAV files for cloned voices (XTTS, F5-TTS, Chatterbox)
└── piper_voices\ ← Piper models (.onnx + .onnx.json)
The XTTSv2 model is stored in:
%USERPROFILE%\AppData\Local\tts\tts_models--multilingual--multi-dataset--xtts_v2\
F5-TTS and Chatterbox models are cached in the default Hugging Face cache
(%USERPROFILE%\.cache\huggingface\hub\).