A FastAPI-based TTS server using Microsoft's VibeVoice model for high-quality text-to-speech synthesis with voice cloning capabilities.
- High-quality text-to-speech synthesis
- Voice cloning from reference audio
- Support for multiple voice presets
- Voice conversion (change voice of existing audio)
- Docker support with RunPod compatibility
- RTX 5090 (Blackwell) GPU support
| Method | Endpoint | Description |
|---|---|---|
GET |
/base_tts/ |
TTS with default voice |
GET |
/synthesize_speech/ |
TTS with custom voice |
POST |
/upload_audio/ |
Upload reference audio for voice cloning |
POST |
/change_voice/ |
Voice conversion on existing audio |
GET /base_tts/?text=Hello%20world&speed=1.0
text(required): Text to synthesizespeed(optional, default=1.0): Speech speed (0.8-1.2)
GET /synthesize_speech/?text=Hello%20world&voice=my_voice&speed=1.0
text(required): Text to synthesizevoice(required): Voice label (must match uploaded audio)speed(optional, default=1.0): Speech speed (0.8-1.2)
Upload a reference audio file for voice cloning.
audio_file_label(form): Label for the voicefile(file): Audio file (wav, mp3, flac, ogg, max 5MB)
Convert the voice of an existing audio file.
reference_speaker(form): Voice label to usefile(file): Audio file to convert
Run the pre-built image:
docker run -p 7860:7860 \
-v /path/to/models:/workspace/models/vibevoice \
--gpus all \
valyriantech/vibevoice_server:latestModels are automatically downloaded on first start. To persist models across container restarts, mount a volume to /workspace/models/vibevoice.
docker build -t vibevoice_server .- Install dependencies:
pip install -r requirements.txt- Clone and install VibeVoice:
git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice
pip install -e .- Download models:
./install_models.sh- Set environment variables:
export VIBEVOICE_MODEL_PATH=/path/to/VibeVoice-Large
export VIBEVOICE_TOKENIZER_PATH=/path/to/tokenizer- Run the server:
./start.sh| Variable | Default | Description |
|---|---|---|
VIBEVOICE_MODEL_PATH |
/workspace/models/vibevoice/VibeVoice-Large |
Path to VibeVoice model |
VIBEVOICE_TOKENIZER_PATH |
/workspace/models/vibevoice/tokenizer |
Path to Qwen tokenizer |
- VibeVoice-Large: ~18.7GB, requires ~20GB VRAM
- Tokenizer: Qwen2.5-1.5B tokenizer
For lower VRAM, consider using quantized models:
VibeVoice-Large-Q8: ~12GB VRAMVibeVoice-Large-Q4: ~8GB VRAM
- Use clear audio with minimal background noise
- Recommended: 10-30 seconds of speech
- Audio is automatically resampled to 24kHz
- Speed parameter: Clamped to 0.8-1.2 range
- Voice cloning: Uses audio prefill for natural voice reproduction
- Voice conversion: Uses Whisper for transcription (installed by default)
MIT License (same as VibeVoice)