Skip to content

ValyrianTech/VibeVoice_server

Repository files navigation

VibeVoice Server

A FastAPI-based TTS server using Microsoft's VibeVoice model for high-quality text-to-speech synthesis with voice cloning capabilities.

Features

  • High-quality text-to-speech synthesis
  • Voice cloning from reference audio
  • Support for multiple voice presets
  • Voice conversion (change voice of existing audio)
  • Docker support with RunPod compatibility
  • RTX 5090 (Blackwell) GPU support

API Endpoints

Method Endpoint Description
GET /base_tts/ TTS with default voice
GET /synthesize_speech/ TTS with custom voice
POST /upload_audio/ Upload reference audio for voice cloning
POST /change_voice/ Voice conversion on existing audio

Endpoint Details

GET /base_tts/

GET /base_tts/?text=Hello%20world&speed=1.0
  • text (required): Text to synthesize
  • speed (optional, default=1.0): Speech speed (0.8-1.2)

GET /synthesize_speech/

GET /synthesize_speech/?text=Hello%20world&voice=my_voice&speed=1.0
  • text (required): Text to synthesize
  • voice (required): Voice label (must match uploaded audio)
  • speed (optional, default=1.0): Speech speed (0.8-1.2)

POST /upload_audio/

Upload a reference audio file for voice cloning.

  • audio_file_label (form): Label for the voice
  • file (file): Audio file (wav, mp3, flac, ogg, max 5MB)

POST /change_voice/

Convert the voice of an existing audio file.

  • reference_speaker (form): Voice label to use
  • file (file): Audio file to convert

Installation

Option 1: Docker (Recommended)

Run the pre-built image:

docker run -p 7860:7860 \
  -v /path/to/models:/workspace/models/vibevoice \
  --gpus all \
  valyriantech/vibevoice_server:latest

Models are automatically downloaded on first start. To persist models across container restarts, mount a volume to /workspace/models/vibevoice.

Building from source (optional)

docker build -t vibevoice_server .

Option 2: Local Installation

  1. Install dependencies:
pip install -r requirements.txt
  1. Clone and install VibeVoice:
git clone https://github.com/vibevoice-community/VibeVoice.git
cd VibeVoice
pip install -e .
  1. Download models:
./install_models.sh
  1. Set environment variables:
export VIBEVOICE_MODEL_PATH=/path/to/VibeVoice-Large
export VIBEVOICE_TOKENIZER_PATH=/path/to/tokenizer
  1. Run the server:
./start.sh

Environment Variables

Variable Default Description
VIBEVOICE_MODEL_PATH /workspace/models/vibevoice/VibeVoice-Large Path to VibeVoice model
VIBEVOICE_TOKENIZER_PATH /workspace/models/vibevoice/tokenizer Path to Qwen tokenizer

Model Requirements

  • VibeVoice-Large: ~18.7GB, requires ~20GB VRAM
  • Tokenizer: Qwen2.5-1.5B tokenizer

For lower VRAM, consider using quantized models:

  • VibeVoice-Large-Q8: ~12GB VRAM
  • VibeVoice-Large-Q4: ~8GB VRAM

Voice Cloning Tips

  • Use clear audio with minimal background noise
  • Recommended: 10-30 seconds of speech
  • Audio is automatically resampled to 24kHz

Notes

  • Speed parameter: Clamped to 0.8-1.2 range
  • Voice cloning: Uses audio prefill for natural voice reproduction
  • Voice conversion: Uses Whisper for transcription (installed by default)

License

MIT License (same as VibeVoice)

About

API server for VibeVoice

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published