ZONOS2 is our latest text-to-speech model trained on more than 6 million hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers at low latency with MoE. ZONOS2 excels at high-fidelity and naturalistic voice cloning.
During inference we use nemo TN normalized UTF-8 bytes and an ECAPA-TDNN embedding to generate DAC tokens with our MoE backbone. An inference overview can be seen below.
Language support is as follows.
| Tier | Languages |
|---|---|
| Tier 1 | English, Mandarin Chinese, Japanese |
| Tier 2 | Korean, Russian, Italian, Portuguese, French, Spanish, Vietnamese, German, Hebrew, Dutch |
| Tier 3 | Swedish, Hindi, Tamil, Telugu, Thai, Norwegian, Bengali, Tagalog, Arabic, Danish, Indonesian, Polish, Ukrainian, Romanian, Finnish, Hungarian, Lithuanian, Estonian, Slovak, Croatian, Latvian |
For local inference we provide a high-performance TTS inference server built on Mini-SGLang.
For more details and speech samples, check out our blog.
We also have a hosted version available at cloud.zyphra.com/audio-playground.
Platform Support: Linux only (x86_64). Requires NVIDIA GPU with CUDA toolkit matching your driver version (
nvidia-smito check).
Requires uv.
git clone https://github.com/Zyphra/Zonos2.git
cd Zonos2
uv syncuv run python -m zonos2 --model-path Zyphra/ZONOS2 --tts-default-voices-dir ./default_voices/uv run always uses the project environment, so no venv activation is needed.
The server starts on http://localhost:1919 by default. TTS mode is auto-detected for zonos2 models.
--tts-default-voices-dir <folder> pre-populates the web UI with voice-clone
speakers from disk; the folder is scanned recursively for speaker audio
(.wav, .mp3, .flac, .m4a, .ogg, .opus, .aac, .webm) and saved
embeddings (.npy, .npz). The newest voice is selected automatically on
startup.
curl:
curl -X POST http://localhost:1919/tts/generate \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "stream": true}' \
--output output.pcm
# Convert to WAV
ffmpeg -f f32le -ar 44100 -ac 1 -i output.pcm output.wavWeb UI: Open http://localhost:1919/ in your browser.
You can also run the engine directly in a Python script, without starting a
server, via TTSLLM:
from zonos2.message import TTSSamplingParams
from zonos2.tts import TTSLLM
tts = TTSLLM(model_path="Zyphra/ZONOS2")
results = tts.generate(
["Hello from the offline Python API.", "Batched prompts work too."],
TTSSamplingParams(seed=42),
)
for i, result in enumerate(results):
print(f"frames={len(result['audio_tokens'])}, eos_frame={result['eos_frame']}")
tts.save_audio(result["audio"], f"output_{i}.wav")Full-featured TTS endpoint with streaming support.
Request body:
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
string | required | Text to synthesize |
language |
string | en_us |
Text-normalization language: en_us, en_gb, fr_fr, de, es, it, pt_br, ja, cmn, ko |
text_normalization |
bool | true |
Verbalize numbers, dates and currency before synthesis (false = raw byte tokenization) |
temperature |
float | 1.15 |
Sampling temperature |
topk |
int | 106 |
Top-k sampling |
top_p |
float | 0.0 |
Nucleus (top-p) sampling threshold; 0 disables |
min_p |
float | 0.18 |
Min-p probability filter; 0 disables |
max_tokens |
int | null | model max | Maximum audio tokens. Omit or set null to use the model context limit; long prompts are clamped to remaining context. |
fade_out_ms |
float | 0.0 |
Cosine fade-out applied to the audio tail; 0 disables |
repetition_window |
int | 50 |
Recent generated frames to check per codebook; 0 disables |
repetition_penalty |
float | 1.2 |
Per-codebook repetition penalty strength; 1.0 disables |
repetition_codebooks |
int | 8 |
Number of codebooks from CB0 upward to penalize; negative means all |
seed |
int | null | null |
Random seed for reproducibility |
speaking_rate_enabled |
bool | false |
Set true to use model speaking-rate conditioning when another speaking-rate field is present |
speaking_rate_bucket |
int | null | null |
Exact model speaking-rate bucket to prepend before text |
speaking_rate |
float | null | null |
Target speaking rate in cleaned UTF-8 bytes per second; mapped to a bucket |
speed |
float | null | null |
OpenAI-style multiplier; 1.0 maps to the model's neutral speaking-rate bucket |
quality_enabled |
bool | true |
Enable quality-bin conditioning on supported models |
quality_buckets |
object | list | null | {"trailing_silence_s": 3} |
Per-feature quality bucket indices (keyed by feature name, or a list in feature order) |
quality_values |
object | list | null | null |
Raw quality metric values, mapped to buckets server-side (alternative to quality_buckets) |
clean_speaker_background |
bool | false |
Mark the reference voice as having a clean background (supported models) |
accurate_mode |
bool | true |
true = accurate mode (closer voice match), false = expressive mode |
stream |
bool | true |
Stream audio chunks |
Response: Raw PCM audio (audio/pcm, float32, 44.1 kHz, mono). Headers include X-Audio-Sample-Rate, X-Audio-Channels, X-Audio-Format.
OpenAI-compatible endpoint.
Request body:
{
"model": "zonos2",
"input": "Hello world",
"voice": "alloy",
"response_format": "pcm"
}For speaking-rate-enabled checkpoints, set speaking_rate_enabled to true
and use speaking_rate_bucket for exact bucket control, speaking_rate for
bytes-per-second control, or speed for OpenAI-style multiplier control.
If you find this model useful in an academic context please cite as:
@misc{zyphra2025zonos,
title = {Zonos V2 Technical Report},
author = {Gabriel Clark, Sofian Mejjoute, Mohamed Osman, George Close, Beren Millidge},
year = {2026},
}
