Skip to content

Zyphra/ZONOS2

Repository files navigation

ZONOS2

ZONOS2 title card


ZONOS2 is our latest text-to-speech model trained on more than 6 million hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers at low latency with MoE. ZONOS2 excels at high-fidelity and naturalistic voice cloning.

During inference we use nemo TN normalized UTF-8 bytes and an ECAPA-TDNN embedding to generate DAC tokens with our MoE backbone. An inference overview can be seen below.

ZONOS2 title card

Language support is as follows.

Tier Languages
Tier 1 English, Mandarin Chinese, Japanese
Tier 2 Korean, Russian, Italian, Portuguese, French, Spanish, Vietnamese, German, Hebrew, Dutch
Tier 3 Swedish, Hindi, Tamil, Telugu, Thai, Norwegian, Bengali, Tagalog, Arabic, Danish, Indonesian, Polish, Ukrainian, Romanian, Finnish, Hungarian, Lithuanian, Estonian, Slovak, Croatian, Latvian

For local inference we provide a high-performance TTS inference server built on Mini-SGLang.

For more details and speech samples, check out our blog.

We also have a hosted version available at cloud.zyphra.com/audio-playground.


Quick Start

Platform Support: Linux only (x86_64). Requires NVIDIA GPU with CUDA toolkit matching your driver version (nvidia-smi to check).

1. Installation

Requires uv.

git clone https://github.com/Zyphra/Zonos2.git
cd Zonos2
uv sync

2. Launch the TTS Server

uv run python -m zonos2 --model-path Zyphra/ZONOS2 --tts-default-voices-dir ./default_voices/

uv run always uses the project environment, so no venv activation is needed.

The server starts on http://localhost:1919 by default. TTS mode is auto-detected for zonos2 models. --tts-default-voices-dir <folder> pre-populates the web UI with voice-clone speakers from disk; the folder is scanned recursively for speaker audio (.wav, .mp3, .flac, .m4a, .ogg, .opus, .aac, .webm) and saved embeddings (.npy, .npz). The newest voice is selected automatically on startup.

3. Generate Speech

curl:

curl -X POST http://localhost:1919/tts/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "stream": true}' \
  --output output.pcm

# Convert to WAV
ffmpeg -f f32le -ar 44100 -ac 1 -i output.pcm output.wav

Web UI: Open http://localhost:1919/ in your browser.

Python API (offline inference)

You can also run the engine directly in a Python script, without starting a server, via TTSLLM:

from zonos2.message import TTSSamplingParams
from zonos2.tts import TTSLLM

tts = TTSLLM(model_path="Zyphra/ZONOS2")

results = tts.generate(
    ["Hello from the offline Python API.", "Batched prompts work too."],
    TTSSamplingParams(seed=42),
)

for i, result in enumerate(results):
    print(f"frames={len(result['audio_tokens'])}, eos_frame={result['eos_frame']}")
    tts.save_audio(result["audio"], f"output_{i}.wav")

API Reference

POST /tts/generate

Full-featured TTS endpoint with streaming support.

Request body:

Parameter Type Default Description
text string required Text to synthesize
language string en_us Text-normalization language: en_us, en_gb, fr_fr, de, es, it, pt_br, ja, cmn, ko
text_normalization bool true Verbalize numbers, dates and currency before synthesis (false = raw byte tokenization)
temperature float 1.15 Sampling temperature
topk int 106 Top-k sampling
top_p float 0.0 Nucleus (top-p) sampling threshold; 0 disables
min_p float 0.18 Min-p probability filter; 0 disables
max_tokens int | null model max Maximum audio tokens. Omit or set null to use the model context limit; long prompts are clamped to remaining context.
fade_out_ms float 0.0 Cosine fade-out applied to the audio tail; 0 disables
repetition_window int 50 Recent generated frames to check per codebook; 0 disables
repetition_penalty float 1.2 Per-codebook repetition penalty strength; 1.0 disables
repetition_codebooks int 8 Number of codebooks from CB0 upward to penalize; negative means all
seed int | null null Random seed for reproducibility
speaking_rate_enabled bool false Set true to use model speaking-rate conditioning when another speaking-rate field is present
speaking_rate_bucket int | null null Exact model speaking-rate bucket to prepend before text
speaking_rate float | null null Target speaking rate in cleaned UTF-8 bytes per second; mapped to a bucket
speed float | null null OpenAI-style multiplier; 1.0 maps to the model's neutral speaking-rate bucket
quality_enabled bool true Enable quality-bin conditioning on supported models
quality_buckets object | list | null {"trailing_silence_s": 3} Per-feature quality bucket indices (keyed by feature name, or a list in feature order)
quality_values object | list | null null Raw quality metric values, mapped to buckets server-side (alternative to quality_buckets)
clean_speaker_background bool false Mark the reference voice as having a clean background (supported models)
accurate_mode bool true true = accurate mode (closer voice match), false = expressive mode
stream bool true Stream audio chunks

Response: Raw PCM audio (audio/pcm, float32, 44.1 kHz, mono). Headers include X-Audio-Sample-Rate, X-Audio-Channels, X-Audio-Format.

POST /v1/audio/speech

OpenAI-compatible endpoint.

Request body:

{
  "model": "zonos2",
  "input": "Hello world",
  "voice": "alloy",
  "response_format": "pcm"
}

For speaking-rate-enabled checkpoints, set speaking_rate_enabled to true and use speaking_rate_bucket for exact bucket control, speaking_rate for bytes-per-second control, or speed for OpenAI-style multiplier control.

Citation

If you find this model useful in an academic context please cite as:

@misc{zyphra2025zonos,
  title     = {Zonos V2 Technical Report},
  author    = {Gabriel Clark, Sofian Mejjoute, Mohamed Osman, George Close, Beren Millidge},
  year      = {2026},
}

About

Zonos2 is a leading open-weight text-to-speech MoE.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors