Add Qwen3 TTS architecture support#20752
Conversation
Add GGUF support for the Qwen3-TTS model family: - gguf-py: define QWEN3TTS and QWEN3TTS_CP architectures, TTS-specific KV keys (text_vocab_size, text_embedding_length, num_code_groups, position_id_per_seconds), and tensor name mappings - convert_hf_to_gguf.py: add Qwen3TTSTalkerModel (28-layer Talker with interleaved MRoPE) and Qwen3TTSCodePredictorModel (5-layer Code Predictor with standard RoPE), including speaker encoder tensor remapping - tools/tts/convert_qwen3tts.py: wrapper script to convert both Talker and Code Predictor GGUFs from a single HF model directory - tools/tts/convert_qwen3tts_tokenizer.py: tokenizer conversion helper
Add C++ inference support for Qwen3-TTS Talker and Code Predictor: - llama-arch: register LLM_ARCH_QWEN3TTS (Talker, 28-layer with MRoPE) and LLM_ARCH_QWEN3TTS_CP (Code Predictor, 5-layer with standard RoPE), including KV keys, tensor names, and tensor info entries - llama-hparams: add TTS-specific fields (text_vocab_size, text_embd_size, num_code_groups, position_id_per_s) - llama-model: implement load_hparams, load_tensors for both architectures with TTS tensor pointers (text/codec embeddings, projection layers, per-codebook heads) - llama-graph: guard build_inp_embd for null tok_embd to support embedding-only input path used by TTS - models/qwen3tts.cpp: llm_build_qwen3tts graph builder with text projection, codec embedding, and MRoPE attention - models/qwen3tts_cp.cpp: llm_build_qwen3tts_cp graph builder for the code predictor sub-model
Add the end-user facing components for Qwen3-TTS: - tools/tts/qwen3tts.cpp: main CLI tool implementing the full TTS pipeline -- text tokenization, Talker prefill/decode with MRoPE, Code Predictor for multi-codebook generation, and vocoder (DAC-based decoder with VQ, strided ConvTranspose1d, Snake activations) - tools/tts/speaker-encoder: standalone ECAPA-TDNN speaker encoder for voice cloning (extracts speaker embeddings from reference audio) - tools/tts/CMakeLists.txt: build targets for llama-qwen3tts and speaker-encoder - examples/qwen3-tts/: Python usage examples (basic TTS, voice cloning, multilingual, Gradio app, benchmarking)
|
Hi @Acceldium, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Note that this branch adds initial support for running Qwen3-TTS models in llama.cpp. Qwen3-TTS uses a multi-stage pipeline (language model + audio decoder/tokenizer) that requires executing multiple independent compute graphs in sequence — a pattern llama.cpp does not currently support natively.