# Voice Cloning with Qwen3-TTS

Clone any voice from a short reference clip using [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS) (0.6B model, faster than real-time on GPU).

Provide a 10–30 second WAV file and a transcript of what's said in it, then type any text to generate new speech in that voice.

> **⚠️ Important:** Only use reference audio you own or have explicit rights to clone. This notebook is for personal, non-commercial use.

**Before running:** Go to **Runtime → Change runtime type → T4 GPU** to enable GPU acceleration.

## Setup

In [None]:
# Install Qwen3-TTS — see https://github.com/QwenLM/Qwen3-TTS for the latest instructions
!pip install -q git+https://github.com/QwenLM/Qwen3-TTS.git soundfile

In [None]:
import torch
import soundfile as sf
from IPython.display import Audio, display

if not torch.cuda.is_available():
    print("⚠️  No GPU detected. Go to Runtime → Change runtime type → T4 GPU.")
    print("    Generation will be extremely slow on CPU.")
    device = "cpu"
    dtype = torch.float32
else:
    device = "cuda:0"
    # bfloat16 is supported on Ampere and newer (compute capability >= 8.0, e.g. A100, L4)
    # T4 (capability 7.5) doesn't support bfloat16 — use float32 instead of float16,
    # since float16's limited range can cause NaN logits in the acoustic code predictor.
    dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float32
    print(f"✓ GPU: {torch.cuda.get_device_name(0)}")
    print(f"  dtype: {dtype}")

## Step 1: Upload your reference audio

Upload a short WAV clip (10–30 seconds) of the voice you want to clone. Clean speech with minimal background noise works best.

In [None]:
from google.colab import files

print("Select a WAV file to upload...")
uploaded = files.upload()

if not uploaded:
    raise ValueError("No file uploaded.")

ref_audio_path = list(uploaded.keys())[0]
print(f"\nReference clip: {ref_audio_path}")
display(Audio(ref_audio_path))

## Step 2: Set reference transcript and text to generate

- **Reference transcript**: Type exactly what is said in your reference clip. Verbatim accuracy significantly improves cloning quality over using no transcript.
- **Text to generate**: What you want the cloned voice to say.

In [None]:
ref_text = "Type the exact words spoken in your reference clip here." # @param {type:"string"}
text_to_generate = "Hello. I am functioning within normal parameters." # @param {type:"string"}

## Step 3: Load the model

Downloads ~620 MB of weights on first run. Subsequent runs use the Colab cache.

In [None]:
from qwen_tts import Qwen3TTSModel

model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-0.6B-Base",
    device_map=device,
    dtype=dtype,
)
print("✓ Model loaded")

## Step 4: Generate speech

Run this cell to generate. To try different text, update `text_to_generate` in Step 2 and re-run this cell — no need to reload the model.

In [None]:
wavs, sr = model.generate_voice_clone(
    text=text_to_generate,
    language="English",
    ref_audio=ref_audio_path,
    ref_text=ref_text,
)

output_path = "cloned_output.wav"
sf.write(output_path, wavs[0], sr)
print(f"Saved: {output_path}")
display(Audio(output_path))

## Tips

- **Transcript accuracy matters**: An exact verbatim transcript significantly outperforms x_vector-only mode (no transcript).
- **Clip quality matters**: Clean speech with no background noise or music gives the best results. The [`data_extractor.py`](https://github.com/esherma/CharacterVoiceCloning/blob/main/data_extractor.py) script in this repo automates downloading and filtering clips from YouTube.
- **Generating multiple lines**: Use `model.create_voice_clone_prompt()` once and pass the result as `voice_clone_prompt=` to avoid re-encoding the reference audio on every call. See the [Qwen3-TTS docs](https://github.com/QwenLM/Qwen3-TTS) for details.
- **Download output**: Run `files.download('cloned_output.wav')` to save the result locally.

---

Built with [Qwen3-TTS](https://github.com/QwenLM/Qwen3-TTS) by Alibaba Qwen. See [CharacterVoiceCloning](https://github.com/esherma/CharacterVoiceCloning) for the full pipeline including YouTube extraction, clip filtering, and Claude Code hook integration.