# Zonos TTS Model - For Google Colab Notebook
## MUST USE A100 GPU
This notebook clones the [Zonos repository](https://github.com/Zyphra/Zonos), installs the required system and Python dependencies, and then runs a sample text-to-speech generation. The generated audio is saved as `sample.wav` and is also playable directly from the notebook.

In [None]:
# Update package lists and install eSpeak (required for phonemization)
!apt update && apt install -y espeak-ng

In [None]:
# Clone the Zonos repository from GitHub
!git clone https://github.com/Zyphra/Zonos.git
%cd Zonos

In [None]:
# Install Python dependencies using uv as recommended in the README cite50†
!pip install -U uv
!pip install -e .
!pip install --no-build-isolation -e .[compile]

In [None]:
!pip install numpy==1.24.4
!pip install scipy==1.13.3
!pip install scikit-learn==1.6.1
!pip install triton

# Must restart session !!!
# Must restart session !!!
# Must restart session !!!

## Running the TTS Sample

The code below loads an example speaker sample, generates speech from the text **"Hello, world!"**, and saves the generated output to a `.wav` file.

In [None]:
import torch
import torchaudio

from zonos.model import Zonos
from zonos.conditioning import make_cond_dict

# Load the pretrained Zonos model (using the transformer model)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device=device)

# Load an example speaker audio file provided in the repo
wav, sampling_rate = torchaudio.load("/content/Replace_With_Original_Example.mp3") # <-- IMPORTANT USE ORGINAL SPEAKER EXAMPLE HERE / PLACE SAMPLE AUDIO IN COLAB DRIVE

# Generate the speaker embedding from the example audio
speaker = model.make_speaker_embedding(wav, sampling_rate)

# Create the conditioning dictionary with the desired text
cond_dict = make_cond_dict(text="Hello, world!", speaker=speaker, language="en-us")
conditioning = model.prepare_conditioning(cond_dict)

# Generate the audio codes from the model
codes = model.generate(conditioning)

# Decode the generated codes to waveform(s)
wavs = model.autoencoder.decode(codes).cpu()

# Save the first generated waveform to a .wav file
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)

print("Audio saved to sample.wav")

In [None]:
# Play the generated audio directly in the notebook
from IPython.display import Audio
Audio("sample.wav")