# VITS TTS - Fast Text-to-Speech

This notebook demonstrates VITS TTS, which is optimized for speed and Apple Silicon GPU acceleration.

**Features:**
- Fast generation (~10 seconds)
- Apple Silicon MPS GPU acceleration
- Lightweight model (40M parameters)
- Good quality (though less natural than Bark)

In [1]:
# Load VITS TTS model - Apple Silicon optimized
from transformers import VitsModel, VitsTokenizer
import torch
import time

# Load VITS model (lightweight, Apple Silicon friendly)
model_name = "facebook/mms-tts-eng"  # English VITS model
tokenizer = VitsTokenizer.from_pretrained(model_name)
model = VitsModel.from_pretrained(model_name)

# Move to MPS if available
device = "mps" if torch.backends.mps.is_available() else "cpu"
model = model.to(device)

print(f"VITS model loaded on {device}")
print(f"Model size: ~40M parameters (much smaller than Bark)")

VITS model loaded on mps
Model size: ~40M parameters (much smaller than Bark)


In [2]:
# Generate speech with VITS
text = "Hello, this is VITS TTS. I generate speech quickly using Apple Silicon GPU acceleration."

# Tokenize input
inputs = tokenizer(text, return_tensors="pt")

# Move inputs to device
if device == "mps":
    inputs = {k: v.to(device) for k, v in inputs.items()}

print("Generating audio with VITS...")

start_time = time.time()
with torch.no_grad():
    # VITS uses forward() not generate()
    outputs = model(**inputs)
    audio_array = outputs.waveform
end_time = time.time()

print(f"Audio generated in {end_time - start_time:.2f} seconds on {device}")
print(f"Audio shape: {audio_array.shape}")

# Check the model's actual sample rate
model_sample_rate = model.config.sampling_rate
print(f"Model's sampling rate: {model_sample_rate}")

# Convert to numpy for saving
vits_audio = audio_array.cpu().numpy().squeeze()

# Store speech data
vits_speech = {"audio": vits_audio, "sampling_rate": model_sample_rate}

Generating audio with VITS...
Audio generated in 10.80 seconds on mps
Audio shape: torch.Size([1, 90112])
Model's sampling rate: 16000


In [3]:
# Save VITS audio as MP3
import soundfile as sf
from pydub import AudioSegment
import numpy as np
import os

# Create output directory
os.makedirs("output", exist_ok=True)

# Save as WAV first
wav_file = "output/vits_output.wav"
mp3_file = "output/vits_output.mp3"

# Normalize audio to prevent clipping
audio_normalized = vits_speech["audio"] / np.max(np.abs(vits_speech["audio"]))

# Save as WAV
sf.write(wav_file, audio_normalized, vits_speech["sampling_rate"])

# Convert to MP3
audio_segment = AudioSegment.from_wav(wav_file)
audio_segment.export(mp3_file, format="mp3")

print(f"VITS audio saved as {mp3_file}")
print(f"Duration: {len(audio_segment)/1000:.2f} seconds")
print(f"Generation time: {end_time - start_time:.2f} seconds")
print("\n🚀 VITS TTS: Fast generation with Apple Silicon GPU!")

VITS audio saved as output/vits_output.mp3
Duration: 5.63 seconds
Generation time: 10.80 seconds

🚀 VITS TTS: Fast generation with Apple Silicon GPU!
