# Comparing State-of-the-Art Speech Synthesis
## OpenAI vs Qwen3-TTS (Open Source)

This notebook provides a comprehensive framework for evaluating and comparing modern text-to-speech (TTS) systems:
- **OpenAI GPT-4 TTS** (gpt-4o-mini-tts) - Cloud API
- **Qwen3-TTS** ([CustomVoice](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice), VoiceDesign, Base) - Open Source

### Evaluation Criteria
1. **Audio Quality**: Naturalness, expressiveness, clarity
2. **Features**: Voice options, style control, voice cloning
3. **Latency**: Generation speed, streaming support
4. **Cost**: Per-character pricing vs local compute costs

### Framework Components
- Automated evaluation using LLM-as-Judge
- Human evaluation interface (voting system)
- Prosodic feature analysis

---
## 1. Setup and Dependencies

First, we install and import all necessary libraries. The `qwen-tts` package uses **Transformers v4.57.3** for Qwen3-TTS models.

In [39]:
# ============================================================
# STEP 1: Install system dependencies (run in terminal first!)
# ============================================================
# For Ubuntu/Debian:
!sudo apt-get update && sudo apt-get install -y libsndfile1 libsndfile1-dev ffmpeg build-essential sox
#
# For macOS: brew install libsndfile ffmpeg sox
# For conda: conda install -c conda-forge libsndfile ffmpeg sox

# ============================================================
# STEP 2: Install core Python packages
# ============================================================
!pip install openai  # OpenAI TTS API
!pip install librosa soundfile scipy numpy pandas tabulate
!pip install ipywidgets IPython
!pip install openai-whisper  # For transcription/evaluation
!pip install accelerate torch torchaudio
!pip install python-dotenv  # Env loading

# ============================================================
# STEP 3: Qwen3-TTS - Official package from Alibaba
# https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
# ============================================================
# NOTE: If "Failed building wheel for sphn" error occurs,
#       make sure Step 1 system dependencies are installed!
!pip install -U qwen-tts

# Optional: FlashAttention 2 for lower VRAM usage (requires CUDA)
# !pip install -U flash-attn --no-build-isolation


Hit:2 https://repo.download.nvidia.com/baseos/ubuntu/noble/arm64 noble InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/sbsa  InRelease
Hit:4 https://workbench.download.nvidia.com/stable/linux/debian default InRelease
Hit:5 https://repo.download.nvidia.com/baseos/ubuntu/noble/arm64 noble-updates InRelease
Hit:6 https://developer.download.nvidia.com/hpc-sdk/ubuntu/arm64  InRelease    
Hit:1 https://prod-cdn.packages.k8s.io/repositories/isv:/kubernetes:/core:/stable:/v1.31/deb  InRelease
Hit:7 https://esm.ubuntu.com/apps/ubuntu noble-apps-security InRelease         
Hit:8 http://ports.ubuntu.com/ubuntu-ports noble InRelease                     
Hit:9 https://esm.ubuntu.com/apps/ubuntu noble-apps-updates InRelease          
Hit:10 https://esm.ubuntu.com/infra/ubuntu noble-infra-security InRelease
Hit:11 http://ports.ubuntu.com/ubuntu-ports noble-updates InRelease 
Hit:12 https://esm.ubuntu.com/infra/ubuntu noble-infra-updates InRelease
Hit:13 http://port

In [40]:
import os
import json
import time
import base64
from pathlib import Path
from dataclasses import dataclass, field
from typing import Optional, List, Dict, Any, Callable
from datetime import datetime

import numpy as np
import pandas as pd

# Audio processing
import soundfile as sf
import librosa
import librosa.display

# Display and widgets
from IPython.display import display, Audio, HTML, Markdown
import ipywidgets as widgets

# Plotting
import matplotlib.pyplot as plt
%matplotlib inline

# Load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()
print("✓ Environment variables loaded from .env")

print("Core imports successful!")

✓ Environment variables loaded from .env
Core imports successful!


In [41]:
# Check transformers version
import transformers
print(f"Transformers version: {transformers.__version__}")

# Note: qwen-tts requires transformers==4.57.3 specifically
# This is the correct version for Qwen3-TTS models
major_version = int(transformers.__version__.split('.')[0])
if major_version >= 4:
    print("✓ Transformers version compatible with qwen-tts")
else:
    print("⚠️  Warning: Update transformers with: pip install transformers==4.57.3")

Transformers version: 4.57.3
✓ Transformers version compatible with qwen-tts


In [42]:
# Check for GPU availability
import torch

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {DEVICE}")

if DEVICE == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("⚠️  No GPU detected - open-source models will run slowly")

Device: cuda
GPU: NVIDIA GB10
VRAM: 128.5 GB


---
## 2. Configuration and API Keys

Set up API credentials and output directories.

In [43]:
@dataclass
class Config:
    """Central configuration for TTS comparison framework."""
    
    # API Keys (set via environment variables for security)
    openai_api_key: str = field(default_factory=lambda: os.getenv("OPENAI_API_KEY", ""))
    
    # Output settings
    output_dir: Path = field(default_factory=lambda: Path("./outputs"))
    sample_rate: int = 24000  # Standard TTS sample rate
    
    # OpenAI TTS settings
    openai_model: str = "gpt-4o-mini-tts"
    openai_voice: str = "alloy"  # Options: alloy, echo, fable, onyx, nova, shimmer, ash, ballad, coral, sage, verse
    
    # Qwen3-TTS Models (https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice)
    qwen_custom_voice_model: str = "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"  # 9 premium voices
    qwen_voice_design_model: str = "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign"  # Voice from description
    qwen_base_model: str = "Qwen/Qwen3-TTS-12Hz-1.7B-Base"  # Voice cloning from 3s audio
    qwen_speaker: str = "Ryan"  # Options: Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee
    
    # Evaluation settings
    llm_judge_model: str = "gpt-4o"  # Model for automated evaluation
    
    def __post_init__(self):
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
    def validate(self):
        """Check that required credentials are set."""
        issues = []
        if not self.openai_api_key:
            issues.append("OPENAI_API_KEY not set")
        return issues

# Initialize configuration
config = Config()
config.output_dir.mkdir(parents=True, exist_ok=True)

# Validate setup
validation_issues = config.validate()
if validation_issues:
    print("⚠️  Configuration issues:")
    for issue in validation_issues:
        print(f"   - {issue}")
else:
    print("✓ Configuration validated successfully")

print(f"\nOutput directory: {config.output_dir.absolute()}")

✓ Configuration validated successfully

Output directory: /home/doran/jupyterlab/article-scripts/speech-synthesis-comparison/outputs


---
## 3. Test Cases Definition

Define diverse test cases to evaluate different aspects of TTS quality.

In [44]:
@dataclass
class TestCase:
    """A single test case for TTS evaluation."""
    id: str
    name: str
    text: str
    category: str  # neutral, emotional, complex, dialogue, multilingual
    style_instruction: str = ""  # Optional style guidance
    reference_audio_path: Optional[str] = None  # For voice cloning tests
    expected_emotion: Optional[str] = None
    notes: str = ""

# Define comprehensive test cases
TEST_CASES = [
    # Neutral Narrative
    TestCase(
        id="neutral_01",
        name="News Article",
        text="The rapid advancement of artificial intelligence has transformed numerous industries, from healthcare to finance. Researchers continue to push the boundaries of what machines can accomplish, leading to unprecedented breakthroughs in natural language processing and computer vision.",
        category="neutral",
        style_instruction="Read in a clear, professional news anchor tone.",
        notes="Tests general narration quality"
    ),
    TestCase(
        id="neutral_02",
        name="Technical Documentation",
        text="To initialize the connection, first import the client library and create an instance with your API key. Then call the connect method with the appropriate endpoint URL and timeout parameters.",
        category="neutral",
        style_instruction="Speak clearly and methodically, as if explaining to a developer.",
        notes="Tests handling of technical content"
    ),
    
    # Emotional Content
    TestCase(
        id="emotion_01",
        name="Excited Announcement",
        text="I can't believe it! We actually did it! After three years of hard work, our team has finally achieved what everyone said was impossible!",
        category="emotional",
        style_instruction="Speak with genuine excitement and enthusiasm, like celebrating a major achievement.",
        expected_emotion="excitement",
        notes="Tests expressiveness and excitement conveyance"
    ),
    TestCase(
        id="emotion_02",
        name="Sympathetic Response",
        text="I'm so sorry to hear about what happened. Please know that we're here for you, and we'll do everything we can to help you through this difficult time.",
        category="emotional",
        style_instruction="Speak with warmth, empathy, and genuine concern.",
        expected_emotion="sympathy",
        notes="Tests emotional warmth and empathy"
    ),
    TestCase(
        id="emotion_03",
        name="Urgent Warning",
        text="Warning! The system has detected a critical security breach. All users must log out immediately and change their passwords. This is not a drill.",
        category="emotional",
        style_instruction="Speak with urgency and seriousness, conveying the importance of immediate action.",
        expected_emotion="urgency",
        notes="Tests ability to convey urgency"
    ),
    
    # Complex/Challenging Content
    TestCase(
        id="complex_01",
        name="Tongue Twister",
        text="She sells seashells by the seashore. The shells she sells are seashells, I'm sure. So if she sells shells on the seashore, then I'm sure she sells seashore shells.",
        category="complex",
        style_instruction="Speak clearly and at a moderate pace.",
        notes="Tests pronunciation accuracy"
    ),
    TestCase(
        id="complex_02",
        name="Numbers and Acronyms",
        text="The Q4 2025 report shows that our API handled 1,234,567 requests with a 99.97% success rate. The CPU utilization averaged 45.3% while GPU memory usage peaked at 8.2 GB.",
        category="complex",
        style_instruction="Read numbers and acronyms naturally, as a professional would.",
        notes="Tests handling of numbers and technical abbreviations"
    ),
    TestCase(
        id="complex_03",
        name="Proper Nouns",
        text="Dr. Yoshua Bengio from Université de Montréal and Geoffrey Hinton from the University of Toronto collaborated with Yann LeCun at Meta AI Research on groundbreaking neural network architectures.",
        category="complex",
        style_instruction="Pronounce all names correctly and naturally.",
        notes="Tests proper noun pronunciation"
    ),
    
    # Dialogue/Conversational
    TestCase(
        id="dialogue_01",
        name="Customer Service",
        text="Hello! Thank you for calling customer support. How may I help you today? I understand your frustration, and I want to assure you that we'll resolve this issue right away.",
        category="dialogue",
        style_instruction="Speak as a friendly, professional customer service representative.",
        notes="Tests conversational tone"
    ),
    TestCase(
        id="dialogue_02",
        name="Storytelling",
        text="Once upon a time, in a land far, far away, there lived a young inventor who dreamed of building machines that could think. Little did she know that her dreams would one day change the world.",
        category="dialogue",
        style_instruction="Narrate like a storyteller, with warmth and a sense of wonder.",
        notes="Tests narrative storytelling ability"
    ),
    
    # Long-form content
    TestCase(
        id="longform_01",
        name="Extended Paragraph",
        text="""The development of large language models represents one of the most significant technological achievements of the past decade. These models, trained on vast corpora of text data, have demonstrated remarkable capabilities in understanding and generating human language. From answering complex questions to writing creative fiction, from translating between languages to summarizing lengthy documents, the applications seem nearly limitless. However, with great power comes great responsibility. Researchers and practitioners must carefully consider the ethical implications of deploying these systems, ensuring they are used to benefit humanity while minimizing potential harms.""",
        category="neutral",
        style_instruction="Read as a thoughtful essay, maintaining engagement throughout.",
        notes="Tests coherence over longer content"
    ),
]

print(f"Defined {len(TEST_CASES)} test cases:")
for tc in TEST_CASES:
    print(f"  [{tc.category}] {tc.id}: {tc.name}")

Defined 11 test cases:
  [neutral] neutral_01: News Article
  [neutral] neutral_02: Technical Documentation
  [emotional] emotion_01: Excited Announcement
  [emotional] emotion_02: Sympathetic Response
  [complex] complex_01: Tongue Twister
  [complex] complex_02: Numbers and Acronyms
  [complex] complex_03: Proper Nouns
  [dialogue] dialogue_01: Customer Service
  [dialogue] dialogue_02: Storytelling
  [neutral] longform_01: Extended Paragraph


---
## 4. TTS Provider Implementations

Abstract base class and concrete implementations for each TTS provider.

In [45]:
from abc import ABC, abstractmethod

@dataclass
class TTSResult:
    """Result from a TTS generation."""
    provider: str
    test_case_id: str
    audio_path: Path
    audio_data: np.ndarray
    sample_rate: int
    generation_time: float  # seconds
    character_count: int
    estimated_cost: float  # USD
    metadata: Dict[str, Any] = field(default_factory=dict)
    error: Optional[str] = None
    
    @property
    def duration(self) -> float:
        """Audio duration in seconds."""
        return len(self.audio_data) / self.sample_rate
    
    @property
    def realtime_factor(self) -> float:
        """How fast generation was relative to audio duration."""
        if self.generation_time > 0:
            return self.duration / self.generation_time
        return float('inf')


class TTSProvider(ABC):
    """Abstract base class for TTS providers."""
    
    def __init__(self, config: Config):
        self.config = config
        self.name = self.__class__.__name__
    
    @abstractmethod
    def generate(self, test_case: TestCase) -> TTSResult:
        """Generate audio for a test case."""
        pass
    
    @abstractmethod
    def is_available(self) -> bool:
        """Check if this provider is properly configured."""
        pass
    
    def _save_audio(self, audio_data: np.ndarray, test_case_id: str, 
                    sample_rate: int) -> Path:
        """Save audio to file."""
        filename = f"{self.name.lower()}_{test_case_id}.wav"
        path = self.config.output_dir / filename
        sf.write(path, audio_data, sample_rate)
        return path

### 4.1 OpenAI TTS Provider

In [46]:
class OpenAITTS(TTSProvider):
    """OpenAI GPT-4 TTS implementation."""
    
    # Pricing per 1M characters (as of late 2024)
    PRICING = {
        "tts-1": 15.0,  # Standard quality
        "tts-1-hd": 30.0,  # High quality
        "gpt-4o-mini-tts": 30.0,  # Latest model
    }
    
    VOICES = ["alloy", "echo", "fable", "onyx", "nova", "shimmer", 
              "ash", "ballad", "coral", "sage", "verse"]
    
    def __init__(self, config: Config):
        super().__init__(config)
        self.client = None
        if self.is_available():
            from openai import OpenAI
            self.client = OpenAI(api_key=config.openai_api_key)
    
    def is_available(self) -> bool:
        return bool(self.config.openai_api_key)
    
    def generate(self, test_case: TestCase) -> TTSResult:
        """Generate speech using OpenAI's TTS API."""
        if not self.client:
            return TTSResult(
                provider=self.name,
                test_case_id=test_case.id,
                audio_path=Path(),
                audio_data=np.array([]),
                sample_rate=24000,
                generation_time=0,
                character_count=len(test_case.text),
                estimated_cost=0,
                error="OpenAI client not initialized"
            )
        
        start_time = time.time()
        
        try:
            # Build the request
            # gpt-4o-mini-tts supports instructions for style control
            response = self.client.audio.speech.create(
                model=self.config.openai_model,
                voice=self.config.openai_voice,
                input=test_case.text,
                instructions=test_case.style_instruction if test_case.style_instruction else None,
                response_format="wav",
            )
            
            generation_time = time.time() - start_time
            
            # Save to temporary file and load as numpy array
            temp_path = self.config.output_dir / f"temp_openai_{test_case.id}.wav"
            response.stream_to_file(temp_path)
            
            # Load audio data
            audio_data, sample_rate = sf.read(temp_path)
            
            # Move to final path
            final_path = self._save_audio(audio_data, test_case.id, sample_rate)
            temp_path.unlink()  # Remove temp file
            
            # Calculate cost
            char_count = len(test_case.text)
            price_per_char = self.PRICING.get(self.config.openai_model, 30.0) / 1_000_000
            estimated_cost = char_count * price_per_char
            
            return TTSResult(
                provider=self.name,
                test_case_id=test_case.id,
                audio_path=final_path,
                audio_data=audio_data,
                sample_rate=sample_rate,
                generation_time=generation_time,
                character_count=char_count,
                estimated_cost=estimated_cost,
                metadata={
                    "model": self.config.openai_model,
                    "voice": self.config.openai_voice,
                }
            )
            
        except Exception as e:
            return TTSResult(
                provider=self.name,
                test_case_id=test_case.id,
                audio_path=Path(),
                audio_data=np.array([]),
                sample_rate=24000,
                generation_time=time.time() - start_time,
                character_count=len(test_case.text),
                estimated_cost=0,
                error=str(e)
            )

print("✓ OpenAI TTS Provider defined")
print(f"  Available voices: {OpenAITTS.VOICES}")

✓ OpenAI TTS Provider defined
  Available voices: ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer', 'ash', 'ballad', 'coral', 'sage', 'verse']


### 4.2 Qwen3-TTS (Open Source)

In [48]:
class Qwen3TTS(TTSProvider):
    """Qwen3-TTS open-source model via official qwen-tts package.
    
    Models (https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice):
    - CustomVoice: 9 premium voices with instruction control
    - VoiceDesign: Create voice from text description
    - Base: Voice cloning from 3-second audio sample
    
    Features:
    - 10 language support (EN, CN, JP, KO, DE, FR, RU, PT, ES, IT)
    - Streaming generation (~97ms latency)
    - Instruction-driven style/emotion control
    """
    
    # Model variants
    MODELS = {
        "Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice": {"params": "0.6B", "vram": "4GB"},
        "Qwen/Qwen3-TTS-12Hz-0.6B-Base": {"params": "0.6B", "vram": "4GB"},
        "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice": {"params": "1.7B", "vram": "8GB"},
        "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign": {"params": "1.7B", "vram": "8GB"},
        "Qwen/Qwen3-TTS-12Hz-1.7B-Base": {"params": "1.7B", "vram": "8GB"},
    }
    
    # Available speakers for CustomVoice models
    SPEAKERS = {
        "Vivian": "Bright, slightly edgy young female voice (Chinese)",
        "Serena": "Warm, gentle young female voice (Chinese)",
        "Uncle_Fu": "Seasoned male voice with low, mellow timbre (Chinese)",
        "Dylan": "Youthful Beijing male voice, clear natural timbre (Chinese/Beijing)",
        "Eric": "Lively Chengdu male voice, slightly husky (Chinese/Sichuan)",
        "Ryan": "Dynamic male voice with strong rhythmic drive (English)",
        "Aiden": "Sunny American male voice with clear midrange (English)",
        "Ono_Anna": "Playful Japanese female voice, light nimble timbre (Japanese)",
        "Sohee": "Warm Korean female voice with rich emotion (Korean)",
    }
    
    def __init__(self, config: Config, load_model: bool = False):
        super().__init__(config)
        self.model = None
        self.model_type = "custom_voice"  # custom_voice, voice_design, or base
        
        if load_model and self.is_available():
            self._load_model()
    
    def is_available(self) -> bool:
        # Check if qwen-tts is installed and GPU available
        try:
            from qwen_tts import Qwen3TTSModel
            return torch.cuda.is_available()
        except ImportError:
            return False
    
    def _load_model(self, model_type: str = "custom_voice"):
        """Load Qwen3-TTS model using official qwen-tts package."""
        try:
            from qwen_tts import Qwen3TTSModel
            
            model_map = {
                "custom_voice": self.config.qwen_custom_voice_model,
                "voice_design": self.config.qwen_voice_design_model,
                "base": self.config.qwen_base_model,
            }
            model_id = model_map.get(model_type, self.config.qwen_custom_voice_model)
            self.model_type = model_type
            
            print(f"Loading {model_id}...")
            
            self.model = Qwen3TTSModel.from_pretrained(
                model_id,
                device_map="cuda:0",
                dtype=torch.bfloat16,
                attn_implementation="flash_attention_2" if self._has_flash_attn() else "sdpa",
            )
            
            print(f"✓ Model loaded: {model_id}")
            print(f"  Supported speakers: {self.model.get_supported_speakers()}")
            print(f"  Supported languages: {self.model.get_supported_languages()}")
            
        except Exception as e:
            print(f"Error loading model: {e}")
            print("Install with: pip install -U qwen-tts")
            print("See: https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice")
    
    def _has_flash_attn(self) -> bool:
        try:
            import flash_attn
            return True
        except ImportError:
            return False
    
    def generate(self, test_case: TestCase, 
                 reference_audio: Optional[np.ndarray] = None,
                 ref_text: Optional[str] = None,
                 voice_description: Optional[str] = None) -> TTSResult:
        """Generate speech using Qwen3-TTS.
        
        Args:
            test_case: The test case to generate
            reference_audio: Optional 3-second audio for voice cloning (Base model)
            ref_text: Transcript of reference audio (required for voice cloning)
            voice_description: Text description for voice design (VoiceDesign model)
        """
        if not self.model:
            return TTSResult(
                provider=self.name,
                test_case_id=test_case.id,
                audio_path=Path(),
                audio_data=np.array([]),
                sample_rate=24000,
                generation_time=0,
                character_count=len(test_case.text),
                estimated_cost=0,
                error="Model not loaded. Call _load_model() first."
            )
        
        start_time = time.time()
        
        try:
            # Generate based on model type
            if self.model_type == "custom_voice":
                # CustomVoice: Use preset speakers with optional instruction
                wavs, sr = self.model.generate_custom_voice(
                    text=test_case.text,
                    language="English",  # or "Auto" for auto-detect
                    speaker=self.config.qwen_speaker,
                    instruct=test_case.style_instruction if test_case.style_instruction else None,
                )
            elif self.model_type == "voice_design" and voice_description:
                # VoiceDesign: Create voice from text description
                wavs, sr = self.model.generate_voice_design(
                    text=test_case.text,
                    language="English",
                    instruct=voice_description,
                )
            elif self.model_type == "base" and reference_audio is not None:
                # Base: Clone voice from reference audio
                wavs, sr = self.model.generate_voice_clone(
                    text=test_case.text,
                    language="English",
                    ref_audio=(reference_audio, 24000),
                    ref_text=ref_text or test_case.text[:50],
                )
            else:
                # Fallback to custom voice
                wavs, sr = self.model.generate_custom_voice(
                    text=test_case.text,
                    language="Auto",
                    speaker=self.config.qwen_speaker,
                )
            
            generation_time = time.time() - start_time
            
            # Convert to numpy array
            audio_data = wavs[0] if isinstance(wavs, list) else wavs
            if not isinstance(audio_data, np.ndarray):
                audio_data = np.array(audio_data)
            audio_data = audio_data.astype(np.float32)
            
            # Normalize if needed
            if np.abs(audio_data).max() > 1.0:
                audio_data = audio_data / np.abs(audio_data).max()
            
            # Save audio
            final_path = self._save_audio(audio_data, test_case.id, sr)
            
            return TTSResult(
                provider=self.name,
                test_case_id=test_case.id,
                audio_path=final_path,
                audio_data=audio_data,
                sample_rate=sr,
                generation_time=generation_time,
                character_count=len(test_case.text),
                estimated_cost=0,  # Open source = free (excluding compute)
                metadata={
                    "model_type": self.model_type,
                    "speaker": self.config.qwen_speaker if self.model_type == "custom_voice" else None,
                    "voice_cloning": reference_audio is not None,
                    "voice_design": voice_description is not None,
                }
            )
            
        except Exception as e:
            return TTSResult(
                provider=self.name,
                test_case_id=test_case.id,
                audio_path=Path(),
                audio_data=np.array([]),
                sample_rate=24000,
                generation_time=time.time() - start_time,
                character_count=len(test_case.text),
                estimated_cost=0,
                error=str(e)
            )

print("✓ Qwen3-TTS Provider defined")
print(f"  Available models: {list(Qwen3TTS.MODELS.keys())}")
print(f"  Available speakers: {list(Qwen3TTS.SPEAKERS.keys())}")

✓ Qwen3-TTS Provider defined
  Available models: ['Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice', 'Qwen/Qwen3-TTS-12Hz-0.6B-Base', 'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice', 'Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign', 'Qwen/Qwen3-TTS-12Hz-1.7B-Base']
  Available speakers: ['Vivian', 'Serena', 'Uncle_Fu', 'Dylan', 'Eric', 'Ryan', 'Aiden', 'Ono_Anna', 'Sohee']


---
## 5. Audio Analysis Utilities

Functions for extracting prosodic features and analyzing audio quality.

In [51]:
@dataclass
class AudioFeatures:
    """Extracted audio features for evaluation."""
    duration: float
    pitch_mean: float
    pitch_std: float
    pitch_range: float
    energy_mean: float
    energy_std: float
    speaking_rate: float  # syllables per second (estimated)
    pause_ratio: float  # ratio of silence to speech
    spectral_centroid_mean: float
    
    def to_dict(self) -> Dict[str, float]:
        return {
            "duration": self.duration,
            "pitch_mean": self.pitch_mean,
            "pitch_std": self.pitch_std,
            "pitch_range": self.pitch_range,
            "energy_mean": self.energy_mean,
            "energy_std": self.energy_std,
            "speaking_rate": self.speaking_rate,
            "pause_ratio": self.pause_ratio,
            "spectral_centroid_mean": self.spectral_centroid_mean,
        }


def extract_audio_features(audio_data: np.ndarray, sample_rate: int, 
                           text: str = "") -> AudioFeatures:
    """Extract prosodic and acoustic features from audio."""
    
    # Duration
    duration = len(audio_data) / sample_rate
    
    # Pitch (F0) extraction using librosa
    try:
        f0, voiced_flag, voiced_probs = librosa.pyin(
            audio_data,
            fmin=librosa.note_to_hz('C2'),
            fmax=librosa.note_to_hz('C7'),
            sr=sample_rate
        )
        f0_valid = f0[~np.isnan(f0)]
        pitch_mean = np.mean(f0_valid) if len(f0_valid) > 0 else 0
        pitch_std = np.std(f0_valid) if len(f0_valid) > 0 else 0
        pitch_range = (np.max(f0_valid) - np.min(f0_valid)) if len(f0_valid) > 0 else 0
    except:
        pitch_mean, pitch_std, pitch_range = 0, 0, 0
    
    # Energy (RMS)
    rms = librosa.feature.rms(y=audio_data)[0]
    energy_mean = np.mean(rms)
    energy_std = np.std(rms)
    
    # Speaking rate estimation (words / duration)
    word_count = len(text.split()) if text else 0
    speaking_rate = word_count / duration if duration > 0 else 0
    
    # Pause ratio (silence detection)
    silence_threshold = 0.01
    is_silence = rms < silence_threshold
    pause_ratio = np.mean(is_silence)
    
    # Spectral centroid (brightness)
    spectral_centroid = librosa.feature.spectral_centroid(y=audio_data, sr=sample_rate)[0]
    spectral_centroid_mean = np.mean(spectral_centroid)
    
    return AudioFeatures(
        duration=duration,
        pitch_mean=pitch_mean,
        pitch_std=pitch_std,
        pitch_range=pitch_range,
        energy_mean=energy_mean,
        energy_std=energy_std,
        speaking_rate=speaking_rate,
        pause_ratio=pause_ratio,
        spectral_centroid_mean=spectral_centroid_mean,
    )


def plot_audio_comparison(results: List[TTSResult], test_case: TestCase):
    """Create visualization comparing audio outputs."""
    
    n_results = len(results)
    fig, axes = plt.subplots(n_results, 3, figsize=(15, 4 * n_results))
    
    if n_results == 1:
        axes = axes.reshape(1, -1)
    
    for i, result in enumerate(results):
        if result.error or len(result.audio_data) == 0:
            continue
            
        # Waveform
        axes[i, 0].plot(result.audio_data, alpha=0.7)
        axes[i, 0].set_title(f"{result.provider} - Waveform")
        axes[i, 0].set_xlabel("Samples")
        axes[i, 0].set_ylabel("Amplitude")
        
        # Spectrogram
        D = librosa.amplitude_to_db(
            np.abs(librosa.stft(result.audio_data)), 
            ref=np.max
        )
        librosa.display.specshow(
            D, sr=result.sample_rate, x_axis='time', y_axis='hz',
            ax=axes[i, 1]
        )
        axes[i, 1].set_title(f"{result.provider} - Spectrogram")
        
        # Pitch contour
        try:
            f0, _, _ = librosa.pyin(
                result.audio_data,
                fmin=librosa.note_to_hz('C2'),
                fmax=librosa.note_to_hz('C7'),
                sr=result.sample_rate
            )
            times = librosa.times_like(f0, sr=result.sample_rate)
            axes[i, 2].plot(times, f0, 'b-', alpha=0.7)
            axes[i, 2].set_title(f"{result.provider} - Pitch Contour")
            axes[i, 2].set_xlabel("Time (s)")
            axes[i, 2].set_ylabel("Frequency (Hz)")
        except:
            axes[i, 2].text(0.5, 0.5, "Pitch extraction failed", 
                           ha='center', va='center')
    
    plt.suptitle(f"Test Case: {test_case.name}", fontsize=14)
    plt.tight_layout()
    plt.show()

print("✓ Audio analysis utilities defined")

✓ Audio analysis utilities defined


---
## 6. LLM-as-Judge Evaluation

Automated quality evaluation using GPT-4 to analyze prosodic features and transcriptions.

In [52]:
class LLMJudge:
    """LLM-based evaluation of TTS quality."""
    
    EVALUATION_PROMPT = """You are an expert speech scientist evaluating text-to-speech systems.

You will be given information about audio outputs from different TTS systems for the same input text.
Since you cannot hear the audio directly, you will analyze:
1. Prosodic features (pitch variation, speaking rate, energy dynamics)
2. Transcription accuracy (if ASR transcription differs from input)
3. The appropriateness of the features given the intended style/emotion

## Input Text
{input_text}

## Intended Style
{style_instruction}

## Expected Emotion/Tone
{expected_emotion}

## Model Outputs
{model_outputs}

## Evaluation Criteria
1. **Naturalness**: Based on pitch variation and energy dynamics, which output likely sounds most human-like?
   - Higher pitch standard deviation often indicates more expressive speech
   - Natural speech has varied energy patterns
   - Very flat features may indicate robotic delivery

2. **Expressiveness**: Given the intended style, which output's features best match expectations?
   - Excited speech: higher pitch mean, larger pitch range, faster speaking rate
   - Calm/soothing: lower pitch variation, slower rate
   - Urgent: faster rate, higher energy

3. **Accuracy**: Any transcription errors indicate pronunciation issues.

Please provide:
1. Analysis of each model's characteristics based on the features
2. Ranking from best to worst for this specific test case
3. Reasoning for your ranking
4. A score (1-10) for each model

Format your response as JSON:
```json
{{
    "analysis": {{
        "model_name": "analysis text"
    }},
    "ranking": ["best_model", "second_best", ...],
    "reasoning": "explanation",
    "scores": {{
        "model_name": score
    }}
}}
```
"""
    
    def __init__(self, config: Config):
        self.config = config
        self.client = None
        if config.openai_api_key:
            from openai import OpenAI
            self.client = OpenAI(api_key=config.openai_api_key)
    
    def evaluate(self, test_case: TestCase, 
                 results: List[TTSResult],
                 transcriptions: Dict[str, str] = None) -> Dict[str, Any]:
        """Evaluate TTS outputs using LLM."""
        
        if not self.client:
            return {"error": "OpenAI client not available"}
        
        # Extract features for each result
        model_outputs = []
        for result in results:
            if result.error:
                model_outputs.append(f"### {result.provider}\nError: {result.error}")
                continue
            
            features = extract_audio_features(
                result.audio_data, 
                result.sample_rate,
                test_case.text
            )
            
            transcription = transcriptions.get(result.provider, "[Not available]") if transcriptions else "[Not available]"
            
            output_str = f"""### {result.provider}
- Duration: {features.duration:.2f}s
- Generation Time: {result.generation_time:.2f}s (RTF: {result.realtime_factor:.2f}x)
- Pitch Mean: {features.pitch_mean:.1f} Hz
- Pitch Std Dev: {features.pitch_std:.1f} Hz
- Pitch Range: {features.pitch_range:.1f} Hz
- Energy Mean: {features.energy_mean:.4f}
- Energy Std Dev: {features.energy_std:.4f}
- Speaking Rate: {features.speaking_rate:.1f} words/sec
- Pause Ratio: {features.pause_ratio:.2%}
- ASR Transcription: {transcription}
"""
            model_outputs.append(output_str)
        
        # Build prompt
        prompt = self.EVALUATION_PROMPT.format(
            input_text=test_case.text,
            style_instruction=test_case.style_instruction or "None specified",
            expected_emotion=test_case.expected_emotion or "Neutral",
            model_outputs="\n".join(model_outputs)
        )
        
        try:
            response = self.client.chat.completions.create(
                model=self.config.llm_judge_model,
                messages=[
                    {"role": "system", "content": "You are an expert speech quality evaluator."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.3,
                response_format={"type": "json_object"}
            )
            
            return json.loads(response.choices[0].message.content)
            
        except Exception as e:
            return {"error": str(e)}

print("✓ LLM Judge defined")

✓ LLM Judge defined


---
## 7. Human Evaluation Interface

Interactive widgets for human listening and voting.

In [53]:
class HumanEvaluator:
    """Interactive human evaluation interface."""
    
    def __init__(self):
        self.votes = {}  # {test_case_id: {provider: vote_count}}
        self.ratings = {}  # {test_case_id: {provider: [ratings]}}
        self.comments = {}  # {test_case_id: [comments]}
    
    def create_evaluation_widget(self, test_case: TestCase, 
                                  results: List[TTSResult]) -> widgets.VBox:
        """Create an interactive evaluation widget for a test case."""
        
        # Initialize storage
        if test_case.id not in self.votes:
            self.votes[test_case.id] = {r.provider: 0 for r in results if not r.error}
            self.ratings[test_case.id] = {r.provider: [] for r in results if not r.error}
            self.comments[test_case.id] = []
        
        # Header
        header = widgets.HTML(f"""
        <h3>Test Case: {test_case.name}</h3>
        <p><b>Category:</b> {test_case.category}</p>
        <p><b>Text:</b> "{test_case.text[:100]}{'...' if len(test_case.text) > 100 else ''}"</p>
        <p><b>Style:</b> {test_case.style_instruction or 'None'}</p>
        <hr>
        """)
        
        # Audio players and vote buttons
        audio_widgets = []
        vote_buttons = []
        rating_sliders = []
        
        for result in results:
            if result.error:
                audio_widgets.append(widgets.HTML(
                    f"<p><b>{result.provider}:</b> Error - {result.error}</p>"
                ))
                continue
            
            # Audio player
            audio_html = f"""
            <div style="margin: 10px 0; padding: 10px; border: 1px solid #ddd; border-radius: 5px;">
                <p><b>{result.provider}</b></p>
                <audio controls src="data:audio/wav;base64,{self._audio_to_base64(result)}">
                    Your browser does not support audio.
                </audio>
                <p style="font-size: 0.9em; color: #666;">
                    Duration: {result.duration:.2f}s | 
                    Gen Time: {result.generation_time:.2f}s | 
                    Cost: ${result.estimated_cost:.4f}
                </p>
            </div>
            """
            audio_widgets.append(widgets.HTML(audio_html))
            
            # Vote button
            vote_btn = widgets.Button(
                description=f"Vote: {result.provider}",
                button_style='primary',
                layout=widgets.Layout(width='200px')
            )
            vote_btn.provider = result.provider
            vote_btn.test_case_id = test_case.id
            vote_btn.on_click(self._on_vote)
            vote_buttons.append(vote_btn)
            
            # Rating slider (1-5 MOS scale)
            rating = widgets.IntSlider(
                value=3,
                min=1, max=5,
                description=f'{result.provider} MOS:',
                style={'description_width': '150px'},
                layout=widgets.Layout(width='400px')
            )
            rating.provider = result.provider
            rating.test_case_id = test_case.id
            rating_sliders.append(rating)
        
        # Submit ratings button
        submit_btn = widgets.Button(
            description="Submit Ratings",
            button_style='success'
        )
        submit_btn.test_case_id = test_case.id
        submit_btn.rating_sliders = rating_sliders
        submit_btn.on_click(self._on_submit_ratings)
        
        # Comment box
        comment_box = widgets.Textarea(
            placeholder='Add any observations or comments...',
            layout=widgets.Layout(width='100%', height='80px')
        )
        comment_box.test_case_id = test_case.id
        
        # Vote count display
        self.vote_display = widgets.HTML(self._get_vote_display(test_case.id))
        
        # Assemble widget
        return widgets.VBox([
            header,
            widgets.VBox(audio_widgets),
            widgets.HTML("<h4>Vote for Best Quality:</h4>"),
            widgets.HBox(vote_buttons),
            widgets.HTML("<h4>Rate Each (1=Bad, 5=Excellent):</h4>"),
            widgets.VBox(rating_sliders),
            submit_btn,
            widgets.HTML("<h4>Comments:</h4>"),
            comment_box,
            widgets.HTML("<h4>Current Results:</h4>"),
            self.vote_display
        ])
    
    def _audio_to_base64(self, result: TTSResult) -> str:
        """Convert audio to base64 for embedding."""
        import io
        buffer = io.BytesIO()
        sf.write(buffer, result.audio_data, result.sample_rate, format='WAV')
        return base64.b64encode(buffer.getvalue()).decode('utf-8')
    
    def _on_vote(self, button):
        """Handle vote button click."""
        self.votes[button.test_case_id][button.provider] += 1
        self.vote_display.value = self._get_vote_display(button.test_case_id)
        print(f"Voted for {button.provider}!")
    
    def _on_submit_ratings(self, button):
        """Handle rating submission."""
        for slider in button.rating_sliders:
            self.ratings[slider.test_case_id][slider.provider].append(slider.value)
        self.vote_display.value = self._get_vote_display(button.test_case_id)
        print("Ratings submitted!")
    
    def _get_vote_display(self, test_case_id: str) -> str:
        """Generate HTML for vote/rating display."""
        votes = self.votes.get(test_case_id, {})
        ratings = self.ratings.get(test_case_id, {})
        
        html = "<table style='width:100%'><tr><th>Provider</th><th>Votes</th><th>Avg MOS</th></tr>"
        for provider, vote_count in votes.items():
            mos_scores = ratings.get(provider, [])
            avg_mos = np.mean(mos_scores) if mos_scores else "N/A"
            if isinstance(avg_mos, float):
                avg_mos = f"{avg_mos:.2f}"
            html += f"<tr><td>{provider}</td><td>{vote_count}</td><td>{avg_mos}</td></tr>"
        html += "</table>"
        return html
    
    def get_summary(self) -> pd.DataFrame:
        """Get summary of all evaluations."""
        rows = []
        for test_id, providers in self.votes.items():
            for provider, votes in providers.items():
                mos_scores = self.ratings.get(test_id, {}).get(provider, [])
                rows.append({
                    "test_case": test_id,
                    "provider": provider,
                    "votes": votes,
                    "avg_mos": np.mean(mos_scores) if mos_scores else None,
                    "num_ratings": len(mos_scores)
                })
        return pd.DataFrame(rows)

print("✓ Human Evaluator defined")

✓ Human Evaluator defined


---
## 8. Main Evaluation Pipeline

Orchestrates the full comparison workflow.

In [62]:
class TTSComparisonFramework:
    """Main framework for comparing TTS systems."""
    
    def __init__(self, config: Config):
        self.config = config
        self.providers: Dict[str, TTSProvider] = {}
        self.results: Dict[str, List[TTSResult]] = {}  # {test_case_id: [results]}
        self.llm_evaluations: Dict[str, Dict] = {}
        self.human_evaluator = HumanEvaluator()
    
    def register_provider(self, name: str, provider: TTSProvider):
        """Register a TTS provider."""
        if provider.is_available():
            self.providers[name] = provider
            print(f"✓ Registered: {name}")
        else:
            print(f"✗ {name} not available (missing credentials or dependencies)")
    
    def run_generation(self, test_cases: List[TestCase] = None, 
                       providers: List[str] = None) -> Dict[str, List[TTSResult]]:
        """Generate audio for all test cases with all providers."""
        
        test_cases = test_cases or TEST_CASES
        providers = providers or list(self.providers.keys())
        
        print(f"\n{'='*60}")
        print(f"Running TTS generation")
        print(f"Test cases: {len(test_cases)} | Providers: {len(providers)}")
        print(f"{'='*60}\n")
        
        for tc in test_cases:
            print(f"\n--- {tc.id}: {tc.name} ---")
            self.results[tc.id] = []
            
            for provider_name in providers:
                if provider_name not in self.providers:
                    continue
                
                provider = self.providers[provider_name]
                print(f"  Generating with {provider_name}...", end=" ")
                
                result = provider.generate(tc)
                self.results[tc.id].append(result)
                
                if result.error:
                    print(f"ERROR: {result.error}")
                else:
                    print(f"OK ({result.generation_time:.2f}s, RTF: {result.realtime_factor:.1f}x)")
        
        return self.results
    
    def run_llm_evaluation(self, test_cases: List[TestCase] = None) -> Dict[str, Dict]:
        """Run LLM-based evaluation on generated results."""
        
        test_cases = test_cases or TEST_CASES
        judge = LLMJudge(self.config)
        
        print(f"\n{'='*60}")
        print("Running LLM Evaluation")
        print(f"{'='*60}\n")
        
        for tc in test_cases:
            if tc.id not in self.results:
                print(f"Skipping {tc.id} - no results")
                continue
            
            print(f"Evaluating {tc.id}...", end=" ")
            evaluation = judge.evaluate(tc, self.results[tc.id])
            self.llm_evaluations[tc.id] = evaluation
            
            if "error" in evaluation:
                print(f"ERROR: {evaluation['error']}")
            else:
                print(f"OK - Winner: {evaluation.get('ranking', ['N/A'])[0]}")
        
        return self.llm_evaluations
    
    def display_human_evaluation(self, test_case_id: str):
        """Display human evaluation widget for a test case."""
        tc = next((t for t in TEST_CASES if t.id == test_case_id), None)
        if not tc:
            print(f"Test case {test_case_id} not found")
            return
        
        if test_case_id not in self.results:
            print(f"No results for {test_case_id} - run generation first")
            return
        
        widget = self.human_evaluator.create_evaluation_widget(
            tc, self.results[test_case_id]
        )
        display(widget)
    
    def generate_report(self) -> str:
        """Generate a comprehensive comparison report."""
        
        report = []
        report.append("# TTS Comparison Report")
        report.append(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        
        # Summary statistics
        report.append("## Summary Statistics\n")
        
        all_results = []
        for tc_id, results in self.results.items():
            for r in results:
                if not r.error:
                    all_results.append({
                        "test_case": tc_id,
                        "provider": r.provider,
                        "duration": r.duration,
                        "gen_time": r.generation_time,
                        "rtf": r.realtime_factor,
                        "cost": r.estimated_cost,
                    })
        
        if all_results:
            df = pd.DataFrame(all_results)
            summary = df.groupby("provider").agg({
                "gen_time": "mean",
                "rtf": "mean",
                "cost": "sum"
            }).round(3)
            report.append(summary.to_markdown())
            report.append("\n")
        
        # LLM Evaluation Results
        if self.llm_evaluations:
            report.append("## LLM Evaluation Results\n")
            
            wins = {}
            scores = {}
            
            for tc_id, eval_result in self.llm_evaluations.items():
                if "error" not in eval_result:
                    ranking = eval_result.get("ranking", [])
                    if ranking:
                        winner = ranking[0]
                        wins[winner] = wins.get(winner, 0) + 1
                    
                    for provider, score in eval_result.get("scores", {}).items():
                        if provider not in scores:
                            scores[provider] = []
                        scores[provider].append(score)
            
            report.append("### Wins per Provider")
            for provider, count in sorted(wins.items(), key=lambda x: -x[1]):
                report.append(f"- {provider}: {count} wins")
            
            report.append("\n### Average Scores")
            for provider, score_list in scores.items():
                avg = np.mean(score_list)
                report.append(f"- {provider}: {avg:.2f}/10")
            report.append("\n")
        
        # Human Evaluation Results
        human_summary = self.human_evaluator.get_summary()
        if not human_summary.empty:
            report.append("## Human Evaluation Results\n")
            report.append(human_summary.to_markdown())
            report.append("\n")
        
        # Cost Analysis
        report.append("## Cost Analysis\n")
        report.append("| Provider | Pricing Model | Est. Cost per 1M chars |")
        report.append("|----------|---------------|------------------------|")
        report.append("| OpenAI gpt-4o-mini-tts | Per character | ~$30 |")
        report.append("| Qwen3-TTS | Open source | $0 (+ compute) |")
        report.append("\n")
        
        return "\n".join(report)

print("✓ TTS Comparison Framework defined")

✓ TTS Comparison Framework defined


---
## 9. Running the Comparison

Execute the full comparison pipeline.

In [55]:
# Initialize the framework
framework = TTSComparisonFramework(config)

# Register available providers
print("Registering TTS providers...\n")

# OpenAI TTS (Cloud API)
framework.register_provider("OpenAI", OpenAITTS(config))

# Qwen3-TTS (Open Source, requires GPU)
# https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
framework.register_provider("Qwen3TTS", Qwen3TTS(config, load_model=True))

print(f"\nRegistered providers: {list(framework.providers.keys())}")

Registering TTS providers...

✓ Registered: OpenAI
Loading Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice...


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

✓ Model loaded: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
  Supported speakers: ['aiden', 'dylan', 'eric', 'ono_anna', 'ryan', 'serena', 'sohee', 'uncle_fu', 'vivian']
  Supported languages: ['auto', 'chinese', 'english', 'french', 'german', 'italian', 'japanese', 'korean', 'portuguese', 'russian', 'spanish']
✓ Registered: Qwen3TTS

Registered providers: ['OpenAI', 'Qwen3TTS']


In [56]:
# Select test cases for evaluation (can run subset for quick testing)
selected_test_cases = TEST_CASES[:3]  # Start with first 3 for demo

print("Selected test cases:")
for tc in selected_test_cases:
    print(f"  - {tc.id}: {tc.name} ({tc.category})")

Selected test cases:
  - neutral_01: News Article (neutral)
  - neutral_02: Technical Documentation (neutral)
  - emotion_01: Excited Announcement (emotional)


In [57]:
# Run generation (uncomment when ready and API keys are set)
results = framework.run_generation(selected_test_cases)

# For demo purposes, show what would happen:
print("To run generation, uncomment the line above.")
print("\nExpected output for each test case:")
for tc in selected_test_cases:
    print(f"\n{tc.id}:")
    for provider in framework.providers:
        print(f"  - {provider}: outputs/{provider.lower()}_{tc.id}.wav")


Running TTS generation
Test cases: 3 | Providers: 2


--- neutral_01: News Article ---
  Generating with OpenAI... 

  response.stream_to_file(temp_path)
Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.


OK (3.66s, RTF: 4.6x)
  Generating with Qwen3TTS... OK (12.92s, RTF: 1.2x)

--- neutral_02: Technical Documentation ---
  Generating with OpenAI... 

Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.


OK (2.54s, RTF: 4.9x)
  Generating with Qwen3TTS... OK (9.43s, RTF: 1.2x)

--- emotion_01: Excited Announcement ---
  Generating with OpenAI... 

Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.


OK (2.21s, RTF: 4.2x)
  Generating with Qwen3TTS... OK (7.11s, RTF: 1.2x)
To run generation, uncomment the line above.

Expected output for each test case:

neutral_01:
  - OpenAI: outputs/openai_neutral_01.wav
  - Qwen3TTS: outputs/qwen3tts_neutral_01.wav

neutral_02:
  - OpenAI: outputs/openai_neutral_02.wav
  - Qwen3TTS: outputs/qwen3tts_neutral_02.wav

emotion_01:
  - OpenAI: outputs/openai_emotion_01.wav
  - Qwen3TTS: outputs/qwen3tts_emotion_01.wav


In [None]:
# Run LLM evaluation (uncomment after generation)
llm_results = framework.run_llm_evaluation(selected_test_cases)

#print("To run LLM evaluation, uncomment the line above after generation.")


Running LLM Evaluation

Evaluating neutral_01... OK - Winner: OpenAITTS
Evaluating neutral_02... OK - Winner: Qwen3TTS
Evaluating emotion_01... OK - Winner: Qwen3TTS
To run LLM evaluation, uncomment the line above after generation.


In [None]:
# Display human evaluation interface (uncomment after generation)
framework.display_human_evaluation("neutral_01")

#print("To display human evaluation, uncomment and specify a test case ID.")

VBox(children=(HTML(value='\n        <h3>Test Case: News Article</h3>\n        <p><b>Category:</b> neutral</p>…

To display human evaluation, uncomment and specify a test case ID.


In [63]:
# Generate final report (uncomment after all evaluations)
report = framework.generate_report()
display(Markdown(report))

#print("To generate report, uncomment after running evaluations.")

# TTS Comparison Report
Generated: 2026-02-01 22:23:04

## Summary Statistics

| provider   |   gen_time |   rtf |   cost |
|:-----------|-----------:|------:|-------:|
| OpenAITTS  |      2.803 | 4.588 |  0.018 |
| Qwen3TTS   |      9.82  | 1.213 |  0     |


## LLM Evaluation Results

### Wins per Provider
- Qwen3TTS: 2 wins
- OpenAITTS: 1 wins

### Average Scores
- OpenAITTS: 7.00/10
- Qwen3TTS: 8.00/10


## Human Evaluation Results

|    | test_case   | provider   |   votes | avg_mos   |   num_ratings |
|---:|:------------|:-----------|--------:|:----------|--------------:|
|  0 | neutral_01  | OpenAITTS  |       0 |           |             0 |
|  1 | neutral_01  | Qwen3TTS   |       0 |           |             0 |


## Cost Analysis

| Provider | Pricing Model | Est. Cost per 1M chars |
|----------|---------------|------------------------|
| OpenAI gpt-4o-mini-tts | Per character | ~$30 |
| Google Gemini Pro TTS | Per character | ~$30 |
| Qwen3-TTS | Open source | $0 (+ compute) |



---
## 10. Quick Start Example

A minimal example showing how to generate and compare audio.

In [64]:
def quick_comparison(text: str, style: str = "") -> None:
    """Quick comparison of a single text across all available providers."""
    
    test_case = TestCase(
        id="quick_test",
        name="Quick Test",
        text=text,
        category="custom",
        style_instruction=style
    )
    
    print(f"Text: \"{text}\"")
    print(f"Style: {style or 'None'}")
    print("\n" + "="*50 + "\n")
    
    for name, provider in framework.providers.items():
        if not provider.is_available():
            print(f"{name}: Not available")
            continue
        
        print(f"Generating with {name}...")
        result = provider.generate(test_case)
        
        if result.error:
            print(f"  Error: {result.error}")
        else:
            print(f"  Duration: {result.duration:.2f}s")
            print(f"  Generation time: {result.generation_time:.2f}s")
            print(f"  Realtime factor: {result.realtime_factor:.1f}x")
            print(f"  Cost: ${result.estimated_cost:.4f}")
            print(f"  Saved to: {result.audio_path}")
            
            # Display audio player
            display(Audio(result.audio_data, rate=result.sample_rate))
        
        print()

# Example usage (uncomment when API keys are configured):
quick_comparison(
     "Hello! Welcome to our text-to-speech comparison framework.",
     style="Speak in a warm, friendly tone"
)

Text: "Hello! Welcome to our text-to-speech comparison framework."
Style: Speak in a warm, friendly tone


Generating with OpenAI...
  Duration: 4.00s
  Generation time: 1.16s
  Realtime factor: 3.5x
  Cost: $0.0017
  Saved to: outputs/openaitts_quick_test.wav


  response.stream_to_file(temp_path)


Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.



Generating with Qwen3TTS...
  Duration: 3.42s
  Generation time: 2.87s
  Realtime factor: 1.2x
  Cost: $0.0000
  Saved to: outputs/qwen3tts_quick_test.wav





---
## 11. Voice Cloning Example (Qwen3-TTS)

Demonstrate voice cloning capability unique to open-source models.

In [67]:
def voice_cloning_demo(reference_audio_path: str, ref_text: str, text: str) -> None:
    """Demonstrate voice cloning with Qwen3-TTS Base model.
    
    Qwen3-TTS can clone a voice from just 3 seconds of audio!
    Requires: Qwen/Qwen3-TTS-12Hz-1.7B-Base model
    
    Args:
        reference_audio_path: Path to reference audio (or URL)
        ref_text: Transcript of the reference audio
        text: Text to synthesize with the cloned voice
    """
    import tempfile
    import urllib.request
    
    # Check if Qwen3 provider is available
    if "Qwen3TTS" not in framework.providers:
        print("Qwen3-TTS provider not registered")
        return
    
    qwen = framework.providers["Qwen3TTS"]
    
    # Load the Base model for voice cloning
    if not qwen.model or qwen.model_type != "base":
        print("Loading Qwen3-TTS Base model for voice cloning...")
        qwen._load_model("base")
    
    if not qwen.model:
        print("Failed to load model")
        return
    
    # Load reference audio (handle URLs)
    print(f"Loading reference audio: {reference_audio_path}")
    if reference_audio_path.startswith(("http://", "https://")):
        # Download to temp file
        with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as tmp:
            urllib.request.urlretrieve(reference_audio_path, tmp.name)
            reference_audio, sr = librosa.load(tmp.name, sr=None)
    else:
        reference_audio, sr = sf.read(reference_audio_path)
    
    # Resample to 24kHz if needed
    if sr != 24000:
        reference_audio = librosa.resample(reference_audio, orig_sr=sr, target_sr=24000)
    
    # Trim to ~3 seconds for best results
    max_samples = 3 * 24000
    if len(reference_audio) > max_samples:
        reference_audio = reference_audio[:max_samples]
        print(f"Trimmed to 3 seconds for optimal cloning")
    
    print(f"Reference duration: {len(reference_audio)/24000:.2f}s")
    
    # Create test case
    test_case = TestCase(
        id="clone_test",
        name="Voice Cloning Test",
        text=text,
        category="cloning"
    )
    
    # Generate with voice cloning
    print(f"\nGenerating: \"{text}\"")
    result = qwen.generate(test_case, reference_audio=reference_audio, ref_text=ref_text)
    
    if result.error:
        print(f"Error: {result.error}")
    else:
        print(f"Generated in {result.generation_time:.2f}s")
        print("\nReference audio:")
        display(Audio(reference_audio, rate=24000))
        print("\nCloned voice output:")
        display(Audio(result.audio_data, rate=result.sample_rate))

# Example usage (from Qwen3-TTS documentation):
ref_audio_url = "https://k57oanifx8xopgjg.public.blob.vercel-storage.com/prod/audio/34611_echo.mp3"
ref_text = "Okay. Yeah. I resent you. I love you. I respect you."
voice_cloning_demo(ref_audio_url, ref_text, "This is my cloned voice speaking!")

Loading reference audio: https://k57oanifx8xopgjg.public.blob.vercel-storage.com/prod/audio/34611_echo.mp3


Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.


Trimmed to 3 seconds for optimal cloning
Reference duration: 3.00s

Generating: "This is my cloned voice speaking!"
Generated in 13.05s

Reference audio:



Cloned voice output:


---
## 12. Voice Design Example (Qwen3-TTS)

Create a custom voice from a text description.

In [68]:
def voice_design_demo(voice_description: str, text: str, language: str = "English") -> None:
    """Demonstrate voice design with Qwen3-TTS VoiceDesign model.
    
    Create a custom voice from a natural language description!
    Requires: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign model
    
    Args:
        voice_description: Natural language description of desired voice
        text: Text to synthesize
        language: Target language (English, Chinese, Japanese, etc.)
    """
    
    if "Qwen3TTS" not in framework.providers:
        print("Qwen3-TTS provider not registered")
        return
    
    qwen = framework.providers["Qwen3TTS"]
    
    # Load the VoiceDesign model
    if not qwen.model or qwen.model_type != "voice_design":
        print("Loading Qwen3-TTS VoiceDesign model...")
        qwen._load_model("voice_design")
    
    if not qwen.model:
        print("Failed to load model")
        return
    
    print(f"Voice description: \"{voice_description}\"")
    print(f"Text: \"{text}\"")
    print(f"Language: {language}")
    
    # Generate directly using the model's API
    try:
        wavs, sr = qwen.model.generate_voice_design(
            text=text,
            language=language,
            instruct=voice_description,
        )
        
        audio_data = wavs[0] if isinstance(wavs, list) else wavs
        print(f"\n✓ Generated successfully")
        display(Audio(audio_data, rate=sr))
        
        # Optionally save
        output_path = config.output_dir / f"voice_design_{hash(voice_description) % 10000}.wav"
        sf.write(output_path, audio_data, sr)
        print(f"Saved to: {output_path}")
        
    except Exception as e:
        print(f"Error: {e}")

# Example voice designs to try (from Qwen3-TTS docs):
VOICE_DESIGNS = [
    "A warm, friendly female narrator with a calm demeanor",
    "Male, 17 years old, tenor range, gaining confidence",
    "Seasoned male voice with a low, mellow timbre suitable for audiobooks",
    "Playful female voice with a light, nimble timbre",
    "Dynamic male voice with strong rhythmic drive for announcements",
]

# Example usage:
voice_design_demo(
    "A warm, friendly female narrator with a calm demeanor",
    "Welcome to today's episode. We'll be exploring the fascinating world of AI."
)

Loading Qwen3-TTS VoiceDesign model...
Loading Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign...


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.83G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/245 [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

speech_tokenizer/model.safetensors:   0%|          | 0.00/682M [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

configuration.json:   0%|          | 0.00/76.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/127 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Setting `pad_token_id` to `eos_token_id`:2150 for open-end generation.


✓ Model loaded: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
  Supported speakers: []
  Supported languages: ['auto', 'chinese', 'english', 'french', 'german', 'italian', 'japanese', 'korean', 'portuguese', 'russian', 'spanish']
Voice description: "A warm, friendly female narrator with a calm demeanor"
Text: "Welcome to today's episode. We'll be exploring the fascinating world of AI."
Language: English

✓ Generated successfully


Saved to: outputs/voice_design_6816.wav


---
## 13. Batch Processing Utility

Process multiple texts efficiently.

In [69]:
def batch_generate(texts: List[str], provider_name: str = "Qwen3TTS",
                   style: str = "", output_dir: str = "./batch_outputs") -> List[Path]:
    """Generate audio for multiple texts."""
    
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    if provider_name not in framework.providers:
        print(f"Provider {provider_name} not available")
        return []
    
    provider = framework.providers[provider_name]
    outputs = []
    total_cost = 0
    total_time = 0
    
    print(f"Processing {len(texts)} texts with {provider_name}...\n")
    
    for i, text in enumerate(texts):
        test_case = TestCase(
            id=f"batch_{i:04d}",
            name=f"Batch item {i}",
            text=text,
            category="batch",
            style_instruction=style
        )
        
        result = provider.generate(test_case)
        
        if result.error:
            print(f"  [{i+1}/{len(texts)}] Error: {result.error}")
        else:
            # Save to batch output directory
            final_path = output_path / f"{provider_name.lower()}_{i:04d}.wav"
            sf.write(final_path, result.audio_data, result.sample_rate)
            outputs.append(final_path)
            total_cost += result.estimated_cost
            total_time += result.generation_time
            print(f"  [{i+1}/{len(texts)}] {final_path.name} ({result.duration:.1f}s audio)")
    
    print(f"\n{'='*40}")
    print(f"Completed: {len(outputs)}/{len(texts)} successful")
    print(f"Total generation time: {total_time:.1f}s")
    print(f"Total estimated cost: ${total_cost:.4f}")
    
    return outputs

# Example usage:
texts_to_convert = [
    "This is the first sentence.",
    "Here is another sentence to convert.",
    "And a third one for good measure."
]
outputs = batch_generate(texts_to_convert, "Qwen3TTS", style="Speak naturally")

Processing 3 texts with Qwen3TTS...

  [1/3] Error: model with 
tokenizer_type: qwen3_tts_tokenizer_12hz
tts_model_size: 1b7
tts_model_type: voice_design
does not support generate_custom_voice, Please check Model Card or Readme for more details.
  [2/3] Error: model with 
tokenizer_type: qwen3_tts_tokenizer_12hz
tts_model_size: 1b7
tts_model_type: voice_design
does not support generate_custom_voice, Please check Model Card or Readme for more details.
  [3/3] Error: model with 
tokenizer_type: qwen3_tts_tokenizer_12hz
tts_model_size: 1b7
tts_model_type: voice_design
does not support generate_custom_voice, Please check Model Card or Readme for more details.

Completed: 0/3 successful
Total generation time: 0.0s
Total estimated cost: $0.0000


---
## 15. Export Results

Save evaluation results and generated audio.

In [70]:
def export_results(output_dir: str = "./evaluation_results"):
    """Export all evaluation results to files."""
    
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Export generation results
    if framework.results:
        results_data = []
        for tc_id, results in framework.results.items():
            for r in results:
                results_data.append({
                    "test_case": tc_id,
                    "provider": r.provider,
                    "duration": r.duration if not r.error else None,
                    "gen_time": r.generation_time,
                    "rtf": r.realtime_factor if not r.error else None,
                    "cost": r.estimated_cost,
                    "chars": r.character_count,
                    "error": r.error,
                    "audio_path": str(r.audio_path) if r.audio_path else None,
                })
        
        df = pd.DataFrame(results_data)
        csv_path = output_path / f"generation_results_{timestamp}.csv"
        df.to_csv(csv_path, index=False)
        print(f"Saved generation results: {csv_path}")
    
    # Export LLM evaluations
    if framework.llm_evaluations:
        json_path = output_path / f"llm_evaluations_{timestamp}.json"
        with open(json_path, 'w') as f:
            json.dump(framework.llm_evaluations, f, indent=2)
        print(f"Saved LLM evaluations: {json_path}")
    
    # Export human evaluations
    human_df = framework.human_evaluator.get_summary()
    if not human_df.empty:
        human_csv = output_path / f"human_evaluations_{timestamp}.csv"
        human_df.to_csv(human_csv, index=False)
        print(f"Saved human evaluations: {human_csv}")
    
    # Generate and save report
    report = framework.generate_report()
    report_path = output_path / f"comparison_report_{timestamp}.md"
    with open(report_path, 'w') as f:
        f.write(report)
    print(f"Saved report: {report_path}")
    
    print(f"\nAll results exported to: {output_path.absolute()}")

# Uncomment to export:
export_results()

Saved generation results: evaluation_results/generation_results_20260201_223838.csv
Saved LLM evaluations: evaluation_results/llm_evaluations_20260201_223838.json
Saved human evaluations: evaluation_results/human_evaluations_20260201_223838.csv
Saved report: evaluation_results/comparison_report_20260201_223838.md

All results exported to: /home/doran/jupyterlab/article-scripts/speech-synthesis-comparison/evaluation_results


---
## Summary

This notebook compares **OpenAI TTS** vs **Qwen3-TTS** (open source):

### Providers
- **OpenAI** (gpt-4o-mini-tts): Cloud API, excellent quality, style control
- **Qwen3-TTS** (Open Source): Voice cloning, voice design, local deployment

### Evaluation Methods
1. **Automated (LLM-as-Judge)**: Analyzes prosodic features and compares quality
2. **Human Evaluation**: Interactive voting and MOS rating interface
3. **Audio Analysis**: Pitch, energy, and spectral feature extraction

### Key Findings
- Both systems produce near-human quality speech
- OpenAI costs ~$30/1M chars vs Qwen3-TTS is free (local GPU)
- Qwen3-TTS offers voice cloning from 3s audio samples

### Next Steps
1. Set OPENAI_API_KEY in .env
2. Run generation on test cases
3. Compare results

In [71]:
print("\n" + "="*60)
print("TTS Comparison Framework Ready!")
print("="*60)
print(f"\nConfiguration:")
print(f"  Output directory: {config.output_dir.absolute()}")
print(f"  Registered providers: {list(framework.providers.keys())}")
print(f"  Test cases defined: {len(TEST_CASES)}")
print(f"\nTo get started:")
print(f"  1. Set OPENAI_API_KEY")
print(f"  2. Run: framework.run_generation()")
print(f"  3. Run: framework.run_llm_evaluation()")
print(f"  4. Use: framework.display_human_evaluation('test_case_id')")
print(f"  5. Generate report: framework.generate_report()")


TTS Comparison Framework Ready!

Configuration:
  Output directory: /home/doran/jupyterlab/article-scripts/speech-synthesis-comparison/outputs
  Registered providers: ['OpenAI', 'Qwen3TTS']
  Test cases defined: 11

To get started:
  1. Set OPENAI_API_KEY
  2. Run: framework.run_generation()
  3. Run: framework.run_llm_evaluation()
  4. Use: framework.display_human_evaluation('test_case_id')
  5. Generate report: framework.generate_report()
