# üé§ Enhanced Voice Cloning with Zonos TTS - Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Wamp1re-Ai/Zonos/blob/main/Enhanced_Voice_Cloning_Colab.ipynb)

This Google Colab notebook provides an **enhanced voice cloning system** using Zonos TTS. It's designed for ease of use within the Colab environment and offers several improvements over standard voice cloning approaches, focusing on naturalness, consistency, and control.

**Key improvements include:**
- ‚úÖ Smooth, natural speech flow (reduced unnatural pauses and timing issues).
- ‚úÖ Consistent speaking rate.
- ‚úÖ Clear, intelligible speech (reduced gibberish generation).
- ‚úÖ Stable voice reproduction.

## üöÄ Features:
- üîß **Advanced Audio Preprocessing**: Automatic silence removal and normalization for uploaded voice samples.
- üìä **Voice Quality Analysis**: SNR estimation and quality scoring for your voice samples.
- ‚öôÔ∏è **Optimized & Customizable Parameters**: Choose from Quality Presets for balanced results or fine-tune for specific needs. Includes options for faster generation (lower CFG Scales) and emotional expressiveness.
- üéØ **Adaptive Settings**: Parameters automatically adjust based on the quality of your voice sample and chosen preset.
- üîÑ **Reproducible Results**: Seed support for consistent audio generation.

---

## üìã Instructions:
1. **Run Cell 1 (Setup)**: Clones the Zonos repository and sets up the Colab environment.
2. **Run Cell 2 (Install Dependencies)**: Installs necessary Python packages using UV for speed.
3. **Run Cell 3 (Load Model)**: Loads the Zonos TTS model. **IMPORTANT:** If you modify underlying model code (e.g., `zonos/model.py`), you MUST re-run this cell for changes to take effect.
4. **Run Cell 4 (Upload Voice Sample)**: Upload a 10-30 second audio file of the voice you want to clone.
5. **Run Cell 5 (Generate Speech)**: Generate speech using your cloned voice and selected Quality Preset.
6. **Run Cell 6 (Run Benchmarks - Optional)**: Test generation speed and quality with different CFG Scales.

**Troubleshooting Note**: If you encounter NumPy-related errors, especially after installing dependencies, try restarting the Colab Runtime (`Runtime` > `Restart runtime` or `Factory reset runtime`) and then re-run cells from Cell 1. This usually resolves such issues.

In [None]:
#@title 1. üì• Setup and Clone Repository
import os
import subprocess
import sys

print("üöÄ Enhanced Voice Cloning Setup")
print("=" * 40)

# Check if we're in Colab
try:
    import google.colab
    IN_COLAB = True
    print("‚úÖ Running in Google Colab")
except ImportError:
    IN_COLAB = False
    print("‚ö†Ô∏è Not running in Google Colab")

# Clone the repository if it doesn't exist
if not os.path.exists('Zonos'):
    print("\nüì• Cloning Zonos repository...")
    !git clone https://github.com/Wamp1re-Ai/Zonos.git
    print("‚úÖ Repository cloned successfully!")
else:
    print("\n‚úÖ Repository already exists!")

# Change to the Zonos directory
%cd Zonos

# Install system dependencies
print("\nüîß Installing system dependencies...")
!apt-get update -qq
!apt-get install -y espeak-ng git-lfs -qq
!git lfs install
print("‚úÖ System dependencies installed!")

# Check for enhanced files
if os.path.exists('enhanced_voice_cloning.py'):
    print("\nüöÄ Enhanced voice cloning files detected!")
    print("You have access to all the latest improvements.")
else:
    print("\n‚ö†Ô∏è Enhanced files not found. Using standard voice cloning.")

print("\n‚úÖ Setup complete! Continue to Cell 2.")

In [None]:
#@title 2. ‚ö° Install Dependencies with UV (Ultra-Fast Installation)
import subprocess
import sys
import os
import time

print("‚ö° Ultra-Fast Dependency Installation with UV")
print("=" * 50)

start_time = time.time()

# Step 1: Install UV for ultra-fast package management
print("\nüöÄ Step 1: Installing UV (Rust-based package manager)...")
try:
    # Check if uv is already installed
    result = subprocess.run(['uv', '--version'], capture_output=True, text=True)
    if result.returncode == 0:
        print(f"‚úÖ UV already installed: {result.stdout.strip()}")
    else:
        raise FileNotFoundError
except (FileNotFoundError, subprocess.CalledProcessError):
    print("üì¶ Installing UV...")
    !curl -LsSf https://astral.sh/uv/install.sh | sh
    # Add uv to PATH for current session
    os.environ['PATH'] = f"/root/.cargo/bin:{os.environ.get('PATH', '')}"
    print("‚úÖ UV installed successfully!")

# Step 2: Fix NumPy compatibility FIRST
print("\nüîß Step 2: Fixing NumPy compatibility (ultra-fast)...")
!uv pip install "numpy==1.26.4" --force-reinstall --system

# Verify NumPy installation
try:
    import numpy as np
    print(f"‚úÖ NumPy {np.__version__} installed successfully")
    
    # Double-check version
    numpy_major = int(np.__version__.split('.')[0])
    if numpy_major >= 2:
        print("‚ö†Ô∏è NumPy 2.x still detected. This may require a runtime restart.")
        print("If you get errors in Cell 3, restart runtime and try again.")
    else:
        print("‚úÖ NumPy version is now compatible with transformers")
        
except Exception as e:
    print(f"‚ö†Ô∏è NumPy verification failed: {e}")
    print("Continuing with installation...")

# Step 3: Install core dependencies with UV (much faster)
print("\n‚ö° Step 3: Installing core dependencies with UV...")

# Check PyTorch (usually pre-installed in Colab)
try:
    import torch
    import torchaudio
    print(f"‚úÖ PyTorch {torch.__version__} already available")
    print(f"‚úÖ TorchAudio {torchaudio.__version__} already available")
except ImportError:
    print("üì¶ Installing PyTorch with UV...")
    !uv pip install torch torchaudio --system

# Install all other packages in one UV command (much faster than pip)
print("‚ö° Installing all dependencies with UV (10x faster than pip)...")
!uv pip install "transformers>=4.45.0,<4.50.0" "huggingface-hub>=0.20.0" "soundfile>=0.12.1" "phonemizer>=3.2.0" "inflect>=7.0.0" "scipy" "ipywidgets>=8.0.0" --system

print("\n‚ö° Step 4: Installing Zonos package with UV...")
try:
    !uv pip install -e . --system
    print("‚úÖ Zonos package installed successfully!")
except Exception as e:
    print(f"‚ö†Ô∏è Package installation failed, adding to Python path...")
    current_dir = os.getcwd()
    if current_dir not in sys.path:
        sys.path.insert(0, current_dir)
    print(f"‚úÖ Added {current_dir} to Python path")

installation_time = time.time() - start_time
print(f"\nüéâ All dependencies installed successfully in {installation_time:.1f} seconds!")
print(f"‚ö° UV is ~10x faster than pip for package installation")
print("\nüöÄ Ready for Cell 3: Load Model")
print("\nüí° Note: If Cell 3 gives NumPy errors:")
print("   1. Runtime ‚Üí Restart runtime")
print("   2. Re-run Cell 1 and Cell 2")
print("   3. Then run Cell 3 again")
print("   This is normal and fixes the NumPy compatibility issue.")

In [None]:
#@title 3. ü§ñ Load Enhanced Zonos Model
# IMPORTANT: If you have modified the underlying Zonos Python files (e.g., zonos/model.py),
# you MUST re-run this cell for those changes to take effect in the model.
import sys
import os

print("ü§ñ Loading Enhanced Zonos Model")
print("=" * 40)

# Make sure we can import zonos modules
current_dir = os.getcwd()
if current_dir not in sys.path:
    sys.path.insert(0, current_dir)

# Check NumPy version (should be fixed by Cell 2)
print("üîß Verifying NumPy compatibility...")
try:
    import numpy as np
    numpy_version = np.__version__
    numpy_major = int(numpy_version.split('.')[0])
    print(f"NumPy version: {numpy_version}")
    
    if numpy_major >= 2:
        print("\n‚ö†Ô∏è WARNING: NumPy 2.x detected!")
        print("This may cause issues. If you get errors below:")
        print("1. Runtime ‚Üí Restart runtime")
        print("2. Re-run Cell 1 and Cell 2")
        print("3. Try Cell 3 again")
        print("\nContinuing anyway...")
    else:
        print("‚úÖ NumPy version is compatible")
        
except ImportError:
    print("‚ùå NumPy not found! Please run Cell 2 first.")
    raise

# Import PyTorch
print("\nüì¶ Loading PyTorch...")
try:
    import torch
    import torchaudio
    print(f"‚úÖ PyTorch {torch.__version__}")
    print(f"‚úÖ TorchAudio {torchaudio.__version__}")
except Exception as e:
    print(f"‚ùå PyTorch error: {e}")
    print("Please run Cell 2 to install dependencies.")
    raise

# Import transformers with better error handling
print("\nü§ó Loading Transformers...")
try:
    import transformers
    print(f"‚úÖ Transformers {transformers.__version__}")
except Exception as e:
    error_msg = str(e)
    print(f"‚ùå Transformers error: {e}")
    
    if "numpy" in error_msg.lower() or "_center" in error_msg:
        print("\nüîß This is the NumPy 2.x compatibility issue!")
        print("\nüìã SOLUTION:")
        print("1. Runtime ‚Üí Restart runtime")
        print("2. Run Cell 1 (Setup)")
        print("3. Run Cell 2 (Dependencies)")
        print("4. Run Cell 3 (this cell) again")
        print("\nThis will fix the NumPy compatibility issue.")
    else:
        print("Please check your dependencies in Cell 2.")
    raise

# Try to import enhanced voice cloning modules
print("\nüöÄ Loading Enhanced Voice Cloning...")
ENHANCED_AVAILABLE = False
try:
    # First check if the file exists
    import os
    if os.path.exists('enhanced_voice_cloning.py'):
        print("‚úì Enhanced voice cloning file found")
        
        # Try importing the enhanced modules
        from enhanced_voice_cloning import (
            EnhancedVoiceCloner, 
            create_enhanced_voice_cloner, 
            quick_voice_clone
        )
        print("‚úÖ Enhanced Voice Cloning modules loaded successfully!")
        ENHANCED_AVAILABLE = True
        
    else:
        print("‚ö†Ô∏è enhanced_voice_cloning.py not found in current directory")
        ENHANCED_AVAILABLE = False
        
except ImportError as e:
    print(f"‚ö†Ô∏è Enhanced modules import failed: {e}")
    print("This might be due to missing dependencies in the enhanced module.")
    print("Using standard voice cloning instead.")
    ENHANCED_AVAILABLE = False
except Exception as e:
    print(f"‚ö†Ô∏è Unexpected error loading enhanced modules: {e}")
    print("Using standard voice cloning instead.")
    ENHANCED_AVAILABLE = False

# Import standard Zonos modules
print("\nüéµ Loading Zonos modules...")
try:
    from zonos.model import Zonos
    from zonos.conditioning import make_cond_dict, supported_language_codes
    from zonos.utils import DEFAULT_DEVICE
    print("‚úÖ Zonos modules loaded successfully!")
except ImportError as e:
    print(f"‚ùå Zonos import error: {e}")
    print("Make sure Cell 2 completed successfully.")
    raise

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\nüñ•Ô∏è Using device: {device}")

if device.type == 'cuda':
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name} ({gpu_memory:.1f} GB)")
    torch.cuda.empty_cache()

# Load the model
model_name = "Zyphra/Zonos-v0.1-transformer"
print(f"\nüì• Loading model: {model_name}")
print("This may take 2-5 minutes for the first time...")

try:
    model = Zonos.from_pretrained(model_name, device=device)
    model.requires_grad_(False).eval()
    print("‚úÖ Model loaded successfully!")
    
    # Model info
    total_params = sum(p.numel() for p in model.parameters())
    print(f"\nüìä Model Info:")
    print(f"  - Parameters: {total_params:,}")
    print(f"  - Device: {next(model.parameters()).device}")
    print(f"  - Enhanced features: {'‚úÖ Available' if ENHANCED_AVAILABLE else '‚ùå Standard only'}")
    print(f"  - Languages: {len(supported_language_codes)} supported")
    
    # Create enhanced cloner if available
    if ENHANCED_AVAILABLE:
        print("\nüöÄ Creating Enhanced Voice Cloner...")
        try:
            # Attempt to pass the model to the enhanced cloner if it accepts it
            try:
                enhanced_cloner = create_enhanced_voice_cloner(model=model, device=device)
            except TypeError:
                 print("  (Enhanced cloner does not accept model directly, creating with device only)")
                 enhanced_cloner = create_enhanced_voice_cloner(device=device)
            print("‚úÖ Enhanced Voice Cloner ready!")
            globals()['enhanced_cloner'] = enhanced_cloner
        except Exception as e:
            print(f"‚ö†Ô∏è Failed to create enhanced cloner: {e}")
            print("Will create fallback enhanced functions...")
            ENHANCED_AVAILABLE = False # Fallback to simple if cloner fails
    
    # Create fallback enhanced functions using zonos.speaker_cloning
    if not ENHANCED_AVAILABLE:
        print("\nüîß Creating fallback enhanced voice cloning functions...")
        try:
            from zonos.speaker_cloning import (
                preprocess_audio_for_cloning,
                analyze_voice_quality,
                get_voice_cloning_conditioning_params,
                get_voice_cloning_sampling_params
            )
            
            def simple_enhanced_clone_voice(wav, sr, **kwargs):
                processed_wav = preprocess_audio_for_cloning(
                    wav, sr,
                    target_length_seconds=kwargs.get('target_length_seconds', 20.0),
                    normalize=kwargs.get('normalize', True),
                    remove_silence=kwargs.get('remove_silence', True)
                )
                quality_metrics = analyze_voice_quality(processed_wav, sr)
                speaker_embedding = model.make_speaker_embedding(processed_wav, sr)
                speaker_embedding = speaker_embedding.to(device, dtype=torch.bfloat16)
                return speaker_embedding, quality_metrics
            
            # Modified simple_enhanced_generate_speech to accept emotion_vector
            def simple_enhanced_generate_speech(text, speaker_embedding=None, language='en-us', 
                                               voice_quality=None, seed=None, cfg_scale=2.0, 
                                               custom_conditioning_params=None, custom_sampling_params=None, 
                                               emotion_vector=None, **kwargs):
                if seed is not None:
                    torch.manual_seed(seed)
                conditioning_params = get_voice_cloning_conditioning_params(voice_quality)
                sampling_params = get_voice_cloning_sampling_params(voice_quality)
                if custom_conditioning_params:
                    conditioning_params.update(custom_conditioning_params)
                if custom_sampling_params:
                    sampling_params.update(custom_sampling_params)
                
                cond_dict_extra_args = {}
                if emotion_vector is not None:
                    cond_dict_extra_args['emotion'] = emotion_vector
                    
                cond_dict = make_cond_dict(
                    text=text, language=language, speaker=speaker_embedding,
                    device=device, **conditioning_params, **cond_dict_extra_args
                )
                conditioning = model.prepare_conditioning(cond_dict, cfg_scale=cfg_scale)
                sampling_dict = {k: v for k, v in sampling_params.items() if k in ['min_p', 'top_k', 'top_p', 'temperature', 'repetition_penalty']}
                
                tokens_per_char = 20
                estimated_tokens = len(text) * tokens_per_char
                min_tokens = 1000
                max_tokens = max(min_tokens, min(estimated_tokens, 86 * 120))
                
                codes = model.generate(
                    prefix_conditioning=conditioning,
                    max_new_tokens=max_tokens,
                    cfg_scale=cfg_scale, 
                    batch_size=1, 
                    progress_bar=True,
                    sampling_params=sampling_dict
                )
                audio = model.autoencoder.decode(codes).cpu().detach()
                return audio
            
            globals()['enhanced_clone_voice_from_audio'] = simple_enhanced_clone_voice
            globals()['enhanced_generate_speech'] = simple_enhanced_generate_speech
            print("‚úÖ Fallback enhanced functions created (now emotion-aware)!")
            ENHANCED_AVAILABLE = True # Mark as available because we have the fallback
            
        except Exception as e:
            print(f"‚ö†Ô∏è Failed to create fallback functions: {e}")
            print("Using standard voice cloning only.")
            ENHANCED_AVAILABLE = False # Ensure it's false if creation fails
    
    globals()['model'] = model
    globals()['device'] = device
    globals()['ENHANCED_AVAILABLE'] = ENHANCED_AVAILABLE 
    
    print("\nüéâ Setup complete! Ready for voice cloning.")
    print("\nüöÄ Next: Run Cell 4 to upload your voice sample.")
    
except Exception as e:
    print(f"‚ùå Error loading model: {e}")
    print("\nüîß Troubleshooting:")
    print("1. Check internet connection")
    print("2. Restart runtime if NumPy issues persist")
    print("3. Re-run all cells from the beginning")
    raise

In [None]:
#@title 4. üé§ Upload Voice Sample for Cloning
from google.colab import files
import torchaudio
import torch
import IPython.display as ipd

print("üé§ Voice Cloning - Upload Your Audio File")
print("Upload an audio file (10-30 seconds) to clone the speaker's voice")
print("Supported formats: WAV, MP3, FLAC, etc.")
print("")

# Upload audio file
uploaded = files.upload()

if uploaded:
    # Get the uploaded file
    audio_file = list(uploaded.keys())[0]
    print(f"\nüìÅ Processing: {audio_file}")
    
    try:
        # Load and process the audio
        wav, sr = torchaudio.load(audio_file)
        
        # Convert to mono if needed
        if wav.shape[0] > 1:
            wav = wav.mean(0, keepdim=True)
        
        # Show audio info
        duration = wav.shape[1] / sr
        print(f"üìä Audio Info:")
        print(f"  - Duration: {duration:.1f} seconds")
        print(f"  - Sample rate: {sr} Hz")
        print(f"  - Channels: {wav.shape[0]}")
        
        # Quality recommendations
        if duration < 5:
            print("\n‚ö†Ô∏è Audio is quite short (< 5s). Consider using 10-20 seconds for better results.")
        elif duration > 30:
            print("\nüí° Audio is long (> 30s). The system will use the best portion automatically.")
        else:
            print("\n‚úÖ Audio duration is optimal for voice cloning!")
        
        # Play the audio
        print("\nüîä Preview of your audio:")
        ipd.display(ipd.Audio(wav.numpy(), rate=sr))
        
        # Create speaker embedding
        print("\nüß† Creating voice embedding...")
        
        if ENHANCED_AVAILABLE and 'enhanced_cloner' in globals():
            print("üöÄ Using Enhanced Voice Cloner class...")
            speaker_embedding, quality_metrics = enhanced_cloner.clone_voice_from_audio(
                wav, sr,
                target_length_seconds=min(20.0, duration),
                normalize=True,
                remove_silence=True,
                analyze_quality=True
            )
            print(f"\nüìà Voice Quality Analysis:")
            print(f"  - Quality Score: {quality_metrics['quality_score']:.3f} / 1.000")
            print(f"  - SNR Estimate: {quality_metrics['snr_estimate']:.1f} dB")
            globals()['voice_quality_metrics'] = quality_metrics
        elif ENHANCED_AVAILABLE and 'enhanced_clone_voice_from_audio' in globals():
            print("üöÄ Using fallback enhanced_clone_voice_from_audio function...")
            speaker_embedding, quality_metrics = enhanced_clone_voice_from_audio(
                wav, sr,
                target_length_seconds=min(20.0, duration),
                normalize=True,
                remove_silence=True
            )
            print(f"\nüìà Voice Quality Analysis (from fallback):")
            print(f"  - Quality Score: {quality_metrics['quality_score']:.3f} / 1.000")
            print(f"  - SNR Estimate: {quality_metrics['snr_estimate']:.1f} dB")
            globals()['voice_quality_metrics'] = quality_metrics
        else:
            print("üì¢ Using standard Zonos model.make_speaker_embedding...")
            speaker_embedding = model.make_speaker_embedding(wav, sr)
            speaker_embedding = speaker_embedding.to(device, dtype=torch.bfloat16)
            globals()['voice_quality_metrics'] = {} # No specific quality metrics for standard
        
        globals()['cloned_voice'] = speaker_embedding
        globals()['original_audio_file'] = audio_file
        
        print("\n‚úÖ Voice cloning successful!")
        print("Your cloned voice is ready to use in Cell 5.")
        
    except Exception as e:
        print(f"‚ùå Error processing audio: {e}")
        print("Please try a different audio file or check the format.")
else:
    print("No file uploaded. You can still use the default voice in Cell 5.")

In [None]:
#@title 5. üé§ Generate Speech with Enhanced Voice Cloning
import IPython.display as ipd
import torch
import time

#@markdown ### Text and Settings
text = "Hello! This is an enhanced voice cloning demonstration using Zonos TTS. The new system provides much better consistency and naturalness." #@param {type:"string"}
language = "en-us" #@param ["en-us", "en-gb", "fr-fr", "es-es", "de-de", "it-it", "ja-jp", "zh-cn"]
seed = 42 #@param {type:"integer"}

#@markdown ### Voice Quality Preset
#@markdown Select a preset to balance speed, quality, and expressiveness. Advanced settings are optimized based on your choice.
quality_preset = "Balanced" #@param ["Conservative", "Balanced", "Fast (Less Expressive)", "Expressive", "Creative"]

#@markdown **Quality Preset Descriptions:**
#@markdown - **Conservative**: Safe, stable output with minimal artifacts. Good for challenging audio or when maximum clarity is needed.
#@markdown - **Balanced**: Good balance of quality, naturalness, and speed (recommended starting point).
#@markdown - **Fast (Less Expressive)**: Prioritizes generation speed by using a lower CFG Scale (1.5). Output may be flatter or less expressive but is significantly faster.
#@markdown - **Expressive**: More dynamic and expressive speech, with an adjusted emotional profile for liveliness (e.g., slightly happier/more surprised).
#@markdown - **Creative**: Experimental, most expressive but may have artifacts, with a unique, diverse emotional profile.
#@markdown 
#@markdown *Underlying parameters like CFG Scale, pitch variation, speaking rate, and sampling settings (min_p, temperature) are automatically adjusted based on your voice sample's quality and the chosen preset. The 'Expressive' and 'Creative' presets also apply specific emotion vectors.*

print("üé§ Enhanced Voice Cloning Generation")
print("=" * 40)

torch.manual_seed(seed)
speaker_embedding = globals().get('cloned_voice', None)
if speaker_embedding is not None:
    print("üé≠ Using your cloned voice!")
    if 'original_audio_file' in globals(): print(f"üìÅ Voice source: {original_audio_file}")
else:
    print("üé§ Using default voice (upload audio in Cell 4 to use your own voice)")

print(f"\nüéµ Generating speech...")
print(f"üìù Text: {text[:100]}{'...' if len(text) > 100 else ''}")
print(f"üåç Language: {language}")
print(f"üé≤ Seed: {seed}")

start_time = time.time()

try:
    use_enhanced_cloner_class = ENHANCED_AVAILABLE and 'enhanced_cloner' in globals()
    use_fallback_enhanced_func = ENHANCED_AVAILABLE and not use_enhanced_cloner_class and 'enhanced_generate_speech' in globals()

    if use_enhanced_cloner_class or use_fallback_enhanced_func:
        print(f"üöÄ Using Enhanced Voice Cloning system...")
        voice_quality = globals().get('voice_quality_metrics', None)
        print(f"üéØ Using {quality_preset} preset with automatic optimization...")
        quality_score = voice_quality.get('quality_score', 0.7) if voice_quality else 0.7
        snr_estimate = voice_quality.get('snr_estimate', 20.0) if voice_quality else 20.0
        
        emotion_vector_override = None 

        if quality_preset == "Conservative":
            base_pitch = 8.0; base_rate = 10.0; base_min_p = 0.02; base_temp = 0.6; cfg_scale_notebook = 2.5
        elif quality_preset == "Fast (Less Expressive)":
            base_pitch = 8.0; base_rate = 10.0; base_min_p = 0.03; base_temp = 0.7; cfg_scale_notebook = 1.5
        elif quality_preset == "Expressive":
            base_pitch = 18.0; base_rate = 14.0; base_min_p = 0.06; base_temp = 0.85; cfg_scale_notebook = 2.0
            emotion_vector_override = [0.6, 0.05, 0.05, 0.05, 0.1, 0.05, 0.05, 0.05] 
        elif quality_preset == "Creative":
            base_pitch = 22.0; base_rate = 16.0; base_min_p = 0.08; base_temp = 0.95; cfg_scale_notebook = 1.8
            emotion_vector_override = [0.2, 0.1, 0.1, 0.1, 0.2, 0.1, 0.1, 0.1] 
        else:  # Balanced (default)
            base_pitch = 12.0; base_rate = 12.0; base_min_p = 0.04; base_temp = 0.75; cfg_scale_notebook = 2.2
        
        quality_factor = min(1.2, max(0.8, quality_score * 1.2))
        snr_factor = min(1.1, max(0.9, (snr_estimate - 15.0) / 20.0 + 1.0))
        pitch_std = max(5.0, min(25.0, base_pitch * quality_factor))
        speaking_rate = max(8.0, min(18.0, base_rate * snr_factor))
        min_p = max(0.01, min(0.15, base_min_p * quality_factor))
        temperature = max(0.5, min(1.0, base_temp * quality_factor))
        cfg_scale_notebook = max(1.0, min(3.0, cfg_scale_notebook)) 
        
        custom_conditioning = {'pitch_std': pitch_std, 'speaking_rate': speaking_rate}
        custom_sampling = {'min_p': min_p, 'temperature': temperature}
        
        print(f"üìä Automatically optimized parameters (for this cell's run):")
        if voice_quality: print(f"  - Voice quality score: {quality_score:.3f}")
        if voice_quality: print(f"  - SNR estimate: {snr_estimate:.1f} dB")
        print(f"  - Pitch variation: {pitch_std:.1f}")
        print(f"  - Speaking rate: {speaking_rate:.1f}")
        print(f"  - Sampling min_p: {min_p:.3f}")
        print(f"  - Temperature: {temperature:.2f}")
        print(f"  - CFG Scale (from preset): {cfg_scale_notebook:.1f}")
        if emotion_vector_override: print(f"  - Emotion Vector Override: {emotion_vector_override}")

        if use_fallback_enhanced_func:
            print("üöÄ Using fallback enhanced_generate_speech function (now emotion-aware)...")
            audio = enhanced_generate_speech(
                text=text, speaker_embedding=speaker_embedding, language=language,
                voice_quality=voice_quality, custom_conditioning_params=custom_conditioning,
                custom_sampling_params=custom_sampling, cfg_scale=cfg_scale_notebook, seed=seed,
                emotion_vector=emotion_vector_override 
            )
            sample_rate = model.autoencoder.sampling_rate
        elif use_enhanced_cloner_class:
            print("üöÄ Using EnhancedVoiceCloner class...")
            audio = enhanced_cloner.generate_speech(
                text=text, speaker_embedding=speaker_embedding, language=language,
                voice_quality=voice_quality, custom_conditioning_params=custom_conditioning,
                custom_sampling_params=custom_sampling, cfg_scale=cfg_scale_notebook, seed=seed,
                emotion_vector=emotion_vector_override 
            )
            sample_rate = enhanced_cloner.model.autoencoder.sampling_rate 
        else:
             raise Exception("Logic error: No valid enhanced generation function determined.")
        print(f"‚úÖ Enhanced generation completed!")
        
    else: 
        print("üì¢ Using standard Zonos model.generate (no enhanced features or emotion override)...")
        cfg_scale_notebook = 2.2
        cond_dict = make_cond_dict(text=text, language=language, speaker=speaker_embedding, device=device)
        conditioning = model.prepare_conditioning(cond_dict, cfg_scale=cfg_scale_notebook)
        tokens_per_char = 20
        estimated_tokens = len(text) * tokens_per_char
        min_tokens = 1000
        max_tokens = max(min_tokens, min(estimated_tokens, 86 * 120))
        codes = model.generate(
            prefix_conditioning=conditioning, max_new_tokens=max_tokens,
            cfg_scale=cfg_scale_notebook, batch_size=1, progress_bar=True
        )
        audio = model.autoencoder.decode(codes).cpu().detach()
        sample_rate = model.autoencoder.sampling_rate
        print(f"‚úÖ Standard generation completed!")
    
    if audio.dim() == 2 and audio.size(0) > 1: audio = audio[0:1, :]
    generation_time = time.time() - start_time
    duration = audio.shape[-1] / sample_rate
    
    print(f"\nüìä Generation Stats:")
    print(f"  - Generation time: {generation_time:.2f} seconds")
    print(f"  - Audio duration: {duration:.2f} seconds")
    print(f"  - Sample rate: {sample_rate} Hz")
    
    print(f"\nüîä Generated Audio:")
    wav_numpy = audio.squeeze().numpy()
    ipd.display(ipd.Audio(wav_numpy, rate=sample_rate))
    globals()['last_generated_audio'] = (wav_numpy, sample_rate)
    
    if use_enhanced_cloner_class or use_fallback_enhanced_func:
        print("\nüéâ Enhanced voice cloning features used.")
    print(f"\n‚úÖ Success! Your voice clone is ready.")
    
except Exception as e:
    print(f"‚ùå Error during audio generation: {e}")
    print("\nüîß Troubleshooting:")
    print("- Try shorter text (under 200 characters)")
    print("- Check GPU memory usage")
    print("- Restart runtime if NumPy issues persist")
    import traceback
    traceback.print_exc()

In [None]:
#@title 6. üìä Run CFG Scale Benchmarks
#@markdown This cell runs benchmarks with different CFG scales (1.0, 1.5, 2.2). Other generation parameters (pitch, rate, sampling) are based on the 'Balanced' preset to isolate the impact of CFG Scale.
#@markdown - **CFG Scale 1.0**: Typically offers the fastest generation but may result in the least expressive or most robotic audio. 
#@markdown - **CFG Scale 1.5**: Used by the "Fast (Less Expressive)" preset in Cell 5. Aims for a balance between speed and quality, though still less expressive than higher CFG scales.
#@markdown - **CFG Scale 2.2**: Default for the "Balanced" preset in Cell 5, offering a good blend of quality and naturalness.
#@markdown Results will show Real-Time Factor (RTF), audio duration, generation time, and allow you to listen to each sample.
#@markdown 
#@markdown **IMPORTANT:** If `zonos/model.py` (or other underlying model code) has been changed due to updates or local modifications, you **MUST re-run Cell 3 (Load Model)** to load the new model code *before* running these benchmarks or generating audio in Cell 5.

import time
import torchaudio
import IPython.display as ipd
import numpy as np
import os

benchmark_audio_dir = "/content/Zonos/benchmark_audio"
if not os.path.exists(benchmark_audio_dir):
    os.makedirs(benchmark_audio_dir)

def run_benchmark_trial(text_input, language_code, seed_value, cfg_scale_to_test, quality_preset_value, 
                        speaker_embedding_tensor, voice_quality_data, 
                        zonos_model, torch_device, 
                        run_warmup=False):
    print(f"\n--- Benchmarking Trial ---")
    print(f"Text: '{text_input[:50]}...' ({len(text_input)} chars)")
    print(f"CFG Scale: {cfg_scale_to_test}, Preset (base for other params): {quality_preset_value}")

    torch.manual_seed(seed_value)

    quality_score = voice_quality_data.get('quality_score', 0.7) if voice_quality_data else 0.7
    snr_estimate = voice_quality_data.get('snr_estimate', 20.0) if voice_quality_data else 20.0

    # Use base parameters from the 'Balanced' preset for consistency in benchmark, CFG is overridden
    base_pitch, base_rate, base_min_p, base_temp = 12.0, 12.0, 0.04, 0.75

    quality_factor = min(1.2, max(0.8, quality_score * 1.2))
    snr_factor = min(1.1, max(0.9, (snr_estimate - 15.0) / 20.0 + 1.0))
    
    pitch_std = max(5.0, min(25.0, base_pitch * quality_factor))
    speaking_rate = max(8.0, min(18.0, base_rate * snr_factor))
    min_p_val = max(0.01, min(0.15, base_min_p * quality_factor))
    temperature_val = max(0.5, min(1.0, base_temp * quality_factor))

    current_custom_conditioning = {'pitch_std': pitch_std, 'speaking_rate': speaking_rate}
    current_custom_sampling = {'min_p': min_p_val, 'temperature': temperature_val}

    if run_warmup:
        print("Running warmup...")
        warmup_text = "Warmup."
        warmup_cond_dict = make_cond_dict(
            text=warmup_text, language=language_code, speaker=speaker_embedding_tensor,
            device=torch_device, **current_custom_conditioning
        )
        warmup_conditioning = zonos_model.prepare_conditioning(warmup_cond_dict, cfg_scale=cfg_scale_to_test)
        _ = zonos_model.generate(
            prefix_conditioning=warmup_conditioning, max_new_tokens=30, cfg_scale=cfg_scale_to_test,
            batch_size=1, sampling_params=current_custom_sampling, progress_bar=False
        )
        print("Warmup complete.")

    generation_start_time = time.time()
    cond_dict = make_cond_dict(
        text=text_input, language=language_code, speaker=speaker_embedding_tensor,
        device=torch_device, **current_custom_conditioning 
    )
    prepared_conditioning = zonos_model.prepare_conditioning(cond_dict, cfg_scale=cfg_scale_to_test)
    
    tokens_per_char = 15 
    estimated_tokens = len(text_input) * tokens_per_char
    min_gen_tokens = 200
    max_gen_tokens = max(min_gen_tokens, min(estimated_tokens, 86 * 100))

    codes = zonos_model.generate(
        prefix_conditioning=prepared_conditioning, max_new_tokens=max_gen_tokens,
        cfg_scale=cfg_scale_to_test, batch_size=1, 
        sampling_params=current_custom_sampling, progress_bar=True
    )
    audio_output = zonos_model.autoencoder.decode(codes).cpu().detach()
    generation_time = time.time() - generation_start_time
    sample_rate = zonos_model.autoencoder.sampling_rate
    
    if audio_output.dim() == 2 and audio_output.size(0) > 1: audio_output = audio_output[0:1, :]
    audio_duration = audio_output.shape[-1] / sample_rate
    rtf = generation_time / audio_duration if audio_duration > 0 else float('inf')
    
    print(f"  Generated {audio_duration:.2f}s audio in {generation_time:.2f}s. RTF: {rtf:.2f}")

    clean_text_for_filename = text_input[:20].replace(' ', '_').replace('.', '').replace('!', '').replace('?', '')
    audio_filename = f"benchmark_cfg_{cfg_scale_to_test}_seed_{seed_value}_text_{clean_text_for_filename}.wav"
    audio_filepath = os.path.join(benchmark_audio_dir, audio_filename)
    torchaudio.save(audio_filepath, audio_output.squeeze(0), sample_rate)
    print(f"  Saved audio to: {audio_filepath}")
    return rtf, audio_duration, generation_time, audio_filepath

texts_to_benchmark = [
    "Hello world.",
    "This is a test of the emergency broadcast system.",
    "The quick brown fox jumps over the lazy dog, and other fables are often used for typing practice."
]
cfg_scales_to_benchmark = [1.0, 1.5, 2.2]
benchmark_language = "en-us"
benchmark_seed = 42 
benchmark_quality_preset_for_other_params = "Balanced" 
benchmark_results_list = [] 

if 'model' not in globals() or 'device' not in globals():
    print("‚ö†Ô∏è Model or device not found. Please run previous cells (1-3) to load the model.")
elif 'make_cond_dict' not in globals():
    print("‚ö†Ô∏è make_cond_dict not found. Please ensure Cell 3 (model loading) has run successfully.")
else:
    current_speaker_embedding = globals().get('cloned_voice', None)
    if current_speaker_embedding is None: print("üé§ No cloned voice found. Using default speaker if model supports.")
    current_voice_quality_metrics = globals().get('voice_quality_metrics', {})
    
    print("\nüî• Running a single warm-up generation before benchmark loop (using CFG 2.2 from preset)...")
    run_benchmark_trial(
        "Warmup text.", benchmark_language, benchmark_seed, 2.2, 
        benchmark_quality_preset_for_other_params, current_speaker_embedding, current_voice_quality_metrics,
        model, device, run_warmup=False 
    )
    print("üî• Warm-up finished.\n")

    for cfg_val in cfg_scales_to_benchmark:
        for text_sample in texts_to_benchmark:
            rtf, audio_dur, gen_time, audio_file = run_benchmark_trial(
                text_sample, benchmark_language, benchmark_seed, cfg_val,
                benchmark_quality_preset_for_other_params, current_speaker_embedding, current_voice_quality_metrics,
                model, device
            )
            benchmark_results_list.append({
                "text": text_sample,
                "cfg_scale": cfg_val,
                "rtf": rtf,
                "audio_duration": audio_dur,
                "generation_time": gen_time,
                "audio_file": audio_file
            })

    print("\n\n--- Benchmark Summary ---")
    table_header = f"{'CFG':<5} | {'Text Len':<8} | {'RTF':<5} | {'Audio (s)':<10} | {'Gen Time (s)':<12} | {'File':<70}"
    print(table_header)
    print("-" * len(table_header))
    for res in benchmark_results_list:
        text_len_desc = "Short" if len(res['text']) < 20 else "Medium" if len(res['text']) < 70 else "Long"
        print(f"{res['cfg_scale']:<5.1f} | {text_len_desc:<8} | {res['rtf']:<5.2f} | {res['audio_duration']:<10.2f} | {res['generation_time']:<12.2f} | {os.path.basename(res['audio_file']):<70}")
        ipd.display(ipd.HTML(f"<b>Text:</b> {res['text']}<br><b>CFG:</b> {res['cfg_scale']}, <b>File:</b> {res['audio_file']}"))
        ipd.display(ipd.Audio(res['audio_file']))
        print("-" * 70)

    globals()['benchmark_run_results_list'] = benchmark_results_list

print("\n‚úÖ Benchmarking cell execution complete.")
print("Reminder: If you've updated zonos/model.py or other core files, ensure you've re-run Cell 3 to load changes before running Cell 5 or this benchmark cell.")

---
## üéâ Enhanced Voice Cloning Complete!

You've successfully used the enhanced voice cloning system with Zonos TTS. This notebook provides a comprehensive suite for voice cloning, generation, and performance benchmarking.

### üöÄ What's Enhanced & Key Features:
- **Improved Speech Quality**: Significant reductions in gibberish, better timing consistency, and more natural speech flow.
- **Advanced Audio Preprocessing**: Automatic silence removal and normalization for uploaded voice samples.
- **Voice Quality Analysis**: SNR estimation and quality scoring for your voice samples to guide parameter choices.
- **Flexible Quality Presets**: 
    - Choose from presets like "Conservative", "Balanced", "Fast (Less Expressive)", "Expressive", and "Creative".
    - "Fast" preset uses a lower CFG Scale (1.5) for quicker generation with a trade-off in expressiveness.
    - "Expressive" and "Creative" presets now incorporate specific emotion vectors for more vivid speech.
- **Adaptive Settings**: Parameters automatically adjust based on your voice sample's quality and chosen preset.
- **CFG Scale Control**: Support for `cfg_scale=1.0` (and other values) in `zonos.model.py` allows for fine-tuning the balance between speed and expressiveness. This is benchmarked in Cell 6.
- **Reproducible Results**: Seed support for consistent audio generation.
- **Google Colab Compatibility**: Streamlined setup and dependency management within the Colab environment.
- **Benchmarking Tools**: Cell 6 allows for systematic testing of different CFG Scales to understand performance and quality trade-offs.

### üí° Tips for Best Results:
- Use clean, high-quality audio (16kHz+ sample rate, minimal background noise/music) for voice cloning.
- Provide 10-20 seconds of clear speech for optimal cloning.
- Experiment with different Quality Presets in Cell 5 to find the best match for your needs.
- If modifying underlying code (like `zonos/model.py`), always re-run Cell 3 (Load Model) to apply changes.

### üîß If You Encountered Issues:
- **NumPy or other dependency errors**: Try `Runtime` > `Restart runtime` (or `Factory reset runtime`) then re-run cells from the beginning (Cell 1 onwards).
- **Model loading errors after code changes**: Ensure you've re-run Cell 3.
- **Memory errors**: Try shorter text for generation or restart the runtime.
- **Audio quality issues**: Use cleaner source audio for cloning. Experiment with different presets in Cell 5.

---

**üé§ Thank you for using Enhanced Voice Cloning with Zonos TTS!**

For more information, visit: [Zonos GitHub Repository](https://github.com/Wamp1re-Ai/Zonos)