<a href="https://colab.research.google.com/github/brunombo/Python/blob/master/long_TTS_xtts_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
text_to_speech_to_synthetise= "Ce document pr√©sente les Manifold-Constrained Hyper-Connections (mHC), une architecture novatrice con√ßue par DeepSeek-AI pour stabiliser l'entra√Ænement des grands mod√®les de langage. Bien que les Hyper-Connections (HC) classiques am√©liorent les performances en √©largissant le flux r√©siduel, leur nature non contrainte provoque souvent une instabilit√© num√©rique et des probl√®mes de divergence du signal. Pour rem√©dier √† cela, les auteurs utilisent l'algorithme de Sinkhorn-Knopp afin de projeter les connexions sur une vari√©t√© de matrices doublement stochastiques, pr√©servant ainsi la propri√©t√© de mappage d'identit√©. Cette approche garantit une propagation saine du signal tout en optimisant l'efficacit√© mat√©rielle gr√¢ce √† la fusion de noyaux et √† des strat√©gies de m√©morisation s√©lective. Les r√©sultats exp√©rimentaux d√©montrent que mHC surpasse les m√©thodes existantes en termes de scalabilit√© et de capacit√©s de raisonnement sur divers tests de r√©f√©rence. En int√©grant ces contraintes g√©om√©triques rigoureuses, le cadre mHC offre une solution robuste pour l'√©volution des architectures neuronales √† grande √©chelle."

voice_gender = 'female_fr'
# ['female_fr', 'male_fr']

In [None]:
# Installation des d√©pendances
!pip install -q scipy noisereduce
!pip install -q numpy==2.0.2

# Installation du fork maintenu (supporte Python 3.12+)
!pip install -q coqui-tts
!pip install -q torchcodec

In [None]:
! pip install torchcodec



In [None]:
# -*- coding: utf-8 -*-
"""
TTS XTTS v2 - Version Long Audio (> 1 heure)
=============================================

Module de synth√®se vocale haute qualit√© utilisant Coqui XTTS v2.
Optimis√© pour la g√©n√©ration d'audio longs avec:
- enable_text_splitting=True pour d√©coupage automatique
- Chunking intelligent par paragraphes pour textes tr√®s longs
- Concat√©nation audio avec crossfade
- Barre de progression et estimation temps restant
- Gestion m√©moire optimis√©e
- Correction du bug d'argument 'language' sur l'API synthesizer

Auteur: Bruno
Date: Janvier 2025
Correction: Gemini
"""

# ==============================================================================
# IMPORTS
# ==============================================================================

from __future__ import annotations

import os
import re
import gc
import wave
import time
import hashlib
import warnings
from pathlib import Path
from typing import Optional, Union, List, Callable
from dataclasses import dataclass
from enum import Enum

import numpy as np

warnings.filterwarnings("ignore", category=UserWarning)

# ==============================================================================
# INSTALLATION (Colab)
# ==============================================================================

def install_dependencies():
    """Installe les d√©pendances si n√©cessaire (Colab)."""
    import subprocess
    import sys

    packages = [
        ("scipy", "scipy"),
        ("noisereduce", "noisereduce"),
        ("TTS", "coqui-tts"),
    ]

    for module, package in packages:
        try:
            __import__(module)
        except ImportError:
            print(f"üì¶ Installation de {package}...")
            subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", package])

    # numpy compatible
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "numpy==2.0.2"])
    except:
        pass

# ==============================================================================
# CONFIGURATION
# ==============================================================================

@dataclass
class TTSConfig:
    """Configuration globale du module TTS."""
    MODEL_NAME: str = "tts_models/multilingual/multi-dataset/xtts_v2"
    SAMPLE_RATE: int = 24000
    DEFAULT_LANGUAGE: str = "fr"
    GDRIVE_FOLDER: str = "/content/drive/MyDrive/TTS_Output"

    # Configuration pour audio longs
    MAX_CHARS_PER_CHUNK: int = 500  # Caract√®res max par chunk pour textes tr√®s longs
    CROSSFADE_DURATION: float = 0.05  # Dur√©e du crossfade en secondes
    ENABLE_TEXT_SPLITTING: bool = True  # Activer le split natif XTTS

    PRESET_VOICES: dict = None

    def __post_init__(self):
        self.PRESET_VOICES = {
            "female_fr": "https://huggingface.co/spaces/coqui/xtts/resolve/main/examples/female.wav",
            "male_fr": "https://huggingface.co/spaces/coqui/xtts/resolve/main/examples/male.wav",
        }

Config = TTSConfig()

# ==============================================================================
# DEVICE MANAGEMENT
# ==============================================================================

_device = None
_device_name = "cpu"

def detect_device():
    """D√©tecte le meilleur device disponible."""
    global _device, _device_name
    import torch

    # Essayer TPU
    try:
        import torch_xla.core.xla_model as xm
        _device = xm.xla_device()
        _device_name = "tpu"
        print(f"‚öôÔ∏è Device: TPU")
        return
    except:
        pass

    # Essayer CUDA
    if torch.cuda.is_available():
        _device = torch.device("cuda")
        _device_name = f"cuda ({torch.cuda.get_device_name(0)})"
        print(f"‚öôÔ∏è Device: {_device_name}")
        return

    # Fallback CPU
    _device = torch.device("cpu")
    _device_name = "cpu"
    print(f"‚öôÔ∏è Device: CPU")

# ==============================================================================
# TEXT SPLITTING UTILITIES
# ==============================================================================

class TextSplitter:
    """
    Utilitaire pour d√©couper intelligemment les textes longs.
    Pr√©serve la coh√©rence des phrases et paragraphes.
    """

    @staticmethod
    def estimate_audio_duration(text: str, chars_per_second: float = 15.0) -> float:
        """
        Estime la dur√©e audio pour un texte donn√©.
        """
        return len(text) / chars_per_second

    @staticmethod
    def split_into_sentences(text: str) -> List[str]:
        """D√©coupe le texte en phrases."""
        # Pattern pour fin de phrase
        pattern = r'(?<=[.!?])\s+'
        sentences = re.split(pattern, text)
        return [s.strip() for s in sentences if s.strip()]

    @staticmethod
    def split_into_paragraphs(text: str) -> List[str]:
        """D√©coupe le texte en paragraphes."""
        paragraphs = re.split(r'\n\s*\n', text)
        return [p.strip() for p in paragraphs if p.strip()]

    @classmethod
    def split_for_long_audio(
        cls,
        text: str,
        max_chars: int = 500,
        preserve_sentences: bool = True
    ) -> List[str]:
        """
        D√©coupe un texte long en chunks optimaux pour la synth√®se.
        """
        # Si texte court, retourner tel quel
        if len(text) <= max_chars:
            return [text]

        chunks = []

        if preserve_sentences:
            sentences = cls.split_into_sentences(text)
            current_chunk = ""

            for sentence in sentences:
                # Si la phrase seule d√©passe max_chars, la d√©couper
                if len(sentence) > max_chars:
                    if current_chunk:
                        chunks.append(current_chunk.strip())
                        current_chunk = ""
                    # D√©couper la phrase longue par mots
                    words = sentence.split()
                    sub_chunk = ""
                    for word in words:
                        if len(sub_chunk) + len(word) + 1 <= max_chars:
                            sub_chunk += " " + word if sub_chunk else word
                        else:
                            if sub_chunk:
                                chunks.append(sub_chunk.strip())
                            sub_chunk = word
                    if sub_chunk:
                        current_chunk = sub_chunk
                elif len(current_chunk) + len(sentence) + 1 <= max_chars:
                    current_chunk += " " + sentence if current_chunk else sentence
                else:
                    if current_chunk:
                        chunks.append(current_chunk.strip())
                    current_chunk = sentence

            if current_chunk:
                chunks.append(current_chunk.strip())
        else:
            # D√©coupage simple par caract√®res
            for i in range(0, len(text), max_chars):
                chunks.append(text[i:i + max_chars])

        return chunks


# ==============================================================================
# AUDIO PROCESSING
# ==============================================================================

class AudioProcessor:
    """Processeur audio pour post-traitement et concat√©nation."""

    @staticmethod
    def normalize(audio: np.ndarray, target_db: float = -3.0) -> np.ndarray:
        """Normalise l'audio au niveau cible."""
        if audio.dtype == np.int16:
            audio = audio.astype(np.float32) / 32768.0

        peak = np.max(np.abs(audio))
        if peak > 0:
            target_linear = 10 ** (target_db / 20)
            audio = audio * (target_linear / peak)

        return np.clip(audio, -1.0, 1.0)

    @staticmethod
    def crossfade(
        audio1: np.ndarray,
        audio2: np.ndarray,
        sample_rate: int,
        duration: float = 0.05
    ) -> np.ndarray:
        """
        Concat√®ne deux segments audio avec crossfade.
        """
        # Convertir en float si n√©cessaire
        if audio1.dtype == np.int16:
            audio1 = audio1.astype(np.float32) / 32768.0
        if audio2.dtype == np.int16:
            audio2 = audio2.astype(np.float32) / 32768.0

        fade_samples = int(sample_rate * duration)

        # Si audio trop court pour crossfade, concat√©ner simplement
        if len(audio1) < fade_samples or len(audio2) < fade_samples:
            return np.concatenate([audio1, audio2])

        # Cr√©er les courbes de fade
        fade_out = np.linspace(1.0, 0.0, fade_samples)
        fade_in = np.linspace(0.0, 1.0, fade_samples)

        # Appliquer le crossfade
        audio1_end = audio1[-fade_samples:] * fade_out
        audio2_start = audio2[:fade_samples] * fade_in

        # Assembler
        result = np.concatenate([
            audio1[:-fade_samples],
            audio1_end + audio2_start,
            audio2[fade_samples:]
        ])

        return result

    @classmethod
    def concatenate_chunks(
        cls,
        audio_chunks: List[np.ndarray],
        sample_rate: int,
        crossfade_duration: float = 0.05
    ) -> np.ndarray:
        """
        Concat√®ne plusieurs chunks audio avec crossfade.
        """
        if not audio_chunks:
            return np.array([], dtype=np.float32)

        if len(audio_chunks) == 1:
            audio = audio_chunks[0]
            if audio.dtype == np.int16:
                audio = audio.astype(np.float32) / 32768.0
            return audio

        result = audio_chunks[0]
        if result.dtype == np.int16:
            result = result.astype(np.float32) / 32768.0

        for chunk in audio_chunks[1:]:
            result = cls.crossfade(result, chunk, sample_rate, crossfade_duration)

        return result

    @staticmethod
    def enhance(
        audio: np.ndarray,
        sample_rate: int,
        normalize: bool = True,
        warmth: bool = True
    ) -> np.ndarray:
        """Am√©liore la qualit√© audio."""
        if audio.dtype == np.int16:
            audio = audio.astype(np.float32) / 32768.0

        if warmth:
            try:
                from scipy import signal
                nyquist = sample_rate / 2
                cutoff = min(300, nyquist * 0.9) / nyquist
                b, a = signal.butter(2, cutoff, btype='low')
                bass = signal.filtfilt(b, a, audio)
                audio = audio + 0.15 * bass
            except ImportError:
                pass

        if normalize:
            peak = np.max(np.abs(audio))
            if peak > 0:
                target = 10 ** (-3.0 / 20)
                audio = audio * (target / peak)

        audio = np.clip(audio, -1.0, 1.0)
        return audio


# ==============================================================================
# PROGRESS TRACKER
# ==============================================================================

class ProgressTracker:
    """Suivi de progression avec estimation du temps restant."""

    def __init__(self, total: int, description: str = ""):
        self.total = total
        self.current = 0
        self.description = description
        self.start_time = time.time()
        self.chunk_times = []

    def update(self, chunk_duration: float = None):
        """Met √† jour la progression."""
        self.current += 1
        if chunk_duration:
            self.chunk_times.append(chunk_duration)
        self._display()

    def _display(self):
        """Affiche la barre de progression."""
        elapsed = time.time() - self.start_time
        percent = (self.current / self.total) * 100

        # Estimation temps restant
        if self.chunk_times:
            avg_time = np.mean(self.chunk_times)
            remaining = avg_time * (self.total - self.current)
            eta_str = self._format_time(remaining)
        else:
            eta_str = "..."

        # Barre de progression
        bar_length = 30
        filled = int(bar_length * self.current / self.total)
        bar = "‚ñà" * filled + "‚ñë" * (bar_length - filled)

        elapsed_str = self._format_time(elapsed)

        print(f"\r{self.description} [{bar}] {self.current}/{self.total} "
              f"({percent:.1f}%) | Temps: {elapsed_str} | ETA: {eta_str}", end="")

        if self.current >= self.total:
            print()  # Nouvelle ligne √† la fin

    @staticmethod
    def _format_time(seconds: float) -> str:
        """Formate un temps en secondes en HH:MM:SS."""
        hours = int(seconds // 3600)
        minutes = int((seconds % 3600) // 60)
        secs = int(seconds % 60)

        if hours > 0:
            return f"{hours:02d}:{minutes:02d}:{secs:02d}"
        return f"{minutes:02d}:{secs:02d}"


# ==============================================================================
# TTS ENGINE
# ==============================================================================

_tts_model = None
_voices_cache = {}
os.environ["COQUI_TOS_AGREED"] = "1"

def get_model():
    """Charge le mod√®le XTTS v2 avec cache."""
    global _tts_model

    if _tts_model is None:
        print("üîÑ Chargement du mod√®le XTTS v2...")
        from TTS.api import TTS

        _tts_model = TTS(Config.MODEL_NAME)

        if _device is not None and _device_name.startswith("cuda"):
            _tts_model = _tts_model.to(_device)

        print("‚úì Mod√®le charg√©")

    return _tts_model


def get_voice_path(voice: str) -> str:
    """Obtient le chemin vers un fichier de voix."""
    global _voices_cache
    import urllib.request

    if voice in _voices_cache:
        return _voices_cache[voice]

    if os.path.isfile(voice):
        _voices_cache[voice] = voice
        return voice

    if voice in Config.PRESET_VOICES:
        url = Config.PRESET_VOICES[voice]
        path = f"/tmp/{voice}.wav"

        if not os.path.exists(path):
            print(f"üì• T√©l√©chargement de la voix '{voice}'...")
            urllib.request.urlretrieve(url, path)

        _voices_cache[voice] = path
        return path

    raise FileNotFoundError(f"Voix '{voice}' non trouv√©e")


# ==============================================================================
# MAIN SYNTHESIS FUNCTIONS
# ==============================================================================

def synthesize_chunk(
    text: str,
    voice_path: str,
    language: str = "fr",
    enable_text_splitting: bool = True
) -> np.ndarray:
    """
    Synth√©tise un chunk de texte en audio via l'inf√©rence directe (Low-Level).
    Bypass total du SpeakerManager pour √©viter le bug FileNotFoundError .pth
    """
    model_wrapper = get_model()

    # 1. Acc√®s "chirurgical" au mod√®le interne XTTS
    # C'est lui qui fait le travail, sans la couche de gestion de fichiers bugg√©e
    if hasattr(model_wrapper, 'synthesizer'):
        xtts_model = model_wrapper.synthesizer.tts_model
    else:
        # Cas rare ou structure diff√©rente, on tente l'acc√®s direct
        xtts_model = model_wrapper.tts_model

    # 2. Calcul manuel des latents (Empreinte vocale)
    # On transforme le fichier WAV en vecteurs math√©matiques
    try:
        gpt_cond_latent, speaker_embedding = xtts_model.get_conditioning_latents(
            audio_path=[voice_path],
            gpt_cond_len=30,
            max_ref_length=60
        )
    except Exception as e:
        print(f"‚ö†Ô∏è Erreur calcul latents: {e}")
        raise e

    # 3. Inf√©rence directe
    # On appelle la fonction de g√©n√©ration pure, sans passer par tts()
    try:
        out = xtts_model.inference(
            text=text,
            language=language,
            gpt_cond_latent=gpt_cond_latent,
            speaker_embedding=speaker_embedding,
            temperature=0.7,        # Param√®tre standard pour la cr√©ativit√©
            length_penalty=1.0,     # P√©nalit√© de longueur
            repetition_penalty=2.0, # √âvite les b√©gaiements
            top_k=50,
            top_p=0.8,
            enable_text_splitting=enable_text_splitting
        )

        # Le r√©sultat est g√©n√©ralement dans un dictionnaire sous la cl√© 'wav'
        if isinstance(out, dict) and 'wav' in out:
            wav = out['wav']
        else:
            wav = out

        # S'assurer que c'est bien un numpy array sur CPU
        if hasattr(wav, 'cpu'):
            wav = wav.cpu().numpy()
        if isinstance(wav, list):
            wav = np.array(wav, dtype=np.float32)

        return wav

    except Exception as e:
        print(f"‚ö†Ô∏è Erreur lors de l'inf√©rence directe : {e}")
        raise e


def text_to_speech_long(
    text: str,
    voice: str = "female_fr",
    language: str = "fr",
    output_path: Optional[str] = None,
    enhance: bool = False,
    use_gdrive: bool = False,
    gdrive_folder: str = None,
    max_chars_per_chunk: int = None,
    show_progress: bool = True,
    enable_text_splitting: bool = True
) -> dict:
    """
    G√©n√®re un fichier audio long (> 1 heure) √† partir de texte.
    """
    import torch

    # Configuration
    max_chars = max_chars_per_chunk or Config.MAX_CHARS_PER_CHUNK
    voice_path = get_voice_path(voice)

    # Estimation initiale
    estimated_duration = TextSplitter.estimate_audio_duration(text)
    print(f"\nüìù Texte: {len(text):,} caract√®res")
    print(f"‚è±Ô∏è  Dur√©e estim√©e: {ProgressTracker._format_time(estimated_duration)}")

    # D√©couper le texte
    chunks = TextSplitter.split_for_long_audio(text, max_chars=max_chars)
    print(f"üì¶ Chunks: {len(chunks)}")

    # Initialiser la progression
    progress = None
    if show_progress:
        progress = ProgressTracker(len(chunks), "üéôÔ∏è Synth√®se")

    # G√©n√©rer l'audio chunk par chunk
    audio_chunks = []

    for i, chunk in enumerate(chunks):
        chunk_start = time.time()

        try:
            wav = synthesize_chunk(
                text=chunk,
                voice_path=voice_path,
                language=language,
                enable_text_splitting=enable_text_splitting
            )
            audio_chunks.append(wav)

        except Exception as e:
            print(f"\n‚ö†Ô∏è Erreur chunk {i+1}: {e}")
            # Continuer avec les autres chunks
            continue

        # Lib√©rer la m√©moire GPU p√©riodiquement
        if _device_name.startswith("cuda") and (i + 1) % 10 == 0:
            torch.cuda.empty_cache()

        chunk_duration = time.time() - chunk_start
        if progress:
            progress.update(chunk_duration)

    if not audio_chunks:
        raise RuntimeError("Aucun audio g√©n√©r√©")

    print("\nüîó Concat√©nation des chunks...")

    # Concat√©ner avec crossfade
    final_audio = AudioProcessor.concatenate_chunks(
        audio_chunks,
        Config.SAMPLE_RATE,
        Config.CROSSFADE_DURATION
    )

    # Lib√©rer les chunks de la m√©moire
    del audio_chunks
    gc.collect()
    if _device_name.startswith("cuda"):
        torch.cuda.empty_cache()

    # Post-traitement
    if enhance:
        print("‚ú® Post-traitement...")
        final_audio = AudioProcessor.enhance(
            final_audio,
            Config.SAMPLE_RATE,
            normalize=True,
            warmth=True
        )
    else:
        final_audio = AudioProcessor.normalize(final_audio)

    # Convertir en int16
    final_audio = (final_audio * 32767).astype(np.int16)

    # G√©n√©rer le nom de fichier
    if output_path is None:
        h = hashlib.md5(text[:100].encode()).hexdigest()[:8]
        output_path = f"tts_long_{voice}_{h}.wav"

    # Dossier de sortie
    if use_gdrive:
        folder = Path(gdrive_folder or Config.GDRIVE_FOLDER)
        folder.mkdir(parents=True, exist_ok=True)
        final_path = folder / Path(output_path).name
    else:
        final_path = Path(output_path)

    # Sauvegarder
    print(f"üíæ Sauvegarde: {final_path}")
    with wave.open(str(final_path), "wb") as wav_file:
        wav_file.setnchannels(1)
        wav_file.setsampwidth(2)
        wav_file.setframerate(Config.SAMPLE_RATE)
        wav_file.writeframes(final_audio.tobytes())

    # Calculer la dur√©e r√©elle
    duration = len(final_audio) / Config.SAMPLE_RATE

    print(f"\n‚úÖ Audio g√©n√©r√© avec succ√®s!")
    print(f"   üìÅ Fichier: {final_path}")
    print(f"   ‚è±Ô∏è  Dur√©e: {ProgressTracker._format_time(duration)}")
    print(f"   üì¶ Chunks: {len(chunks)}")
    print(f"   üé§ Voix: {voice}")

    return {
        'path': str(final_path),
        'sample_rate': Config.SAMPLE_RATE,
        'duration_seconds': duration,
        'duration_formatted': ProgressTracker._format_time(duration),
        'audio_data': final_audio,
        'voice': voice,
        'language': language,
        'device': _device_name,
        'chunks_count': len(chunks),
        'text_length': len(text)
    }


def text_to_speech(
    text: str,
    voice: str = "female_fr",
    language: str = "fr",
    output_path: Optional[str] = None,
    enhance: bool = False,
    use_gdrive: bool = False,
    gdrive_folder: str = None,
    enable_text_splitting: bool = True
) -> dict:
    """
    G√©n√®re un fichier audio √† partir de texte avec XTTS v2.
    """
    # Basculer automatiquement vers la version long pour textes > 10000 chars
    if len(text) > 10000:
        print("üì¢ Texte long d√©tect√© - utilisation de text_to_speech_long()")
        return text_to_speech_long(
            text=text,
            voice=voice,
            language=language,
            output_path=output_path,
            enhance=enhance,
            use_gdrive=use_gdrive,
            gdrive_folder=gdrive_folder,
            enable_text_splitting=enable_text_splitting
        )

    voice_path = get_voice_path(voice)

    # G√©n√©rer l'audio avec enable_text_splitting
    wav = synthesize_chunk(
        text=text,
        voice_path=voice_path,
        language=language,
        enable_text_splitting=enable_text_splitting
    )

    # Post-traitement
    if enhance:
        audio = AudioProcessor.enhance(wav, Config.SAMPLE_RATE)
    else:
        audio = AudioProcessor.normalize(wav)

    audio = (audio * 32767).astype(np.int16)

    # Nom de fichier
    if output_path is None:
        h = hashlib.md5(text.encode()).hexdigest()[:8]
        output_path = f"tts_{voice}_{h}.wav"

    # Dossier de sortie
    if use_gdrive:
        folder = Path(gdrive_folder or Config.GDRIVE_FOLDER)
        folder.mkdir(parents=True, exist_ok=True)
        final_path = folder / Path(output_path).name
    else:
        final_path = Path(output_path)

    # Sauvegarder
    with wave.open(str(final_path), "wb") as wav_file:
        wav_file.setnchannels(1)
        wav_file.setsampwidth(2)
        wav_file.setframerate(Config.SAMPLE_RATE)
        wav_file.writeframes(audio.tobytes())

    duration = len(audio) / Config.SAMPLE_RATE

    print(f"‚úì Audio g√©n√©r√©: {final_path}")
    print(f"  Dur√©e: {duration:.2f}s | Voix: {voice}")

    return {
        'path': str(final_path),
        'sample_rate': Config.SAMPLE_RATE,
        'duration_seconds': duration,
        'audio_data': audio,
        'voice': voice,
        'language': language,
        'device': _device_name
    }


# ==============================================================================
# UTILITIES
# ==============================================================================

def preview_audio(result: dict) -> None:
    """Pr√©visualise l'audio dans le notebook."""
    from IPython.display import Audio, display

    audio = result['audio_data']
    if audio.dtype == np.int16:
        audio = audio.astype(np.float32) / 32768.0

    display(Audio(audio, rate=result['sample_rate']))


def list_voices() -> list:
    """Liste les voix disponibles."""
    return list(Config.PRESET_VOICES.keys())


def list_languages() -> list:
    """Liste les langues support√©es."""
    return ["en", "es", "fr", "de", "it", "pt", "pl", "tr",
            "ru", "nl", "cs", "ar", "zh-cn", "ja", "hu", "ko", "hi"]


def clear_cache():
    """Lib√®re la m√©moire."""
    global _tts_model
    import torch

    _tts_model = None
    gc.collect()

    if _device_name.startswith("cuda"):
        torch.cuda.empty_cache()

    print("‚úì Cache vid√©")


def estimate_duration(text: str) -> dict:
    """
    Estime la dur√©e audio pour un texte.
    """
    duration = TextSplitter.estimate_audio_duration(text)
    chunks = len(TextSplitter.split_for_long_audio(text))

    return {
        'chars': len(text),
        'estimated_seconds': duration,
        'estimated_formatted': ProgressTracker._format_time(duration),
        'chunks_estimate': chunks
    }


# ==============================================================================
# ALIASES
# ==============================================================================

tts = text_to_speech
tts_long = text_to_speech_long


# ==============================================================================
# INITIALIZATION
# ==============================================================================

def init():
    """Initialise le module."""
    detect_device()
    print("‚úÖ Module XTTS v2 Long Audio charg√©")
    print(f"   Device: {_device_name}")
    print(f"   Voix: {list_voices()}")
    print(f"   enable_text_splitting: activ√© par d√©faut")


# Auto-init
if __name__ != "__main__":
    try:
        detect_device()
    except:
        pass


# ==============================================================================
# EXAMPLE USAGE
# ==============================================================================

if __name__ == "__main__":
    # Installation si n√©cessaire
    install_dependencies()

    # Initialisation
    init()

    # Exemple avec texte court
    print("\n" + "="*60)
    print("EXEMPLE 1: Texte court")
    print("="*60)

    short_text = """
    Ce document pr√©sente les Manifold-Constrained Hyper-Connections,
    une architecture novatrice con√ßue par DeepSeek-AI pour stabiliser
    l'entra√Ænement des grands mod√®les de langage.
    """

    result = text_to_speech(
        text=short_text.strip(),
        voice="female_fr",
        enhance=True
    )

    print(f"\nR√©sultat: {result['duration_seconds']:.2f}s")

    # Exemple avec texte long (simul√©)
    print("\n" + "="*60)
    print("EXEMPLE 2: Estimation pour texte long")
    print("="*60)

    # Simuler un texte de ~1 heure (environ 54000 caract√®res)
    long_text = short_text.strip() * 300  # ~54000 chars ‚âà 1 heure

    estimation = estimate_duration(long_text)
    print(f"\nEstimation pour {estimation['chars']:,} caract√®res:")
    print(f"  Dur√©e: {estimation['estimated_formatted']}")
    print(f"  Chunks: {estimation['chunks_estimate']}")

    # Pour g√©n√©rer r√©ellement:
    # result = text_to_speech_long(long_text, voice="female_fr", show_progress=True)

‚öôÔ∏è Device: cuda (Tesla T4)
‚úÖ Module XTTS v2 Long Audio charg√©
   Device: cuda (Tesla T4)
   Voix: ['female_fr', 'male_fr']
   enable_text_splitting: activ√© par d√©faut

EXEMPLE 1: Texte court
üîÑ Chargement du mod√®le XTTS v2...
‚úì Mod√®le charg√©
‚úì Audio g√©n√©r√©: tts_female_fr_151473ed.wav
  Dur√©e: 9.79s | Voix: female_fr

R√©sultat: 9.79s

EXEMPLE 2: Estimation pour texte long

Estimation pour 55,200 caract√®res:
  Dur√©e: 01:01:20
  Chunks: 108


In [None]:
# Lire le fichier
with open("mon_texte_long.txt", "r", encoding="utf-8") as f:
    texte_complet = f.read()

# Lancer la g√©n√©ration
text_to_speech_long(
    text=texte_complet,
    voice="female_fr",
    language="fr"
)


üìù Texte: 6,715 caract√®res
‚è±Ô∏è  Dur√©e estim√©e: 07:27
üì¶ Chunks: 16
üéôÔ∏è Synth√®se [‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà] 16/16 (100.0%) | Temps: 03:10 | ETA: 00:00

üîó Concat√©nation des chunks...
üíæ Sauvegarde: tts_long_female_fr_8aba435b.wav

‚úÖ Audio g√©n√©r√© avec succ√®s!
   üìÅ Fichier: tts_long_female_fr_8aba435b.wav
   ‚è±Ô∏è  Dur√©e: 06:37
   üì¶ Chunks: 16
   üé§ Voix: female_fr


{'path': 'tts_long_female_fr_8aba435b.wav',
 'sample_rate': 24000,
 'duration_seconds': 397.586,
 'duration_formatted': '06:37',
 'audio_data': array([28, 21, 31, ...,  3,  7,  1], dtype=int16),
 'voice': 'female_fr',
 'language': 'fr',
 'device': 'cuda (Tesla T4)',
 'chunks_count': 16,
 'text_length': 6715}