# Text-to-Speech Podcast Generator

This notebook completes our podcast creation pipeline by converting our structured script into a fully-voiced podcast. The workflow includes:

1. **Environment setup**: Installing required packages and downloading FFmpeg
2. **Voice selection**: Configuring different voices for each speaker
3. **Text processing**: Handling special characters and expressions for TTS compatibility
4. **Audio generation**: Converting text to speech using Kokoro TTS
5. **Podcast assembly**: Combining individual segments with proper pauses and transitions
6. **Audio export**: Creating distribution-ready files in various formats

By the end of this notebook, you'll have a complete podcast audio file ready for distribution, with distinct voices for each speaker and professional sound quality.

This is the final part of our four-part series:
1. Text extraction and cleaning
2. Podcast script generation
3. TTS-optimized formatting
4. Audio generation (this notebook)

## Package Installation

First, we need to install the required packages:
- **torch and torchaudio**: For audio processing
- **pydub**: For manipulating audio segments
- **soundfile**: For reading and writing audio files
- **kokoro**: A high-quality open-source TTS library

In [None]:
# Install the required packages
!pip install torch torchaudio pydub soundfile
!pip install -q kokoro>=0.7.11

## FFmpeg Setup

FFmpeg is required for audio processing and MP3 export. This function will:
- Download FFmpeg for Windows
- Extract it to a local directory
- Add it to the system PATH for use with pydub

In [None]:
import os
import tempfile
import zipfile
import requests
import shutil

def setup_ffmpeg():
    """Download and setup FFmpeg for Windows"""
    os.makedirs("ffmpeg", exist_ok=True)
    
    print("Downloading ffmpeg...")
    ffmpeg_url = "https://github.com/BtbN/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-win64-gpl.zip"
    r = requests.get(ffmpeg_url, stream=True)
    zip_path = os.path.join(tempfile.gettempdir(), "ffmpeg.zip")
    
    with open(zip_path, 'wb') as f:
        for chunk in r.iter_content(chunk_size=8192):
            f.write(chunk)
    
    print("Extracting ffmpeg...")
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall("ffmpeg")
    
    # Find the ffmpeg.exe and add its directory to PATH
    for root, dirs, files in os.walk("ffmpeg"):
        if "ffmpeg.exe" in files:
            ffmpeg_path = os.path.join(root, "ffmpeg.exe")
            os.environ["PATH"] += os.pathsep + os.path.dirname(ffmpeg_path)
            print(f"ffmpeg set up at {ffmpeg_path}")
            return True
    
    return False

# Setup ffmpeg if needed
try:
    ffmpeg_setup = setup_ffmpeg()
except Exception as e:
    print(f"Error setting up ffmpeg: {e}")
    ffmpeg_setup = False

In [3]:
import torch
import ast
from pydub import AudioSegment
from tqdm import tqdm
import numpy as np  # Add this import

import os
import io
import soundfile as sf
from IPython.display import Audio, display
from kokoro import KPipeline

In [None]:
# Create output directory
os.makedirs("podcast_segments", exist_ok=True)

# Load podcast transcript
with open('./new_generated_podcast.txt', 'r', encoding='utf-8') as file:
    PODCAST_TEXT = file.read()

# Parse the podcast script
podcast_segments = ast.literal_eval(PODCAST_TEXT)
print(f"Loaded {len(podcast_segments)} segments")

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

In [None]:
# Initialize Kokoro pipelines for each speaker
print("Loading Kokoro models...")

# Initialize separate pipelines for each speaker with different voices
# Using American English as the base language
speaker1_pipeline = KPipeline(lang_code='a')  # American English
speaker2_pipeline = KPipeline(lang_code='a')  # American English

# Helper function to convert audio to AudioSegment
def convert_to_audio_segment(audio_array, sample_rate=24000):
    """Convert numpy array to AudioSegment for easier manipulation"""
    # Create temporary WAV file
    temp_wav = io.BytesIO()
    sf.write(temp_wav, audio_array, sample_rate, format='wav')
    temp_wav.seek(0)
    
    # Load as AudioSegment
    audio_segment = AudioSegment.from_wav(temp_wav)
    return audio_segment

## Voice Selection and Configuration

Kokoro offers multiple high-quality voices across various languages. For our podcast, we'll select voices from American English options:

### Selected Voices

- **Speaker 1**: `af_heart` - Female American English voice with excellent quality (Grade A)
- **Speaker 2**: `am_fenrir` - Male American English voice with good quality (Grade C+)

These voices provide a clear distinction between speakers, enhancing listener engagement.

### Available American English Voices
| Voice ID | Gender | Quality Grade | Description |
|----------|--------|--------------|-------------|
| af_heart | Female | A | High-quality professional female voice |
| af_bella | Female | A- | Energetic female voice with extensive training data |
| af_nicole | Female | B- | Good quality female voice with headphone recording |
| am_fenrir | Male | C+ | Good quality male voice |
| am_michael | Male | C+ | Clear male voice with adequate training |
| am_puck | Male | C+ | Natural male voice with good articulation |

You can change the voice assignments in the `generate_speech` function below.

For more voices or other languages, see the [Kokoro voice documentation](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/VOICES.md).

In [6]:
# Function to generate speech using Kokoro
def generate_speech_kokoro(text, speaker="speaker1"):
    """Generate speech using Kokoro TTS
    
    Args:
        text: Text to synthesize
        speaker: 'speaker1' for Speaker 1, 'speaker2' for Speaker 2
    
    Returns:
        audio_segment: AudioSegment of the generated audio
    """
    try:
        # Clean up text - remove special characters that might cause issues
        text = text.replace('"', '').strip()
        
        # Handle expressions
        text = text.replace("[laughs]", ", hahaha, ")
        text = text.replace("[sigh]", ", *sigh*, ")
        
        # Select the appropriate pipeline and voice
        if speaker == "speaker1":
            # Use a female voice for Speaker 1
            pipeline = speaker1_pipeline
            voice = 'af_heart'  # Female voice
            speed = 1.0
        else:
            # Use a male voice for Speaker 2
            pipeline = speaker2_pipeline
            voice = 'am_fenrir'  # Male voice
            speed = 1.1  # Slightly faster
        
        # Generate the audio
        full_audio = None
        
        # Split long text into paragraphs to avoid memory issues
        generator = pipeline(text, voice=voice, speed=speed, split_pattern=r'[,.!?;]\s+')
        
        for _, _, audio in generator:
            if full_audio is None:
                full_audio = audio
            else:
                full_audio = np.concatenate((full_audio, audio))
        
        # Convert to AudioSegment
        audio_segment = convert_to_audio_segment(full_audio)
        
        return audio_segment
    
    except Exception as e:
        print(f"Error generating speech with Kokoro: {e}")
        return None

In [7]:
def export_podcast(podcast_audio, base_filename="final_podcast"):
    """Export the podcast in multiple formats"""
    # Create output directory
    os.makedirs("podcast_export", exist_ok=True)
    
    # Export formats
    formats = {
        "mp3": {"format": "mp3", "bitrate": "192k", "tags": {"album": "AI Podcast Series", "title": "The Role of Data in LLMs"}},
        "wav": {"format": "wav"},
        "ogg": {"format": "ogg", "bitrate": "128k"},
    }
    
    results = {}
    
    for fmt, params in formats.items():
        try:
            output_path = f"podcast_export/{base_filename}.{fmt}"
            podcast_audio.export(output_path, **params)
            file_size = os.path.getsize(output_path) / (1024 * 1024)  # MB
            results[fmt] = {"path": output_path, "size": file_size}
            print(f"Exported {fmt.upper()} file: {output_path} ({file_size:.2f} MB)")
        except Exception as e:
            print(f"Error exporting {fmt} format: {e}")
    
    return results

In [None]:
# Generate the podcast
final_podcast = AudioSegment.empty()
segment_paths = []

# Modify the export part to use WAV if ffmpeg is not available
for i, (speaker, text) in enumerate(tqdm(podcast_segments, desc="Generating podcast")):
    speaker_id = "speaker1" if speaker == "Speaker 1" else "speaker2"
    print(f"Processing segment {i+1}/{len(podcast_segments)}: {speaker}")
    
    # Generate audio for this segment
    audio_segment = generate_speech_kokoro(text, speaker_id)
    
    if audio_segment:
        # Add slight pause between segments
        if i > 0:
            final_podcast += AudioSegment.silent(duration=500)  # 500ms pause
        
        # Add to podcast
        final_podcast += audio_segment
        
        # Save individual segment - use wav format if ffmpeg is not available
        if ffmpeg_setup:
            segment_path = f"podcast_segments/segment_{i+1:02d}_{speaker_id}.mp3"
            audio_segment.export(segment_path, format="mp3", bitrate="192k")
        else:
            segment_path = f"podcast_segments/segment_{i+1:02d}_{speaker_id}.wav"
            audio_segment.export(segment_path, format="wav")
            
        segment_paths.append(segment_path)
    else:
        print(f"Warning: Failed to generate segment {i+1}")

# Export the full podcast - use wav format if ffmpeg is not available
if ffmpeg_setup:
    final_podcast.export("final_podcast_kokoro.mp3", format="mp3", bitrate="192k")
    print("Podcast generated successfully as MP3!")
    audio_file = "final_podcast_kokoro.mp3"
else:
    final_podcast.export("final_podcast_kokoro.wav", format="wav")
    print("Podcast generated successfully as WAV (ffmpeg not available for MP3 conversion)!")
    audio_file = "final_podcast_kokoro.wav"

In [None]:
# Play a preview of the final podcast
display(Audio(audio_file))