# üé§ Presentation Transcript Generator

> **Professional AI-Powered Slide-to-Speech Tool** - Transform presentation slides into natural, fluent speech transcripts using AI Agent technology

---

## ‚ú® Features

- üìä **Smart Slide Analysis** - Automatically parse PDF presentation content
- üéôÔ∏è **Speech Rate Detection** - Upload 20-second audio to automatically calculate speaking speed
- üé≠ **Multiple Speech Styles** - Supports lively, serious, motivational, educational, and conversational styles
- üåê **Multilingual Support** - Traditional Chinese, English, Simplified Chinese, Japanese, Korean, Spanish, French, and German
- üë®‚Äçüè´ **Expert Role Playing** - AI generates content from domain expert perspectives
- üì• **One-Click Download** - Export in multiple formats

---

## üöÄ Workflow

1. **Environment Setup** - Install necessary packages
2. **Configure Parameters** - Set speech duration, style, language, etc.
3. **Upload Files** - Upload presentation PDF and audio (optional)
4. **Generate Transcript** - AI automatically generates professional transcript
5. **Download Results** - Get complete transcript file

---

**Last Updated**: December 2025


In [None]:
%%capture
# ÂÆâË£ùÂøÖË¶ÅÂ•ó‰ª∂
!pip install -q pymupdf pillow
!pip install -q pydub
!pip install -q openai
!pip install -q ipywidgets
!apt-get install -qq ffmpeg

print("‚úÖ All packages installed!")


In [None]:
# Import standard libraries
import os
import io
import json
import base64
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Optional, Tuple

# Import third-party libraries
import fitz  # PyMuPDF
from PIL import Image
from pydub import AudioSegment
from openai import OpenAI
from IPython.display import display, HTML, clear_output
import ipywidgets as widgets

# Configure warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Modules imported successfully!")
print("üìå Please set your API Key in the next step")

In [None]:
# Configure CSS styles - Use Noto Sans TC font
custom_css = """
<style>
@import url('https://fonts.googleapis.com/css2?family=Noto+Sans+TC:wght@300;400;500;700&display=swap');

* {
    font-family: 'Noto Sans TC', 'Segoe UI', Arial, sans-serif !important;
}

.widget-label {
    font-weight: 500 !important;
    color: #2c3e50 !important;
    font-size: 14px !important;
}

.widget-text input, .widget-textarea textarea, .widget-dropdown select {
    border: 2px solid #e0e0e0 !important;
    border-radius: 8px !important;
    padding: 10px !important;
    font-size: 14px !important;
    transition: all 0.3s ease !important;
}

.widget-text input:focus, .widget-textarea textarea:focus, .widget-dropdown select:focus {
    border-color: #4CAF50 !important;
    box-shadow: 0 0 0 3px rgba(76, 175, 80, 0.1) !important;
    outline: none !important;
}

.widget-button {
    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%) !important;
    color: white !important;
    border: none !important;
    border-radius: 8px !important;
    padding: 12px 24px !important;
    font-weight: 500 !important;
    font-size: 14px !important;
    cursor: pointer !important;
    transition: all 0.3s ease !important;
}

.widget-button:hover {
    transform: translateY(-2px) !important;
    box-shadow: 0 5px 15px rgba(102, 126, 234, 0.4) !important;
}

.success-box {
    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
    color: white;
    padding: 20px;
    border-radius: 12px;
    margin: 20px 0;
    box-shadow: 0 4px 15px rgba(0,0,0,0.1);
}

.info-box {
    background: #f8f9fa;
    border-left: 4px solid #667eea;
    padding: 15px;
    border-radius: 8px;
    margin: 15px 0;
}

.transcript-output {
    background: white;
    border: 2px solid #e0e0e0;
    border-radius: 12px;
    padding: 25px;
    margin: 20px 0;
    box-shadow: 0 2px 10px rgba(0,0,0,0.05);
    max-height: 500px;
    overflow-y: auto;
}

.slide-header {
    background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
    color: white;
    padding: 12px 20px;
    border-radius: 8px;
    font-weight: 600;
    font-size: 16px;
    margin-top: 20px;
    margin-bottom: 10px;
}

.slide-content {
    color: #2c3e50;
    line-height: 1.8;
    font-size: 15px;
    padding: 10px 20px;
}

.progress-indicator {
    background: #f8f9fa;
    border-radius: 12px;
    padding: 20px;
    margin: 15px 0;
    border: 2px solid #e0e0e0;
}

h1, h2, h3 {
    color: #2c3e50 !important;
    font-weight: 600 !important;
}
</style>
"""

display(HTML(custom_css))
print("‚úÖ UI style configuration complete! Using Noto Sans TC font")

In [None]:
class PDFProcessor:
    """Class for processing PDF slides"""
    
    def __init__(self):
        self.slides_content = []
    
    def extract_slides(self, pdf_path: str) -> List[Dict[str, str]]:
        """
        Extract content from each page of the PDF
        
        Args:
            pdf_path: Path to the PDF file
            
        Returns:
            List of page content [{"page": 1, "text": "...", "image": "..."}, ...]
        """
        try:
            doc = fitz.open(pdf_path)
            
            # Check if PDF is empty
            if len(doc) == 0:
                doc.close()
                raise Exception("This PDF file does not contain any pages")
            
            slides = []
            
            for page_num in range(len(doc)):
                page = doc[page_num]
                
                # Extract text
                text = page.get_text().strip()
                
                # Convert to image (for visualization or OCR)
                pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))
                img_data = pix.tobytes("png")
                
                slides.append({
                    "page": page_num + 1,
                    "text": text if text else "[No text content on this page]",
                    "image": base64.b64encode(img_data).decode()
                })
            
            doc.close()
            self.slides_content = slides
            return slides
            
        except Exception as e:
            if "PDF" in str(e):
                raise Exception(f"PDF processing error: {str(e)}")
            else:
                raise Exception(f"PDF processing error: Unable to read file, please check if the file format is correct")
    
    def get_slide_summary(self) -> str:
        """Get slide summary"""
        if not self.slides_content:
            return "No slides loaded yet"
        
        summary = f"Total {len(self.slides_content)} slides\n\n"
        for slide in self.slides_content[:3]:  # Show preview of first 3 slides
            summary += f"üìÑ Page {slide['page']}:\n{slide['text'][:100]}...\n\n"
        
        if len(self.slides_content) > 3:
            summary += f"...and {len(self.slides_content) - 3} other pages"
        
        return summary


class AudioAnalyzer:
    """Analyze audio and calculate speech rate using GPT-4o Audio API"""
    
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
        self.words_per_minute = None
    
    def _convert_m4a_to_mp3(self, audio_path: str) -> str:
        """Convert m4a format to mp3"""
        try:
            audio = AudioSegment.from_file(audio_path)
            mp3_path = "/tmp/converted_audio.mp3"
            audio.export(mp3_path, format="mp3", bitrate="128k")
            return mp3_path
        except Exception as e:
            raise Exception(f"Audio format conversion error: {str(e)}")
    
    def analyze_audio(self, audio_path: str) -> float:
        """Analyze audio and calculate speech rate using GPT-4o Audio API"""
        try:
            audio = AudioSegment.from_file(audio_path)
            duration_seconds = len(audio) / 1000.0
            
            # Check audio duration
            if duration_seconds < 5:
                raise Exception("Audio duration too short (less than 5 seconds), suggest uploading around 20 seconds for more accurate results")
            if duration_seconds > 120:
                raise Exception("Audio duration too long (over 2 minutes), please upload a 20-60 second audio sample")
            
            # Convert m4a to mp3 if necessary
            if audio_path.lower().endswith('.m4a'):
                print("üîÑ m4a format detected, converting to mp3...")
                audio_path = self._convert_m4a_to_mp3(audio_path)
            
            # Transcribe using GPT-4o Audio API
            print("üéôÔ∏è Analyzing with GPT-4o Audio API...")
            
            with open(audio_path, 'rb') as audio_file:
                transcription = self.client.audio.transcriptions.create(
                    model="whisper-1",
                    file=audio_file,
                    language="zh"
                )
            
            text = transcription.text
            
            # Check for transcription content
            if not text or len(text.strip()) == 0:
                raise Exception("Unable to recognize audio content, please ensure audio is clear and contains speech")
            
            # Calculate character count (Chinese characters counted individually)
            char_count = len([c for c in text if c.strip() and not c.isspace()])
            
            # Calculate words per minute
            wpm = (char_count / duration_seconds) * 60
            self.words_per_minute = wpm
            
            # Clean up temporary files
            if audio_path.startswith("/tmp/"):
                if os.path.exists(audio_path):
                    os.remove(audio_path)
            
            return wpm
            
        except Exception as e:
            # Clean up temporary files (even if error occurs)
            if 'audio_path' in locals() and audio_path.startswith("/tmp/"):
                if os.path.exists(audio_path):
                    os.remove(audio_path)
            raise Exception(f"Audio analysis error: {str(e)}")


class TranscriptGenerator:
    """Generate speech transcript using OpenAI Vision models (Supports GPT-5.1/o3/GPT-4o/GPT-4o-mini)"""
    
    def __init__(self, api_key: str):
        self.client = OpenAI(api_key=api_key)
        self.transcript = ""
        self.use_vision = True
    
    def generate_transcript(
        self,
        slides: List[Dict[str, str]],
        target_duration: int,
        words_per_minute: float,
        style: str,
        topic: str,
        audience: str,
        language: str,
        model_name: str = "gpt-5.1",
        expert_role: Optional[str] = None,
        include_tips: bool = False
    ) -> str:
        """Generate speech transcript (supports multi-language, multi-model, and speech tips)"""
        try:
            # Check if slides are empty
            if not slides or len(slides) == 0:
                raise Exception("No slide content, please upload a PDF file first")
            
            # Calculate target word count
            target_words = int(target_duration * words_per_minute)
            words_per_slide = target_words // len(slides)
            
            # Create system prompt
            system_prompt = self._create_system_prompt(
                style, topic, audience, language, expert_role, words_per_slide, include_tips
            )
            
            # Display generation info
            model_names = {
                'gpt-5.1': 'GPT-5.1 (Strongest Multimodal Understanding)',
                'o3': 'o3 (Strong Reasoning Model)',
                'gpt-4o': 'GPT-4o (Balanced All-rounder)',
                'gpt-4o-mini': 'GPT-4o-mini (Fast and Economical)'
            }
            print(f"ü§ñ Generating transcript using {model_names.get(model_name, model_name)}...")
            print(f"üìä Target word count: {target_words} words")
            print(f"üìÑ Number of slides: {len(slides)} pages")
            print(f"üåê Output language: {language}")
            if include_tips:
                print("üí° Including speech tips (gestures, tone, pauses, etc.)")
            
            # Build prompt content
            tips_instruction = ""
            if include_tips:
                tips_instruction = """

„ÄêSpeech Tips Suggestions„Äë
Please include the following speech tips in appropriate places within the transcript (marked with [square brackets]):
- [Gesture: Open arms] - When emphasizing a key point
- [Gesture: Point to slide] - When explaining a chart
- [Tone: Raise volume] - For key messages
- [Tone: Slow down] - For important concepts
- [Pause 2-3 seconds] - During section transitions
- [Eye contact] - When interacting with the audience
- [Movement: Move to center stage] - During opening or closing
"""
            
            user_content = [
                {
                    "type": "text",
                    "text": f"""
Please generate a complete speech transcript based on the following slide images.

Speech Parameters:
- Total Duration: {target_duration} minutes
- Speech Rate: Approximately {int(words_per_minute)} words per minute
- Target Total Word Count: Approximately {target_words} words
- Suggested Words per Slide: Approximately {words_per_slide} words
- Output Language: {language}{tips_instruction}

Output Format Requirements:
Slide 1
[Speech content for slide 1]

Slide 2
[Speech content for slide 2]

...and so on

Please ensure:
1. Carefully observe the visual elements, charts, and text on each slide
2. The transcript for each page is natural and smooth, explaining the key points on the slide
3. Content flows smoothly with a clear opening and closing
4. Matches the specified speech style and tone
5. Total word count is around {target_words} words (allow 10% variance)
"""
                }
            ]
            
            # Add all slide images
            for slide in slides:
                user_content.append({
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{slide['image']}",
                        "detail": "high"
                    }
                })
            # Call OpenAI Vision API (Supports GPT-5.1/o3/GPT-4o/GPT-4o-mini)
            response = self.client.chat.completions.create(
                model=model_name,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_content}
                ],
                temperature=0.7,
                max_completion_tokens=4000
            )
            
            transcript = response.choices[0].message.content
            self.transcript = transcript
            
            return transcript
            
        except Exception as e:
            error_msg = str(e)
            if "API key" in error_msg or "authentication" in error_msg.lower():
                raise Exception("‚ùå API Key error, please check if your OpenAI API Key is correct")
            elif "rate limit" in error_msg.lower():
                raise Exception("‚ùå API request limit reached, please try again later")
            elif "quota" in error_msg.lower():
                raise Exception("‚ùå API quota exceeded, please check your OpenAI account balance")
            else:
                raise Exception(f"Transcript generation error: {error_msg}")
    
    def _create_system_prompt(
        self,
        style: str,
        topic: str,
        audience: str,
        language: str,
        expert_role: Optional[str],
        words_per_slide: int,
        include_tips: bool = False
    ) -> str:
        """Create system prompt"""
        
        style_descriptions = {
            "Lively": "Use a relaxed, lively tone with appropriate interactive and humorous elements",
            "Serious": "Use a formal, professional tone maintaining academic rigor",
            "Motivational": "Use inspiring language full of positive energy and motivation",
            "Educational": "Use clear, easy-to-understand explanations, as if teaching students",
            "Conversational": "Use a conversational tone, as if talking face-to-face with the audience"
        }
        
        language_instructions = {
            "Traditional Chinese": "Output in Traditional Chinese",
            "English": "Output in English",
            "Simplified Chinese": "Output in Simplified Chinese",
            "Japanese": "Output in Japanese",
            "Korean": "Output in Korean",
            "Spanish": "Output in Spanish",
            "French": "Output in French",
            "German": "Output in German"
        }
        
        role_intro = ""
        if expert_role:
            role_intro = f"You are a {expert_role}, "
        
        style_desc = style_descriptions.get(style, "Use a natural and smooth tone")
        lang_inst = language_instructions.get(language, "Output in Traditional Chinese")
        
        tips_requirement = ""
        if include_tips:
            tips_requirement = """
7. Include speech tips suggestions in appropriate places, marked with [square brackets], including:
   - Gesture suggestions (e.g., open arms, point to slide, clench fist for emphasis)
   - Tone suggestions (e.g., raise volume, slow down, emphasize)
   - Pause timing (e.g., [Pause 2-3 seconds])
   - Body language (e.g., eye contact, movement, lean forward)
   These suggestions should blend naturally into the transcript to help the speaker better convey the message
"""
        
        return f"""
{role_intro}You are an experienced speaker and content creation expert.

Speech Topic: {topic}
Target Audience: {audience}
Speech Style: {style_desc}
Language Requirement: {lang_inst}

Your task is to create a natural, smooth, and engaging speech transcript based on the provided slide content.

Requirements:
1. Content must be faithful to the slides but expressed in spoken language
2. Approximately {words_per_slide} words per page, adjustable based on content importance
3. Opening must be attractive, closing must be powerful
4. Add transition phrases appropriately to ensure smooth flow
5. Match the specified speech style and target audience
6. Ensure content is professional and accurate, yet easy to understand{tips_requirement}
"""

print("‚úÖ Core functionality classes created!")

In [None]:
from google.colab import userdata
import getpass

# Try to load from Colab Secrets
try:
    OPENAI_API_KEY = userdata.get('GPT_API_KEY')
    print("‚úÖ API Key loaded from Colab Secrets")
except:
    # Manual input
    print("üîë Please enter your OpenAI API Key:")
    print("üí° Hint: You can store the API Key in Colab's 'Secrets' feature")
    OPENAI_API_KEY = getpass.getpass("API Key: ")
    
if OPENAI_API_KEY:
    print("‚úÖ API Key set successfully!")
else:
    print("‚ö†Ô∏è Warning: API Key not set, AI generation features unavailable")

In [None]:
class TranscriptGeneratorUI:
    """Interactive User Interface"""
    
    def __init__(self):
        self.pdf_processor = PDFProcessor()
        self.audio_analyzer = None  # Delayed initialization, requires API key
        self.transcript_generator = None
        
        # Store uploaded files
        self.pdf_path = None
        self.audio_path = None
        self.current_wpm = 200  # Current speech rate
        
        # Create UI widgets
        self._create_widgets()
    
    def _create_widgets(self):
        """Create all UI widgets"""
        
        # Title
        display(HTML("""
        <div class="success-box">
            <h2 style="color: white; margin: 0;">üé§ Speech Transcript Generator</h2>
            <p style="color: white; margin: 10px 0 0 0; opacity: 0.9;">
                Easily convert slides into professional speech transcripts
            </p>
        </div>
        """))
        
        # 1. Upload PDF
        display(HTML("<div class='info-box'><h3>üìÑ Step 1: Upload Slide PDF</h3></div>"))
        self.pdf_upload = widgets.FileUpload(
            accept='.pdf',
            multiple=False,
            description='Select PDF'
        )
        self.pdf_status = widgets.HTML(value="<p style='color: #666;'>Not uploaded</p>")
        display(self.pdf_upload, self.pdf_status)
        
        # 2. Set Speech Duration
        display(HTML("<div class='info-box'><h3>‚è±Ô∏è Step 2: Set Speech Duration</h3></div>"))
        self.duration_input = widgets.IntText(
            value=10,
            description='Duration',
            min=1,
            max=180,
            style={'description_width': 'initial'}
        )
        display(widgets.HBox([self.duration_input, widgets.Label('Minutes')]))
        
        # 3. Speech Rate Settings
        display(HTML("<div class='info-box'><h3>üéôÔ∏è Step 3: Set Speech Rate</h3></div>"))
        
        # Speech rate selection dropdown
        self.speed_preset = widgets.Dropdown(
            options=[
                ('Slow (150 wpm)', 150),
                ('Medium (200 wpm)', 200),
                ('Fast (250 wpm)', 250),
                ('Auto Analysis (Upload 20s audio)', 0)
            ],
            value=200,
            description='Speech Rate',
            style={'description_width': 'initial'}
        )
        display(self.speed_preset)
        
        # Audio upload area (used when "Auto Analysis" is selected)
        self.audio_container = widgets.VBox([
            widgets.HTML("<p style='color: #666; font-size: 13px; margin: 10px 0;'>üí° After selecting 'Auto Analysis', please upload 20 seconds of audio</p>")
        ])
        
        self.audio_upload = widgets.FileUpload(
            accept='.m4a,.mp3,.wav',
            multiple=False,
            description='Upload Audio',
            layout=widgets.Layout(display='none')  # Hidden by default
        )
        self.audio_status = widgets.HTML(value="")
        self.analyze_button = widgets.Button(
            description='üéµ Start Analysis',
            button_style='info',
            layout=widgets.Layout(display='none')  # Hidden by default
        )
        
        self.audio_container.children = self.audio_container.children + (self.audio_upload, self.audio_status, self.analyze_button)
        display(self.audio_container)
        
        # Monitor speech rate selection changes
        self.speed_preset.observe(self._on_speed_change, names='value')
        
        # üÜï 4. AI Model Selection
        display(HTML("""
        <div class='info-box'>
            <h3>ü§ñ Step 4: Select AI Model</h3>
            <p style='color: #666; font-size: 13px; margin: 5px 0;'>
                üí° <strong>GPT-5.1</strong> possesses the strongest multimodal understanding capabilities, enabling deep analysis of images and text
            </p>
        </div>
        """))
        self.model_dropdown = widgets.Dropdown(
            options=[
                ('GPT-5.1 - Strongest Multimodal (Deep understanding of text & images, Recommended) ‚≠ê', 'gpt-5.1'),
                ('o3 - Strong Reasoning (Complex logic analysis)', 'o3'),
                ('GPT-4o - Balanced All-rounder (Speed & Quality)', 'gpt-4o'),
                ('GPT-4o-mini - Fast & Economical (Basic needs)', 'gpt-4o-mini')
            ],
            value='gpt-5.1',
            description='AI Model',
            style={'description_width': 'initial'}
        )
        display(self.model_dropdown)
        
        # 5. Speech Style
        display(HTML("<div class='info-box'><h3>üé≠ Step 5: Select Speech Style</h3></div>"))
        self.style_dropdown = widgets.Dropdown(
            options=['Lively', 'Serious', 'Motivational', 'Educational', 'Conversational'],
            value='Lively',
            description='Speech Style',
            style={'description_width': 'initial'}
        )
        display(self.style_dropdown)
        
        # 6. Speech Information
        display(HTML("<div class='info-box'><h3>üìù Step 6: Fill in Speech Info</h3></div>"))
        
        self.topic_input = widgets.Text(
            value='',
            placeholder='e.g., Application of AI in Education',
            description='Speech Topic',
            style={'description_width': 'initial'}
        )
        
        self.audience_input = widgets.Text(
            value='',
            placeholder='e.g., University Students, Teachers, Tech Enthusiasts',
            description='Target Audience',
            style={'description_width': 'initial'}
        )
        
        # Language options
        self.language_dropdown = widgets.Dropdown(
            options=['Traditional Chinese', 'English', 'Simplified Chinese', 'Japanese', 'Korean', 'Spanish', 'French', 'German'],
            value='Traditional Chinese',
            description='Output Language',
            style={'description_width': 'initial'}
        )
        
        # üÜï Expert Role - Add detailed description
        display(HTML("""
        <div style='background: #E3F2FD; border-left: 4px solid #2196F3; padding: 12px; border-radius: 6px; margin: 10px 0;'>
            <strong style='color: #1976D2;'>üí° What is an Expert Role?</strong>
            <p style='color: #555; font-size: 13px; margin: 8px 0 0 0; line-height: 1.6;'>
                AI will <strong>act as the expert you specify</strong> to write the transcript, making the content more professional and persuasive.<br>
                ‚Ä¢ e.g., "Senior AI Researcher" ‚Üí Explain from a technical expert's perspective<br>
                ‚Ä¢ e.g., "PhD in Educational Psychology" ‚Üí Explain from an educational expert's perspective<br>
                ‚Ä¢ e.g., "Startup Mentor" ‚Üí Share insights from practical experience<br>
                <em>Leave blank to use a generic speaker persona</em>
            </p>
        </div>
        """))
        
        self.expert_role = widgets.Text(
            value='',
            placeholder='e.g., Senior AI Researcher, PhD in Educational Psychology (Optional)',
            description='Expert Role',
            style={'description_width': 'initial'},
            layout=widgets.Layout(width='500px')
        )
        
        # Speech tips suggestion option
        self.include_tips = widgets.Checkbox(
            value=True,
            description='Include speech tips suggestions (gestures, tone, pauses, etc.)',
            style={'description_width': 'initial'}
        )
        
        display(self.topic_input, self.audience_input, self.language_dropdown, self.expert_role, self.include_tips)
        
        # 7. Generate Button
        display(HTML("<div style='margin-top: 30px;'></div>"))
        self.generate_button = widgets.Button(
            description='üöÄ Generate Transcript',
            button_style='success',
            layout=widgets.Layout(width='200px', height='50px')
        )
        display(self.generate_button)
        
        # 8. Output Area
        self.output_area = widgets.Output()
        display(self.output_area)
        
        # Bind events
        self.pdf_upload.observe(self._on_pdf_upload, names='value')
        self.audio_upload.observe(self._on_audio_upload, names='value')
        self.analyze_button.on_click(self._analyze_audio)
        self.generate_button.on_click(self._generate_transcript)
    
    def _on_speed_change(self, change):
        """Handle speech rate selection changes"""
        if change['new'] == 0:  # "Auto Analysis" selected
            self.audio_upload.layout.display = 'block'
            self.analyze_button.layout.display = 'block'
            self.audio_status.value = "<p style='color: #2196F3;'>üì§ Please upload a 20-second audio file</p>"
        else:
            self.audio_upload.layout.display = 'none'
            self.analyze_button.layout.display = 'none'
            self.audio_status.value = ""
            self.current_wpm = change['new']
    
    def _on_pdf_upload(self, change):
        """Handle PDF upload"""
        if change['new']:
            try:
                uploaded_file = list(change['new'].values())[0]
                filename = uploaded_file['metadata']['name']
                file_size = len(uploaded_file['content'])
                
                # Check file size (suggested < 50MB)
                if file_size > 50 * 1024 * 1024:
                    self.pdf_status.value = "<div style='color: #f44336;'>‚ùå File too large (over 50MB), please compress and upload again</div>"
                    return
                
                # Save PDF
                self.pdf_path = "/tmp/presentation.pdf"
                with open(self.pdf_path, 'wb') as f:
                    f.write(uploaded_file['content'])
                
                # Parse PDF
                slides = self.pdf_processor.extract_slides(self.pdf_path)
                
                self.pdf_status.value = f"""
                <div style='color: #4CAF50; font-weight: 500;'>
                    ‚úÖ Uploaded: {filename}<br>
                    üìä Total {len(slides)} slides
                </div>
                """
                
            except Exception as e:
                self.pdf_status.value = f"<div style='color: #f44336;'>‚ùå Error: {str(e)}</div>"
    
    def _on_audio_upload(self, change):
        """Handle audio upload"""
        if change['new']:
            try:
                uploaded_file = list(change['new'].values())[0]
                
                # Save audio
                filename = uploaded_file['metadata']['name']
                ext = os.path.splitext(filename)[1]
                self.audio_path = f"/tmp/audio{ext}"
                
                with open(self.audio_path, 'wb') as f:
                    f.write(uploaded_file['content'])
                
                self.audio_status.value = f"""
                <div style='color: #4CAF50; font-weight: 500;'>
                    ‚úÖ Uploaded: {filename}<br>
                    üëâ Please click the "Start Analysis" button
                </div>
                """
                
            except Exception as e:
                self.audio_status.value = f"<div style='color: #f44336;'>‚ùå Error: {str(e)}</div>"
    
    def _analyze_audio(self, button):
        """Analyze audio speech rate using GPT-4o Audio API"""
        if not self.audio_path:
            self.audio_status.value = "<div style='color: #f44336;'>‚ùå Please upload an audio file first</div>"
            return
        
        try:
            self.audio_status.value = "<div style='color: #2196F3;'>‚è≥ Analyzing using GPT-4o Audio API...</div>"
            
            # Initialize AudioAnalyzer (requires API key)
            if self.audio_analyzer is None:
                self.audio_analyzer = AudioAnalyzer(OPENAI_API_KEY)
            
            wpm = self.audio_analyzer.analyze_audio(self.audio_path)
            
            # Update current speech rate
            self.current_wpm = int(wpm)
            
            self.audio_status.value = f"""
            <div style='color: #4CAF50; font-weight: 500;'>
                ‚úÖ Analysis Complete!<br>
                üé§ Your speech rate: {self.current_wpm} words/min
            </div>
            """
            
        except Exception as e:
            self.audio_status.value = f"<div style='color: #f44336;'>‚ùå {str(e)}</div>"
    
    def _generate_transcript(self, button):
        """Generate transcript"""
        with self.output_area:
            clear_output()
            
            # Validate input
            if not self.pdf_path:
                display(HTML("<div style='color: #f44336;'>‚ùå Please upload PDF slides first</div>"))
                return
            
            if not self.topic_input.value:
                display(HTML("<div style='color: #f44336;'>‚ùå Please fill in the speech topic</div>"))
                return
            
            if not self.audience_input.value:
                display(HTML("<div style='color: #f44336;'>‚ùå Please fill in the target audience</div>"))
                return
            
            try:
                display(HTML("""
                <div class='progress-indicator'>
                    <h3>üîÑ Generating transcript...</h3>
                    <p>Please wait, this may take some time.</p>
                </div>
                """))
                
                # Initialize generator
                self.transcript_generator = TranscriptGenerator(OPENAI_API_KEY)
                
                # Determine speech rate to use
                if self.speed_preset.value == 0:  # Auto Analysis
                    if self.current_wpm == 200:  # Not analyzed yet
                        clear_output()
                        display(HTML("<div style='color: #f44336;'>‚ùå Please upload audio and complete analysis first, or select a default speech rate</div>"))
                        return
                    wpm = self.current_wpm
                else:
                    wpm = self.speed_preset.value
                
                # üÜï Generate transcript (pass model selection)
                transcript = self.transcript_generator.generate_transcript(
                    slides=self.pdf_processor.slides_content,
                    target_duration=self.duration_input.value,
                    words_per_minute=wpm,
                    style=self.style_dropdown.value,
                    topic=self.topic_input.value,
                    audience=self.audience_input.value,
                    language=self.language_dropdown.value,
                    model_name=self.model_dropdown.value,
                    expert_role=self.expert_role.value if self.expert_role.value else None,
                    include_tips=self.include_tips.value
                )
                
                clear_output()
                
                # Display results
                display(HTML("""
                <div class='success-box'>
                    <h2 style="color: white; margin: 0;">‚úÖ Transcript Generation Complete!</h2>
                </div>
                """))
                
                # Format output
                formatted_transcript = self._format_transcript(transcript)
                display(HTML(f"<div class='transcript-output'>{formatted_transcript}</div>"))
                
                # Download buttons
                self._create_download_buttons(transcript)
                
            except Exception as e:
                clear_output()
                display(HTML(f"""
                <div style='color: #f44336; padding: 20px; border: 2px solid #f44336; border-radius: 8px;'>
                    <h3>‚ùå Generation Failed</h3>
                    <p>{str(e)}</p>
                </div>
                """))
    
    def _format_transcript(self, transcript: str) -> str:
        """Format transcript output"""
        lines = transcript.split('\n')
        formatted = ""
        
        for line in lines:
            line = line.strip()
            if not line:
                continue
            
            # Slide title
            if line.startswith('Slide'):
                formatted += f"<h3 style='color: #2196F3; margin-top: 20px; border-left: 4px solid #2196F3; padding-left: 10px;'>üìÑ {line}</h3>"
            # Speech tips suggestions (marked with brackets)
            elif '[' in line and ']' in line:
                # Highlight suggestions
                import re
                highlighted = re.sub(r'\[([^\]]+)\]', r'<span style="background: #FFF3E0; color: #F57C00; padding: 2px 6px; border-radius: 3px; font-weight: 500;">[\1]</span>', line)
                formatted += f"<p style='line-height: 1.8; margin: 10px 0;'>{highlighted}</p>"
            # General content
            else:
                formatted += f"<p style='line-height: 1.8; margin: 10px 0;'>{line}</p>"
        
        return formatted
    
    def _create_download_buttons(self, transcript: str):
        """Create download buttons"""
        
        # TXT Download
        txt_content = transcript.encode('utf-8')
        txt_b64 = base64.b64encode(txt_content).decode()
        
        filename = f"transcript_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        
        display(HTML(f"""
        <div style='margin-top: 20px; text-align: center;'>
            <a href="data:text/plain;base64,{txt_b64}" 
               download="{filename}.txt"
               style="display: inline-block; padding: 12px 30px; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); 
                      color: white; text-decoration: none; border-radius: 8px; font-weight: 500; margin: 10px;">
                üì• Download TXT Format
            </a>
        </div>
        """))

print("‚úÖ Interactive UI is ready!")
print("üëá Please scroll down to start using")

In [None]:
# Display startup message
print("\n" + "="*80)
print("üéâ Application Starting...")
print("="*80)
print("\nüìã Usage Steps:")
print("1Ô∏è‚É£  Upload your slides PDF")
print("2Ô∏è‚É£  Set speech duration")
print("3Ô∏è‚É£  Select speech rate (Slow/Medium/Fast/Auto Analysis)")
print("4Ô∏è‚É£  Select AI Model (GPT-5.1 Recommended ‚≠ê)")
print("5Ô∏è‚É£  Select speech style")
print("6Ô∏è‚É£  Fill in speech information")
print("7Ô∏è‚É£  Click 'Generate Transcript' button")
print("8Ô∏è‚É£  Download generated transcript")
print("\n‚ú® Key Features:")
print("   ‚Ä¢ GPT-5.1 Model - Strongest multimodal understanding, deep analysis of slide content (text & images)")
print("   ‚Ä¢ GPT-4o Audio - Accurately calculates your speech rate")
print("   ‚Ä¢ 8 Languages Supported - Traditional Chinese/English/Simplified Chinese/Japanese/Korean/Spanish/French/German")
print("   ‚Ä¢ Speech Tips Suggestions - AI provides professional advice on gestures, tone, pauses, etc.")
print("   ‚Ä¢ Expert Role Play - AI writes as a specified expert, making content more professional and persuasive")
print("\nüí° Expert Role Description:")
print("   After filling in the 'Expert Role' field, AI will write the transcript acting as that persona")
print("   e.g., 'Senior AI Researcher' explains from a technical expert's perspective")
print("        'PhD in Educational Psychology' explains from an educational expert's perspective")
print("="*80 + "\n")

# Start the application
app = TranscriptGeneratorUI()