# Audio Understanding

## üéØ AI-Powered Audio Understanding with Transcription & Spectrogram Analysis

### What is Audio Understanding in Video?

Audio understanding means teaching AI to "hear" and comprehend spoken content just like humans do. When you listen to a video, you naturally understand:
- **What's being said** - the actual words and sentences
- **How it sounds** - the tone, tempo, energy, and emotional characteristics
- **Context** - how the conversation flows
- **Key topics** - the main subjects being discussed
- **Timing** - when important things are said

### Live Stream Audio Understanding

Now imagine applying audio understanding to **live streams** - audio content that's happening in real-time. This could be:

- **Live broadcasts** - news, interviews, podcasts happening now
- **Video calls** - meetings, presentations, conversations
- **Streaming content** - live shows, tutorials, commentary
- **Events** - conferences, speeches, announcements

Live stream audio understanding means AI can transcribe and analyze both spoken content and acoustic characteristics **as it happens**, providing real-time insights about what's being said and how it sounds. This opens up powerful possibilities like automatic captions, emotion detection, content moderation, key moment detection, and live content summarization.

### Our Solution: Real-Time Audio Processing Pipeline

We address these challenges with a smart approach:

- **üé§ Audio Extraction**: Extract audio streams from video content
- **üìù Real-Time Transcription**: Use Amazon Transcribe for real-time speech-to-text
- **üî§ Smart Sentence Building**: Combine words into complete sentences with timing
- **üìä Spectrogram Analysis**: Extract acoustic features like tempo, energy, and frequency characteristics
- **üß† AI Content Analysis**: Use Claude to organize content into chapters and topics with audio insights

**Let's see how this works in practice!** üéß

## 1. Import Required Libraries

**Load required modules** for audio processing, transcription, and AI analysis.

In [None]:
import asyncio
import io
import wave
from pathlib import Path
from IPython.display import Audio, display, HTML, JSON
from datetime import datetime
import time
import json
import boto3

from amazon_transcribe.client import TranscribeStreamingClient
from amazon_transcribe.handlers import TranscriptResultStreamHandler
from amazon_transcribe.model import TranscriptEvent
sample_rate=16000

# Load shared configuration from prerequisites notebook
%store -r AUDIO_MODEL_ID
%store -r AWS_REGION

stream_duration = 120  # Process 120 seconds

AUDIO_MODEL_ID="global.anthropic.claude-sonnet-4-20250514-v1:0"
print("‚úÖ Libraries imported successfully!")


## 2. Preview Source Audio

**Listen to the audio** we'll analyze for transcription and spectrogram features.

### Source Video
**Meridian, 2016**, Mystery from [Netflix](https://opencontent.netflix.com/#h.fzfk5hndrb9w) - This video is available under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode)

We'll extract and analyze 2-minute audio track from the video for both spoken content and acoustic characteristics.

<div class="alert alert-block alert-success">
Using the same video source as visual understanding allows you to see how audio and visual analysis complement each other.
</div>


In [None]:
import subprocess
import os
from IPython.display import HTML, display
import IPython
import base64

def create_audio_from_video(video_path, start_time=0, duration=120):
    """Extract audio from video and create MP3 file with playback"""
    
    # Create output MP3 file
    output_path = "extracted_audio.mp3"
    
    try:
        # Extract audio using FFmpeg to MP3
        ffmpeg_command = [
            'ffmpeg', '-y',
            '-ss', str(start_time),
            '-i', video_path,
            '-t', str(duration),
            '-vn',
            '-acodec', 'mp3',
            '-ab', '128k',
            output_path
        ]
        
        print(f"üéµ Extracting audio: {start_time}s to {start_time + duration}s")
        result = subprocess.run(ffmpeg_command, capture_output=True, text=True)
        
        if result.returncode != 0:
            print(f"‚ùå Error extracting audio: {result.stderr}")
            return None
            
        print(f"‚úÖ Audio extracted successfully: {output_path}")
        return output_path
        
    except Exception as e:
        print(f"‚ùå Error: {e}")
        return None

# Video configuration
video_path = "../sample_videos/Netflix_Open_Content_Meridian.mp4"
start_time = 0  # Start from beginning
duration = 120  # Extract 2 minutes

# Extract and display audio
audio_file = create_audio_from_video(video_path, start_time, duration)
IPython.display.Audio(audio_file)

## 3. Extract Audio Stream From Source Video

**Set up audio extraction** from the video file using FFmpeg to simulate real-time streaming.

### Audio Processing Setup

- **Real-time streaming** using FFmpeg with `-re` flag
- **16kHz sample rate** optimized for speech recognition
- **Mono audio** to reduce processing overhead
- **PCM format** for direct streaming to Amazon Transcribe

This creates a continuous audio stream that mimics live broadcast conditions.

**Initialize FFmpeg audio streaming** from the video file with real-time processing.

In [None]:
async def cleanup_ffmpeg_processes():
    """Kill any existing FFmpeg processes to prevent conflicts"""
    import subprocess
    import signal
    import os
    
    try:
        # Find all FFmpeg processes
        result = subprocess.run(
            ['pgrep', '-f', 'ffmpeg'],
            capture_output=True,
            text=True
        )
        
        if result.returncode == 0 and result.stdout.strip():
            pids = result.stdout.strip().split('\n')
            print(f"üßπ Found {len(pids)} existing FFmpeg process(es), cleaning up...")
            
            for pid in pids:
                try:
                    pid_int = int(pid)
                    os.kill(pid_int, signal.SIGTERM)
                    print(f"   ‚úÖ Terminated FFmpeg process (PID: {pid_int})")
                except (ValueError, ProcessLookupError) as e:
                    print(f"   ‚ö†Ô∏è Could not terminate PID {pid}: {e}")
            
            # Wait a moment for processes to terminate
            await asyncio.sleep(1)
            print("üßπ FFmpeg cleanup complete")
        else:
            print("‚úÖ No existing FFmpeg processes found")
            
    except Exception as e:
        print(f"‚ö†Ô∏è Error during FFmpeg cleanup: {e}")

async def stream_audio_from_video(video_file, start_time, duration, sample_rate=16000):
    """Stream audio directly from video file using FFmpeg - with cleanup"""
    
    # Clean up any existing FFmpeg processes first
    await cleanup_ffmpeg_processes()
    
    # FFmpeg command for real-time audio streaming
    ffmpeg_command = [
        'ffmpeg',
        '-re',                    # Real-time flag - crucial for proper streaming
        '-ss', str(start_time),   # Start time
        '-i', video_file,         # Input video
        '-t', str(duration),      # Duration
        '-tune', 'zerolatency',   # Low latency tuning
        '-f', 'wav',              # WAV format output
        '-ac', '1',               # Mono
        '-ar', str(sample_rate),  # Sample rate
        '-c:a', 'pcm_s16le',      # 16-bit PCM
        '-'                       # Output to stdout
    ]
    
    print(f"üéµ Starting FFmpeg audio stream: {start_time}s to {start_time + duration}s")
    print("üìä Using real-time streaming with -re flag")
    
    # Start FFmpeg process
    process = await asyncio.create_subprocess_exec(
        *ffmpeg_command, 
        stdout=asyncio.subprocess.PIPE, 
        stderr=asyncio.subprocess.PIPE
    )
    
    print("‚úÖ FFmpeg process started, ready for streaming")
    return process

# Start FFmpeg streaming process
ffmpeg_process = await stream_audio_from_video(video_path, start_time, stream_duration, sample_rate)

## 4. Create Real-Time Transcription Processor

**Set up the transcription processor** to process real-time audio stream through Amazon Transcribe for instant speech-to-text conversion.

### Smart Sentence Building

The transcription handler:
- **Buffers incoming words** from the streaming API
- **Detects sentence boundaries** using punctuation markers
- **Combines comma-separated phrases** into complete sentences
- **Tracks precise timing** for each sentence start and end
- **Handles real-time processing** of partial and final results

This creates complete, timestamped sentences from continuous speech recognition.

In [None]:
# Import SentenceBuilder and TranscriptProcessor for handling transcription processing
# üìÅ Implementation: sentence_builder.py - handles sentence building and comma combination logic
# üìÅ Implementation: transcript_processor.py - handles complex transcript processing operations
from components import SentenceBuilder, TranscriptItemProcessor, SentenceFormatter, TranscriptEventValidator

print("‚úÖ SentenceBuilder and TranscriptProcessor imported successfully!")

In [None]:
class TranscriptionHandler(TranscriptResultStreamHandler):
    """Restored handler with proper sentence building logic"""
    def __init__(self, transcript_result_stream):
        super().__init__(transcript_result_stream)
        self.sentences = []
        self.sentence_builder = SentenceBuilder()
        # Restore essential buffering for proper sentence formation
        self.partial_buffer = {}
        self.last_processed_stable_key = None
        self.sentence_processed_keys = set()
    
    def _create_sentence_from_buffer(self):
        """Build complete sentences from buffered word items with comma handling"""
        if not self.partial_buffer:
            return
        
        sorted_items = sorted(self.partial_buffer.values(), key=lambda x: x.start_time)
        sentence_words = []
        sentence_start = None
        sentence_end = None
        punctuation = ""
        items_to_remove = []
        
        for item in sorted_items:
            item_key = f"{item.start_time}_{item.end_time}_{item.content}"
            
            if item.item_type == "pronunciation":
                sentence_words.append(item.content)
                if sentence_start is None:
                    sentence_start = item.start_time
                sentence_end = item.end_time
                items_to_remove.append(item_key)
            elif item.item_type == "punctuation":
                punctuation = item.content.strip()
                items_to_remove.append(item_key)
                break
        
        if sentence_words:
            # Use SentenceBuilder for comma handling
            sentence_data = self.sentence_builder.add_sentence_fragment(
                sentence_words, sentence_start, sentence_end, punctuation
            )
            
            if sentence_data:
                self._finalize_sentence(sentence_data)
            
            # Clean up processed items
            for key in items_to_remove:
                if key in self.partial_buffer:
                    del self.partial_buffer[key]
                    self.sentence_processed_keys.add(key)
    
    def _finalize_sentence(self, sentence_data):
        """Output a completed sentence"""
        if sentence_data:
            timestamp = datetime.now().strftime("%H:%M:%S")
            print(f"üìù SENTENCE: {sentence_data['text']}")
            print(f"‚è±Ô∏è Time: {sentence_data['start_time']:.3f}s-{sentence_data['end_time']:.3f}s")
            
            self.sentences.append({
                'text': sentence_data['text'],
                'start_time': sentence_data['start_time'],
                'end_time': sentence_data['end_time'],
                'timestamp': timestamp
            })
    
    async def handle_transcript_event(self, transcript_event: TranscriptEvent):
        """Restored transcript processing with proper partial and final result handling"""
        try:
            if not transcript_event or not hasattr(transcript_event, 'transcript'):
                return
            
            if not transcript_event.transcript or not hasattr(transcript_event.transcript, 'results'):
                return
                
            results = transcript_event.transcript.results
            if not results:
                return
            
            for result in results:
                if not result or not hasattr(result, 'alternatives') or not result.alternatives:
                    continue
                    
                alt = result.alternatives[0]
                if not alt:
                    continue
                
                if result.is_partial:
                    # Process partial results for real-time sentence building
                    if hasattr(alt, 'items') and alt.items:
                        found_last_processed = self.last_processed_stable_key is None
                        
                        for item in alt.items:
                            if not item or not hasattr(item, 'item_type'):
                                continue
                                
                            if not all(hasattr(item, attr) for attr in ['start_time', 'end_time', 'content']):
                                continue
                                
                            item_key = f"{item.start_time}_{item.end_time}_{item.content}"
                            
                            if not found_last_processed:
                                if item_key == self.last_processed_stable_key:
                                    found_last_processed = True
                                continue
                            
                            if hasattr(item, 'stable') and item.stable:
                                if item_key not in self.partial_buffer:
                                    self.partial_buffer[item_key] = item
                                    self.last_processed_stable_key = item_key
                                    
                                    if item.item_type == "punctuation":
                                        self._create_sentence_from_buffer()
                else:
                    # Process final results
                    if hasattr(alt, 'items') and alt.items:
                        for item in alt.items:
                            if not item or not hasattr(item, 'item_type'):
                                continue
                                
                            if not all(hasattr(item, attr) for attr in ['start_time', 'end_time', 'content']):
                                continue
                                
                            item_key = f"{item.start_time}_{item.end_time}_{item.content}"
                            
                            if item_key in self.sentence_processed_keys:
                                continue
                            
                            if item_key not in self.partial_buffer:
                                self.partial_buffer[item_key] = item
                                
                                if item.item_type == "punctuation":
                                    self._create_sentence_from_buffer()
                    
                    # Clean up processed items
                    remaining_items = {k: v for k, v in self.partial_buffer.items() if k not in self.sentence_processed_keys}
                    self.partial_buffer = remaining_items
                    self.sentence_processed_keys.clear()
                    
                    if not remaining_items:
                        self.last_processed_stable_key = None
                        
        except Exception as e:
            print(f"‚ùå Transcript event error: {e}")
    
    def finalize_sentences(self):
        """Force completion of any pending sentences at end of stream"""
        sentence_data = self.sentence_builder.finalize_pending()
        if sentence_data:
            print(f"üìù FINAL SENTENCE: {sentence_data['text']}")
            print(f"‚è±Ô∏è Time: {sentence_data['start_time']:.3f}s")
            timestamp = datetime.now().strftime("%H:%M:%S")
            self.sentences.append({
                'text': sentence_data['text'],
                'start_time': sentence_data['start_time'],
                'end_time': sentence_data['end_time'],
                'timestamp': timestamp
            })
print("‚úÖ Transcription Handler initialized successfully")

## 4.5. Execute Real-Time Transcription Processor

**Start the real-time transcription process** - Stream audio to Amazon Transcribe and build complete sentences.

### What Happens Now

1. **Initialize Transcribe client** with English language settings
2. **Start streaming audio** from FFmpeg to Amazon Transcribe
3. **Process transcription results** in real-time
4. **Build complete sentences** with precise timestamps
5. **Display results** as they are detected

You'll see sentences appear like:
- üìù SENTENCE: That's right.
- ‚è±Ô∏è Time: 4.63s - 5.14s

In [None]:
# Setup Transcribe client and handler (like working notebook)
print("üîß Setting up Transcribe client...")

transcribe_client = TranscribeStreamingClient(region=AWS_REGION)

transcribe_stream = await transcribe_client.start_stream_transcription(
    language_code='en-US',
    media_sample_rate_hz=sample_rate,
    media_encoding='pcm'
)

# Setup handler
handler = TranscriptionHandler(transcribe_stream.output_stream)
asyncio.create_task(handler.handle_events())

print("‚úÖ Transcribe stream ready")
print("üì° Starting real-time FFmpeg ‚Üí Transcribe streaming...")

# Real-time streaming loop (based on working notebook)
try:
    chunks_sent = 0
    total_bytes = 0
    last_data_time = time.time()
    
    while True:
        try:
            # Read audio data from FFmpeg stdout (like working notebook)
            data = await asyncio.wait_for(ffmpeg_process.stdout.read(1024 * 2), timeout=1.0)
            
            if data:
                last_data_time = time.time()
                
                # Send directly to Transcribe
                await transcribe_stream.input_stream.send_audio_event(audio_chunk=data)
                
                chunks_sent += 1
                total_bytes += len(data)
                
                # Progress every 50 chunks (~2 seconds)
                if chunks_sent % 50 == 0:
                    elapsed = total_bytes / (sample_rate * 2)
                    #print(f"üìä Streamed {elapsed:.1f}s ({chunks_sent} chunks)")
            else:
                print("üì° FFmpeg stream ended")
                break
                
        except asyncio.TimeoutError:
            time_since_data = time.time() - last_data_time
            if time_since_data >= 10:  # 10 second timeout
                print("‚ö†Ô∏è No audio data for 10 seconds, stopping...")
                break
    
    # End streams
    await transcribe_stream.input_stream.end_stream()
    await ffmpeg_process.wait()
    
    # Finalize any pending sentences
    handler.finalize_sentences()
    
    # Final stats
    total_seconds = total_bytes / (sample_rate * 2)
    print(f"\n‚úÖ Streaming complete: {total_seconds:.1f}s ({chunks_sent} chunks)")
    print(f"üìù Sentences detected: {len(handler.sentences)}")
    
    # Final cleanup to ensure no processes remain
    await cleanup_ffmpeg_processes()
    
    # Wait for final transcription
    await asyncio.sleep(3)
    
except Exception as e:
    print(f"‚ùå Streaming error: {e}")
    if ffmpeg_process.returncode is None:
        ffmpeg_process.terminate()
        await ffmpeg_process.wait()
    
    # Cleanup on error as well
    await cleanup_ffmpeg_processes()

## 5. Display Consolidated Transcript

**Review the complete transcription** results from the real-time processing.


In [None]:
# Create buffer and display transcript
sentence_json = [{'sentence': s['text'], 'start_time': s['start_time'], 'end_time': s['end_time']} 
                       for s in handler.sentences] if handler.sentences else []

print(f"\nüìö TRANSCRIPT ({len(sentence_json)} sentences)")
print("=" * 50)
if sentence_json:
    print(" ".join(s['sentence'] for s in sentence_json))
    print(f"üìä {sentence_json[0]['start_time']:.1f}s - {sentence_json[-1]['end_time']:.1f}s")
else:
    print("No sentences detected")
print("=" * 50)

## 6. Perform Audio Feature Extraction

**Extract audio characteristics** to enhance content analysis with acoustic insights.

### Audio Features Extracted

- **Spectral Centroid** - Audio "brightness" measurement
- **RMS Energy** - Overall loudness and dynamic range
- **Zero Crossing Rate** - Speech vs music distinction
- **Tempo Detection** - Rhythmic patterns in speech
- **MFCC Features** - Speech analysis coefficients
- **Spectral Rolloff** - Frequency distribution characteristics

These features provide additional context that complements the transcribed text for better AI analysis.

In [None]:
# Import AudioSpectrogramAnalyzer for comprehensive audio analysis
# üìÅ Implementation: audio_spectrogram_analyzer.py - handles mel-spectrogram generation, waveform visualization, and comprehensive audio feature extraction
from components import AudioSpectrogramAnalyzer

print("‚úÖ AudioSpectrogramAnalyzer imported successfully!")

In [None]:
# Generate comprehensive spectrogram analysis and extract audio features for AI model enhancement
print("üéµ Starting spectrogram analysis...")

# Initialize the analyzer
spectrogram_analyzer = AudioSpectrogramAnalyzer(sample_rate=sample_rate)

# Extract audio from video for analysis
audio_data, sr = spectrogram_analyzer.extract_audio_from_video(
    video_path, 
    start_time=start_time, 
    duration=stream_duration
)

if audio_data is not None:
    # Generate mel-spectrogram for frequency analysis
    mel_spec = spectrogram_analyzer.generate_spectrogram(audio_data, sr)
    
    # Extract comprehensive audio features that will be provided to Amazon Bedrock
    audio_features = spectrogram_analyzer.analyze_audio_features(audio_data, sr)
    
    # Create detailed visualization (waveform, spectrogram, RMS energy)
    spectrogram_image = spectrogram_analyzer.create_spectrogram_visualization(start_time_offset=start_time)
    
    # Generate human-readable audio characteristics description
    audio_description = spectrogram_analyzer.get_audio_description()
    
    print("\nüìä AUDIO ANALYSIS SUMMARY (Features for AI Model)")
    print("=" * 50)
    print(f"üéµ Duration: {audio_features.get('duration', 0):.1f}s")
    print(f"üéº Tempo: {audio_features.get('tempo', 0):.1f} BPM (rhythmic patterns)")
    print(f"üìà Spectral Centroid: {audio_features.get('spectral_centroid_mean', 0):.1f} Hz (audio brightness)")
    print(f"üîä RMS Energy: {audio_features.get('rms_mean', 0):.4f} (loudness/dynamics)")
    print(f"üéôÔ∏è Zero Crossing Rate: {audio_features.get('zero_crossing_rate_mean', 0):.4f} (speech vs music)")
    print(f"üìù Audio Characteristics: {audio_description}")
    print("\nüí° These features will be provided to Amazon Bedrock for enhanced content analysis")
    print("=" * 50)
    
else:
    print("‚ùå Could not extract audio for spectrogram analysis")
    audio_features = {}
    spectrogram_image = None
    audio_description = "Audio analysis unavailable"

## 7. Define Audio Content Analyzer Leveraging Claude

**Define the enhanced audio content analyzer** to analyze transcribed content with audio insights using Claude for creating structured chapters and topics.

This transforms raw transcription and audio insights into organized chapters and topics.

In [None]:
# Enhanced AudioContentAnalyzer with spectrogram integration
class EnhancedAudioContentAnalyzer:
    """Enhanced analyzer that includes spectrogram data in content analysis"""
    
    def __init__(self, model_id=None, region=None):
        self.bedrock_client = boto3.client('bedrock-runtime', region_name=region or AWS_REGION)
        self.model_id = model_id or AUDIO_MODEL_ID
    
    def create_analysis_prompt(self, sentences, audio_features=None, audio_description=None):
        """Create a structured prompt for chapter and topic analysis with spectrogram data"""
        transcript_text = " ".join([s['sentence'] for s in sentences])
        
        # Build audio analysis section if available
        audio_analysis_section = ""
        if audio_features and audio_description:
            audio_analysis_section = f"""

            AUDIO SPECTROGRAM ANALYSIS:
            Duration: {audio_features.get('duration', 0):.1f} seconds
            Tempo: {audio_features.get('tempo', 0):.1f} BPM
            Spectral Centroid: {audio_features.get('spectral_centroid_mean', 0):.1f} Hz
            RMS Energy: {audio_features.get('rms_mean', 0):.4f}
            Zero Crossing Rate: {audio_features.get('zero_crossing_rate_mean', 0):.4f}
            Audio Characteristics: {audio_description}
            
            MFCC Features (Speech Analysis):
            {json.dumps(audio_features.get('mfcc_means', [])[:5], indent=2)}  # First 5 MFCC coefficients
            
            AUDIO INSIGHTS FOR CONTENT ANALYSIS:
            - Spectral Centroid ({audio_features.get('spectral_centroid_mean', 0):.1f} Hz): {'High energy/animated' if audio_features.get('spectral_centroid_mean', 0) > 2000 else 'Calm/conversational'} delivery
            - RMS Energy ({audio_features.get('rms_mean', 0):.4f}): {'Dynamic/emphatic' if audio_features.get('rms_mean', 0) > 0.1 else 'Steady/measured'} speaking style
            - Zero Crossing Rate ({audio_features.get('zero_crossing_rate_mean', 0):.4f}): {'Clear speech focus' if audio_features.get('zero_crossing_rate_mean', 0) > 0.1 else 'Background elements present'}
            - Tempo ({audio_features.get('tempo', 0):.1f} BPM): {'Rapid/urgent' if audio_features.get('tempo', 0) > 120 else 'Deliberate/thoughtful'} pacing
            - MFCC patterns indicate speech clarity and vocal characteristics"""
            
        prompt = f"""Analyze the following transcript and organize it into chapters and topics. MANDATORY: Incorporate the audio spectrogram analysis into your chapter titles, summaries, and descriptions to reflect the speaker's delivery style and emotional tone.
        
        TRANSCRIPT:
        {transcript_text}
        
        SENTENCE TIMING DATA:
        {json.dumps(sentences, indent=2)}{audio_analysis_section}
        
        Please create a structured analysis with the following format:
        
        {{
          "chapters": [
            {{
              "title": "Chapter Title (must reflect audio energy/tone)",
              "start_time": 0.0,
              "end_time": 30.0,
              "summary": "Brief chapter summary incorporating audio delivery characteristics",
              "audio_tone": "Description of speaker's energy and delivery style in this section",
              "topics": [
                {{
                  "title": "Topic Title",
                  "start_time": 0.0,
                  "end_time": 15.0,
                  "description": "Topic description enhanced with audio context",
                  "key_points": ["Point 1", "Point 2"]
                }}
              ]
            }}
          ],
          "overall_summary": "Overall content summary that includes speaker's delivery style and audio characteristics",
          "audio_delivery_analysis": "Summary of how the speaker's vocal patterns and energy levels enhance the content",
          "total_duration": 120.0
        }}
        
        CRITICAL REQUIREMENTS:
        - Chapter titles MUST include descriptors based on audio energy (e.g., "Energetic Opening", "Measured Technical Discussion")
        - All summaries MUST reference the speaker's delivery style using the spectrogram data
        - Include "audio_tone" field for each chapter describing vocal characteristics
        - Add "audio_delivery_analysis" field summarizing overall speaking patterns
        - Use tempo to identify pacing changes between sections
        - Use RMS energy to identify emphasis and key moments
        - Use spectral centroid to gauge speaker engagement and excitement levels
        - Descriptions should reflect whether content is delivered with high energy, calmly, urgently, etc.
        - DONOT mention the audio charateristic measure numbers while you build the summary.
        - MORE focus to the script delivered with inclusion of little emotions identified using audio analysis.
        
        Return only the JSON structure, no additional text."""
        
        return prompt
    
    def analyze_content(self, sentences, audio_features=None, audio_description=None):
        """Send transcript to Bedrock for chapter and topic analysis with spectrogram data"""
        if not sentences:
            return {"chapters": [], "overall_summary": "No content to analyze", "total_duration": 0.0}
        
        prompt = self.create_analysis_prompt(sentences, audio_features, audio_description)
        
        try:
            # Prepare the request for Claude
            request_body = {
                'anthropic_version': 'bedrock-2023-05-31',
                'max_tokens': 4096,
                "messages": [
                    {
                        "role": "user",
                        "content": [
                            {
                                "type": "text",
                                "text": prompt
                            }
                        ]
                    }
                ],
                "temperature": 0.3
            }
            
            print("ü§ñ Analyzing content with Amazon Bedrock (including spectrogram data)...")
            
            response = self.bedrock_client.invoke_model(
                modelId=self.model_id,
                body=json.dumps(request_body)
            )
            
            response_body = json.loads(response['body'].read())
            analysis_text = response_body['content'][0]['text']
            
            # Parse the JSON response
            try:
                analysis = json.loads(analysis_text)
                return analysis
            except json.JSONDecodeError:
                # If JSON parsing fails, extract JSON from the response
                import re
                json_match = re.search(r'\{.*\}', analysis_text, re.DOTALL)
                if json_match:
                    return json.loads(json_match.group())
                else:
                    raise ValueError("Could not parse JSON from response")
                    
        except Exception as e:
            print(f"‚ùå Error analyzing content: {e}")
            return {
                "chapters": [{
                    "title": "Content Analysis",
                    "start_time": sentences[0]['start_time'] if sentences else 0.0,
                    "end_time": sentences[-1]['end_time'] if sentences else 0.0,
                    "summary": "Analysis failed - showing raw transcript",
                    "topics": [{
                        "title": "Transcript Content",
                        "start_time": sentences[0]['start_time'] if sentences else 0.0,
                        "end_time": sentences[-1]['end_time'] if sentences else 0.0,
                        "description": " ".join([s['sentence'] for s in sentences[:5]]) + "...",
                        "key_points": ["Analysis unavailable"]
                    }]
                }],
                "overall_summary": f"Content analysis failed: {str(e)}",
                "total_duration": sentences[-1]['end_time'] if sentences else 0.0
            }

print("‚úÖ Enhanced AudioContentAnalyzer with spectrogram integration ready!")

## 8. Execute Audio Content Analysis

**Analyze content using both transcription and audio features** using Claude to create comprehensive chapters and topics.

In [None]:
# Analyze the transcribed content with spectrogram data
if sentence_json:
    print("üîç Starting enhanced content analysis with spectrogram data...")
    
    # Initialize the enhanced analyzer
    analyzer = EnhancedAudioContentAnalyzer()
    
    # Perform the analysis with both transcript and spectrogram data
    content_analysis = analyzer.analyze_content(sentence_json, audio_features, audio_description)
    
    print("\nüìã ENHANCED CONTENT ANALYSIS RESULTS")
    print("=" * 60)
    
    # Display overall summary
    print(f"üìñ Overall Summary: {content_analysis.get('overall_summary', 'No summary available')}")
    print(f"‚è±Ô∏è Total Duration: {content_analysis.get('total_duration', 0):.1f}s")
    print(f"üìö Chapters Found: {len(content_analysis.get('chapters', []))}")
    
    # Display chapters and topics
    for i, chapter in enumerate(content_analysis.get('chapters', []), 1):
        print(f"\nüìñ Chapter {i}: {chapter.get('title', 'Untitled')}")
        print(f"   ‚è±Ô∏è Time: {chapter.get('start_time', 0):.1f}s - {chapter.get('end_time', 0):.1f}s")
        print(f"   üìù Summary: {chapter.get('summary', 'No summary')}")
        
        topics = chapter.get('topics', [])
        print(f"   üè∑Ô∏è Topics ({len(topics)}):")
        
        for j, topic in enumerate(topics, 1):
            print(f"      {j}. {topic.get('title', 'Untitled Topic')}")
            print(f"         ‚è±Ô∏è {topic.get('start_time', 0):.1f}s - {topic.get('end_time', 0):.1f}s")
            print(f"         üìÑ {topic.get('description', 'No description')}")
            
            key_points = topic.get('key_points', [])
            if key_points:
                print(f"         üîë Key Points: {', '.join(key_points)}")
    
    print("\n" + "=" * 60)
    
    # Store the analysis for further use
    print(f"\n‚úÖ Enhanced analysis complete! Found {len(content_analysis.get('chapters', []))} chapters with detailed topics.")
    
else:
    print("‚ùå No sentences available for analysis")
    content_analysis = None

## 9. Comparison: With vs Without Spectrogram Analysis

**See the difference** between basic transcript analysis and enhanced spectrogram-powered analysis.

### Why Spectrogram Analysis Matters

**Spectrogram analysis adds crucial acoustic context that transforms basic transcription into rich, nuanced understanding:**

**üéØ Key Use Cases:**
- **Key Moment Detection**: Identify excitement peaks, emphasis, and emotional highlights
- **Speaker Engagement**: Detect when speakers are animated vs. calm/measured
- **Content Pacing**: Understand rushed vs. deliberate delivery for better segmentation
- **Emotional Context**: Capture tone that text alone cannot convey
- **Quality Assessment**: Identify clear speech vs. background noise/music
- **Audience Targeting**: Match content energy to appropriate audience segments
- **Highlight Generation**: Create clips based on vocal energy and engagement levels

**üìä Acoustic Features That Make a Difference:**
- **Spectral Centroid**: Brightness/energy ‚Üí Excitement vs. calm delivery
- **RMS Energy**: Loudness dynamics ‚Üí Emphasis and key moments
- **Tempo**: Speech pacing ‚Üí Urgency vs. thoughtful discussion
- **Zero Crossing Rate**: Speech clarity ‚Üí Professional vs. casual content
- **MFCC**: Voice characteristics ‚Üí Speaker identification and emotion

In [None]:
# Create a basic analyzer without spectrogram data for comparison
from components import BasicAudioContentAnalyzer, ComparisonUIBuilder

# Run comparison analysis with collapsible display
if sentence_json:
    # Basic analysis (transcript only)
    basic_analyzer = BasicAudioContentAnalyzer(model_id=AUDIO_MODEL_ID, region=AWS_REGION)
    basic_analysis = basic_analyzer.analyze_content(sentence_json)
    
    # Create UI builder and display comparison
    ui_builder = ComparisonUIBuilder(video_path)
    ui_builder.display_comparison(basic_analysis, content_analysis)
        
else:
    print("‚ùå No transcript available for comparison")

**üéâ Congratulations! You now understand how to perform audio understanding with AI!**