# Day 2, Session 5 - Demo: Voice-Enabled Invoice Assistant

## Voice Completes the Multimodal Experience

- Users can describe complex requirements verbally
- Hands-free operation for warehouse/field workers
- Accessibility for visually impaired users
- Natural conversation flow with invoices

**Today: Architecture overview, not full implementation!**

This demo shows the complete architecture for integrating voice capabilities with our invoice processing agent. We'll explore how to combine speech-to-text, our LangGraph agent, and text-to-speech into a seamless conversational experience.

**Duration: 15 minutes**

**Note**: Due to time constraints, we focus on architecture and integration patterns rather than full implementation. The concepts shown here can be extended for production systems.

In [None]:
# Global configuration - Instructor will fill these
OLLAMA_URL = "http://XX.XX.XX.XX"  # Course server IP (port 80)
API_TOKEN = "YOUR_TOKEN_HERE"      # Instructor provides token
MODEL = "qwen3:8b"                  # Default model on server

In [None]:
# Show required libraries (not all will be installed)
"""
Required packages for production voice system:
- livekit-agents: WebRTC signaling and rooms
- openai-whisper: Speech-to-text
- elevenlabs: Text-to-speech
- pyaudio: Audio streaming
- webrtcvad: Voice activity detection
"""

import asyncio
from dataclasses import dataclass
from typing import Optional, Dict, Any, List
import json
import time
import requests

## 1. Voice-Enabled Invoice Assistant Architecture

Let's start by understanding the complete system architecture. This diagram shows how all components work together to create a seamless voice experience.

In [None]:
architecture = """
VOICE-ENABLED INVOICE ASSISTANT ARCHITECTURE
=============================================

[User Microphone] 
        ↓ (WebRTC)
[LiveKit Server] ← Signaling → [Browser/App]
        ↓
[Audio Stream Buffer]
        ↓
[Voice Activity Detection]
        ↓ (when speech detected)
[Whisper STT] → "Show me invoice INV-001"
        ↓
[LangGraph Agent]
    ├→ [Vision Node] → Process invoice image
    ├→ [LLM Node] → Reasoning
    └→ [Tool Nodes] → API calls
        ↓
[Response Generator] → "The total is $1,250"
        ↓
[ElevenLabs TTS]
        ↓
[Audio Stream]
        ↓ (WebRTC)
[User Speaker]

KEY COMPONENTS:
- WebRTC: Real-time audio streaming
- VAD: Detects when user is speaking
- STT: Converts speech to text
- LangGraph: Our existing invoice agent
- TTS: Converts response back to speech
- Streaming: Low-latency audio delivery
"""

print(architecture)

## 2. Voice Pipeline Components

Now let's examine each component in detail. Understanding these building blocks is crucial for implementing a production voice system.

In [None]:
@dataclass
class VoiceConfig:
    """Configuration for voice processing pipeline"""
    sample_rate: int = 16000  # Standard rate for speech processing
    chunk_duration_ms: int = 30  # For VAD processing
    whisper_model: str = "base.en"  # Balance of speed vs accuracy
    elevenlabs_voice: str = "Rachel"  # Natural sounding voice
    silence_threshold_ms: int = 1000  # When to stop listening
    max_audio_length_s: int = 30  # Prevent infinite recording

class AudioBuffer:
    """Ring buffer for audio chunks with overflow protection"""
    
    def __init__(self, max_size: int = 100):
        self.buffer = []
        self.max_size = max_size
        self.total_duration = 0
    
    def add(self, chunk: bytes, duration_ms: int):
        """Add audio chunk to buffer with sliding window"""
        if len(self.buffer) >= self.max_size:
            # Remove oldest chunk
            old_chunk, old_duration = self.buffer.pop(0)
            self.total_duration -= old_duration
        
        self.buffer.append((chunk, duration_ms))
        self.total_duration += duration_ms
    
    def get_audio_data(self) -> bytes:
        """Concatenate all chunks for processing"""
        return b''.join(chunk for chunk, _ in self.buffer)
    
    def clear(self):
        """Clear buffer after processing"""
        self.buffer.clear()
        self.total_duration = 0

class VoiceActivityDetector:
    """Detect when user is speaking vs silence"""
    
    def __init__(self, config: VoiceConfig):
        self.config = config
        self.is_speaking = False
        self.silence_start = None
        # In production: Initialize WebRTC VAD
        # import webrtcvad
        # self.vad = webrtcvad.Vad(2)  # Aggressiveness level 0-3
    
    def process_chunk(self, audio_chunk: bytes) -> Dict[str, bool]:
        """Process audio chunk and detect speech state changes"""
        # In production, use actual VAD:
        # is_speech = self.vad.is_speech(audio_chunk, self.config.sample_rate)
        
        # Mock implementation for demo
        import random
        is_speech = random.random() > 0.7  # Simulate 30% speech detection
        
        current_time = time.time()
        
        # State machine for speech detection
        if is_speech:
            if not self.is_speaking:
                # Speech started
                self.is_speaking = True
                self.silence_start = None
                return {'speech_started': True, 'speech_ended': False}
        else:
            if self.is_speaking:
                # Potential speech end - start silence timer
                if self.silence_start is None:
                    self.silence_start = current_time
                elif current_time - self.silence_start > (self.config.silence_threshold_ms / 1000):
                    # Speech ended after sufficient silence
                    self.is_speaking = False
                    self.silence_start = None
                    return {'speech_started': False, 'speech_ended': True}
        
        return {'speech_started': False, 'speech_ended': False}

# Test the components
print("🔧 Voice pipeline components initialized")
print(f"   Config: {VoiceConfig().sample_rate}Hz, {VoiceConfig().whisper_model} model")
print(f"   Buffer: Max {AudioBuffer().max_size} chunks")
print(f"   VAD: {VoiceConfig().silence_threshold_ms}ms silence threshold")

## 3. Whisper Speech-to-Text Integration

Whisper is OpenAI's speech recognition model. Understanding how to integrate it effectively is key to good voice experiences.

In [None]:
class WhisperSTT:
    """Speech-to-text integration with OpenAI Whisper"""
    
    def __init__(self, model_size: str = "base"):
        """
        Whisper model sizes and their characteristics:
        
        Model    | Parameters | Speed  | Accuracy | Use Case
        ---------|------------|--------|----------|----------
        tiny     | 39M        | ~32x   | Good     | Real-time, mobile
        base     | 74M        | ~16x   | Better   | Balanced performance
        small    | 244M       | ~6x    | Good     | Server deployment
        medium   | 769M       | ~2x    | Very good| High accuracy needs
        large    | 1550M      | ~1x    | Excellent| Best possible accuracy
        
        For production: Consider faster-whisper or WhisperX for better performance
        """
        self.model_size = model_size
        self.model = None  # Would load actual model in production
        print(f"🎤 Initialized Whisper STT with '{model_size}' model")
    
    async def transcribe_audio_buffer(self, audio_buffer: AudioBuffer) -> Dict[str, Any]:
        """
        Process complete audio buffer when user stops speaking
        
        Key challenge: Whisper works best with complete utterances
        Streaming approaches exist but add complexity
        """
        if audio_buffer.total_duration < 500:  # Too short
            return {'text': '', 'confidence': 0.0, 'language': 'en'}
        
        # Get concatenated audio data
        audio_data = audio_buffer.get_audio_data()
        
        print(f"   🔄 Transcribing {audio_buffer.total_duration}ms of audio...")
        
        # In production:
        # import whisper
        # result = self.model.transcribe(
        #     audio_data,
        #     language='en',  # Can auto-detect
        #     task='transcribe',  # vs 'translate'
        #     temperature=0.0  # Deterministic output
        # )
        # return {
        #     'text': result['text'].strip(),
        #     'confidence': result.get('avg_logprob', 0.0),
        #     'language': result.get('language', 'en'),
        #     'segments': result.get('segments', [])
        # }
        
        # Mock transcription for demo
        mock_transcriptions = [
            "Show me invoice INV-001",
            "What's the total amount for the recent TechCorp invoice?",
            "Extract all line items from this invoice",
            "Who is the vendor on invoice number ABC-123?",
            "Calculate the tax amount on this document"
        ]
        
        import random
        transcription = random.choice(mock_transcriptions)
        
        # Simulate processing time
        await asyncio.sleep(0.5)  # Typical Whisper latency
        
        return {
            'text': transcription,
            'confidence': 0.95,
            'language': 'en',
            'duration_ms': audio_buffer.total_duration
        }
    
    def optimize_for_domain(self, invoice_vocabulary: List[str]):
        """
        Optimization strategies for invoice-specific speech:
        
        1. Custom vocabulary: Common invoice terms
        2. Prompt engineering: Guide model context
        3. Post-processing: Fix common OCR/STT errors
        """
        self.invoice_terms = invoice_vocabulary
        print(f"   📝 Loaded {len(invoice_vocabulary)} domain-specific terms")

# Test Whisper integration
print("🧪 Testing Whisper STT integration...")

whisper_stt = WhisperSTT("base")

# Add domain optimization
invoice_vocabulary = [
    "invoice", "vendor", "total", "subtotal", "tax", "line item",
    "quantity", "unit price", "amount", "due date", "invoice number"
]
whisper_stt.optimize_for_domain(invoice_vocabulary)

# Simulate audio processing
test_buffer = AudioBuffer()
test_buffer.add(b"mock_audio_data", 2000)  # 2 second recording

print(f"\n   📊 Simulated audio buffer: {test_buffer.total_duration}ms")

## 4. Bridging Voice to LangGraph Agent

This is where the magic happens - connecting voice input to our existing invoice processing agent. The key is maintaining context and handling multimodal inputs.

In [None]:
class VoiceToGraphBridge:
    """Bridge between voice pipeline and LangGraph invoice agent"""
    
    def __init__(self, langgraph_app, llm_config: Dict[str, str]):
        self.graph = langgraph_app
        self.llm_config = llm_config
        self.conversation_history = []
        self.current_invoice_context = None
        
        print(f"🔗 Voice-to-Graph bridge initialized")
        print(f"   LLM: {llm_config.get('url', 'Not configured')}/{llm_config.get('model', 'unknown')}")
    
    async def process_voice_command(self, 
                                   transcription: str, 
                                   invoice_image: Optional[bytes] = None,
                                   user_context: Dict[str, Any] = None) -> Dict[str, Any]:
        """
        Process voice command through LangGraph agent
        
        This is where we adapt voice input for our existing agent:
        1. Clean and enhance transcription
        2. Maintain conversation context
        3. Route to appropriate graph nodes
        4. Format response for voice output
        """
        print(f"   🎯 Processing voice command: '{transcription}'")
        
        # Step 1: Enhance transcription with context
        enhanced_query = self._enhance_voice_query(transcription)
        
        # Step 2: Prepare state for LangGraph
        graph_state = {
            "user_query": enhanced_query,
            "original_transcription": transcription,
            "modality": "voice",
            "conversation_history": self.conversation_history[-5:],  # Last 5 exchanges
            "require_voice_response": True,
            "response_style": "conversational",  # vs "formal"
            "max_response_length": 150  # Keep voice responses concise
        }
        
        # Step 3: Add invoice image if provided
        if invoice_image:
            graph_state["invoice_image"] = invoice_image
            graph_state["has_visual_input"] = True
        elif self.current_invoice_context:
            # Use previously loaded invoice
            graph_state["invoice_data"] = self.current_invoice_context
            graph_state["has_context"] = True
        
        # Step 4: Execute graph (simulated)
        print(f"   ⚙️ Executing LangGraph with voice-optimized state...")
        
        # In production:
        # result = await self.graph.ainvoke(graph_state)
        
        # Mock graph execution
        result = await self._simulate_graph_execution(graph_state)
        
        # Step 5: Store conversation history
        self.conversation_history.append({
            "user_input": transcription,
            "agent_response": result["final_answer"],
            "timestamp": time.time(),
            "invoice_referenced": result.get("invoice_id")
        })
        
        # Step 6: Optimize response for voice
        voice_response = self._optimize_for_voice_output(result["final_answer"])
        
        return {
            "success": True,
            "voice_response": voice_response,
            "detailed_data": result.get("structured_data"),
            "confidence": result.get("confidence", 0.9),
            "processing_time_ms": result.get("processing_time", 800)
        }
    
    def _enhance_voice_query(self, transcription: str) -> str:
        """
        Clean and enhance voice transcription for better LLM processing
        
        Common voice query issues:
        - Informal language: "what's" → "what is"
        - Missing context: "show me the total" → "show me the total amount for the current invoice"
        - Ambiguous references: "that invoice" → "invoice INV-001"
        """
        enhanced = transcription.lower().strip()
        
        # Basic cleanup
        replacements = {
            "what's": "what is",
            "show me": "please extract",
            "tell me": "provide information about",
            "that invoice": f"the current invoice {self.current_invoice_context.get('id', '') if self.current_invoice_context else ''}"
        }
        
        for old, new in replacements.items():
            enhanced = enhanced.replace(old, new)
        
        # Add context if missing
        if self.current_invoice_context and "invoice" not in enhanced:
            enhanced += f" for invoice {self.current_invoice_context['id']}"
        
        return enhanced
    
    async def _simulate_graph_execution(self, state: Dict[str, Any]) -> Dict[str, Any]:
        """Simulate LangGraph execution for demo purposes"""
        # Simulate processing time
        await asyncio.sleep(0.8)
        
        query = state["user_query"]
        
        # Generate appropriate response based on query type
        if "total" in query or "amount" in query:
            response = "The total amount for invoice INV-001 is $1,250.00 including tax."
            structured_data = {"total_amount": 1250.00, "currency": "USD", "includes_tax": True}
        elif "vendor" in query or "company" in query:
            response = "The vendor for this invoice is TechSupplies Corporation."
            structured_data = {"vendor_name": "TechSupplies Corporation"}
        elif "line item" in query or "items" in query:
            response = "This invoice contains 3 line items: Laptop computers, software licenses, and consulting services."
            structured_data = {"line_items_count": 3, "categories": ["hardware", "software", "services"]}
        else:
            response = "I can help you analyze invoice data. Try asking about totals, vendors, or line items."
            structured_data = {"suggestion": "ask_specific_question"}
        
        return {
            "final_answer": response,
            "structured_data": structured_data,
            "confidence": 0.92,
            "processing_time": 800,
            "invoice_id": "INV-001"
        }
    
    def _optimize_for_voice_output(self, text_response: str) -> str:
        """
        Optimize text response for voice output
        
        Voice-specific considerations:
        - Break up long numbers: "1250.00" → "one thousand two hundred fifty dollars"
        - Add pauses: "The total is... one thousand dollars"
        - Remove visual formatting: No bullet points, tables
        - Simplify complex sentences
        """
        # Remove common text artifacts
        cleaned = text_response.replace("\n", " ").replace("  ", " ")
        
        # Add natural pauses
        if "$" in cleaned:
            # Add pause before dollar amounts
            cleaned = cleaned.replace("is $", "is... ")
        
        # Limit length for voice
        if len(cleaned) > 150:
            sentences = cleaned.split(". ")
            cleaned = sentences[0] + "."
        
        return cleaned

# Test the bridge
print("🧪 Testing Voice-to-Graph bridge...")

# Mock LangGraph app
mock_graph = None  # Would be actual LangGraph application

bridge = VoiceToGraphBridge(
    langgraph_app=mock_graph,
    llm_config={"url": OLLAMA_URL, "model": MODEL, "token": API_TOKEN}
)

# Set current invoice context
bridge.current_invoice_context = {"id": "INV-001", "vendor": "TechSupplies Corp"}

print(f"   📋 Invoice context set: {bridge.current_invoice_context['id']}")

## 5. Text-to-Speech with ElevenLabs

Converting our agent's responses back to natural-sounding speech is the final piece of the voice experience.

In [None]:
class ElevenLabsTTS:
    """Text-to-speech integration with ElevenLabs API"""
    
    def __init__(self, api_key: str, voice_id: str = "Rachel"):
        """
        ElevenLabs voice model options:
        
        Model         | Latency | Quality | Use Case
        --------------|---------|---------|----------
        Turbo v2.5    | 32ms    | Good    | Real-time conversation
        Multilingual  | 200ms   | Excellent| Multi-language support
        Voice Cloning | 500ms   | Custom  | Branded voice experience
        
        Voice characteristics:
        - Rachel: Professional, clear
        - Adam: Friendly, conversational
        - Bella: Warm, approachable
        """
        self.api_key = api_key
        self.voice_id = voice_id
        self.model = "eleven_turbo_v2_5"  # Optimized for low latency
        print(f"🔊 Initialized ElevenLabs TTS with '{voice_id}' voice")
    
    async def synthesize_streaming(self, text: str) -> Dict[str, Any]:
        """
        Generate speech with streaming for low latency
        
        Streaming approach:
        1. Send text to ElevenLabs API
        2. Receive audio chunks as they're generated
        3. Start playback immediately (don't wait for completion)
        4. Handle user interruptions gracefully
        """
        print(f"   🎵 Synthesizing: '{text[:50]}{'...' if len(text) > 50 else ''}'")
        
        # In production:
        # import httpx
        # 
        # url = f"https://api.elevenlabs.io/v1/text-to-speech/{self.voice_id}/stream"
        # headers = {
        #     "Accept": "audio/mpeg",
        #     "Content-Type": "application/json",
        #     "xi-api-key": self.api_key
        # }
        # 
        # data = {
        #     "text": text,
        #     "model_id": self.model,
        #     "voice_settings": {
        #         "stability": 0.5,      # Voice consistency
        #         "similarity_boost": 0.75,  # Voice similarity
        #         "style": 0.5,         # Expressiveness
        #         "use_speaker_boost": True
        #     }
        # }
        # 
        # async with httpx.AsyncClient() as client:
        #     async with client.stream('POST', url, headers=headers, json=data) as response:
        #         audio_chunks = []
        #         async for chunk in response.aiter_bytes():
        #             audio_chunks.append(chunk)
        #             yield chunk  # Stream immediately
        # 
        # return b''.join(audio_chunks)
        
        # Mock implementation for demo
        estimated_duration = len(text) * 0.05  # ~50ms per character
        await asyncio.sleep(0.1)  # Simulate API latency
        
        return {
            "audio_data": b"mock_audio_mp3_data",
            "duration_seconds": estimated_duration,
            "sample_rate": 22050,
            "format": "mp3",
            "voice_used": self.voice_id,
            "model_used": self.model
        }
    
    def optimize_for_conversation(self, text: str) -> str:
        """
        Optimize text for conversational speech synthesis
        
        Techniques:
        1. Add SSML for natural pauses
        2. Phonetic spelling for technical terms
        3. Emphasize important information
        """
        # Add natural pauses
        optimized = text.replace(", ", ", <break time='0.3s'/> ")
        optimized = optimized.replace(". ", ". <break time='0.5s'/> ")
        
        # Emphasize monetary amounts
        import re
        optimized = re.sub(r'\$([0-9,]+\.?[0-9]*)', r'<emphasis level="strong">$\1</emphasis>', optimized)
        
        # Slow down invoice numbers for clarity
        optimized = re.sub(r'(INV-[A-Z0-9]+)', r'<prosody rate="slow">\1</prosody>', optimized)
        
        return optimized
    
    async def handle_interruption(self):
        """
        Handle user interruptions (barge-in)
        
        When user starts speaking while agent is talking:
        1. Stop current TTS immediately
        2. Clear audio buffer
        3. Switch to listening mode
        """
        print("   ⏹️ User interruption detected - stopping TTS")
        # In production: Stop streaming audio
        # await self.stop_current_synthesis()

# Test TTS integration
print("🧪 Testing ElevenLabs TTS integration...")

# Mock API key for demo
tts = ElevenLabsTTS(api_key="demo_key", voice_id="Rachel")

test_response = "The total amount for invoice INV-001 is $1,250.00 including tax."
optimized_text = tts.optimize_for_conversation(test_response)

print(f"\n   📝 Original: {test_response}")
print(f"   🎭 Optimized: {optimized_text[:100]}...")

## 6. Conversation Flow and Turn-Taking

Managing natural conversation flow is one of the most challenging aspects of voice interfaces. This includes handling interruptions, managing silence, and ensuring smooth turn-taking.

In [None]:
class ConversationManager:
    """Manage conversation state and turn-taking in voice interactions"""
    
    def __init__(self, config: VoiceConfig):
        self.config = config
        self.user_speaking = False
        self.agent_speaking = False
        self.last_speech_time = None
        self.conversation_state = "listening"  # listening, processing, speaking, waiting
        self.interruption_count = 0
        
        print(f"💬 Conversation manager initialized")
        print(f"   Silence threshold: {config.silence_threshold_ms}ms")
    
    def handle_user_speech_start(self) -> Dict[str, Any]:
        """
        User starts talking - manage interruptions and state transitions
        
        Scenarios:
        1. User speaks while agent is quiet → Normal turn
        2. User interrupts agent speaking → Barge-in behavior
        3. User continues after pause → Extended input
        """
        current_time = time.time()
        
        if self.agent_speaking:
            # User is interrupting - handle barge-in
            print("   🔄 User interruption detected (barge-in)")
            self.interruption_count += 1
            
            # Stop agent speech immediately
            action = self._stop_agent_speech()
            
            # Switch to listening mode
            self.conversation_state = "listening"
            self.agent_speaking = False
            
            return {
                "action": "handle_interruption",
                "previous_state": "agent_speaking",
                "stop_tts": True,
                "interruption_count": self.interruption_count
            }
        
        elif self.conversation_state == "waiting":
            # Normal user turn
            print("   🎤 User started speaking")
            self.conversation_state = "listening"
            
            return {
                "action": "start_listening",
                "previous_state": "waiting",
                "start_recording": True
            }
        
        # Update state
        self.user_speaking = True
        self.last_speech_time = current_time
        
        return {"action": "continue_listening"}
    
    def handle_silence(self, duration_ms: int) -> Dict[str, Any]:
        """
        Handle silence periods - decide when to process speech
        
        Silence handling strategy:
        - Short pause: Continue waiting
        - Medium pause: End of utterance, process speech
        - Long pause: Timeout, prompt user
        """
        if not self.user_speaking:
            return {"action": "no_change"}
        
        if duration_ms >= self.config.silence_threshold_ms:
            # End of user utterance
            print(f"   ⏸️ Silence detected ({duration_ms}ms) - processing speech")
            
            self.user_speaking = False
            self.conversation_state = "processing"
            
            return {
                "action": "process_speech",
                "silence_duration": duration_ms,
                "should_transcribe": True
            }
        
        return {"action": "continue_listening"}
    
    def handle_agent_response_start(self, estimated_duration_s: float) -> Dict[str, Any]:
        """
        Agent starts speaking - update state and prepare for potential interruptions
        """
        print(f"   🤖 Agent starting to speak ({estimated_duration_s:.1f}s)")
        
        self.agent_speaking = True
        self.conversation_state = "speaking"
        
        return {
            "action": "start_speaking",
            "estimated_duration": estimated_duration_s,
            "monitor_for_interruption": True
        }
    
    def handle_agent_response_complete(self) -> Dict[str, Any]:
        """
        Agent finished speaking - return to listening state
        """
        print("   ✅ Agent finished speaking - waiting for user")
        
        self.agent_speaking = False
        self.conversation_state = "waiting"
        
        return {
            "action": "wait_for_user",
            "ready_for_input": True
        }
    
    def _stop_agent_speech(self) -> Dict[str, Any]:
        """
        Immediately stop agent speech due to interruption
        """
        return {
            "stop_tts_immediately": True,
            "clear_audio_buffer": True,
            "interruption_handled": True
        }
    
    def get_conversation_stats(self) -> Dict[str, Any]:
        """
        Get conversation flow statistics for optimization
        """
        return {
            "current_state": self.conversation_state,
            "user_speaking": self.user_speaking,
            "agent_speaking": self.agent_speaking,
            "interruption_count": self.interruption_count,
            "last_speech_time": self.last_speech_time
        }

# Test conversation management
print("🧪 Testing conversation flow management...")

config = VoiceConfig()
conversation = ConversationManager(config)

# Simulate conversation flow
print("\n   📊 Simulating conversation scenarios:")

# Scenario 1: Normal user turn
result1 = conversation.handle_user_speech_start()
print(f"   1. User starts speaking: {result1['action']}")

# Scenario 2: User finishes speaking
result2 = conversation.handle_silence(1200)  # 1.2 seconds
print(f"   2. Silence detected: {result2['action']}")

# Scenario 3: Agent responds
result3 = conversation.handle_agent_response_start(3.5)
print(f"   3. Agent speaking: {result3['action']}")

# Scenario 4: User interrupts
result4 = conversation.handle_user_speech_start()
print(f"   4. User interrupts: {result4['action']}")

# Show stats
stats = conversation.get_conversation_stats()
print(f"\n   📈 Conversation stats: State={stats['current_state']}, Interruptions={stats['interruption_count']}")

## 7. Latency Optimization Strategies

Voice interfaces are extremely sensitive to latency. Let's examine where delays occur and how to minimize them.

In [None]:
class LatencyOptimizer:
    """Analyze and optimize voice pipeline latency"""
    
    def __init__(self):
        self.latency_breakdown = {
            "vad_processing": 30,      # Voice Activity Detection
            "stt_whisper_base": 2000,  # Whisper speech-to-text
            "langgraph_processing": 1500,  # Invoice agent
            "tts_elevenlabs": 500,     # Text-to-speech
            "network_roundtrip": 100,  # Network delays
            "audio_buffering": 50      # Audio pipeline delays
        }
        
        self.optimizations = {
            "vad_processing": 30,      # Already optimized
            "stt_whisper_base": 500,   # faster-whisper or WhisperX
            "langgraph_processing": 800,   # Caching + prompt optimization
            "tts_elevenlabs": 75,      # Streaming TTS
            "network_roundtrip": 50,   # Edge deployment
            "audio_buffering": 30      # Optimized audio pipeline
        }
    
    def analyze_current_latency(self) -> Dict[str, Any]:
        """Calculate current end-to-end latency"""
        total_baseline = sum(self.latency_breakdown.values())
        total_optimized = sum(self.optimizations.values())
        improvement = total_baseline - total_optimized
        
        return {
            "baseline_ms": total_baseline,
            "optimized_ms": total_optimized,
            "improvement_ms": improvement,
            "improvement_percent": (improvement / total_baseline) * 100,
            "breakdown": self.latency_breakdown,
            "optimized_breakdown": self.optimizations
        }
    
    def get_optimization_recommendations(self) -> List[Dict[str, Any]]:
        """Get specific recommendations for latency reduction"""
        recommendations = [
            {
                "component": "Speech-to-Text",
                "current_ms": 2000,
                "optimized_ms": 500,
                "improvement": "75%",
                "method": "faster-whisper or WhisperX",
                "description": "Replace standard Whisper with optimized implementations",
                "difficulty": "Medium",
                "impact": "High"
            },
            {
                "component": "LangGraph Processing",
                "current_ms": 1500,
                "optimized_ms": 800,
                "improvement": "47%",
                "method": "Prompt caching + model optimization",
                "description": "Cache frequent queries and optimize prompt templates",
                "difficulty": "Low",
                "impact": "High"
            },
            {
                "component": "Text-to-Speech",
                "current_ms": 500,
                "optimized_ms": 75,
                "improvement": "85%",
                "method": "Streaming TTS with ElevenLabs Turbo",
                "description": "Start audio playback while generating rest of speech",
                "difficulty": "Medium",
                "impact": "High"
            },
            {
                "component": "Network",
                "current_ms": 100,
                "optimized_ms": 50,
                "improvement": "50%",
                "method": "Edge deployment",
                "description": "Deploy closer to users with CDN or edge compute",
                "difficulty": "High",
                "impact": "Medium"
            }
        ]
        
        return sorted(recommendations, key=lambda x: x["current_ms"], reverse=True)
    
    def simulate_optimization_impact(self, optimizations_applied: List[str]) -> Dict[str, Any]:
        """Simulate the impact of applying specific optimizations"""
        current_total = sum(self.latency_breakdown.values())
        optimized_total = current_total
        
        # Apply selected optimizations
        for opt in optimizations_applied:
            if opt in self.latency_breakdown:
                current_latency = self.latency_breakdown[opt]
                optimized_latency = self.optimizations[opt]
                reduction = current_latency - optimized_latency
                optimized_total -= reduction
        
        improvement = current_total - optimized_total
        
        # Classify result
        if optimized_total < 1000:
            quality = "Excellent (<1s)"
        elif optimized_total < 2000:
            quality = "Good (<2s)"
        elif optimized_total < 3000:
            quality = "Acceptable (<3s)"
        else:
            quality = "Poor (>3s)"
        
        return {
            "current_ms": current_total,
            "optimized_ms": optimized_total,
            "improvement_ms": improvement,
            "improvement_percent": (improvement / current_total) * 100,
            "user_experience": quality,
            "optimizations_applied": optimizations_applied
        }

# Analyze current latency
print("⚡ Voice Pipeline Latency Analysis")
print("=" * 50)

optimizer = LatencyOptimizer()
analysis = optimizer.analyze_current_latency()

print(f"\n📊 LATENCY BREAKDOWN:")
print(f"{'Component':<25} | {'Current':<8} | {'Optimized':<10} | {'Improvement':<12}")
print("-" * 70)

for component, current_ms in analysis['breakdown'].items():
    optimized_ms = analysis['optimized_breakdown'][component]
    improvement = current_ms - optimized_ms
    improvement_pct = (improvement / current_ms) * 100 if current_ms > 0 else 0
    
    print(f"{component.replace('_', ' ').title():<25} | {current_ms:<8}ms | {optimized_ms:<10}ms | {improvement_pct:<12.1f}%")

print("-" * 70)
print(f"{'TOTAL':<25} | {analysis['baseline_ms']:<8}ms | {analysis['optimized_ms']:<10}ms | {analysis['improvement_percent']:<12.1f}%")

# Show target performance
print(f"\n🎯 PERFORMANCE TARGETS:")
print(f"   Current: {analysis['baseline_ms']}ms ({analysis['baseline_ms']/1000:.1f}s)")
print(f"   Optimized: {analysis['optimized_ms']}ms ({analysis['optimized_ms']/1000:.1f}s)")
print(f"   Target: <2000ms (<2s) for good user experience")

if analysis['optimized_ms'] < 2000:
    print(f"   ✅ Target achieved with optimizations!")
else:
    print(f"   ⚠️ Additional optimizations needed")

# Show top recommendations
recommendations = optimizer.get_optimization_recommendations()
print(f"\n🚀 TOP OPTIMIZATION OPPORTUNITIES:")

for i, rec in enumerate(recommendations[:3], 1):
    print(f"\n   {i}. {rec['component']} ({rec['improvement']} improvement)")
    print(f"      Method: {rec['method']}")
    print(f"      Impact: {rec['impact']}, Difficulty: {rec['difficulty']}")

# Simulate applying all optimizations
all_components = list(optimizer.latency_breakdown.keys())
simulation = optimizer.simulate_optimization_impact(all_components)

print(f"\n🎉 WITH ALL OPTIMIZATIONS:")
print(f"   End-to-end latency: {simulation['optimized_ms']}ms ({simulation['optimized_ms']/1000:.1f}s)")
print(f"   Improvement: {simulation['improvement_percent']:.1f}%")
print(f"   User experience: {simulation['user_experience']}")

## 8. Complete Voice Pipeline Demonstration

Let's put it all together and simulate the complete voice-enabled invoice processing flow.

In [None]:
async def voice_invoice_pipeline_demo():
    """
    Complete demonstration of voice-enabled invoice processing
    
    This simulates the full pipeline from user speech to agent response
    showing all the integration points and timing.
    """
    print("🎬 COMPLETE VOICE PIPELINE DEMONSTRATION")
    print("=" * 60)
    
    # Initialize all components
    config = VoiceConfig()
    audio_buffer = AudioBuffer()
    vad = VoiceActivityDetector(config)
    whisper = WhisperSTT("base")
    bridge = VoiceToGraphBridge(None, {"url": OLLAMA_URL, "model": MODEL})
    tts = ElevenLabsTTS("demo_key", "Rachel")
    conversation = ConversationManager(config)
    
    print(f"\n🔧 All components initialized")
    
    # Set invoice context
    bridge.current_invoice_context = {
        "id": "INV-001",
        "vendor": "TechSupplies Corp",
        "total": 1250.00
    }
    
    print(f"📋 Invoice context: {bridge.current_invoice_context['id']}")
    
    # Simulate complete interaction
    print(f"\n" + "=" * 60)
    print(f"🎭 SIMULATING VOICE INTERACTION")
    print(f"=" * 60)
    
    # Step 1: User starts speaking
    print(f"\n1️⃣ USER STARTS SPEAKING")
    conversation_action = conversation.handle_user_speech_start()
    print(f"   Action: {conversation_action['action']}")
    
    # Step 2: Simulate audio capture
    print(f"\n2️⃣ AUDIO CAPTURE & VAD")
    for i in range(5):  # Simulate 5 audio chunks
        mock_audio_chunk = f"audio_chunk_{i}".encode()
        vad_result = vad.process_chunk(mock_audio_chunk)
        
        if vad_result['speech_started']:
            print(f"   🎤 Speech detected in chunk {i}")
        
        audio_buffer.add(mock_audio_chunk, 100)  # 100ms chunks
        await asyncio.sleep(0.05)  # Real-time simulation
    
    print(f"   📊 Captured {audio_buffer.total_duration}ms of audio")
    
    # Step 3: Detect end of speech
    print(f"\n3️⃣ SILENCE DETECTION")
    silence_action = conversation.handle_silence(1200)
    print(f"   Action: {silence_action['action']}")
    
    if silence_action['action'] == 'process_speech':
        # Step 4: Speech-to-text
        print(f"\n4️⃣ SPEECH-TO-TEXT (WHISPER)")
        start_time = time.time()
        
        transcription_result = await whisper.transcribe_audio_buffer(audio_buffer)
        stt_time = time.time() - start_time
        
        print(f"   📝 Transcription: '{transcription_result['text']}'")
        print(f"   🎯 Confidence: {transcription_result['confidence']:.2f}")
        print(f"   ⏱️ Processing time: {stt_time:.2f}s")
        
        # Step 5: LangGraph processing
        print(f"\n5️⃣ LANGGRAPH AGENT PROCESSING")
        start_time = time.time()
        
        agent_result = await bridge.process_voice_command(
            transcription_result['text']
        )
        
        agent_time = time.time() - start_time
        
        print(f"   🤖 Agent response: '{agent_result['voice_response']}'")
        print(f"   🎯 Confidence: {agent_result['confidence']:.2f}")
        print(f"   ⏱️ Processing time: {agent_time:.2f}s")
        
        # Step 6: Text-to-speech
        print(f"\n6️⃣ TEXT-TO-SPEECH (ELEVENLABS)")
        start_time = time.time()
        
        # Optimize text for speech
        optimized_text = tts.optimize_for_conversation(agent_result['voice_response'])
        
        tts_result = await tts.synthesize_streaming(optimized_text)
        tts_time = time.time() - start_time
        
        print(f"   🔊 Audio generated: {tts_result['duration_seconds']:.1f}s duration")
        print(f"   🎵 Voice: {tts_result['voice_used']}")
        print(f"   ⏱️ Generation time: {tts_time:.2f}s")
        
        # Step 7: Play response
        print(f"\n7️⃣ AUDIO PLAYBACK")
        conversation.handle_agent_response_start(tts_result['duration_seconds'])
        
        # Simulate playback
        print(f"   🔊 Playing audio response...")
        await asyncio.sleep(tts_result['duration_seconds'])  # Simulate playback time
        
        conversation.handle_agent_response_complete()
        print(f"   ✅ Playback complete - ready for next user input")
        
        # Step 8: Calculate total latency
        print(f"\n8️⃣ PERFORMANCE SUMMARY")
        total_processing_time = stt_time + agent_time + tts_time
        
        print(f"   📊 Latency breakdown:")
        print(f"      STT (Whisper): {stt_time:.2f}s")
        print(f"      Agent (LangGraph): {agent_time:.2f}s")
        print(f"      TTS (ElevenLabs): {tts_time:.2f}s")
        print(f"      Total processing: {total_processing_time:.2f}s")
        
        # User experience assessment
        if total_processing_time < 2.0:
            experience = "Excellent - feels conversational"
        elif total_processing_time < 3.0:
            experience = "Good - acceptable for most users"
        elif total_processing_time < 5.0:
            experience = "Fair - users may notice delay"
        else:
            experience = "Poor - feels unnatural"
        
        print(f"      User experience: {experience}")
    
    print(f"\n" + "=" * 60)
    print(f"🎉 VOICE PIPELINE DEMONSTRATION COMPLETE")
    print(f"=" * 60)

# Run the complete demonstration
await voice_invoice_pipeline_demo()

## 9. Production Deployment Considerations

Building a demo is one thing - deploying a production voice system requires additional considerations.

In [None]:
production_considerations = """
🏭 PRODUCTION DEPLOYMENT CHECKLIST
=====================================

📡 INFRASTRUCTURE:
   ✓ WebRTC media server (LiveKit/Agora/Twilio)
   ✓ Load balancers for voice processing servers
   ✓ CDN for audio content delivery
   ✓ Auto-scaling based on concurrent sessions
   ✓ Geographic distribution for latency

🔒 SECURITY & PRIVACY:
   ✓ End-to-end encryption for audio streams
   ✓ Audio data retention policies
   ✓ GDPR/CCPA compliance for voice data
   ✓ User consent for voice recording
   ✓ Secure API authentication

⚡ PERFORMANCE OPTIMIZATION:
   ✓ GPU acceleration for Whisper
   ✓ Model quantization for faster inference
   ✓ Response caching for common queries
   ✓ Streaming optimizations
   ✓ Circuit breakers for service failures

📊 MONITORING & ANALYTICS:
   ✓ Real-time latency monitoring
   ✓ Speech recognition accuracy tracking
   ✓ User satisfaction metrics
   ✓ Error rate dashboards
   ✓ Resource utilization alerts

🔧 RELIABILITY:
   ✓ Graceful degradation (fallback to text)
   ✓ Service health checks
   ✓ Automated failover
   ✓ Data backup and recovery
   ✓ Disaster recovery procedures

👥 USER EXPERIENCE:
   ✓ Onboarding and voice training
   ✓ Accessibility features
   ✓ Multi-language support
   ✓ Background noise handling
   ✓ User feedback collection

💰 COST OPTIMIZATION:
   ✓ Usage-based pricing models
   ✓ Resource scheduling (scale down off-hours)
   ✓ Model efficiency optimizations
   ✓ Bandwidth optimization
   ✓ Cost monitoring and alerts

🧪 TESTING STRATEGY:
   ✓ Automated voice quality testing
   ✓ Load testing with realistic audio
   ✓ Accent and language variety testing
   ✓ Noise robustness testing
   ✓ Integration testing across components
"""

print(production_considerations)

# Architecture alternatives
architecture_options = {
    "STT Options": {
        "OpenAI Whisper": {"pros": "Open source, high accuracy", "cons": "Higher latency"},
        "Deepgram API": {"pros": "Low latency, streaming", "cons": "Cost per minute"},
        "Google Speech": {"pros": "Robust, multi-language", "cons": "Vendor lock-in"},
        "Azure Speech": {"pros": "Enterprise features", "cons": "Complex pricing"}
    },
    "TTS Options": {
        "ElevenLabs": {"pros": "Natural voices, streaming", "cons": "Cost per character"},
        "Cartesia": {"pros": "Ultra-low latency", "cons": "Limited voices"},
        "Azure Neural": {"pros": "Enterprise grade", "cons": "Setup complexity"},
        "AWS Polly": {"pros": "AWS integration", "cons": "Voice quality"}
    },
    "WebRTC Platforms": {
        "LiveKit": {"pros": "Open source, feature-rich", "cons": "Self-hosting required"},
        "Agora": {"pros": "Global infrastructure", "cons": "Higher costs"},
        "Twilio": {"pros": "Simple integration", "cons": "Limited customization"},
        "Daily.co": {"pros": "Developer-friendly", "cons": "Smaller ecosystem"}
    }
}

print(f"\n🔧 ARCHITECTURE COMPONENT OPTIONS:")
print(f"=" * 50)

for category, options in architecture_options.items():
    print(f"\n{category}:")
    for option, details in options.items():
        print(f"   • {option}")
        print(f"     Pros: {details['pros']}")
        print(f"     Cons: {details['cons']}")

print(f"\n💡 RECOMMENDATION FOR PRODUCTION:")
print(f"   STT: Deepgram API (for low latency) + Whisper (for high accuracy)")
print(f"   TTS: ElevenLabs Turbo (for quality) + fallback to faster options")
print(f"   WebRTC: LiveKit (for control) or Daily.co (for simplicity)")
print(f"   LLM: Keep existing Ollama setup with caching optimizations")

## Key Learnings

### Voice Interface Complexity

1. **Multi-Component Integration**
   - Voice systems require coordination of 6+ components
   - Each component adds latency and potential failure points
   - Integration complexity grows exponentially with features

2. **Latency is Critical**
   - Users expect <2 second response times for natural conversation
   - Each 500ms of delay significantly degrades user experience
   - Optimization must happen at every layer of the stack

3. **State Management Challenges**
   - Turn-taking requires sophisticated state machines
   - Interruption handling is complex but essential
   - Conversation context must be maintained across turns

### Architecture Patterns

1. **Modular Design**
   - Each component should be swappable
   - Abstract interfaces allow technology upgrades
   - Microservices approach enables scaling individual components

2. **Streaming Everything**
   - Stream audio capture for real-time VAD
   - Stream TTS generation for perceived performance
   - Stream LLM responses when possible

3. **Graceful Degradation**
   - Fall back to text interface if voice fails
   - Continue with partial results if some components fail
   - Provide user feedback about system state

### Production Considerations

1. **Privacy First**
   - Voice data is highly sensitive
   - Implement data minimization principles
   - Clear user consent and data retention policies

2. **Monitor Everything**
   - Track latency at each component
   - Monitor speech recognition accuracy
   - Measure user satisfaction and completion rates

3. **Plan for Scale**
   - Voice processing is compute-intensive
   - WebRTC requires specialized infrastructure
   - Geographic distribution essential for global deployment

### Next Steps for Implementation

1. **Start Simple**: Begin with push-to-talk rather than continuous listening
2. **Optimize Incrementally**: Focus on one latency bottleneck at a time
3. **Test Extensively**: Voice interfaces have many edge cases
4. **Gather Feedback**: User testing reveals real-world usage patterns
5. **Plan for Costs**: Voice APIs can be expensive at scale

### Integration with Existing Agent

The beauty of this architecture is that your existing LangGraph invoice agent requires minimal changes:
- Add voice-optimized response formatting
- Handle conversation context in state
- Optimize prompts for voice queries
- The core reasoning and tool calling remains unchanged

Voice adds a powerful interface layer while preserving all your existing agent capabilities.