# Voice Assistant with Speech-to-Text and LLM Integration

This notebook demonstrates how to build a complete voice assistant that:
1. Listens to speech input using Whisper
2. Processes the text with a Large Language Model
3. Responds with synthesized speech using Bark
4. Provides a web interface using Streamlit

## Features
- Real-time speech recognition
- Multi-language support
- Contextual conversations
- Natural voice synthesis
- Web-based interface

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install -q whisper-openai bark transformers torch torchaudio sounddevice soundfile streamlit

In [None]:
import whisper
import torch
import sounddevice as sd
import soundfile as sf
import numpy as np
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from bark import SAMPLE_RATE, generate_audio, preload_models
import threading
import queue
import time
import warnings
warnings.filterwarnings('ignore')

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## 2. Initialize Models

In [None]:
class VoiceAssistant:
    def __init__(self):
        self.is_listening = False
        self.conversation_history = []
        self.audio_queue = queue.Queue()
        
        print("Loading Whisper model...")
        self.whisper_model = whisper.load_model("base")
        
        print("Loading LLM...")
        self.llm_pipeline = pipeline(
            "text-generation",
            model="microsoft/DialoGPT-medium",
            tokenizer="microsoft/DialoGPT-medium",
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
            device=0 if torch.cuda.is_available() else -1
        )
        
        print("Loading Bark TTS models...")
        preload_models()
        
        print("Voice Assistant initialized!")
    
    def record_audio(self, duration=5, sample_rate=16000):
        """Record audio from microphone."""
        print(f"Recording for {duration} seconds...")
        audio_data = sd.rec(int(duration * sample_rate), 
                           samplerate=sample_rate, 
                           channels=1, 
                           dtype=np.float32)
        sd.wait()
        return audio_data.flatten()
    
    def transcribe_audio(self, audio_data, sample_rate=16000):
        """Transcribe audio to text using Whisper."""
        # Save temporary audio file
        temp_file = "temp_audio.wav"
        sf.write(temp_file, audio_data, sample_rate)
        
        # Transcribe
        result = self.whisper_model.transcribe(temp_file)
        return result["text"].strip()
    
    def generate_response(self, user_input):
        """Generate response using LLM."""
        # Add context from conversation history
        context = "\n".join(self.conversation_history[-3:])  # Last 3 exchanges
        if context:
            prompt = f"{context}\nUser: {user_input}\nAssistant:"
        else:
            prompt = f"User: {user_input}\nAssistant:"
        
        # Generate response
        response = self.llm_pipeline(
            prompt,
            max_length=len(prompt) + 100,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True,
            pad_token_id=self.llm_pipeline.tokenizer.eos_token_id
        )
        
        # Extract assistant response
        full_response = response[0]['generated_text']
        assistant_response = full_response.split("Assistant:")[-1].strip()
        
        # Update conversation history
        self.conversation_history.append(f"User: {user_input}")
        self.conversation_history.append(f"Assistant: {assistant_response}")
        
        return assistant_response
    
    def text_to_speech(self, text, voice_preset="v2/en_speaker_6"):
        """Convert text to speech using Bark."""
        print(f"Generating speech: {text[:50]}...")
        audio_array = generate_audio(text, history_prompt=voice_preset)
        return audio_array
    
    def play_audio(self, audio_array):
        """Play audio array."""
        sd.play(audio_array, SAMPLE_RATE)
        sd.wait()
    
    def process_voice_input(self, duration=5):
        """Complete voice processing pipeline."""
        try:
            # Record audio
            audio_data = self.record_audio(duration)
            
            # Transcribe to text
            user_text = self.transcribe_audio(audio_data)
            print(f"You said: {user_text}")
            
            if not user_text.strip():
                print("No speech detected")
                return None, None
            
            # Generate response
            response_text = self.generate_response(user_text)
            print(f"Assistant: {response_text}")
            
            # Convert to speech
            response_audio = self.text_to_speech(response_text)
            
            # Play response
            self.play_audio(response_audio)
            
            return user_text, response_text
            
        except Exception as e:
            print(f"Error in voice processing: {e}")
            return None, None

# Initialize the voice assistant
assistant = VoiceAssistant()

## 3. Test Individual Components

In [None]:
# Test speech-to-text
print("Testing Speech-to-Text...")
print("Say something when prompted!")

# Uncomment to test with actual microphone input
# audio = assistant.record_audio(duration=3)
# transcription = assistant.transcribe_audio(audio)
# print(f"Transcribed: {transcription}")

# For demo purposes, simulate with text
test_input = "Hello, how are you today?"
print(f"Simulated input: {test_input}")

In [None]:
# Test LLM response generation
print("Testing LLM Response Generation...")
response = assistant.generate_response(test_input)
print(f"Generated response: {response}")

In [None]:
# Test text-to-speech
print("Testing Text-to-Speech...")
speech_audio = assistant.text_to_speech(response)
print(f"Generated speech audio with shape: {speech_audio.shape}")

# Play the audio (uncomment to hear)
# assistant.play_audio(speech_audio)
print("TTS test completed (audio playback commented out)")

## 4. Interactive Voice Session

In [None]:
def interactive_voice_session(num_interactions=3):
    """Run an interactive voice session."""
    print("\n=== Interactive Voice Assistant Session ===")
    print("Note: Microphone and audio playback are disabled in this demo")
    print("Simulating voice interactions...\n")
    
    # Simulate conversation with text inputs
    simulated_inputs = [
        "What's the weather like?",
        "Tell me a joke",
        "What can you help me with?"
    ]
    
    for i, simulated_input in enumerate(simulated_inputs[:num_interactions]):
        print(f"--- Interaction {i+1} ---")
        print(f"Simulated user input: {simulated_input}")
        
        # Generate and display response
        response = assistant.generate_response(simulated_input)
        print(f"Assistant response: {response}")
        
        # Simulate TTS (without actual audio playback)
        print("[Speech synthesis completed]")
        print()
    
    print("Session completed!")
    
    # Display conversation history
    print("\n=== Conversation History ===")
    for entry in assistant.conversation_history:
        print(entry)

# Run interactive session
interactive_voice_session()

## 5. Advanced Features

In [None]:
class AdvancedVoiceAssistant(VoiceAssistant):
    def __init__(self):
        super().__init__()
        self.wake_word = "assistant"
        self.is_wake_word_active = False
        self.user_preferences = {
            "voice": "v2/en_speaker_6",
            "language": "en",
            "response_length": "medium"
        }
    
    def detect_wake_word(self, text):
        """Detect wake word in transcribed text."""
        return self.wake_word.lower() in text.lower()
    
    def extract_intent(self, text):
        """Extract user intent from text."""
        text_lower = text.lower()
        
        if any(word in text_lower for word in ['weather', 'temperature', 'forecast']):
            return 'weather'
        elif any(word in text_lower for word in ['joke', 'funny', 'laugh']):
            return 'joke'
        elif any(word in text_lower for word in ['time', 'clock', 'hour']):
            return 'time'
        elif any(word in text_lower for word in ['help', 'assist', 'support']):
            return 'help'
        else:
            return 'general'
    
    def generate_intent_based_response(self, user_input, intent):
        """Generate response based on detected intent."""
        if intent == 'weather':
            return "I don't have access to real-time weather data, but I'd recommend checking a weather app or website for current conditions."
        elif intent == 'joke':
            jokes = [
                "Why don't scientists trust atoms? Because they make up everything!",
                "What do you call a fake noodle? An impasta!",
                "Why did the scarecrow win an award? He was outstanding in his field!"
            ]
            import random
            return random.choice(jokes)
        elif intent == 'time':
            from datetime import datetime
            current_time = datetime.now().strftime("%I:%M %p")
            return f"The current time is {current_time}."
        elif intent == 'help':
            return "I'm a voice assistant that can help with general questions, tell jokes, provide the time, and have conversations. What would you like to know?"
        else:
            return self.generate_response(user_input)
    
    def process_advanced_voice_input(self, duration=5):
        """Advanced voice processing with intent detection."""
        # Simulate for demo
        simulated_inputs = [
            "Assistant, what's the weather like?",
            "Tell me a joke please",
            "What time is it?"
        ]
        
        for user_text in simulated_inputs:
            print(f"\nProcessing: {user_text}")
            
            # Check for wake word
            if self.detect_wake_word(user_text):
                print("Wake word detected!")
                
                # Extract intent
                intent = self.extract_intent(user_text)
                print(f"Detected intent: {intent}")
                
                # Generate response based on intent
                response = self.generate_intent_based_response(user_text, intent)
                print(f"Response: {response}")
                
                # Simulate TTS
                print("[Speech synthesis completed]")
            else:
                print("Wake word not detected, ignoring...")

# Test advanced features
advanced_assistant = AdvancedVoiceAssistant()
advanced_assistant.process_advanced_voice_input()

## 6. Streamlit Web Interface

In [None]:
# Create Streamlit app file
streamlit_app_code = '''
import streamlit as st
import whisper
from bark import generate_audio, SAMPLE_RATE
from transformers import pipeline
import torch
import soundfile as sf
import numpy as np
import io

@st.cache_resource
def load_models():
    """Load and cache models."""
    whisper_model = whisper.load_model("base")
    llm_pipeline = pipeline(
        "text-generation",
        model="microsoft/DialoGPT-medium",
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
    )
    return whisper_model, llm_pipeline

def main():
    st.title("🎤 Voice Assistant with GenAI")
    st.write("A complete voice assistant powered by Whisper, LLM, and Bark")
    
    # Load models
    with st.spinner("Loading AI models..."):
        whisper_model, llm_pipeline = load_models()
    
    # Initialize session state
    if "conversation" not in st.session_state:
        st.session_state.conversation = []
    
    # Text input option
    st.subheader("💬 Text Conversation")
    user_input = st.text_input("Type your message:")
    
    if st.button("Send Message"):
        if user_input:
            # Generate response
            with st.spinner("Generating response..."):
                response = llm_pipeline(
                    f"User: {user_input}\\nAssistant:",
                    max_length=100,
                    temperature=0.7,
                    do_sample=True
                )[0]['generated_text']
                
                assistant_response = response.split("Assistant:")[-1].strip()
            
            # Add to conversation
            st.session_state.conversation.append(("You", user_input))
            st.session_state.conversation.append(("Assistant", assistant_response))
            
            # Generate and provide audio
            with st.spinner("Generating speech..."):
                audio_array = generate_audio(assistant_response)
                
                # Convert to audio file
                audio_buffer = io.BytesIO()
                sf.write(audio_buffer, audio_array, SAMPLE_RATE, format='WAV')
                audio_buffer.seek(0)
                
                st.audio(audio_buffer.read(), format='audio/wav')
    
    # Display conversation history
    if st.session_state.conversation:
        st.subheader("📝 Conversation History")
        for speaker, message in st.session_state.conversation:
            if speaker == "You":
                st.write(f"**{speaker}:** {message}")
            else:
                st.write(f"*{speaker}:* {message}")
    
    # Audio upload option
    st.subheader("🎵 Audio Upload")
    uploaded_audio = st.file_uploader(
        "Upload an audio file", 
        type=['wav', 'mp3', 'm4a']
    )
    
    if uploaded_audio:
        if st.button("Process Audio"):
            with st.spinner("Transcribing audio..."):
                # Save uploaded file temporarily
                with open("temp_audio.wav", "wb") as f:
                    f.write(uploaded_audio.read())
                
                # Transcribe
                result = whisper_model.transcribe("temp_audio.wav")
                transcription = result["text"]
                
                st.write(f"**Transcription:** {transcription}")
                
                # Generate response
                response = llm_pipeline(
                    f"User: {transcription}\\nAssistant:",
                    max_length=100,
                    temperature=0.7
                )[0]['generated_text']
                
                assistant_response = response.split("Assistant:")[-1].strip()
                st.write(f"**Response:** {assistant_response}")
    
    # Clear conversation
    if st.button("Clear Conversation"):
        st.session_state.conversation = []
        st.rerun()

if __name__ == "__main__":
    main()
'''

# Save Streamlit app
with open("voice_assistant_app.py", "w") as f:
    f.write(streamlit_app_code)

print("Streamlit app saved as 'voice_assistant_app.py'")
print("To run: streamlit run voice_assistant_app.py")

## 7. Performance Optimization

In [None]:
def optimize_voice_assistant():
    """Optimization tips and techniques for voice assistant."""
    
    optimization_tips = """
    === VOICE ASSISTANT OPTIMIZATION ===
    
    1. MODEL OPTIMIZATION:
       - Use smaller Whisper models (tiny, base) for faster inference
       - Implement model quantization for reduced memory usage
       - Cache models in memory to avoid reloading
       - Use GPU acceleration when available
    
    2. AUDIO PROCESSING:
       - Implement VAD (Voice Activity Detection) to reduce processing
       - Use streaming audio processing for real-time responses
       - Optimize audio sample rates (16kHz is sufficient for speech)
       - Implement noise reduction preprocessing
    
    3. RESPONSE GENERATION:
       - Use smaller, faster LLMs for quicker responses
       - Implement response caching for common queries
       - Limit response length to reduce TTS time
       - Use async processing for multiple components
    
    4. TTS OPTIMIZATION:
       - Pre-load Bark models to reduce initialization time
       - Use shorter voice presets for faster generation
       - Implement streaming TTS for long responses
       - Cache common responses as audio files
    
    5. SYSTEM OPTIMIZATION:
       - Use multithreading for parallel processing
       - Implement proper error handling and recovery
       - Monitor memory usage and implement cleanup
       - Use connection pooling for API calls
    """
    
    return optimization_tips

print(optimize_voice_assistant())

## 8. Memory Cleanup and Resource Management

In [None]:
def cleanup_resources():
    """Clean up memory and resources."""
    import gc
    
    # Clear variables
    if 'assistant' in globals():
        del assistant
    if 'advanced_assistant' in globals():
        del advanced_assistant
    
    # Force garbage collection
    gc.collect()
    
    # Clear CUDA cache if available
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()
    
    print("Resources cleaned up")

# Clean up
cleanup_resources()

print("Voice Assistant notebook completed!")

## Conclusion

This notebook demonstrated a complete voice assistant implementation featuring:

### Key Components:
1. **Speech Recognition** - Whisper for accurate transcription
2. **Language Understanding** - LLM for contextual responses
3. **Speech Synthesis** - Bark for natural voice generation
4. **Web Interface** - Streamlit for user interaction

### Advanced Features:
- Wake word detection
- Intent recognition
- Conversation history
- Multi-modal input (text and audio)
- Real-time processing

### Applications:
- **Personal Assistant**: Schedule management, reminders
- **Customer Service**: Automated support systems
- **Accessibility**: Voice-controlled interfaces
- **Education**: Interactive learning companions
- **Smart Home**: Voice-controlled IoT devices

### Next Steps:
1. **Integration**: Connect with external APIs (weather, calendar)
2. **Personalization**: User-specific voice and preferences
3. **Multilingual**: Support for multiple languages
4. **Mobile**: Deploy on mobile platforms
5. **IoT**: Integration with smart home devices

**Note**: This implementation provides a foundation for voice assistant development. For production use, consider privacy, security, and performance optimizations.