# Microphone Test, Voice Recognition, and Voice Synthesis App

This Jupyter notebook demonstrates a simple application for testing microphone input, performing voice recognition using Vosk, and synthesizing speech using Piper. It's a basic example of how to integrate these components for a real-time voice interaction application within a notebook environment.
**Important note: this notebook is intended to run only in local environment.**

## Setup

To run this notebook, you need to set up the environment and download the necessary models: check the '*Local Setup Instructions*' section of the README.md file.

## Running the Application

1.  Execute all the code cells sequentially from top to bottom.
2.  After running the last code cell, you will see a "Start App" button and an output area.
3.  Click the "Start App" button to begin recording from your microphone.
4.  Speak into your microphone. The recognized text will appear in the output area.
5.  The application will also synthesize the recognized text using the Piper voice.
6.  To stop the application, say the termination keyword (default is "goodbye") or click the "Stop App" button that appears after starting.

### Piper Voice Selection
For the application's needs, we will download a voice named 'Amy'. You can choose another voice from many available.

In [None]:
# # List of available Piper voices
# !python -m piper.download_voices

# Download a specific Piper voice
!python -m piper.download_voices en_US-amy-medium

# Models are stored in the Open Neural Network Exchange (ONNX) format
piper_voice_name = "en_US-amy-medium.onnx"

### Libraries Import

In [None]:
import json
import time
import sys
from queue import Queue
from threading import Thread, Event, Lock

import ipywidgets as widgets
from IPython.display import display, HTML, Javascript
import webrtcvad
from vosk import Model, KaldiRecognizer
from piper.voice import PiperVoice
import sounddevice as sd
import numpy as np

### Configuration Variables

In [None]:
# Audio settings
sample_rate = 16000
audio_chunk_size = 320 # Size of audio chunks (20ms at 16kHz)
channels = 1 # Number of audio channels (mono)
consecutive_silent_seconds_limit = 1 # Number of silent seconds to detect end of speech

# Initialize WebRTC VAD with aggressive mode
vad = webrtcvad.Vad()
vad.set_mode(3)

# Padding to add to the beginning of the audio buffer
audio_buffer_padding = b'\x00\x00' * int(sample_rate * audio_chunk_size / 1000)

# Keyword to terminate the application
termination_keyword = "goodbye"

### Queues and Events

In [None]:
recordings_queue = Queue()
recognized_texts_queue = Queue()

stop_event = Event()
audio_playback_interrupt_event = Event()

output_lock = Lock()

### Loading Models

In [None]:
# Load the Piper voice model
try:
    voice_model = PiperVoice.load(piper_voice_name)
except FileNotFoundError:
    print(f"Error: Piper voice model file not found. Please ensure '{piper_voice_name}' is in the correct directory.", file=sys.stderr)
    voice_model = None
except Exception as e:
    print(f"An unexpected error occurred while loading the Piper model: {e}", file=sys.stderr)
    voice_model = None

""" Load the Vosk speech recognition model.
Here we use a relatively small model. You can download a larger, much more accurate speech recognition model.
"""
try:
    speech_model = Model(model_name="vosk-model-en-us-0.22-lgraph")
except Exception as e:
    print(f"Error loading Vosk model: {e}. Please ensure the model is downloaded and accessible.", file=sys.stderr)
    speech_model = None
    speech_recognizer = None

if speech_model:
    try:
        speech_recognizer = KaldiRecognizer(speech_model, sample_rate)
        speech_recognizer.SetWords(True)
    except Exception as e:
        print(f"Error creating Vosk recognizer: {e}", file=sys.stderr)
        speech_recognizer = None

### Functions

In [None]:
def callback(indata, frames, time, status):
    """Callback for sounddevice input stream."""
    if status:
        print(status, file=sys.stderr)
    try:
        recordings_queue.put(indata[:,0].tobytes())
    except Exception as e:
        print(f"Error putting data into recordings queue: {e}", file=sys.stderr)

In [None]:
def record_microphone():
    """Continuously records audio from the microphone and puts it into a queue.

    Uses sounddevice to create an input stream from the default microphone.
    The `callback` function is used to process each audio block. The recording
    continues until the `stop_event` is set.
    """
    try:
        with sd.InputStream(samplerate=sample_rate,
                                channels=channels, callback=callback, dtype='int16', blocksize=audio_chunk_size):
            while not stop_event.is_set():
                sd.sleep(100)
    except Exception as e:
        print(f"Error during microphone recording: {e}", file=sys.stderr)
        # Signal other threads to stop in case of a recording error
        stop_event.set()

In [None]:
def speech_recognition(html_display_area):
    """Processes the audio buffer, performs VAD, recognizes speech, and triggers synthesis.

    Reads audio data from the `recordings_queue` queue, uses WebRTC VAD to detect speech,
    buffers speech segments, and uses Vosk to recognize the spoken text. Recognized
    text is displayed in the provided output widget and put into another queue
    for voice synthesis. It also checks for a termination keyword to stop the application.

    Args:
        html_display_area (display): The display object to update with recognized text.
    """
    if not speech_recognizer:
        print("Speech recognition not available due to model loading error.", file=sys.stderr)
        return

    buffer = audio_buffer_padding
    in_speech = False
    silence_threshold = 0

    try:
        while not stop_event.is_set():
            try:
                frames = recordings_queue.get(timeout=0.1)
            except:
                continue

            try:
                # Use WebRTC VAD to check if the current frames contain speech
                is_speech = vad.is_speech(frames, sample_rate=sample_rate)
            except Exception as e:
                print(f"Error during VAD processing: {e}", file=sys.stderr)
                continue

            if is_speech:
                # If speech is detected
                in_speech = True
                buffer += frames
                silence_threshold = 0

            elif in_speech:
                # Check if the silence duration has exceeded the limit
                if silence_threshold < consecutive_silent_seconds_limit * (sample_rate / audio_chunk_size):
                    silence_threshold += 1
                    buffer += frames
                else:
                    # If silence limit reached, process the buffered speech
                    try:
                        speech_recognizer.AcceptWaveform(buffer)
                        result = json.loads(speech_recognizer.Result())
                        recognized_text = result['text']
                        if recognized_text:
                            recognized_text = f"{recognized_text.capitalize()}."
                            with output_lock:
                                html_display_area.update(Javascript(f'addMessage("{recognized_text}");'))
                            recognized_texts_queue.put(recognized_text)
                            audio_playback_interrupt_event.clear()
                            
                            # Check if the termination keyword is in the recognized text
                            if termination_keyword in recognized_text.lower():
                                message = f"Termination keyword detected: '{termination_keyword}'. Stopping..."
                                with output_lock:
                                    html_display_area.update(Javascript(f'addMessage("{message}", true);'))
                                    stop_application(stop_button)
                    except Exception as e:
                        print(f"Error during speech recognition processing: {e}", file=sys.stderr)

                    in_speech = False
                    silence_threshold = 0
                    buffer = audio_buffer_padding
    except Exception as e:
        print(f"An unexpected error occurred in speech_recognition thread: {e}", file=sys.stderr)
        stop_event.set()

In [None]:
def voice_synthesis():
    """Generates voice output based on recognized text and plays the audio.

    Reads recognized text from the `recognized_texts_queue` queue, synthesizes
    speech using the Piper voice model, and plays the resulting audio using
    sounddevice. Audio playback is interrupted if the `audio_playback_interrupt_event`
    is set (e.g., when the user starts speaking again).
    """
    if not voice_model:
        print("Voice synthesis not available due to model loading error.", file=sys.stderr)
        return

    sd.default.samplerate = voice_model.config.sample_rate

    # Add a small padding of silence between synthesized speech segments
    voice_padding = b'\x00\x00' * int(voice_model.config.sample_rate * 0.1)

    try:
        while not stop_event.is_set():
            if audio_playback_interrupt_event.is_set():
                sd.stop()
            try:
                text = recognized_texts_queue.get(timeout=0.1)
            except:
                continue

            if text and not audio_playback_interrupt_event.is_set(): # Stop voice synthesis when user speaks
                try:
                    gen = voice_model.synthesize(text)
                    audio_chunks_int16 = []
                    for chunk in gen:
                        if audio_playback_interrupt_event.is_set():
                            # If interrupted, close the generator and stop playback
                            gen.close()
                            sd.stop()
                            break
                        audio_chunks_int16.append(chunk.audio_int16_array)
                    
                    if audio_chunks_int16:
                        concatenated_audio = np.concatenate(audio_chunks_int16)
                        sd.play(concatenated_audio)
                except Exception as e:
                    print(f"Error during voice synthesis or playback: {e}", file=sys.stderr)
        sd.stop()
    except Exception as e:
        print(f"An unexpected error occurred in voice_synthesis thread: {e}", file=sys.stderr)
        stop_event.set()

### Button Click Handlers

In [None]:
def start_application(button):
    """Starts recording, recognition, and synthesis threads.
    Updates the UI to indicate the application is starting and listening.
    """

    stop_event.clear()
    
    button.layout.display = 'none'
    stop_button.layout = widgets.Layout(display='')
    listening_indicator.layout = widgets.Layout(display='')
    html_display_area.update(Javascript(f'addMessage("Starting...", true);'))
    record_thread = Thread(target=record_microphone)
    transcribe_thread = Thread(target=speech_recognition, args=(html_display_area,))
    voice_synthesis_thread = Thread(target=voice_synthesis)

    record_thread.start()
    transcribe_thread.start()
    voice_synthesis_thread.start()
    html_display_area.update(Javascript(f'addMessage("Listening...", true);'))


def stop_application(button):
    """Stops application threads and updates UI.
    Sets the stop event to signal threads to terminate and updates the UI
    to indicate that the application has stopped.
    """
    button.layout.display = 'none'
    listening_indicator.layout = widgets.Layout(display='none')
    stop_event.set()
    audio_playback_interrupt_event.set()
    html_display_area.update(Javascript(f'''
    addMessage("Application stopped.", true);
    addMessage("Run the cell to start the app again.", true);
    '''))

### Starting The Application

In [None]:
displayArea = """<html>

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Repead after me test app</title>
    <style>
        body {
            margin: 0;
            padding: 20px;
        }

        #chatContainer {
            margin: 20px auto;
            background-color: white;
            padding: 15px;
            border-radius: 10px;
            box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
            overflow-y: auto;
            height: 70vh;
            display: flex;
            flex-direction: column;
        }

        .message {
            display: flex;
            align-items: flex-start;
            margin: 2px 0;
            justify-content: flex-start;
        }

        .message-content {
            display: flex;
            flex-direction: column;
            max-width: 70%;
        }

        .outgoing .message-content {
            /* Align text to the right for outgoing */
            align-items: flex-end;
        }

        .message .text {
            padding: 5px;
            border-radius: 5px;
            /* Allow text to take full width of message-content */
            max-width: 100%;
            /* Break long words */
            word-wrap: break-word;
            /* Needed for the pseudo-elements */
            position: relative;
        }

        .outgoing .text {
            /* Blue for outgoing */
            background-color: #007bff;
            color: white;
        }

        .logging .text {
            /* Orange for logging */
            background-color: #ffb74d;
            color: black;
        }
    </style>
</head>

<body>
    <div id="chatContainer">
    </div>
    <script>
        function addMessage(content, system = false) {
            console.log("message: ", content);

            var chatContainer = document.getElementById("chatContainer");
            var messageDiv = document.createElement("DIV");
            var contentDiv = document.createElement("DIV");
            var textDiv = document.createElement("DIV");

            textDiv.className = "text";
            textDiv.innerText = system ? "System: " : "You said: ";
            textDiv.innerText += content;

            contentDiv.className = "message-content";
            contentDiv.appendChild(textDiv);
            messageDiv.appendChild(contentDiv);
            chatContainer.appendChild(messageDiv);
            chatContainer.scrollTop = chatContainer.scrollHeight;

            messageDiv.className = system ? "message logging" : "message outgoing";

        }
    </script>
</body>

</html>
"""

start_button = widgets.Button(
    description='Start App',
    disabled=False,
    button_style='success',
    tooltip='Start App',
    icon='microphone'
)

stop_button = widgets.Button(
    description='Stop App',
    disabled=False,
    button_style='danger',
    tooltip='Stop App',
    icon='stop',
    layout=widgets.Layout(display='none') # Initially hidden
)

listening_indicator = widgets.HTML(
    value="<span style='margin-left: 10px; color: #dc3545; font-weight: bold;'>Listening...</span>",
    placeholder='App is running',
    description='',
    layout=widgets.Layout(display='none') # Initially hidden
)

control_panel = widgets.HBox([start_button, stop_button, listening_indicator])

# Output widget to display messages and recognized text
app_output_widget = widgets.Output()
# Display object to update the HTML output area
html_display_area = display(HTML(""), display_id=True)

# Link button clicks to the respective functions
start_button.on_click(start_application)
stop_button.on_click(stop_application)

# Display the buttons and the output area
display(control_panel, app_output_widget)
# Add the HTML structure for the message display area
app_output_widget.append_display_data(HTML(displayArea))