# Voice Capture and Response

In this notebook, we will practice voice capture and response through speech-to-text (STT) and text-to-speech (TTS) techniques. We divide the entire project into several steps:
1. Save and play voice created by Google text-to-speech.
2. Use microphone to record voice.
3. Convert the recorded voice to text through speech-to-text (STT).
4. Convert the text to voice through text-to-speech (TTS).
5. Make a voice-to-voice stream.
6. Integrate a LLM to respond voice input with voice output.
7. Build a Web interface for the LLM-supported voice assistant.

## Create a Folder to Store the Recorded Voice

In [1]:
path = "../data/voice/"

## 1. Save and play voice created by Google text-to-speech

### Test google Text-to-Speech gTTS

In [72]:
from gtts import gTTS
tts = gTTS('Hello! Do you know honey never spoils?')
tts.save(path + 'hello.mp3')

### Play the converted voice

In [4]:
from IPython.display import Audio
from io import BytesIO

In [5]:
# Save the audio object into a buffer
audio_buffer = BytesIO()

tts.write_to_fp(audio_buffer)

# Set the buffer's position to the start
audio_buffer.seek(0)

# Create an Audio object with the buffer
audio = Audio(data=audio_buffer.read(), autoplay=True, rate=44100)  # autoplay is optional

# Display the audio player in the notebook
audio

## 2. Use microphone to record voice

### Let’s start by using Python code to listen to the microphone and capture the audio
The speech_recognition module makes this a breeze:

In [6]:
import speech_recognition as sr
import os

### Ensure FLAC in PATH

A FLAC encoder is required to encode the audio data by speechRecognition. In my MabookPro, flac is available in the path /opt/homebrew/bin. Ensure /opt/homebrew/bin is in the PATH environment variable. 

In [10]:
print(os.environ['PATH'])

/opt/homebrew/bin:/Users/yuanan/google-cloud-sdk/bin:/opt/local/bin:/opt/local/sbin:/Users/yuanan/.nvm/versions/node/v21.4.0/bin:/Users/yuanan/anaconda3/envs/drtutor/bin:/Users/yuanan/anaconda3/condabin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Library/Apple/usr/bin:/Library/TeX/texbin:/Users/yuanan/bin


In [11]:
# Get the current PATH
current_path = os.environ.get('PATH', '')
current_path

'/opt/homebrew/bin:/Users/yuanan/google-cloud-sdk/bin:/opt/local/bin:/opt/local/sbin:/Users/yuanan/.nvm/versions/node/v21.4.0/bin:/Users/yuanan/anaconda3/envs/drtutor/bin:/Users/yuanan/anaconda3/condabin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Library/Apple/usr/bin:/Library/TeX/texbin:/Users/yuanan/bin'

In [12]:
# Add /opt/homebrew/bin to PATH
new_path = f"/opt/homebrew/bin:{current_path}"
os.environ['PATH'] = new_path

In [13]:
os.environ['PATH']

'/opt/homebrew/bin:/opt/homebrew/bin:/Users/yuanan/google-cloud-sdk/bin:/opt/local/bin:/opt/local/sbin:/Users/yuanan/.nvm/versions/node/v21.4.0/bin:/Users/yuanan/anaconda3/envs/drtutor/bin:/Users/yuanan/anaconda3/condabin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Library/Apple/usr/bin:/Library/TeX/texbin:/Users/yuanan/bin'

### Recording Voice From Microphone for Several Seconds

In [28]:
recognizer = sr.Recognizer()

try:
    # List available microphones (optional)
    print("Available microphones:")
    print(sr.Microphone.list_microphone_names())

    # Select a specific microphone (optional)
    # with sr.Microphone(device_index=1) as source:

    with sr.Microphone() as source:
        print("Adjusting noise...")
        recognizer.adjust_for_ambient_noise(source, duration=1)
        print("Recording for 4 seconds...")
        recorded_audio = recognizer.listen(source, timeout=4)
        print("Done recording.")
except Exception as ex:
    print("Something wrong during recording:", ex)

Available microphones:
['MacBook Pro Microphone', 'MacBook Pro Speakers', 'Microsoft Teams Audio']
Adjusting noise...
Recording for 4 seconds...
Done recording.


### Play the recorded voice

In [32]:
import simpleaudio as sa

# Play the recorded audio directly from the byte data
wave_obj = sa.WaveObject(recorded_audio.get_wav_data(), sample_rate=22050, num_channels=2, bytes_per_sample=2)
play_obj = wave_obj.play()
play_obj.wait_done()  # Wait until sound has finished playing

## 3. Convert the recorded voice to text through speech-to-text (STT)

In [33]:
try:
    print("Recognizing the text...")
    text = recognizer.recognize_google(recorded_audio, language="en-US")
    print("Decoded Text: {}".format(text))
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand the audio.")
except sr.RequestError:
    print("Could not request results from Google Speech Recognition service.")
except Exception as ex:
    print("Error during recognition:", ex)

Recognizing the text...
Decoded Text: did you know that honey never spoils


## 4. Convert the text to voice through text-to-speech (TTS).

In [34]:
text_tts = gTTS(text)

### Play the Generated Voice

In [35]:
# Save the audio object into a buffer
audio_buffer = BytesIO()
text_tts.write_to_fp(audio_buffer)

# Set the buffer's position to the start
audio_buffer.seek(0)

# Create an Audio object with the buffer
audio = Audio(data=audio_buffer.read(), autoplay=True, rate=44100)  # autoplay is optional

# Display the audio player in the notebook
audio

## 5. Make a voice-to-voice stream

In [47]:
import speech_recognition as sr

# define a function to listen to microphone
def listen_to_microphone(sr:speech_recognition) -> audio:

    recognizer = sr.Recognizer()
    
    try:
        with sr.Microphone() as source:
            print("Adjusting noise...")
            recognizer.adjust_for_ambient_noise(source, duration=1)
            print("Listening to microphone...")
            recorded_audio = recognizer.listen(source)
            print("Done Listening.")

            return recorded_audio
            
    except Exception as ex:
        print("Something wrong during listening:", ex)
        return None

In [48]:
# define a function to recognize voice
def audio_to_text(audio, sr:speech_recognition):

    recognizer = sr.Recognizer()

    text = "Sorry, I can't hear you!"
    
    if audio:
        try:
            print("Recognizing the text...")
            text = recognizer.recognize_google(audio, language="en-US")
            print("Decoded Text: {}".format(text))
            
        except sr.UnknownValueError as ex:
            print("Google Speech Recognition could not understand the audio:", ex)
    
            text = "Google Speech Recognition could not understand the audio"
            
        except sr.RequestError:
            print("Could not request results from Google Speech Recognition service:", ex)
    
            text = "Could not request results from Google Speech Recognition service"
            
        except Exception as ex:
            print("Error during recognition:", ex)
    
            text = "Error during recognition"

    return text
        

In [53]:
from pydub import AudioSegment
from pydub.playback import play

## create a function converting text to voice
def text_to_speech(text):

    try:
        tts = gTTS(text)
    
        # Save the audio object into a buffer
        audio_buffer = BytesIO()
        tts.write_to_fp(audio_buffer)

        # Set the buffer's position to the start
        audio_buffer.seek(0)

        # Load MP3 data into an AudioSegment
        audio_segment = AudioSegment.from_file(audio_buffer, format="mp3")

        # Play the audio
        play(audio_segment)       

    except Exception as ex:
        print("Error during text to speech:", ex)

In [52]:
# Listen to microphone
recorded_audio = listen_to_microphone(sr)

text = audio_to_text(recorded_audio, sr)

text_to_speech(text)

Adjusting noise...
Listening to microphone...
Done Listening.
Recognizing the text...
Decoded Text: listening to the microphone info 323


## 6. Integrate a LLM to respond voice input with voice output
We will use Google Gemini Pro for responding to our voice input.

### First, set up GOOGLE_API_KEY

In [63]:
from dotenv import load_dotenv

load_dotenv()

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

### Import Google genai and Configure Gemini Pro

In [65]:
import google.generativeai as genai

genai.configure(api_key=GOOGLE_API_KEY)

gemini_pro = genai.GenerativeModel(model_name="models/gemini-pro")

### Respond to Input by Gemini LLM

In [69]:
role = '''
        You are an intelligent assistant to chat on the topic:
        `{}`.    
    '''

topic = '''
        The future of artificial intelligence
    '''

role_text = role.format(topic)

instructions = '''
        Respond to the INPUT_TEXT briefly in chat style.
        Respond based on your knowledge about `{}` in brief chat style. 
    '''

instructions_text = instructions.format(topic)

## create a function to respond to input text with Gemini pro
def respond_by_gemini(input_text, role_text, instructions_text):

    final_prompt = [
        "ROLE: " + role_text,
        "INPUT_TEXT: " + input_text,
        instructions_text,
    ]

    response = gemini_pro.generate_content(
            final_prompt,
            stream=True,
        )

    response_list = []
    for chunk in response:
        response_list.append(chunk.text)
        
    response_text = "".join(response_list)

    return response_text


### LLM response

In [71]:
# Listen to microphone
recorded_audio = listen_to_microphone(sr)

text = audio_to_text(recorded_audio, sr)

response_text = respond_by_gemini(text, role_text, instructions)

text_to_speech(response_text)

Adjusting noise...
Listening to microphone...
Done Listening.
Recognizing the text...
Decoded Text: when AI will replace humans


## 7. Build a Web interface for the LLM-supported voice assistant
We will use Streamlit: (https://streamlit.io/) to create a Web interface for the voice assistant. Stay tuned for the next article.