## Real time speech to text python

This project would not make any API calls and essentially we should be able to get real time speech to text conversion by just using our computer. Three parts to the project :
1. Create Widgets that start and stop recording
2. Use pyaudio to record microphone in background
3. vosk lib for speech recognition --> then add the output to the jupyter widget

In [4]:
## Part 1 Import Modules and Create Widgets

import pyaudio
import json
from vosk import Model, KaldiRecognizer
from queue import Queue
from ipywidgets import Button, Output
from IPython.display import display
from threading import Thread
import json, time

# Create the output widget
output = Output()
# Create the start and stop buttons
start_button = Button(description='Start Recording', icon='microphone')
stop_button = Button(description='Stop Recording', icon='stop')

# Add the click event handlers to the buttons
start_button.on_click(lambda _: start_recording())
stop_button.on_click(lambda _: stop_recording())

# Create queues for communication between threads
'''
These two threads will run in the background simultaneously. messages will send out signal to the microphone to stop recording if messages are empty.
recordings is the collection of the audio from record_microphone that gets pulled by speech recognition function. Threads are used for making this simultaneous.
'''
messages = Queue()
recordings = Queue()

# Function to start recording and transcribing
def start_recording():
    messages.put(True)
    
    record = Thread(target=record_microphone)
    record.start()
    
    transcribe = Thread(target=speech_recognition, args=(output,))
    transcribe.start()
    with output:
        print("Starting...")
# Function to stop recording
def stop_recording():
    with output:
        messages.queue.clear()  # Clear the messages queue
        print("Stopped.")


## Part 2 Recording from Microphone using Pyaudio

# Define the constants for speech recognition
CHANNELS = 1
FRAME_RATE = 16000
RECORD_SECONDS = 8 #records for 8 seconds and then sends it for transcription. you can change this and play around but CPU usage will also change.
AUDIO_FORMAT = pyaudio.paInt16

'''
This commented piece of code prints out all devices connected to your computer. You can check the index of the microphone and note it. 
We use it furhter down in record_microphone.
'''
# p = pyaudio.PyAudio()
# #we need to check how many audio devices are connected to our system

# for i in range(p.get_device_count()):
#     print(p.get_device_info_by_index(i))

# p.terminate()


# Function for recording the microphone
def record_microphone(chunk=1024):
    p = pyaudio.PyAudio()

    stream = p.open(
        format=AUDIO_FORMAT,
        channels=CHANNELS,
        rate=FRAME_RATE,
        input=True,
        input_device_index=1,
        frames_per_buffer=chunk
    )
    frames = []

    while not messages.empty():
        
        data = stream.read(chunk)
        frames.append(data)
        #seems like max frames size is 32
        if len(frames) >= (FRAME_RATE * RECORD_SECONDS) / chunk:
            recordings.put(frames.copy())
            frames = []

    stream.stop_stream()
    stream.close()
    p.terminate()

## Part 3 Recognizing live speech with vosk
          
# Create the Vosk model and recognizer
model = Model(model_name='vosk-model-en-us-0.22') #this model is standard, you can change it acc. to the language.
rec = KaldiRecognizer(model, FRAME_RATE)
rec.SetWords(True)


# Function for speech recognition
def speech_recognition(output):
    
    while not messages.empty():
        frames = recordings.get()
        rec.AcceptWaveform(b''.join(frames))
        result = rec.Result()
        text = json.loads(result)['text']
        output.append_stdout(f'{text} \n')
            


# Display the buttons and the output widget
display(start_button, stop_button, output)


LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=6
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:11:12:13:14:15
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /Users/akashnikam/.cache/vosk/vosk-model-en-us-0.22/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:279) Loading HCLG from /Users/akashnikam/.cache/vosk/vosk-model-en-us-0.22/graph/HCLG.fst
LOG (VoskAPI:ReadDataFiles():model.cc:294) Loading words from /Users/akashnikam/.cache/vosk/vosk-model-en-us-0.22/graph/words.txt
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /Users/akashnikam/.cach

Button(description='Start Recording', icon='microphone', style=ButtonStyle())

Button(description='Stop Recording', icon='stop', style=ButtonStyle())

Output()


It's important to note that Python has a Global Interpreter Lock (GIL), which means that only one thread can execute Python bytecode at a time. As a result, threads in Python are not suitable for CPU-bound tasks that require heavy computational work. For CPU-bound tasks, you may want to consider using multiprocessing or other concurrent programming approaches.
