# Speech Recognition with Python Libraries - vosk, SpeechRecognition and Pocketsphinx

<div style="border: 5px ridge; padding:5%"> 
<h1>Table of Contents</h1>
<hr style="border:2px solid black" />

<h3><a href="#online">1. Online Speech Recognition with SpeechRecognition</a></h3>
<h3><a href="#vosk">2. Offline Speech Recognition with Vosk</a></h3>
<h3><a href="#pocketsphinx">3. Offline Speech Recognition with SpeechRecognition and Pocketsphinx</a></h3>
<h3><a href="#results">Results</a></h3>
</div>

<hr style="border:2px solid black" />
<h1><a id="online">Online Speech Recognition with SpeechRecognition</a></h1>

## Import Libraries

In [1]:
import os
import sys
import time
import speech_recognition as sr
# install with `pip install SpeechRecognition`

## Specify the file name to recognize and language

In [2]:
# name of the audio file to recognize (wav preferably)
audio_filename = "audio/speech_recognition_systems.wav"
# name of the text file to write recognized text
text_filename = "audio/speech_recognition_systems_online.txt"
# language of speech
language = 'en-US'

## Reading a file

In [3]:
if not os.path.exists(audio_filename):
    print(f"File '{audio_filename}' doesn't exist")
    sys.exit()

print(f"Reading your file '{audio_filename}'...")
audio_file = sr.AudioFile(audio_filename)
r = sr.Recognizer()

with audio_file as af:
    r.adjust_for_ambient_noise(af)  # clearing sound from noise
    audio = r.record(af)

print(f"'{audio_filename}' file was successfully read and cleaned from noise")

Reading your file 'audio/speech_recognition_systems.wav'...
'audio/speech_recognition_systems.wav' file was successfully read and cleaned from noise


## Recognize

In [4]:
print('Start converting to text. It may take some time...')
start_time = time.time()

Start converting to text. It may take some time...


In [5]:
# recognize speech using Google API
try:
    text = r.recognize_google(audio, language=language)
except sr.UnknownValueError:
    print("Google could not understand audio")
    sys.exit()
except sr.RequestError as e:
    print("Google error; {0}".format(e))
    sys.exit()

In [6]:
time_elapsed = time.strftime(
    '%H:%M:%S', time.gmtime(time.time() - start_time))
print(f'Done! Elapsed time = {time_elapsed}')

Done! Elapsed time = 00:00:08


In [7]:
print("\tGoogle thinks you said:\n")
print(text)

	Google thinks you said:

dumb speech recognition systems required training also called enrolment for an individual speaker with taxed or isolated vocabulary into the system the system analyzes the person specific voice and use it to find you in the recognition of that person's speech resulting in increased security systems that do not use training account speaker independent system systems that use training occult speaker dependant


In [8]:
print(f"Saving text to '{text_filename}'...")
with open(text_filename, "w") as text_file:
    text_file.write(text)
print(f"Text successfully saved")

Saving text to 'audio/speech_recognition_systems_online.txt'...
Text successfully saved


<hr style="border:2px solid black" />
<h1><a id="vosk">Offline Speech Recognition with Vosk</a></h1>

## Import Libraries

In [9]:
import os
import sys
import time
import wave
import json
from vosk import Model, KaldiRecognizer, SetLogLevel
# !pip install vosk

SetLogLevel(0)

## Specify the file name to recognize and the path to the vosk model

In [10]:
# name of the audio file to recognize (wav preferably)
audio_filename = "audio/speech_recognition_systems.wav"
# name of the text file to write recognized text
text_filename = "audio/speech_recognition_systems_vosk.txt"

# path to vosk model downloaded from
# https://alphacephei.com/vosk/models
model_path = "models/vosk-model-en-us-0.21"

## Reading a file and a model

In [11]:
if not os.path.exists(audio_filename):
    print(f"File '{audio_filename}' doesn't exist")
    sys.exit()

print(f"Reading your file '{audio_filename}'...")
wf = wave.open(audio_filename, "rb")
print(f"'{audio_filename}' file was successfully read")

# check if audio if mono wav
if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
    print("Audio file must be WAV format mono PCM.")
    sys.exit()

Reading your file 'audio/speech_recognition_systems.wav'...
'audio/speech_recognition_systems.wav' file was successfully read


In [12]:
if not os.path.exists(model_path):
    print(f"Please download the model from https://alphacephei.com/vosk/models and unpack as {model_path}")
    sys.exit()

print(f"Reading your vosk model '{model_path}'...")
model = Model(model_path)
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True)
print(f"'{model_path}' model was successfully read")

Reading your vosk model 'models/vosk-model-en-us-0.21'...
'models/vosk-model-en-us-0.21' model was successfully read


## Recognize

In [13]:
print('Start converting to text. It may take some time...')
start_time = time.time()

Start converting to text. It may take some time...


In [14]:
results = []

# recognize speech using vosk model
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        part_result = json.loads(rec.Result())
        results.append(part_result)

part_result = json.loads(rec.FinalResult())
results.append(part_result)

`result` - list of json dictionaries, each of them has the following structure:

```
{'result': [{'conf': 0.849133, # confidence
             'end': 4.5, # end time
             'start': 4.05, # start time
             'word': 'test'}], # recognized word
 'text': 'test'}
 ```

In [15]:
results

[{'result': [{'conf': 1.0, 'end': 1.92, 'start': 1.47, 'word': 'some'},
   {'conf': 1.0, 'end': 2.4, 'start': 1.92, 'word': 'speech'},
   {'conf': 1.0, 'end': 3.09, 'start': 2.4, 'word': 'recognition'},
   {'conf': 1.0, 'end': 4.02, 'start': 3.09, 'word': 'systems'},
   {'conf': 1.0, 'end': 4.8, 'start': 4.08, 'word': 'require'},
   {'conf': 1.0, 'end': 5.67, 'start': 4.92, 'word': 'training'},
   {'conf': 1.0, 'end': 6.9, 'start': 6.03, 'word': 'alphago'},
   {'conf': 0.720008, 'end': 7.95, 'start': 6.93, 'word': 'enrollment'},
   {'conf': 0.739668, 'end': 8.7, 'start': 8.34, 'word': 'burn'},
   {'conf': 1.0, 'end': 9.42, 'start': 8.7, 'word': 'individual'},
   {'conf': 1.0, 'end': 10.17, 'start': 9.42, 'word': 'speaker'},
   {'conf': 0.670045, 'end': 10.68, 'start': 10.23, 'word': 'reads'},
   {'conf': 1.0, 'end': 11.34, 'start': 10.689899, 'word': 'text'},
   {'conf': 0.987625, 'end': 11.7, 'start': 11.37, 'word': 'or'},
   {'conf': 1.0, 'end': 12.48, 'start': 11.73, 'word': 'isolat

In [16]:
# forming a final string from the words
text = ''
for r in results:
    text += r['text'] + ' '

In [17]:
time_elapsed = time.strftime(
    '%H:%M:%S', time.gmtime(time.time() - start_time))
print(f'Done! Elapsed time = {time_elapsed}')

Done! Elapsed time = 00:00:07


In [18]:
print("\tVosk thinks you said:\n")
print(text)

	Vosk thinks you said:

some speech recognition systems require training alphago enrollment burn individual speaker reads text or isolated vocabulary into the system the system analyzes the person specific voice and use it to fine tune the recognition of the person's speech resulting in increased accuracy systems that do not use training are called speaker independent systems systems that use training are called speaker dependent 


In [19]:
print(f"Saving text to '{text_filename}'...")
with open(text_filename, "w") as text_file:
    text_file.write(text)
print(f"Text successfully saved")

Saving text to 'audio/speech_recognition_systems_vosk.txt'...
Text successfully saved


<hr style="border:2px solid black" />
<h1><a id="pocketsphinx">Offline Speech Recognition with SpeechRecognition and Pocketsphinx</a></h1>

## Import Libraries

In [20]:
import os
import sys
import time
import speech_recognition as sr
# install with `pip install SpeechRecognition`

## Specify the file name to recognize and language

In [21]:
# name of the audio file to recognize (wav preferably)
audio_filename = "audio/speech_recognition_systems.wav"
# name of the text file to write recognized text
text_filename = "audio/speech_recognition_systems_sphinx.txt"
# language of speech
language = 'en-US'

## Reading a file

In [22]:
if not os.path.exists(audio_filename):
    print(f"File '{audio_filename}' doesn't exist")
    sys.exit()
    
print(f"Reading your file '{audio_filename}'...")
audio_file = sr.AudioFile(audio_filename)
r = sr.Recognizer()

with audio_file as af:
    r.adjust_for_ambient_noise(af)  # clearing sound from noise
    audio = r.record(af)

print(f"'{audio_filename}' file was successfully read and cleaned from noise")

Reading your file 'audio/speech_recognition_systems.wav'...
'audio/speech_recognition_systems.wav' file was successfully read and cleaned from noise


## Recognize

In [23]:
print('Start converting to text. It may take some time...')
start_time = time.time()

Start converting to text. It may take some time...


In [24]:
# recognize speech using Sphinx
try:
    text = r.recognize_sphinx(audio, language=language)
except sr.UnknownValueError:
    print("Sphinx could not understand audio")
    sys.exit()
except sr.RequestError as e:
    print("Sphinx error; {0}".format(e))
    sys.exit()

In [25]:
time_elapsed = time.strftime(
    '%H:%M:%S', time.gmtime(time.time() - start_time))
print(f'Done! Elapsed time = {time_elapsed}')

Done! Elapsed time = 00:00:10


In [26]:
print("\tSphinx thinks you said:\n")
print(text)

	Sphinx thinks you said:

sounds a bit sugar ignition systems requires training i'm also called enrollment where an individual speaker newt spratt clip are isolated look terrible read into this system the system of analyzes the person specific war is in the use it to find him as arrogant nation called a person speech will result in an ways to accuracy systems and cannot use training and pa called speaker newt dependencies the system has and he was training and pop called speaker independent


In [27]:
print(f"Saving text to '{text_filename}'...")
with open(text_filename, "w") as text_file:
    text_file.write(text)
print(f"Text successfully saved")

Saving text to 'audio/speech_recognition_systems_sphinx.txt'...
Text successfully saved


<hr style="border:2px solid black" />
<h1><a id="results">Results</a></h1>

In [32]:
import IPython
IPython.display.Audio(filename="audio/speech_recognition_systems.wav")

Comparison table of recognition results on the example of the [speech recognition article on Wikipedia](https://en.wikipedia.org/wiki/Speech_recognition). 

vosk (its largest model) and Google API show approximately the same results, and sphinx works worse and also slower.


| Method | Recognised Text |
| ----------- | ----------- |
| **Initial Text** | Some speech recognition systems require "training" (also called "enrollment") where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person's specific voice and uses it to fine-tune the recognition of that person's speech, resulting in increased accuracy. Systems that do not use training are called "speaker-independent" systems. Systems that use training are called "speaker dependent". |
| Google API - SpeechRecognition with `recognize_google()` | dumb speech recognition systems required training also called enrolment for an individual speaker with taxed or isolated vocabulary into the system the system analyzes the person specific voice and use it to find you in the recognition of that person's speech resulting in increased security systems that do not use training account speaker independent system systems that use training occult speaker dependant |
| vosk | some speech recognition systems require training alphago enrollment burn individual speaker reads text or isolated vocabulary into the system the system analyzes the person specific voice and use it to fine tune the recognition of the person's speech resulting in increased accuracy systems that do not use training are called speaker independent systems systems that use training are called speaker dependent   |
| SpeechRecognition with `recognize_sphinx()` | sounds a bit sugar ignition systems requires training i'm also called enrollment where an individual speaker newt spratt clip are isolated look terrible read into this system the system of analyzes the person specific war is in the use it to find him as arrogant nation called a person speech will result in an ways to accuracy systems and cannot use training and pa called speaker newt dependencies the system has and he was training and pop called speaker independent |

Another test for smaller text. Again, sphinx performed the worst.

| Method | Recognised Text |
| ----------- | ----------- |
| Initial Text | Test Vosk - speech recognition library |
| Google API with `recognize_google()` | best bossk speech recognition Library |
| vosk | deus vos speech recognition library  |
| SpeechRecognition with `recognize_sphinx()` | that's bosco speech from the ignition library |