
# Vosk Colab Demo (Zombie Gone Commands)

Vosk is an open source offline speech recognition toolkit. Vosk
contains more than 20 languages and dialects, such as English, German, Russian, Chinese, Czech, etc. The sizes of language models vary from tens of megabytes to several gigabytes. Big models are more accurate. For more information see https://alphacephei.com/vosk/.



# Install module and prepare the file

First, you have to install vosk module using the following code:

In [2]:
!pip3 install vosk pydub



## Importing the necessary modules

Secondly, we import here the necessary modules required for all the examples below:

In [3]:
from vosk import Model, KaldiRecognizer
import wave
import json

## Download example audio file

You can upload your audio file and listen it by replacing the URL of our example with your own using the code below.

In [4]:
!wget -q -O /content/command_1.wav https://github.com/Zom-be-gone/ASR/raw/refs/heads/main/audio_samples/command_1.wav

In [5]:
import IPython
IPython.display.Audio("/content/command_1.wav")

# Recognition examples



By default, Vosk uses vosk-model-small-en-us-0.15, defined by the `en-us` lang option. The other options `model_path` and `model_name` allow you to use a specific model path or model name.

When a model is mentioned for the first time, it is automatically downloaded and saved; when a model is mentioned again, an already downloaded model is used.

Initializing the model by language:


In [6]:
model = Model(lang="en-us")

Open downloaded file in 'read bytes' mode as wave object:

In [7]:
wf = wave.open('/content/command_1.wav', 'rb')

The KaldiRecognizer class contains the configuration methods needed here, such as SetWords, SetPartialWords, AcceptWaveform, and others.

The model object is the first parameter for KaldiRecognizer. The second parameter passed to KaldiRecognizer is the sample rate, which can be passed directly as a number like 8000 or 16000 Hz, which will be demonstrated below or using getframerate method shown in the following code fragment.

Creating a KaldiRecognizer object with model and sample rate arguments:

In [8]:
rec = KaldiRecognizer(model, wf.getframerate())

The previous commands are the same for the most of examples, but the following are different.

Activating timestamps for recognized words (partial result and result attributes in recognized result) using methods `SetWords` and `SetPartialWords`:

In [9]:
rec.SetWords(True)
rec.SetPartialWords(True)

The `AcceptWaveform` method reports the presence of a pause after a speech fragment in the audio file, which allows it to be returned from the recognizer and print.

`KaldiRecognizer` class also contains methods for presenting recognition results, such as `Result`, `PartialResult`, `FinalResult`.


> The `PartialResult` method of the `KaldiRecognizer` class returns a string obtained from the dictionary with the "key" "partial", and the "value" that contains recognized fragment of the audio file, which ends with a pause between words.

> The `Result` method of the `KaldiRecognizer` class returns a string obtained from the dictionary with the "key" "text", and the "value" that contains recognized fragment of the audio file, which ends with a pause between its parts like phrases and sentences.

> The `FinalResult` method of the `KaldiRecognizer` class returns a string obtained from the dictionary with the "key" "text" and the "value" that contains all the recognized text.

Run recognition process:

In [10]:
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        print(rec.Result())
    else:
        print(rec.PartialResult())

print(rec.FinalResult())

{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "partial" : ""
}
{
  "result" : [{
      "conf" : 0.887898,
      "end" : 1.530000,
      "start" : 1.050000,
      "word" : "shoot"
    }, {
      "conf" : 0.702576,
      "end" : 1.920000,
      "start" : 1.560000,
      "word" : "every"
    }, {
      "conf"

## Recognition with alternatives

Run the initial code that was described above:

In [11]:
wf = wave.open('/content/command_1.wav', 'rb')
model = Model(lang="en-us")
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True)

`SetMaxAlternatives(n)` method of the `KaldiRecognizer` class shows no more than 'n' different alternatives of the recognized result, which may appear, for example, due to the low quality of the audio file.

In [12]:
rec.SetMaxAlternatives(10)

The recognition result is converted from a string to a dictionary, which is more convenient for its further processing using the json.loads method.

Run recognition process:

In [13]:
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        print(json.loads(rec.Result()))
    else:
        print(json.loads(rec.PartialResult()))

print(json.loads(rec.FinalResult()))

{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': 'shoot a'}
{'partial': 'shoot a'}
{'partial': 'shoot every'}
{'partial': 'shoot every'}
{'partial': 'shoot every'}
{'partial': 'shoot every'}
{'partial': 'shoot every'}
{'partial': 'shoot every'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'alternatives': [{'confidence': 190.736359, 'result': [{'end': 1.53, 'start': 1.05, 'word': 'shoot'}, {'end': 1.92, 'start': 1.56, 'word': 'every'}, {'end': 2.28, 'start': 1.92, 'word': 'one'}], 'text': 'shoot every one'}, {'confidence': 189.394882, 're

## Grammar recognizer


Now lets demonstrate online grammar to improve accuracy.

In [14]:
wf = wave.open('/content/command_1.wav', "rb")
rec = KaldiRecognizer(model, wf.getframerate(), '["shoot every one"]')

Using this recognizer we can get more acccurate results since we already specified the expected input

In [15]:
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        print(rec.Result())
    else:
        jres = json.loads(rec.PartialResult())
        print(jres)


{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': 'one'}
{'partial': 'one'}
{'partial': 'one'}
{'partial': 'one'}
{'partial': 'one'}
{'partial': 'one'}
{'partial': 'one'}
{'partial': 'one'}
{'partial': 'one'}
{'partial': 'shoot'}
{'partial': 'shoot'}
{'partial': 'shoot'}
{'partial': 'shoot'}
{'partial': 'shoot'}
{'partial': 'shoot'}
{'partial': 'shoot every'}
{'partial': 'shoot every'}
{'partial': 'shoot every'}
{'partial': 'shoot every'}
{'partial': 'shoot every'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}
{'partial': 'shoot every one'}


## Real-Time ASR (Not available in Python Colab, just for code example)

In [21]:
#sudo apt-get install libportaudio2 libportaudiocpp0 portaudio19-dev
!apt-get install libportaudio2 libportaudiocpp0 portaudio19-dev

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Suggested packages:
  portaudio19-doc
The following NEW packages will be installed:
  libportaudio2 libportaudiocpp0 portaudio19-dev
0 upgraded, 3 newly installed, 0 to remove and 49 not upgraded.
Need to get 188 kB of archives.
After this operation, 927 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libportaudio2 amd64 19.6.0-1.1 [65.3 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libportaudiocpp0 amd64 19.6.0-1.1 [16.1 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 portaudio19-dev amd64 19.6.0-1.1 [106 kB]
Fetched 188 kB in 1s (127 kB/s)
Selecting previously unselected package libportaudio2:amd64.
(Reading database ... 123629 files and directories currently installed.)
Preparing to unpack .../libportaudio2_19.6.0-1.1_amd64.deb ...
Unpacking libportaudio2:amd64 (19.6.0-1.1) ...
Selecting previously un

In [22]:
pip install sounddevice



In [26]:
if sd.query_devices():
    print(f"Mic Detected : {sd.query_devices()}")
else :
    print("No mic detected")

No mic detected


In [23]:
import os
import queue
import sounddevice as sd
import vosk
import json

# Set the path to your Vosk model
model_path = "path_to_your_model_directory"

if not os.path.exists(model_path):
    print(f"Model not found at {model_path}")
    exit(1)

# Initialize the Vosk model
model = vosk.Model(model_path)
samplerate = 16000  # Sampling rate for the audio

# Create a queue to hold audio data
q = queue.Queue()

def callback(indata, frames, time, status):
    if status:
        print(status, file=sys.stderr)
    q.put(bytes(indata))

# Make sure the mic is recognized
print(sd.query_devices())

# Open the audio stream
with sd.RawInputStream(samplerate=samplerate, blocksize=8000, dtype='int16',
                       channels=1, callback=callback):
    print('#' * 80)
    print('Press Ctrl+C to stop the recording')
    print('#' * 80)

    rec = vosk.KaldiRecognizer(model, samplerate)
    while True:
        data = q.get()
        if rec.AcceptWaveform(data):
            result = rec.Result()
            text = json.loads(result).get('text', '')
            if text:
                print(f"Recognized: {text}")
        else:
            partial_result = rec.PartialResult()
            partial_text = json.loads(partial_result).get('partial', '')
            if partial_text:
                print(f"Partial: {partial_text}")

PortAudioError: Error querying device -1

## Finetune to recognize specific commands more precisely (T.B.D)
