To be able to use a selected pre-trained model to recognize speech, we need to import the necessary vosk classes, load the model into our IDE, and initialize a recognizer.

When initializing a recognizer, apart from the selected model, we have to provide also a sampling rate (aka frame rate) that determines the quality of audio. For speech recognition tasks, the optimal value of the frame rate is 16,000.

In addition, we may want to have not only the complete transcript of our audio file but also individual words in the model, as well as the model's confidence in those words. This could be helpful if we want to correct eventual mistakes.

In [1]:
from vosk import Model, KaldiRecognizer
SAMPLE_RATE = 16000
model = Model(lang="en-us")
rec = KaldiRecognizer(model, SAMPLE_RATE)

LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from /Users/jacobzhao/.cache/vosk/vosk-model-small-en-us-0.15/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:282) Loading HCL and G from /Users/jacobzhao/.cache/vosk/vosk-model-small-en-us-0.15/graph/HCLr.fst /Users/jacobzhao/.cache/vosk/vosk-model-small-en-us-0.15/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:303) Loading winfo /Users/jacobzhao/.cache/vosk/vosk-model-small-en-us-0.15/graph/phones/wo

In [2]:
rec.SetWords(True)

In [3]:
from pydub import AudioSegment
import os

In [4]:
marketplace_path = os.path.join(os.getcwd(), 'marketplace.mp3')
marketplace_full_path = os.path.join(os.getcwd(), 'marketplace_full.mp3')

In [5]:
marketplace = AudioSegment.from_mp3(marketplace_path)
marketplace_full = AudioSegment.from_mp3(marketplace_full_path)

In [6]:
CHANNELS = 1
FRAME_RATE = 16000
marketplace = marketplace.set_channels(CHANNELS)
marketplace = marketplace.set_frame_rate(FRAME_RATE)

In [15]:
rec.AcceptWaveform(marketplace.raw_data)
marketplace_result = rec.Result()

In [16]:
marketplace_result

'{\n  "result" : [{\n      "conf" : 0.521135,\n      "end" : 137.910000,\n      "start" : 137.760000,\n      "word" : "who"\n    }, {\n      "conf" : 0.829375,\n      "end" : 138.060000,\n      "start" : 138.000000,\n      "word" : "are"\n    }, {\n      "conf" : 1.000000,\n      "end" : 138.150000,\n      "start" : 138.060000,\n      "word" : "the"\n    }, {\n      "conf" : 1.000000,\n      "end" : 138.540000,\n      "start" : 138.150000,\n      "word" : "funny"\n    }, {\n      "conf" : 1.000000,\n      "end" : 138.960000,\n      "start" : 138.540000,\n      "word" : "thing"\n    }, {\n      "conf" : 1.000000,\n      "end" : 139.200000,\n      "start" : 138.960000,\n      "word" : "about"\n    }, {\n      "conf" : 1.000000,\n      "end" : 139.290000,\n      "start" : 139.200000,\n      "word" : "the"\n    }, {\n      "conf" : 1.000000,\n      "end" : 139.680000,\n      "start" : 139.290000,\n      "word" : "big"\n    }, {\n      "conf" : 1.000000,\n      "end" : 140.220000,\n      "s

In [17]:
import json
marketplace_json = json.loads(marketplace_result)

In [28]:
marketplace_json['text']

"who are the funny thing about the big economic news of the day the fed raising interest rates have a percentage point was that there was only really one bit of actual news in the news and the interest rate increase wasn't it you know it was common i know it was common wall street news common businesses knew it was common so on this fed day on this program something a little bit different j powell in his own words five of i'm his most used economic words from today's press conference where number one of course it's the biggie two percent inflation flesh and inflation inflation inflation place in english dealing with inflation bells big worry that thing keeping him up at night price stability is the feds whole ballgame right now pal basically said as much to day or number two"

Now we will add punctuations using vosk's vosk-recasepunc-en-0.22, available at https://alphacephei.com/vosk/models

In [32]:
import subprocess

# Assume your text without punctuation is stored in a variable named 'text_without_punctuation'
text_without_punctuation = marketplace_json['text']


command = 'python recasepunc/recasepunc.py predict recasepunc/checkpoint'

# Run the command and get the output
try:
    result = subprocess.check_output(
        command, 
        shell=True, 
        text=True, 
        input=text_without_punctuation
    )
    print(result)
except subprocess.CalledProcessError as e:
    print("Error occurred:", e)



Downloading: 100%|██████████| 28.0/28.0 [00:00<00:00, 41.2kB/s]
Downloading: 100%|██████████| 226k/226k [00:00<00:00, 1.32MB/s]
Downloading: 100%|██████████| 455k/455k [00:00<00:00, 1.80MB/s]
Downloading: 100%|██████████| 570/570 [00:00<00:00, 3.63MB/s]
Downloading: 100%|██████████| 420M/420M [00:08<00:00, 55.0MB/s] 
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from

Who are The funny thing about the big economic news of the day, the Fed raising interest rates have a percentage point was that there was only really one bit of actual news in the news. And the interest rate increase wasn ' t it. You know, it was common. I know it was common Wall Street news. Common businesses knew it was common. So on this Fed day on this program, something a little bit different. J Powell, in his own words, five of. I ' m his most used economic words from today ' s press conference, where number one, Of course, it ' s the biggie Two percent inflation flesh and inflation inflation inflation place in English. Dealing with inflation bells. Big worry, that thing keeping him up at night, Price stability is the Feds whole ballgame right now, Pal. Basically said as much to day or number two.

