# Speech Recognition

Date July 2021

The NVIDIA NeMo toolkit supports numerous Speech synthesis models which can be used to convert text to audio. 

NeMo comes with pretrained models that can be immediately downloaded and used to generate speech.

The following example converts a short snippet from one of my favorite movies "Godfather" -> https://en.wikipedia.org/wiki/The_Godfather

Read more --> https://catalog.ngc.nvidia.com/orgs/nvidia/collections/nemo_tts

In [None]:
BRANCH = 'r1.10.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

In [None]:
# Ignore pre-production warnings
import warnings
warnings.filterwarnings('ignore')
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
# Import Natural Language Processing colleciton
import nemo.collections.nlp as nemo_nlp
# Import Speech Synthesis collection
import nemo.collections.tts as nemo_tts
# We'll use this to listen to audio
import IPython

In [None]:
# Download audio sample which we'll try from the 'Godfather' movie
Audio_sample = 'bada-bing.wav'
!wget http://www.rosswalker.co.uk/movie_sounds/sounds_files_20150201_1096714/godfather/bada-bing.wav
# Listen to it
IPython.display.Audio(Audio_sample)

In [None]:
# Instantiate pre-trained Nemo model
# Load audio_sample and convert it to text with QuartzNet ASR model. 
# To convert text back to audio, we actually need to generate spectrogram with FastPitch first and then convert it to actual audio signal using the HiFiGAN vocoder.

# Speech Recognition model - QuartzNet
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_en_quartznet15x5").cuda()

# Punctuation and capitalization model
punctuation = nemo_nlp.models.PunctuationCapitalizationModel.from_pretrained(model_name='punctuation_en_distilbert').cuda()

# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch").cuda()

# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_hifigan").cuda()

In [None]:
# Convert audio sample to text
files = [Audio_sample]
raw_text = ''
text = ''
for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):
  raw_text = transcription

# Add capitalization and punctuation
res = punctuation.add_punctuation_capitalization(queries=[raw_text])
text = res[0]
print(f'\nRaw recognized text: {raw_text}. \nText with capitalization and punctuation: {text}')

Transcribing: 100%
1/1 [00:01<00:00, 1.49s/it]
[NeMo E 2021-07-07 20:25:31 segment:154] Loading bada-bing.wav via SoundFile raised RuntimeError: `Error opening 'bada-bing.wav': Error in WAV/W64/RF64 file. Malformed 'fmt ' chunk.`. NeMo will fallback to loading via pydub.
[NeMo I 2021-07-07 20:25:32 punctuation_capitalization_model:1091] Using batch size 1 for inference
[NeMo I 2021-07-07 20:25:32 punctuation_capitalization_infer_dataset:91] Max length: 64
[NeMo I 2021-07-07 20:25:32 data_preprocessing:404] Some stats of the lengths of the sequences:
[NeMo I 2021-07-07 20:25:32 data_preprocessing:410] Min: 86 |                  Max: 86 |                  Mean: 86.0 |                  Median: 86.0
[NeMo I 2021-07-07 20:25:32 data_preprocessing:412] 75 percentile: 86.00
[NeMo 2021-07-07 20:25:32 data_preprocessing:413] 99 percentile: 86.00
100%|██████████| 4/4 [00:00<00:00, 106.57batch/s]
Raw recognized text: what are you gonna do nice college boy yeah they wantnto get mixed up in a family business now you wanna goin down te police captain one cauze you slap you her face a little bit i what do you think is ust the army whele you shoot them a mile away you gotta get them closhe like this bet bing you blow their brains all over your nights come in  you're thaking thisv person. 
Text with capitalization and punctuation: What are you gonna do? nice? college boy? Yeah, they wantnto get mixed up in a family business. Now you wanna goin down Te police. Captain One cauze. You slap you her face a little bit. I. what do you think is Ust the army whele? You shoot them a mile away. You gotta get them closhe like this bet Bing you blow their brains all over your nights. Come in, You're thaking thisv person.


In [None]:
# Results

# Original audio sample
IPython.display.Audio(Audio_sample)

# This is what was recognized by the ASR model
print(raw_text)

what are you gonna do nice college boy yeah they wantnto get mixed up in a family business now you wanna goin down te police captain one cauze you slap you her face a little bit i what do you think is ust the army whele you shoot them a mile away you gotta get them closhe like this bet bing you blow their brains all over your nights come in  you're thaking thisv person

# This is how punctuation model changed it
print(text)
What are you gonna do? nice? college boy? Yeah, they wantnto get mixed up in a family business. Now you wanna goin down Te police. Captain One cauze. You slap you her face a little bit. I. what do you think is Ust the army whele? You shoot them a mile away. You gotta get them closhe like this bet Bing you blow their brains all over your nights. Come in, You're thaking thisv person.


# Next steps

A demo like this is great for prototyping and experimentation. However, for real production deployment, NVIDIA Riva is recommended, based on Pytorch https://docs.nvidia.com/deeplearning/riva/user-guide/docs/model-overview.html

NeMo GitHub for more examples: https://github.com/NVIDIA/NeMo