# Speech Recognition

Date July 2021

The NVIDIA NeMo toolkit supports numerous Speech synthesis models which can be used to convert text to audio. 

NeMo comes with pretrained models that can be immediately downloaded and used to generate speech.

The following example converts a short snippet from one of my favorite movies "Godfather" -> https://en.wikipedia.org/wiki/The_Godfather

Read more --> https://catalog.ngc.nvidia.com/orgs/nvidia/collections/nemo_tts

If running in Colab, change Runtime = 'GPU'

In [23]:
BRANCH = 'r1.11.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH  #egg=nemo_toolkit[all]
!python -m pip install git+https://github.com/NVIDIA/apex.git
!pip install pynini==2.1.5

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/NVIDIA/NeMo.git@r1.11.0
  Cloning https://github.com/NVIDIA/NeMo.git (to revision r1.11.0) to /tmp/pip-req-build-cc55puco
  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/NeMo.git /tmp/pip-req-build-cc55puco
  Running command git checkout -b r1.11.0 --track origin/r1.11.0
  Switched to a new branch 'r1.11.0'
  Branch 'r1.11.0' set up to track remote branch 'r1.11.0' from 'origin'.
  Resolved https://github.com/NVIDIA/NeMo.git to commit e856e9732af79a6ed4bffaa3d709bfa387799587
  Preparing metadata (setup.py) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/NVIDIA/apex.git
  Cloning https://github.com/NVIDIA/apex.git to /tmp/pip-req-build-w6ymg_wa
  Running command git clone --filter=blob:none --quiet https://github.com/NV

In [8]:
# Ignore pre-production warnings
import warnings
warnings.filterwarnings('ignore')
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
# Import Natural Language Processing colleciton
import nemo.collections.nlp as nemo_nlp
# Import Speech Synthesis collection
import nemo.collections.tts as nemo_tts
# We'll use this to listen to audio
import IPython

In [9]:
# Download audio sample which we'll try from the 'Godfather' movie
Audio_sample = 'bada-bing.wav'
!wget http://www.rosswalker.co.uk/movie_sounds/sounds_files_20150201_1096714/godfather/bada-bing.wav
# Listen to it
IPython.display.Audio(Audio_sample)

--2023-02-04 22:38:45--  http://www.rosswalker.co.uk/movie_sounds/sounds_files_20150201_1096714/godfather/bada-bing.wav
Resolving www.rosswalker.co.uk (www.rosswalker.co.uk)... 192.31.21.192
Connecting to www.rosswalker.co.uk (www.rosswalker.co.uk)|192.31.21.192|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 46350 (45K) [audio/x-wav]
Saving to: ‘bada-bing.wav.1’


2023-02-04 22:38:46 (58.8 KB/s) - ‘bada-bing.wav.1’ saved [46350/46350]



In [17]:
# Instantiate pre-trained Nemo model
# Load audio_sample and convert it to text with QuartzNet ASR model. 
# To convert text back to audio, we actually need to generate spectrogram with FastPitch first and then convert it to actual audio signal using the HiFiGAN vocoder.

# Speech Recognition model - QuartzNet
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_en_quartznet15x5").cuda()

# Punctuation and capitalization model
punctuation = nemo_nlp.models.PunctuationCapitalizationModel.from_pretrained(model_name='punctuation_en_distilbert').cuda()

# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch").cuda()

# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_hifigan").cuda()

[NeMo I 2023-02-04 22:48:14 cloud:56] Found existing object /root/.cache/torch/NeMo/NeMo_1.11.0/stt_en_quartznet15x5/16661021d16e679bdfd97a2a03944c49/stt_en_quartznet15x5.nemo.
[NeMo I 2023-02-04 22:48:14 cloud:62] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.11.0/stt_en_quartznet15x5/16661021d16e679bdfd97a2a03944c49/stt_en_quartznet15x5.nemo
[NeMo I 2023-02-04 22:48:14 common:910] Instantiating model from pre-trained checkpoint


[NeMo W 2023-02-04 22:48:15 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /data2/voices/train_1k.json
    sample_rate: 16000
    labels:
    - ' '
    - a
    - b
    - c
    - d
    - e
    - f
    - g
    - h
    - i
    - j
    - k
    - l
    - m
    - 'n'
    - o
    - p
    - q
    - r
    - s
    - t
    - u
    - v
    - w
    - x
    - 'y'
    - z
    - ''''
    batch_size: 32
    trim_silence: true
    max_duration: 16.7
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: /asr_set_1.2/train/train_{0..1023}.tar
    num_workers: 20
    
[NeMo W 2023-02-04 22:48:15 modelPT:149] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
   

[NeMo I 2023-02-04 22:48:15 features:223] PADDING: 16
[NeMo I 2023-02-04 22:48:17 audio_preprocessing:491] Numba CUDA SpecAugment kernel is being used
[NeMo I 2023-02-04 22:48:18 save_restore_connector:243] Model EncDecCTCModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.11.0/stt_en_quartznet15x5/16661021d16e679bdfd97a2a03944c49/stt_en_quartznet15x5.nemo.
[NeMo I 2023-02-04 22:48:18 cloud:56] Found existing object /root/.cache/torch/NeMo/NeMo_1.11.0/punctuation_en_distilbert/613c4ee780c6fc158f49d3566cbd6636/punctuation_en_distilbert.nemo.
[NeMo I 2023-02-04 22:48:18 cloud:62] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.11.0/punctuation_en_distilbert/613c4ee780c6fc158f49d3566cbd6636/punctuation_en_distilbert.nemo
[NeMo I 2023-02-04 22:48:18 common:910] Instantiating model from pre-trained checkpoint
[NeMo I 2023-02-04 22:48:21 tokenizer_utils:130] Getting HuggingFace AutoTokenizer with pretrained_model_name: distilbert-base-uncased, vocab_file: /tmp/tmpsvifqger/

Using eos_token, but it is not set yet.
Using bos_token, but it is not set yet.
[NeMo W 2023-02-04 22:48:22 modelPT:217] You tried to register an artifact under config key=tokenizer.vocab_file but an artifact for it has already been registered.
[NeMo W 2023-02-04 22:48:22 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    use_tarred_dataset: false
    label_info_save_dir: null
    text_file: text_train.txt
    labels_file: labels_train.txt
    tokens_in_batch: null
    max_seq_length: 128
    num_samples: -1
    use_cache: true
    cache_dir: null
    get_label_frequences: false
    verbose: true
    n_jobs: 0
    tar_metadata_file: null
    tar_shuffle_n: 1
    shard_strategy: scatter
    shuffle: true
    drop_last: false
    pin_memory: true
    num_workers: 8
    persistent_workers: true
    ds_item: punct_dataset_complete
    
[

[NeMo I 2023-02-04 22:48:24 save_restore_connector:243] Model PunctuationCapitalizationModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.11.0/punctuation_en_distilbert/613c4ee780c6fc158f49d3566cbd6636/punctuation_en_distilbert.nemo.
[NeMo I 2023-02-04 22:48:24 cloud:56] Found existing object /root/.cache/torch/NeMo/NeMo_1.11.0/tts_en_fastpitch_align/26d7e09971f1d611e24df90c7a9d9b38/tts_en_fastpitch_align.nemo.
[NeMo I 2023-02-04 22:48:24 cloud:62] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.11.0/tts_en_fastpitch_align/26d7e09971f1d611e24df90c7a9d9b38/tts_en_fastpitch_align.nemo
[NeMo I 2023-02-04 22:48:24 common:910] Instantiating model from pre-trained checkpoint
[NeMo I 2023-02-04 22:48:27 tokenize_and_classify:87] Creating ClassifyFst grammars.


[NeMo W 2023-02-04 22:48:50 experimental:27] Module <class 'nemo.collections.tts.torch.g2ps.IPAG2P'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-02-04 22:48:51 g2ps:86] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2023-02-04 22:48:51 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: /ws/LJSpeech/nvidia_ljspeech_train_clean_ngc.json
      sample_rate: 22050
      sup_data_path: /raid/LJSpeech/supplementary
      sup_data_types:
      - align_prior_matri

[NeMo I 2023-02-04 22:48:51 features:223] PADDING: 1
[NeMo I 2023-02-04 22:48:52 save_restore_connector:243] Model FastPitchModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.11.0/tts_en_fastpitch_align/26d7e09971f1d611e24df90c7a9d9b38/tts_en_fastpitch_align.nemo.
[NeMo I 2023-02-04 22:48:52 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/files/tts_hifigan.nemo to /root/.cache/torch/NeMo/NeMo_1.11.0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo
[NeMo I 2023-02-04 22:49:15 common:910] Instantiating model from pre-trained checkpoint


[NeMo W 2023-02-04 22:49:18 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2023-02-04 22:49:18 modelPT:149] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segments: 66150


[NeMo I 2023-02-04 22:49:18 features:223] PADDING: 0


[NeMo W 2023-02-04 22:49:18 features:200] Using torch_stft is deprecated and has been removed. The values have been forcibly set to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2023-02-04 22:49:18 features:223] PADDING: 0
[NeMo I 2023-02-04 22:49:20 save_restore_connector:243] Model HifiGanModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.11.0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.


In [20]:
# Convert audio sample to text
files = [Audio_sample]
raw_text = ''
text = ''
for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):
  raw_text = transcription

# Add capitalization and punctuation
res = punctuation.add_punctuation_capitalization(queries=[raw_text])
text = res[0]
print(f'\nRaw recognized text: {raw_text}. \nText with capitalization and punctuation: {text}')



Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

[NeMo E 2023-02-04 22:50:48 segment:184] Loading bada-bing.wav via SoundFile raised RuntimeError: `Error opening 'bada-bing.wav': Error in WAV/W64/RF64 file. Malformed 'fmt ' chunk.`. NeMo will fallback to loading via pydub.


[NeMo I 2023-02-04 22:50:57 punctuation_capitalization_model:1135] Using batch size 1 for inference
[NeMo I 2023-02-04 22:50:57 punctuation_capitalization_infer_dataset:91] Max length: 64
[NeMo I 2023-02-04 22:50:57 data_preprocessing:404] Some stats of the lengths of the sequences:
[NeMo I 2023-02-04 22:50:57 data_preprocessing:406] Min: 86 |                  Max: 86 |                  Mean: 86.0 |                  Median: 86.0
[NeMo I 2023-02-04 22:50:57 data_preprocessing:412] 75 percentile: 86.00
[NeMo I 2023-02-04 22:50:57 data_preprocessing:413] 99 percentile: 86.00


100%|██████████| 4/4 [00:00<00:00, 46.37batch/s]


Raw recognized text: what are you gonna do nice college boy yeah they wantnto get mixed up in a family business now you wanna goin down te police captain one cauze you slap you her face a little bit i what do you think is ust the army whele you shoot them a mile away you gotta get them closhe like this bet bing you blow their brains all over your nights come in  you're thaking thisv person. 
Text with capitalization and punctuation: What are you gonna do? nice? college boy? Yeah, they wantnto get mixed up in a family business. Now you wanna goin down Te police. Captain One cauze. You slap you her face a little bit. I. what do you think is Ust the army whele? You shoot them a mile away. You gotta get them closhe like this bet Bing you blow their brains all over your nights. Come in, You're thaking thisv person.





In [22]:
# Results

# Original audio sample
IPython.display.Audio(Audio_sample)

# This is what was recognized by the ASR model
print(raw_text)


what are you gonna do nice college boy yeah they wantnto get mixed up in a family business now you wanna goin down te police captain one cauze you slap you her face a little bit i what do you think is ust the army whele you shoot them a mile away you gotta get them closhe like this bet bing you blow their brains all over your nights come in  you're thaking thisv person


# Next steps

A demo like this is great for prototyping and experimentation. However, for real production deployment, NVIDIA Riva is recommended, based on Pytorch https://docs.nvidia.com/deeplearning/riva/user-guide/docs/model-overview.html

NeMo GitHub for more examples: https://github.com/NVIDIA/NeMo