<img src="http://developer.download.nvidia.com/notebooks/dlsw-notebooks/riva_asr_asr-python-advanced-finetune-am-conformer-ctc-tao-finetuning/nvidia_logo.png" style="width: 90px; float: right;">

# How to Fine-Tune a Riva ASR Acoustic Model with NeMo Toolkit
This tutorial walks you through how to fine-tune an NVIDIA Riva ASR acoustic model with NeMo.

## NVIDIA Riva Overview

NVIDIA Riva is a GPU-accelerated SDK for building speech AI applications that are customized for your use case and deliver real-time performance. <br/>
Riva offers a rich set of speech and natural language understanding services such as:

- Automated speech recognition (ASR). 
- Text-to-Speech synthesis (TTS). 
- A collection of natural language processing (NLP) services, such as named entity recognition (NER), punctuation, and intent classification.

In this tutorial, we will fine-tune a Riva ASR acoustic model with NeMo Toolkit. <br> 
To understand the basics of Riva ASR APIs, refer to [Getting started with Riva ASR in Python](https://github.com/nvidia-riva/tutorials/blob/stable/asr-python-basics.ipynb). <br>

For more information about Riva, refer to the [Riva developer documentation](https://developer.nvidia.com/riva).

## NeMo (Neural Modules)
[NVIDIA NeMo](https://developer.nvidia.com/nvidia-nemo) is an open-source framework for building, training, and fine-tuning GPU-accelerated speech AI and natural language understanding (NLU) models with a simple Python interface. You may visit the GitHub page of [NeMo](https://github.com/NVIDIA/NeMo) and follow the instructions to setup NeMo.

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect
"""

# Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install text-unidecode
!pip install matplotlib>=3.3.2

## Install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case
that you want to use the "Run All Cells" (or similar) option.
"""
# exit()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9674 sha256=b1227f053186ef1781a3087061f8c967d64bdb9a18918a2f899c2f141051c2b8
  Stored in directory: /root/.cache/pip/wheels/bd/a8/c3/3cf2c14a1837a4e04bd98631724e81f33f462d86a1d895fae0
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
Reading package lists... Done
Building dependency tree       
Reading state information... Done
libsndfile1 is already the newest version (1.0.28-4ubuntu0.18.04.2).
ffmpeg is already the newest version (7:3.4.11-0ubuntu0.1).
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'apt autoremove' to rem

'\nRemember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!\nAlternatively, you can uncomment the exit() below to crash and restart the kernel, in the case\nthat you want to use the "Run All Cells" (or similar) option.\n'

---
## Fine-tuning ASR model using NeMo

### Download Data

In this tutorial we will use the popular AN4 dataset. Let's download it.

In [None]:
! wget https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz  # for the original source, please visit http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz

--2023-01-11 21:02:20--  https://dldata-public.s3.us-east-2.amazonaws.com/an4_sphere.tar.gz
Resolving dldata-public.s3.us-east-2.amazonaws.com (dldata-public.s3.us-east-2.amazonaws.com)... 52.219.104.243
Connecting to dldata-public.s3.us-east-2.amazonaws.com (dldata-public.s3.us-east-2.amazonaws.com)|52.219.104.243|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 64327561 (61M) [application/x-gzip]
Saving to: ‘an4_sphere.tar.gz’


2023-01-11 21:02:27 (9.75 MB/s) - ‘an4_sphere.tar.gz’ saved [64327561/64327561]



After downloading, untar the dataset and move it to the correct directory.

In [None]:
%env DATA_DIR=.
! tar -xvf an4_sphere.tar.gz 
! mv an4 $DATA_DIR

env: DATA_DIR=.
an4/
an4/README
an4/etc/
an4/etc/an4_test.fileids
an4/etc/an4.ug.lm
an4/etc/an4.ug.lm.DMP
an4/etc/an4_train.fileids
an4/etc/an4_train.transcription
an4/etc/an4_test.transcription
an4/etc/an4.dic
an4/etc/an4.phone
an4/etc/an4.filler
an4/wav/
an4/wav/an4_clstk/
an4/wav/an4_clstk/fash/
an4/wav/an4_clstk/fash/an251-fash-b.sph
an4/wav/an4_clstk/fash/an253-fash-b.sph
an4/wav/an4_clstk/fash/an254-fash-b.sph
an4/wav/an4_clstk/fash/an255-fash-b.sph
an4/wav/an4_clstk/fash/cen1-fash-b.sph
an4/wav/an4_clstk/fash/cen2-fash-b.sph
an4/wav/an4_clstk/fash/cen4-fash-b.sph
an4/wav/an4_clstk/fash/cen5-fash-b.sph
an4/wav/an4_clstk/fash/cen7-fash-b.sph
an4/wav/an4_clstk/fbbh/
an4/wav/an4_clstk/fbbh/an86-fbbh-b.sph
an4/wav/an4_clstk/fbbh/an87-fbbh-b.sph
an4/wav/an4_clstk/fbbh/an88-fbbh-b.sph
an4/wav/an4_clstk/fbbh/an89-fbbh-b.sph
an4/wav/an4_clstk/fbbh/an90-fbbh-b.sph
an4/wav/an4_clstk/fbbh/cen1-fbbh-b.sph
an4/wav/an4_clstk/fbbh/cen2-fbbh-b.sph
an4/wav/an4_clstk/fbbh/cen3-fbbh-b.sph
an4/wav/a

### Pre-Processing

This step converts the `.mp3` files into `.wav` files and splits the data into training and testing sets. It also generates a "meta-data" file to be consumed by the data-loader for training and testing.

In [None]:
import json, librosa, os, glob
import subprocess


DATA_DIR = os.environ["DATA_DIR"]
source_data_dir = f"{DATA_DIR}/an4"
target_data_dir = f"{DATA_DIR}/an4_converted"

def an4_build_manifest(transcripts_path, manifest_path, target_wavs_dir):
    """Build an AN4 manifest from a given transcript file."""
    with open(transcripts_path, 'r') as fin:
        with open(manifest_path, 'w') as fout:
            for line in fin:
                # Lines look like this:
                # <s> transcript </s> (fileID)
                transcript = line[: line.find('(') - 1].lower()
                transcript = transcript.replace('<s>', '').replace('</s>', '')
                transcript = transcript.strip()

                file_id = line[line.find('(') + 1 : -2]  # e.g. "cen4-fash-b"
                audio_path = os.path.join(target_wavs_dir, file_id + '.wav')

                duration = librosa.core.get_duration(filename=audio_path)

                # Write the metadata to the manifest
                metadata = {"audio_filepath": audio_path, "duration": duration, "text": transcript}
                json.dump(metadata, fout)
                fout.write('\n')

"""Process AN4 dataset."""
if not os.path.exists(source_data_dir):
    link = 'http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz'
    raise ValueError(
        f"Data not found at `{source_data_dir}`. Please download the AN4 dataset from `{link}` "
        f"and extract it into the folder specified by the `source_data_dir` argument."
    )

# Conversion from SPH files to WAV files
sph_list = glob.glob(os.path.join(source_data_dir, '**/*.sph'), recursive=True)
target_wavs_dir = os.path.join(target_data_dir, 'wavs')
if not os.path.exists(target_wavs_dir):
    print(f"Creating directories for {target_wavs_dir}.")
    os.makedirs(os.path.join(target_data_dir, 'wavs'))

for sph_path in sph_list:
    wav_path = os.path.join(target_wavs_dir, os.path.splitext(os.path.basename(sph_path))[0] + '.wav')
    cmd = ["sox", sph_path, wav_path]
    subprocess.run(cmd, check=True)

# Build AN4 manifests
train_transcripts = os.path.join(source_data_dir, 'etc/an4_train.transcription')
train_manifest = os.path.join(target_data_dir, 'train_manifest.json')
an4_build_manifest(train_transcripts, train_manifest, target_wavs_dir)

test_transcripts = os.path.join(source_data_dir, 'etc/an4_test.transcription')
test_manifest = os.path.join(target_data_dir, 'test_manifest.json')
an4_build_manifest(test_transcripts, test_manifest, target_wavs_dir)


Creating directories for ./an4_converted/wavs.


Let's listen to a sample audio file.

In [None]:
# change path of the file here
import os
import IPython.display as ipd
path = os.environ["DATA_DIR"] + '/an4_converted/wavs/an268-mbmg-b.wav'
ipd.Audio(path)

### Training 

#### Create Tokenizer

Before we can do the actual training, we need to create a tokenizer as this ASR model uses word-piece encoding. Character based models don't need the tokenizer creation as only single characters are regarded as elements in the vocabulary in their cases. We can use the `process_asr_text_tokenizer.py` script of NeMo to create the tokenizer that generates the subword vocabulary for us for use in training. The size of the vocabulary (vocab_size) should be the same as the vocabulary size as the ASR model. We would clone the NeMo's repository from GitHub to use the scripts and examples available there.


In [9]:
# clone NeMo locally
! git clone https://github.com/NVIDIA/NeMo

# create the tokenizer
!python NeMo/scripts/tokenizers/process_asr_text_tokenizer.py \
         --manifest=$DATA_DIR/an4_converted/train_manifest.json \
         --data_root=$DATA_DIR/an4 \
         --vocab_size=128 \
         --tokenizer=spe \
         --spe_type=unigram

fatal: destination path 'NeMo' already exists and is not an empty directory.
[NeMo W 2023-01-11 21:07:37 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo I 2023-01-11 21:07:38 sentencepiece_tokenizer:315] Processing ./an4/text_corpus/document.txt and store at ./an4/tokenizer_spe_unigram_v128
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=./an4/text_corpus/document.txt --model_prefix=./an4/tokenizer_spe_unigram_v128/tokenizer --vocab_size=128 --shuffle_input_sentence=true --hard_vocab_limit=false --model_type=unigram --character_coverage=1.0 --bos_id=-1 --eos_id=-1 --normalization_rule_name=nmt_nfkc_cf
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: ./an4/text_corpus/document.txt
  input_format: 
  model_prefix: ./an4/tokenizer_spe_unigram_v128/tokenizer
  model_type: UNIGRAM
  vocab_size: 128
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 0
  shuffle_input_

#### Training Conformer-CTC

NeMo uses configuartion files to configure the training parameters. You may update them directly by editing the configuration file or from the command-line interface. For example, if the number of epochs are needed to be modified along with a change in the learning rate, you can add `trainer.max_epochs=100` and `optim.lr=0.02` and train the model. 

The following sample command would use the script `speech_to_text_ctc_bpe.py` in the example folder to train/fine-tune a Conformer-CTC ASR model for 100 epochs. For other ASR models like Citrinet, you may find the appropiate config files under the NeMo/examples/asr/conf/.


In [14]:
# To fully train the model from scratch, you'll need to increase trainer.max_epochs from 1
# Empirical evidence suggests that around 200 epochs should suffice
!python ./NeMo/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py \
    --config-path=../conf/conformer/ --config-name=conformer_ctc_bpe \
    +init_from_pretrained_model=stt_en_conformer_ctc_large \
    model.train_ds.manifest_filepath=$DATA_DIR/an4_converted/train_manifest.json \
    model.validation_ds.manifest_filepath=$DATA_DIR/an4_converted/test_manifest.json \
    model.tokenizer.dir=$DATA_DIR/an4/tokenizer_spe_unigram_v128 \
    trainer.devices=1 \
    trainer.max_epochs=1 \
    model.optim.name="adamw" \
    model.optim.lr=1.0 \
    model.optim.weight_decay=0.001 \
    model.optim.sched.warmup_steps=2000 \
    exp_manager.exp_dir=./checkpoints/


[NeMo W 2023-01-11 21:15:56 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-01-11 21:15:57 experimental:27] Module <class 'nemo.collections.asr.models.audio_to_audio_model.AudioToAudioModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-01-11 21:15:59 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
    
[NeMo W 2023-01-11 21:15:59 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_audio.BaseAudioDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-01-11 21:15:59 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_audio.AudioToTargetDataset'> is experimental, not ready for production and is not fully supported. Use at your own ris

In [15]:
!ls nemo_experiments/Conformer-CTC-BPE/2023-01-11_21-07-50


checkpoints
cmd-args.log
events.out.tfevents.1673471281.ce9ee7b0019d
hparams.yaml
lightning_logs.txt
nemo_error_log.txt
nemo_log_globalrank-0_localrank-0.txt


### ASR Evaluation

Now that we have a model trained, we need to check how well it performs.

In [None]:
! python ./NeMo/examples/asr/speech_to_text_eval.py \
    pretrained_name=stt_en_conformer_ctc_large \
    dataset_manifest=$DATA_DIR/an4_converted/test_manifest.json \
    output_filename=./test_manifest_predictions.json \
    batch_size=32 \
    amp=True


[NeMo W 2023-01-11 20:50:59 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-01-11 20:50:59 experimental:27] Module <class 'nemo.collections.asr.models.audio_to_audio_model.AudioToAudioModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-01-11 20:51:01 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
    
[NeMo W 2023-01-11 20:51:02 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_audio.BaseAudioDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-01-11 20:51:02 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_audio.AudioToTargetDataset'> is experimental, not ready for production and is not fully supported. Use at your own ris

### ASR Model Export

With NeMo, you can also export your model in a format that can be deployed using NVIDIA RIVA; a highly performant application framework for multi-modal conversational AI services using GPUs. The same command for exporting to ONNX can be used here. The only small variation is the configuration for `export_format` in the spec file.

#### Install the packages

We will now install the packages NeMo and nemo2riva. nemo2riva is available on [ngc](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/resources/riva_quickstart/files?version=2.8.1). Make sure you install NGC CLI first before running the following commands.

In [None]:
!pip install nvidia-pyindex
!pip install nemo_toolkit['all']
!ngc registry resource download-version "nvidia/riva/riva_quickstart:2.8.1"
!pip install "riva_quickstart_v2.8.1/nemo2riva-2.8.1-py3-none-any.whl"
!pip install protobuf==3.20.0

#### Convert to RIVA.

Convert the downloaded model to .riva format, we will use encryption key=nemotoriva. Change this while generating .riva models for production.

In [None]:
nemo_file_path = FIXME
riva_file_path = nemo_file_path[:-5]+".riva"
!nemo2riva --out {riva_file_path} --key=nemotoriva {nemo_file_path}


## More Resources
You may find more info and details on working with NeMo's ASR model here in the tutorials here:

[NeMo Tutorials](https://github.com/NVIDIA/NeMo/tree/main/tutorials/asr)

## What's Next?

You could use NeMo to build custom models for your own applications, and deploy them to Nvidia Riva! To try deploying these models to RIVA, use the text-to-speech-deployment.ipynb as a quick sample.