# NeMo ASR Tutorial

## Preliminary Setup
This notebook was designed to run in the Docker container defined in the root of the NeMo git repo.  To set up this environment, first **clone NeMo**:
```
git clone https://github.com/NVIDIA/NeMo.git
```
Next, **download a dataset** for training.  We like to keep this *outside* the NeMo tree to avoid a huge Docker build context.  In this example, we use `$HOME/nemo-data/LibriSpeech`.  Adjust paths as needed.
```
cd NeMo
mkdir -p $HOME/nemo-data/LibriSpeech
python scripts/get_librispeech_data.py \
    --data_root=$HOME/nemo-data/LibriSpeech \
    --data_set=dev_clean,train_clean_100
# This requires 26GB of disk space and takes a while, go grab a coffee...
```
Now **build the Docker image**:
```
docker build -t nemo-demo .
```
And then **run the container**, mapping in the data directory used above.  For example:
```
docker run --runtime=nvidia --rm -it --ipc=host \
    -v $HOME/nemo-data/LibriSpeech:/workspace/nemo/data \
    nemo-demo
```
Once inside running Docker container, just use the start-jupyter.sh script to start JupyterLab.  Then using the JupyterLab file browser, open examples/asr/NeMo-ASR-Tutorial-blog.ipynb.


## Introduction
This Automatic Speech Recognition (ASR) tutorial is focused on the Jasper model [[2]](#ref2). Jasper is CTC-based [[1]](#ref1) end-to-end model. The model is called “end-to-end” because it transcribes speech samples without any additional alignment information. CTC allows finding an alignment between audio and text. CTC-ASR training pipeline consists of the following blocks:
1. audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)
2. neural acoustic model (which predicts a probability distribution P_t(c) over vocabulary characters c per each time step t given input features per each timestep)
3. CTC loss function

![CTC training pipeline](https://raw.githubusercontent.com/NVIDIA/NeMo/master/docs/sources/source/asr/jasper.png)

## Getting the data
We will be using the open-source LibriSpeech [[3]](#ref3) dataset</a>.  A script to download and prepare the LibriSpeech dataset is included in the <nemo_git_repo_root>/scripts directory.  For example:
```
$ mkdir /path/to/data
$ python <nemo_git_repo_root>/scripts/get_librispeech_data.py \
    --data_root=/path/to/data \
    --data_set=dev_clean,train_clean_100
```


> **A word of caution** - *LibriSpeech is a large dataset.  Using the
> `--data_set=dev_clean,train_clean_100` requires at least 26GB
> of disk space, and `--data_set=ALL` at least 110GB.  In the example
> above, the `/path/to/data` directory should probably be outside the
> nemo git root directory to avoid a huge Docker build context.
> Probably also best to run this ahead of time, outside the notebook,
> and map it into your running docker container via
> `docker run -v /path/to/data:/data ...`.*


After download and conversion, your data folder should contain 2 json files:
* `dev_clean.json`
* `train_clean_100.json`

In the tutorial we will use `train_clean_100.json` for training and `dev_clean.json` for evaluation. Each line in json file describes a training sample - `audio_filepath` contains path to the wav file, `duration` it’s duration in seconds, and `text` is it’s transcript.

## Training
We will train a small model from the Jasper family [[2]](#ref2). Jasper (“Just Another SPeech Recognizer”) is a deep time delay neural network (TDNN) comprising of blocks of 1D-convolutional layers. Jasper family of models are denoted as Jasper_[BxR] where B is the number of blocks, and R - the number of convolutional sub-blocks within a block. Each sub-block contains a 1-D convolution, batch normalization, ReLU, and dropout:

![Jasper Model](https://raw.githubusercontent.com/NVIDIA/NeMo/master/docs/sources/source/asr/jasper.png)

In the tutorial we will be using model [12x1] and will be using separable convolutions. In the code that follows, we'll run both training (on train_clean_100.json) and evaluation (on dev_clean.json) on single GPU.

### Setup

> **NOTE** - *The following code blocks assume you are running in a Docker
> container based on the example `<nemo git root>/Dockerfile`, with
> `/workspace/nemo` being the path to <nemo git root> inside the
> container.  We also assume you have mounted the top level of the
> LibriSpeech dataset in `/workspace/nemo/data` in the container.*
>    
> *If you are running with a different configuration, please adjust
> paths accordingly in the code blocks below.*

    
First we import the NeMo core and ASR packages and define our datasets and model.

In [None]:
# NeMo's "core" package
import nemo
# NeMo's ASR collection
import nemo_asr

# Path to our training manifest.  Here we assume we've mapped it into a /nemo-data data directory
train_dataset = "/workspace/nemo/data/train_clean_100.json"
# Path to our validation manifest
eval_datasets = "/workspace/nemo/data/dev_clean.json"

# To read the Jasper Model definition yaml
from ruamel.yaml import YAML

# Here we will be using separable convolutions
# with 12 blocks (k=12 repeated once r=1 from the picture above)
yaml = YAML(typ="safe")
# Change path to ./configs/jasper12x1SEP.yaml once NeMo-ASR-Tutorial.ipynb is moved into the examples folder.
with open("/workspace/nemo/examples/asr/configs/jasper12x1SEP.yaml") as f:
    jasper_model_definition = yaml.load(f)
labels = jasper_model_definition['labels']

Now we can create a NeuralFactory to manage training and begin instantiating the Neural Modules required for the Jasper model. 

In [None]:
# Create a Neural Factory
# It creates log files and tensorboard writers for us among other functions
nf = nemo.core.NeuralModuleFactory(
    log_dir='jasper12x1SEP',
    create_tb_writer=True)
tb_writer = nf.tb_writer
logger = nf.logger

# Instantiate neural modules
data_layer = nemo_asr.AudioToTextDataLayer(
    manifest_filepath=train_dataset,
    labels=labels, batch_size=32)

data_layer_val = nemo_asr.AudioToTextDataLayer(
    manifest_filepath=eval_datasets,
    labels=labels, batch_size=32, shuffle=False)

...and audio preprocessors and encoders/decoders for the Jasper model.

In [None]:
data_preprocessor = nemo_asr.AudioPreprocessing()
spec_augment = nemo_asr.SpectrogramAugmentation(rect_masks=5)

jasper_encoder = nemo_asr.JasperEncoder(
    feat_in=64,
    **jasper_model_definition['JasperEncoder'])
jasper_decoder = nemo_asr.JasperDecoderForCTC(
    feat_in=1024, num_classes=len(labels))
ctc_loss = nemo_asr.CTCLossNM(num_classes=len(labels))
greedy_decoder = nemo_asr.GreedyCTCDecoder()

Now we define the DAG for training the Jasper model.

In [None]:
# Training DAG (Model)
audio_signal, audio_signal_len, transcript, transcript_len = data_layer()
processed_signal, processed_signal_len = data_preprocessor(
    input_signal=audio_signal, length=audio_signal_len)
aug_signal = spec_augment(input_spec=processed_signal)
encoded, encoded_len = jasper_encoder(
    audio_signal=aug_signal, length=processed_signal_len)
log_probs = jasper_decoder(encoder_output=encoded)
predictions = greedy_decoder(log_probs=log_probs)
loss = ctc_loss(
    log_probs=log_probs, targets=transcript,
    input_length=encoded_len, target_length=transcript_len)

And similarly for the validation model.

In [None]:
# Validation DAG (Model)
# We need to instantiate additional data layer neural module
# for validation data
audio_signal_v, audio_signal_len_v, transcript_v, transcript_len_v = data_layer_val()
processed_signal_v, processed_signal_len_v = data_preprocessor(
    input_signal=audio_signal_v, length=audio_signal_len_v)
# Note that we are not using data-augmentation in validation DAG
encoded_v, encoded_len_v = jasper_encoder(
    audio_signal=processed_signal_v, length=processed_signal_len_v)
log_probs_v = jasper_decoder(encoder_output=encoded_v)
predictions_v = greedy_decoder(log_probs=log_probs_v)
loss_v = ctc_loss(
    log_probs=log_probs_v, targets=transcript_v,
    input_length=encoded_len_v, target_length=transcript_len_v)


We also need some helper functions to monitor training.

In [None]:
# These helper functions are needed to print and compute various metrics
# such as word error rate and log them into tensorboard
# they are domain-specific and are provided by NeMo's collections
from nemo_asr.helpers import monitor_asr_train_progress, \
    process_evaluation_batch, process_evaluation_epoch

from functools import partial
# Callback to track loss and print predictions during training
train_callback = nemo.core.SimpleLossLoggerCallback(
    tb_writer=tb_writer,
    # Define the tensors that you want SimpleLossLoggerCallback to
    # operate on
    # Here we want to print our loss, and our word error rate which
    # is a function of our predictions, transcript, and transcript_len
    tensors=[loss, predictions, transcript, transcript_len],
    # To print logs to screen, define a print_func
    print_func=partial(
        monitor_asr_train_progress,
        labels=labels,
        logger=logger
    ))

saver_callback = nemo.core.CheckpointCallback(
    folder="./",
    # Set how often we want to save checkpoints
    step_freq=100)

# PRO TIP: while you can only have 1 train DAG, you can have as many
# val DAGs and callbacks as you want. This is useful if you want to monitor
# progress on more than one val dataset at once (say LibriSpeech dev clean
# and dev other)
eval_callback = nemo.core.EvaluatorCallback(
    eval_tensors=[loss_v, predictions_v, transcript_v, transcript_len_v],
    # how to process evaluation batch - e.g. compute WER
    user_iter_callback=partial(
        process_evaluation_batch,
        labels=labels
        ),
    # how to aggregate statistics (e.g. WER) for the evaluation epoch
    user_epochs_done_callback=partial(
        process_evaluation_epoch, tag="DEV-CLEAN", logger=logger
        ),
    eval_step=500,
    tb_writer=tb_writer)


> **NOTE** - *One potential gotcha is the audio_filepath encoded
> in the `train_clean_100.json` and `dev_clean.json` manifest files.
> These will contain the path used when originally executing the
> `get_librispeech_data.py` script which is likely different from
> the path mapped into the docker container.*
>
> *In our example Dockerfile, we have `WORKDIR` set to the nemo git
> root in `/workspace/nemo`.  When running the container, we map the
> data directory into `/workspace/nemo/data`.*
>
> *Assuming we used `get_librispeech_data.py --data_root=/path/to/data`,
> we need to replace the existing /path/to/data prefix with the
> container path as follows:*


In [None]:
# Let's look at the first entry in the train_clean_100.json file to see the path:
!head -n1 /workspace/nemo/data/train_clean_100.json
# Now replace this path with the path inside the container:
!sed -i 's,/path/to/data,/workspace/nemo/data,g' /workspace/nemo/data/train_clean_100.json
!sed -i 's,/path/to/data,/workspace/nemo/data,g' /workspace/nemo/data/dev_clean.json

Now we can train the model! (Training 50 epochs requires approximately 4 hours on RTX8000.)

In [None]:
# Run training using your Neural Factory
# Once this "action" is called data starts flowing along train and eval DAGs
# and computations start to happen
nf.train(
    # Specify the loss to optimize for
    tensors_to_optimize=[loss],
    # Specify which callbacks you want to run
    callbacks=[train_callback, eval_callback, saver_callback],
    # Specify what optimizer to use
    optimizer="novograd",
    # Specify optimizer parameters such as num_epochs and lr
    optimization_params={
        "num_epochs": 50, "lr": 0.02, "weight_decay": 1e-4
        }
    )

To improve word error rates:
* Train longer
* Train on more data
* Use a larger model
* Train on several GPUs and use mixed precision (on NVIDIA Volta and Turing GPUs)
* Start with pre-trained checkpoints

## Mixed Precision Training
Mixed precision and distributed training in NeMo is based on <a href="https://github.com/NVIDIA/apex">NVIDIA’s APEX library</a>. This is installed with NVIDIA's NGC Pytorch container with an example of updating in the example Dockerfile.

> **Note** -  _Because mixed precision requires Tensor Cores it
> only works on NVIDIA Volta and Turing based GPUs._

To train with mixed-precision all you need is to set `optimization_level` parameter of `nemo.core.NeuralModuleFactory` to `nemo.core.Optimization.mxprO1`.  For example:

In [None]:
local_rank = None
nf = nemo.core.NeuralModuleFactory(
    backend=nemo.core.Backend.PyTorch,
    optimization_level=nemo.core.Optimization.mxprO1,
    local_rank=local_rank,
    placement=nemo.core.DeviceType.AllGpu,
    cudnn_benchmark=True)
# Here we define some additional features of the NeuralFactor that
# enable multi-GPU training.  We'll discuss these more below.

In [None]:
# Note, if you completed the training steps above, you'll have
# checkpoints saved in the working directory.  We'll need to
# remove these to restart training.
! rm JasperEncoder-STEP-*

In [None]:
# Now we can re-run training, this time only a few epochs to test.
nf.train(
    # Specify the loss to optimize for
    tensors_to_optimize=[loss],
    # Specify which callbacks you want to run
    callbacks=[train_callback, eval_callback, saver_callback],
    # Specify what optimizer to use
    optimizer="novograd",
    # Specify optimizer parameters such as num_epochs and lr
    optimization_params={
        "num_epochs": 5, "lr": 0.02, "weight_decay": 1e-4
        }
    )

In the above training run, you will notice some additional output describing the optimization level used:
```
Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
```
You will also see output signaling Gradient overflow:
```
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 16384.0
```
This is expected behavior, and example of how NVIDIA's APEX extension tracks gradients and scales loss so that gradients are representable in mixed precision.

## Multi-GPU Training
Enabling multi-GPU training with NeMo is easy:
1. First set placement to `nemo.core.DeviceType.AllGpu` in NeuralModuleFactory and in your Neural Modules
2. Have your script accept `local_rank` argument and do not set it yourself: `parser.add_argument(“–local_rank”, default=None, type=int)`
3. Use the `torch.distributed.launch` package to run your script.

An example of this can be seen in the `<nemo git root>/examples/asr/jasper.py` script.

Unfortunately we can't launch this interactively in a notebook because of the way torch.distributed spawns python processes.  We can instead use a JupyterLab Terminal (File -> New -> Terminal) to launch multi-GPU training directly.  In the example below `--nproc_per_node` should be set to the number of GPUs on the node.  The arguments to the jasper.py script mirror those used in the example above.  The jasper.py script uses these arguments to define the NeuralFactory and NeuralModules in the same way as above, with multiple processes spawned to run on multiple GPUs.  In the example below, we use two GPUs to train the Jasper 15x5 model on the same LibriSpeech train_clean_100 dataset:

```
python -m torch.distributed.launch --nproc_per_node=2 \
    /workspace/nemo/examples/asr/jasper.py \
    --batch_size=64 \
    --num_epochs=100 \
    --lr=0.015 \
    --warmup_steps=8000 \
    --weight_decay=0.001 \
    --train_dataset=/workspace/nemo/data/train_clean_100.json \
    --eval_datasets=/workspace/nemo/data/dev_clean.json \
    --model_config=/workspace/nemo/examples/asr/configs/jasper15x5SEP.yaml \
    --exp_name=MultiNodeExperiment
```

## References
<a name="ref1"></a>Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In _Proceedings of the 23rd international conference on Machine learning_, 369–376. ACM, 2006.

<a name="ref2"></a>Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde. Jasper: an end-to-end convolutional neural acoustic model. _arXiv preprint arXiv:1904.03288_, 2019.

<a name="ref3"></a>Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In _Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on_, 5206–5210. IEEE, 2015.


