# End-To-End Automatic Speech Recognition with Nemo
Basic tutorial of Automatic Speech Recognition (ASR) concepts, introduced with code snippets using the [NeMo framework](https://github.com/NVIDIA/NeMo).

What is ASR?

- ASR, or **Automatic Speech Recognition**: Automatically transcribe spoken language (i.e., speech-to-text). 
- Our goal is usually to have a model that minimizes the **Word Error Rate (WER)** metric when transcribing speech input. 
    - Given some audio file (e.g. a WAV file) containing speech, how do we transform this into the corresponding text with as few errors as possible?

## Taking a Look at Our Data (AN4)

- The AN4 dataset, also known as the Alphanumeric dataset, was collected and published by Carnegie Mellon University. 
- It consists of recordings of people spelling out addresses, names, telephone numbers, etc., one letter or number at a time, as well as their corresponding transcripts. 
- We choose to use AN4 for this tutorial because it is relatively small, with 948 training and 130 test utterances, and so it trains quickly.

In [None]:
# download data and convert the .sph format to .wav format
def download_an4_data(data_dir):
    import glob
    import os
    import subprocess
    import tarfile
    import wget
    import warnings
    warnings.simplefilter("ignore")

    # Download the dataset. This will take a few moments...
    print("******")
    if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):
        an4_url = 'http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz'
        an4_path = wget.download(an4_url, data_dir)
        print(f"Dataset downloaded at: {an4_path}")
    else:
        print("Tarfile already exists.")
        an4_path = data_dir + '/an4_sphere.tar.gz'

    # Untar and convert .sph to .wav (using sox)
    tar = tarfile.open(an4_path)
    tar.extractall(path=data_dir)

    print("Converting .sph to .wav...")
    sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)
    for sph_path in sph_list:
        wav_path = sph_path[:-4] + '.wav'
        cmd = ["sox", sph_path, wav_path]
        subprocess.run(cmd)
    print("Finished conversion.\n******")

In [None]:
# This is where the an4/ directory will be placed.
# Change this if you don't want the data to be extracted in the current directory.
data_dir = '/workspace/data2'
download_an4_data(data_dir)

In [None]:
#!! sed -n '10,20p' {data_dir}/an4/etc/an4_train.transcription
['<s> C Z D Z W EIGHT </s> (an86-fbbh-b)',
 '<s> ENTER SIX TWO FOUR </s> (an87-fbbh-b)',
 '<s> ERASE O T H F I FIVE ZERO </s> (an88-fbbh-b)',
 '<s> RUBOUT T G J W B SEVENTY NINE FIFTY NINE </s> (an89-fbbh-b)',
 '<s> NO </s> (an90-fbbh-b)',
 '<s> H O W E L L </s> (cen1-fbbh-b)',
 '<s> B E V E R L Y </s> (cen2-fbbh-b)',
 '<s> FIFTY ONE FIFTY SIX </s> (cen3-fbbh-b)',
 '<s> P R I N C E </s> (cen4-fbbh-b)',
 '<s> G I B S O N I A </s> (cen5-fbbh-b)',
 '<s> ONE FIVE OH FOUR FOUR </s> (cen6-fbbh-b)']

### The Jasper Model

- We will be putting together a small [Jasper (Just Another SPeech Recognizer) model](https://arxiv.org/abs/1904.03288).
- Jasper architectures consist of a repeated block structure that utilizes 1D convolutions.
- In a Jasper_BxR model:
    - `R` sub-blocks (consisting of a 1D convolution, batch norm, ReLU, and dropout) are grouped into a single block, which is then repeated `B` times.
- We also have a one extra block at the beginning and a few more at the end that are invariant of `B` and `R`, and we use CTC loss.

A Jasper model looks like roughly this:

![Jasper with CTC](https://raw.githubusercontent.com/NVIDIA/NeMo/master/docs/sources/source/asr/jasper_vertical.png)

## Building a Simple ASR Pipeline in NeMo

We'll be using the **Neural Modules (NeMo) toolkit** for this part:
- [GitHub page](https://github.com/NVIDIA/NeMo)
- [Documentation](https://nvidia.github.io/NeMo/)

NeMo lets us easily hook together the components (modules) of our model, such as the data layer, intermediate layers, and various losses, without worrying too much about implementation details of individual parts or connections between modules.

In [None]:
# NeMo's "core" package
import nemo
# NeMo's ASR collection
import nemo.collections.asr as nemo_asr

### Creating Data Manifests

- Manifests for our training and evaluation data, which will contain the metadata of our audio files. 
- NeMo data layers take in a standardized manifest format where each line corresponds to one sample of audio, such that the number of lines in a manifest is equal to the number of samples that are represented by that manifest. A line must contain the path to an audio file, the corresponding transcript (or path to a transcript file), and the duration of the audio sample.

Here's an example of what one line in a NeMo-compatible manifest might look like:
```
{"audio_filepath": "path/to/audio.wav", "duration": 3.45, "text": "this is a nemo tutorial"}
```
- we can build our training and evaluation manifests using the transcription files

In [None]:
!! sed -n '10,20p' train_manifest.json

### Building Training and Evaluation DAGs with NeMo






In [None]:
# Create our NeuralModuleFactory, which will oversee the neural modules.
neural_factory = nemo.core.NeuralModuleFactory( # main engine to drive the pipeline including checkpoints, callbacks, logs, and other details for training and inference
    log_dir=data_dir+'/an4_tutorial/') # where model logs and outputs will be written

logger = nemo.logging

#### Specifying Configuration of the Model

We'll build a *Jasper_4x1 model*, with `B=4` blocks of single (`R=1`) sub-blocks and a *greedy CTC decoder*, using the configuration found in `jasper_an4.yaml`.

Using a YAML config such as this is helpful for getting a quick and human-readable overview of what your architecture looks like, and allows you to swap out model and run configurations easily without needing to change your code.

In [None]:
# --- Config Information ---#
from ruamel.yaml import YAML

config_path = 'jasper_an4.yaml'

yaml = YAML(typ='safe')
with open(config_path) as f:
    params = yaml.load(f)
labels = params['labels'] # Vocab

train_manifest = 'train_manifest.json'
test_manifest = 'test_manifest.json'

!! cat jasper_an4.yaml

#### Initialize Neural Modules

##### Training

In [None]:
# Create training and test data layers (which load data) and data preprocessor
data_layer_train = nemo_asr.AudioToTextDataLayer.import_from_config(
    config_path,
    "AudioToTextDataLayer_train",
    overwrite_params={"manifest_filepath": train_manifest}
) # Training datalayer

data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor.import_from_config(
    config_path, "AudioToMelSpectrogramPreprocessor"
)

# Create the Jasper_4x1 encoder as specified, and a CTC decoder
encoder = nemo_asr.JasperEncoder.import_from_config(
    config_path, "JasperEncoder"
)

decoder = nemo_asr.JasperDecoderForCTC.import_from_config(
    config_path, "JasperDecoderForCTC",
    overwrite_params={"num_classes": len(labels)}
)

ctc_loss = nemo_asr.CTCLossNM(num_classes=len(labels))
greedy_decoder = nemo_asr.GreedyCTCDecoder()

##### Inference

In [None]:
data_layer_test = nemo_asr.AudioToTextDataLayer.import_from_config(
    config_path,
    "AudioToTextDataLayer_eval",
    overwrite_params={"manifest_filepath": test_manifest}
) # Eval datalayer

#### Wire up the training pipeline

The next step is to assemble our training DAG by specifying the inputs to each neural module.

In [None]:
# --- Assemble Training DAG --- #
audio_signal, audio_signal_len, transcript, transcript_len = data_layer_train()

processed_signal, processed_signal_len = data_preprocessor(
    input_signal=audio_signal,
    length=audio_signal_len)

encoded, encoded_len = encoder(
    audio_signal=processed_signal,
    length=processed_signal_len)

log_probs = decoder(encoder_output=encoded)

preds = greedy_decoder(log_probs=log_probs)  # Training predictions
loss = ctc_loss(
    log_probs=log_probs,
    targets=transcript,
    input_length=encoded_len,
    target_length=transcript_len)



#### Wire up the evaluation pipeline

Our evaluation DAG will reuse most of the parts of the training DAG with the exception of the data layer, since we are loading the evaluation data from a different file but evaluating on the same model.

In [None]:
# --- Assemble Validation DAG --- #
(audio_signal_test, audio_len_test,
 transcript_test, transcript_len_test) = data_layer_test()

processed_signal_test, processed_len_test = data_preprocessor(
    input_signal=audio_signal_test,
    length=audio_len_test)

encoded_test, encoded_len_test = encoder(
    audio_signal=processed_signal_test,
    length=processed_len_test)

log_probs_test = decoder(encoder_output=encoded_test)

preds_test = greedy_decoder(log_probs=log_probs_test)  # Test predictions
loss_test = ctc_loss(
    log_probs=log_probs_test,
    targets=transcript_test,
    input_length=encoded_len_test,
    target_length=transcript_len_test)

### Running the Model
- We would like to be able to monitor our model while it's training, so we use **callbacks**. 

#### Create callbacks

In [None]:
# We use these imports to pass to callbacks more complex functions to perform.
from nemo.collections.asr.helpers import monitor_asr_train_progress, \
    process_evaluation_batch, process_evaluation_epoch
from functools import partial

train_callback = nemo.core.SimpleLossLoggerCallback(
    # Notice that we pass in loss, predictions, and the transcript info.
    # Of course we would like to see our training loss, but we need the
    # other arguments to calculate the WER.
    tensors=[loss, preds, transcript, transcript_len],
    # The print_func defines what gets printed.
    print_func=partial(
        monitor_asr_train_progress,
        labels=labels),
    tb_writer=neural_factory.tb_writer
    )

# We can create as many evaluation DAGs and callbacks as we want,
# which is useful in the case of having more than one evaluation dataset.
# In this case, we only have one.
eval_callback = nemo.core.EvaluatorCallback(
    eval_tensors=[loss_test, preds_test, transcript_test, transcript_len_test],
    user_iter_callback=partial(
        process_evaluation_batch, labels=labels),
    user_epochs_done_callback=process_evaluation_epoch,
    eval_step=500,  # How often we evaluate the model on the test set
    tb_writer=neural_factory.tb_writer
    )

checkpoint_saver_callback = nemo.core.CheckpointCallback(
    folder=data_dir+'/an4_checkpoints',
    step_freq=1000  # How often checkpoints are saved
    )

import os
if not os.path.exists(data_dir+'/an4_checkpoints'):
    os.makedirs(data_dir+'/an4_checkpoints')

#### Training the model

Once we create our neural factory and the callbacks for the information that we want to see, we can **start training** by simply calling the train function on the tensors we want to optimize and our callbacks!

In [None]:
# --- Start Training! --- #
neural_factory.train(
    tensors_to_optimize=[loss],
    callbacks=[train_callback, eval_callback, checkpoint_saver_callback],
    optimizer='novograd',
    optimization_params={
        "num_epochs": 110, "lr": 0.001, "weight_decay": 1e-4 # already run 100 epochs and here we do another 10
    })

### Inference

In [None]:
# --- Inference Only --- #

# We've already built the inference DAG above, so all we need is to call infer().
evaluated_tensors = neural_factory.infer(
    # These are the tensors we want to get from the model.
    tensors=[loss_test, preds_test, transcript_test, transcript_len_test],
    # checkpoint_dir specifies where the model params are loaded from.
    checkpoint_dir=(data_dir+'/an4_checkpoints')
    )

# Process the results to get WER
from nemo.collections.asr.helpers import word_error_rate, \
    post_process_predictions, post_process_transcripts

greedy_hypotheses = post_process_predictions( evaluated_tensors[1], labels)

references = post_process_transcripts( evaluated_tensors[2], evaluated_tensors[3], labels)

wer = word_error_rate(hypotheses=greedy_hypotheses, references=references)

print("*** WER: {:.2f} ***".format(wer * 100))

In [None]:
print(greedy_hypotheses[10])
print(references[10])

In [None]:
print(greedy_hypotheses[20])
print(references[20])

In [None]:
!!nvidia-smi

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## Further reading/watching:
- [Stanford Lecture on ASR](https://www.youtube.com/watch?v=3MjIkWxXigM)
- ["An Intuitive Explanation of Connectionist Temporal Classification"](https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c)
- [Explanation of CTC with Prefix Beam Search](https://medium.com/corti-ai/ctc-networks-and-language-models-prefix-beam-search-explained-c11d1ee23306)
- [Listen Attend and Spell Paper (seq2seq ASR model)](https://arxiv.org/abs/1508.01211)
- [Explanation of the mel spectrogram in more depth](https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0)
- [Jasper Paper](https://arxiv.org/abs/1904.03288)
- [SpecAugment Paper](https://arxiv.org/abs/1904.08779)
- [Explanation and visualization of SpecAugment](https://towardsdatascience.com/state-of-the-art-audio-data-augmentation-with-google-brains-specaugment-and-pytorch-d3d1a3ce291e)
- [Cutout Paper](https://arxiv.org/pdf/1708.04552.pdf)