# Catskills Research Company application of NVidia NeMo Quartz 15x5 model trained from scratch for 16000 Hz sample rate for Somali

Lars Ericson

Quantitative Analytics Specialist

Catskills Research Company

1334 Hudson Place
Davidson, NC 28036

lars.ericson@wellsfargo.com

12 November 2020

## Abstract

We describe the Catskills Research Company system for NIST OpenASR20.  In EVAL Constrained condition, this system scored a WER of 1.13849 and 4th place out of 5 on the Leaderboard.

## Core algorithmic approach

We used the NVidia NeMo ASR package [1] and followed their instructions [2] for training a new language from scratch (Constrained condition) using the QuartzNet 15x5 model [3].  This involved creating a YAML file for Somali by modifying the example YAML file [4].

The main modifications were to 

* Input the **grapheme set** for Somali

* Decide on **maximum duration** in seconds of input sample.  We chose 10 seconds and limited our training samples to transcriptions that were 10 seconds or less in the BUILD set.

* Decide on the **sample rate**. Because we initially worked with the pretrained model (Unconstrained condition), which uses 16000Hz sample rate, we stayed with 16000Hz rate for the Constrained condition (probably a mistake, as it increases parameter size for no added value).

* Decide on the initial **learning rate**.  We chose a relatively high rate of 0.02 which is double the normal Novograd recommended starting rate of 0.01, because of use of highly augmented samples for training.

* Decide on the **batch size**.  We chose a batch size of 180 to fit our GPU, because we felt that larger batch size would minimize overfitting.

So, the following entries were changed in the base YAML file to make the Somali YAML file:

A new model from scratch is created by instantiating the `nemo_asr.models.EncDecCTCModel` class in NeMo.

## Additional features and tools used, including software packages and publicly available external resources

We used:

* Python 3.7.9
* `sph2pip_v2.5` for SPH to WAV conversion [5]
* Python modules `IPython`, `Levenshtein`, `OpenASR_convert_reference_transcript`, `argparse`, `audioread`, `csv`, `datetime`, `glob`, `itertools`, `json`, `json`, `librosa`, `logging`, `matplotlib`, `multiprocessing`, `nemo`, `numpy`, `omegaconf`, `operator`, `os`, `pandas`, `pathlib`, `pickle`, `pprint`, `pytorch_lightning`, `random`, `re`, `ruamel`, `scipy`, `shutil`, `soundfile`, `sys`, `tarfile`, `torch`, `torchtext`, `tqdm`, `unidecode`, `warnings`



## Other data used (outside provided data)

Only NIST BABEL Somali BUILD samples were used for training.

## Significant data pre-/post-processing

### Data augmentation

Training audio was split according to the transcript into smaller samples per the timecodes on the scripts.

Each sample was then augmented with random variations using NeMo provided perturbations [6][7].

In particular we applied the following 3 perturbations in sequence 10 times to get 10 new samples:

* **Time stretch** from 0.8 to 1.2.  (Not pitch preserving.)
* **Speed change** from 0.8 to 1.2.  (Pitch preserving.)
* **White noise** from -70db to -35 db.

This is implemented in the following class:

### Speaker activity detection and translation

To reduce the unlabelled DEV and EVAL data to clips of at most 10 seconds in length, it is necessary to implement a Speaker Activity Detection function.  We explored NeMo templates for training a neural network for this purpose. This approach resulted in a very slow function.  We chose instead to implement an ad hoc manually tuned method which relies on the absolute value of the dB level of the mel spectogram to find suitably long periods of silence to cut the clips at.  The method is implemented as follows:

The translation of a clip of 10 seconds or less in duration is performed by function `predicted_segment_transcript`.  This relies on similar thinking to break the transcribed clip into silent and speech components, and then allocate the words of the result proportionally in word size to speech component size:

This in turn relies on a function to call the model to transcribe the audio into graphemes:

and a function to do the allocation of predicted text to speech segments:

## Features that were the most novel or unusual and/or led to the biggest improvements in system performance

The open source NVidia NeMo project claims that the Quartz 15x5 model is novel in the sense of using 10X fewer parameters than other models in the Encoder/Decoder class [8].

It did not train well in the time we had available and performed poorly. This may be the fault of over-extreme or poorly thought out augmentation choices that we made.  We did not understand the model well enough to attempt any configuration changes to increase the learning capacity.

### System configuration

Our system configuration was

* Intel core i9 processor
* 3TB SSD
* 64GB RAM
* NVidia RTX 2080TI GPU with 11GB of VRAM
* Ubuntu 20.04LTS operating system
* Python 3.7.9

## Minimal required hardware specs to run your system

Evaluation required less than 1GB of GPU VRAM and less than 5GB of CPU RAM.  Evaluation was fast, less than 5 minutes for a DEV or EVAL run.

## Minimal required time and amount of data to train/tune your system

In training we were able to keep the GPU about 88% loaded and using 11.2 out of 11.6GB of available RAM.   With the augmented data we were not converging to a training error loss of less than 50 with over 48 hours of training.  This may be due to bad choice of augmentations, and also to upsampling to 16KHz when 8KHz was the source level, which may result in unnecessary artifacts. About 24GB of RAM was used during training.  We allocated 8 cores for data loading.  The 8 cores were periodically busy to 100% but in general stayed in a lower range.

## Diagram giving a visual representation of our system’s workflow

### Training

![training](training_flow.gv.png)

### Evaluation

![training](eval_flow.gv.png)

## References

[1] https://github.com/NVIDIA/NeMo

[2] https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/01_ASR_with_NeMo.ipynb

[3] https://arxiv.org/abs/1910.10261

[4] https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/config

[5] https://www.openslr.org/3/

[6] https://docs.nvidia.com/deeplearning/nemo/neural_mod_bp_guide/index.html#data_augmentation

[7] https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/05_Online_Noise_Augmentation.ipynb

[8] https://arxiv.org/pdf/1910.10261.pdf