In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell.
try:
    import nemo.collections.asr as nemo_asr
except ModuleNotFoundError:
  ! python -m pip install --upgrade git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]

In [None]:
import json
import os
import wget

from IPython.display import Audio
import numpy as np
import scipy.io.wavfile as wav

! pip install pandas

# optional
! pip install plotly
from plotly import graph_objects as go

# Introduction
End-to-end Automatic Speech Recognition (ASR) systems surpassed traditional systems in performance but require large amounts of labeled data for training. 

This tutorial will show how to use a pre-trained with Connectionist Temporal Classification (CTC) ASR model, such as [QuartzNet Model](https://arxiv.org/abs/1910.10261) to split long audio files and the corresponding transcripts into shorter fragments that are suitable for an ASR model training. 

We're going to use [ctc-segmentation](https://github.com/lumaku/ctc-segmentation) Python package based on the algorithm described in [CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition](https://arxiv.org/pdf/2007.09127.pdf).

In [None]:
! pip install ctc_segmentation==1.1.0

In [None]:
# If you're running the notebook locally, update the TOOLS_DIR path below
# In Colab, a few required scripts will be downloaded from NeMo github

TOOLS_DIR = '<UPDATE_PATH_TO_NeMo_root>/tools/ctc_segmentation/scripts'

if 'google.colab' in str(get_ipython()):
    TOOLS_DIR = 'scripts/'
    os.makedirs(TOOLS_DIR, exist_ok=True)

    required_files = ['prepare_data.py',
                    'normalization_helpers.py',
                    'run_ctc_segmentation.py',
                    'verify_segments.py',
                    'cut_audio.py',
                    'process_manifests.py',
                    'utils.py']
    for file in required_files:
        if not os.path.exists(os.path.join(TOOLS_DIR, file)):
            file_path = 'https://raw.githubusercontent.com/NVIDIA/NeMo/main/tools/ctc_segmentation/' + TOOLS_DIR + file
            print(file_path)
            wget.download(file_path, TOOLS_DIR)
elif not os.path.exists(TOOLS_DIR):
      raise ValueError(f'update path to NeMo root directory')

`TOOLS_DIR` should now contain scripts that we are going to need in the next steps, all necessary scripts could be found [here](https://github.com/NVIDIA/NeMo/tree/main/tools/ctc_segmentation/scripts).

In [None]:
print(TOOLS_DIR)
! ls -l $TOOLS_DIR

# Data Download
First, let's download an audio file from [https://librivox.org/](https://librivox.org/).

In [None]:
## create data directory and download an audio file
WORK_DIR = 'WORK_DIR'
DATA_DIR = WORK_DIR + '/DATA'
os.makedirs(DATA_DIR, exist_ok=True)
audio_file = 'childrensshortworks019_06acarriersdog_am_128kb.mp3'
if not os.path.exists(os.path.join(DATA_DIR, audio_file)):
    print('Downloading audio file')
    wget.download('http://archive.org/download/childrens_short_works_vol_019_1310_librivox/' + audio_file, DATA_DIR)

Next, we need to get the corresponding transcript.

Note, the text file and the audio file should have the same base name, for example, an audio file `example.wav` or `example.mp3` should have corresponding text data stored under `example.txt` file.

In [None]:
# text source: http://www.gutenberg.org/cache/epub/24263/pg24263.txt
text =  """
    A carrier on his way to a market town had occasion to stop at some houses
    by the road side, in the way of his business, leaving his cart and horse
    upon the public road, under the protection of a passenger and a trusty
    dog. Upon his return he missed a led horse, belonging to a gentleman in
    the neighbourhood, which he had tied to the end of the cart, and likewise
    one of the female passengers. On inquiry he was informed that during his
    absence the female, who had been anxious to try the mettle of the pony,
    had mounted it, and that the animal had set off at full speed. The carrier
    expressed much anxiety for the safety of the young woman, casting at the
    same time an expressive look at his dog. Oscar observed his master's eye,
    and aware of its meaning, instantly set off in pursuit of the pony, which
    coming up with soon after, he made a sudden spring, seized the bridle, and
    held the animal fast. Several people having observed the circumstance, and
    the perilous situation of the girl, came to relieve her. Oscar, however,
    notwithstanding their repeated endeavours, would not quit his hold, and
    the pony was actually led into the stable with the dog, till such time as
    the carrier should arrive. Upon the carrier entering the stable, Oscar
    wagged his tail in token of satisfaction, and immediately relinquished the
    bridle to his master.
    """

with open(os.path.join(DATA_DIR, audio_file.replace('mp3', 'txt')), 'w') as f:
    f.write(text)

The `DATA_DIR` should now contain both audio and text files:

In [None]:
!ls -l $DATA_DIR

Listen to the audio:

In [None]:
Audio(os.path.join(DATA_DIR, audio_file))

As one probably noticed, the audio file contains a prologue and an epilogue that are missing in the corresponding text. The segmentation algorithm could handle extra audio fragments at the end and the beginning of the audio, but prolonged untranscribed audio segments in the middle of the file could deteriorate segmentation results. That's why to improve the segmentation quality, it is recommended to normalize text, so that transcript contains spoken equivalents of abbreviations and numbers.

# Prepare Text and Audio

We're going to use `prepare_data.py` script to prepare both text and audio data for segmentation.

Text preprocessing:
* the text will be split into sentences and stored under '$OUTPUT_DIR/processed/*.txt' where each sentence is going to start with a new line (we're going to find alignments for these sentences in the next steps)
* out-of-vocabulary words will be removed based on pre-trained ASR model vocabulary, (optionally) text will be changed to lowercase 
* sentences for alignment with the original punctuation and capitalization will be stored under  `$OUTPUT_DIR/processed/*_with_punct.txt`
* numbers will be normalized in a naive way to replace, for example, `12` with `one two`. Such normalization is usually enough for proper segmentation but to build a high-quality training dataset, all out-vocabulary symbols should be replaced with their actual spoken representations.

Audio preprocessing:
* `.mp3` files will be converted to `.wav` files
* audio files will be resampled to use the same sampling rate as was used to pre-train the ASR model we're using for alignment
* stereo tracks will be converted to mono
* since librivox.org audio contains relatively long prologues, we're also cutting a few seconds from the beginning of the audio files (optional step, see `--cut_prefix` argument). In some cases, if an audio contains a very long untranscribed prologue, increasing `--cut_prefix` value might help improve segmentation quality.


The `prepare_data.py` will preprocess all `.txt` files found in the `--in_text=$DATA_DIR` and all `.mp3` files located at `--audio_dir=$DATA_DIR`.


In [None]:
MODEL = 'QuartzNet15x5Base-En'
OUTPUT_DIR = WORK_DIR + '/output'

! python $TOOLS_DIR/prepare_data.py \
--in_text=$DATA_DIR \
--output_dir=$OUTPUT_DIR/processed/ \
--language='eng' \
--cut_prefix=3 \
--model=$MODEL \
--audio_dir=$DATA_DIR

The following three files should be generated and stored at the `$OUTPUT_DIR/processed` folder:
* childrensshortworks019_06acarriersdog_am_128kb.txt
* childrensshortworks019_06acarriersdog_am_128kb.wav
* childrensshortworks019_06acarriersdog_am_128kb_with_punct.txt

In [None]:
! ls -l $OUTPUT_DIR/processed

The `.txt` file without punctuation contains preprocessed text phrases that we're going to align within the audio file. Here, we split the text into sentences. Each line should contain a text snippet for alignment.

In [None]:
with open(os.path.join(OUTPUT_DIR, 'processed', audio_file.replace('.mp3', '.txt')), 'r') as f:
    for line in f:
        print (line)

# Run CTC-Segmentation

In this step, we're going to use the [`ctc-segmentation`](https://github.com/lumaku/ctc-segmentation) to find the start and end time stamps for the segments we created during the previous step.


As described in the [CTC-Segmentation of Large Corpora for German End-to-end Speech Recognition](https://arxiv.org/pdf/2007.09127.pdf), the algorithm is relying on a CTC-based ASR model to extract utterance segments with exact time-wise alignments. For this tutorial, we're using a pre-trained 'QuartzNet15x5Base-En' model.

In [None]:
WINDOW = 8000

! python $TOOLS_DIR/run_ctc_segmentation.py \
--output_dir=$OUTPUT_DIR \
--data=$OUTPUT_DIR/processed \
--model=$MODEL \
--window_len=$WINDOW \
--no_parallel

`WINDOW` parameter might need to be adjusted depending on the length of the utterance one wants to align, the default value should work in most cases.

Let's take a look at the generated alignments.
The expected output for our audio sample with 'QuartzNet15x5Base-En' model looks like this:

```
<PATH_TO>/processed/childrensshortworks019_06acarriersdog_am_128kb.wav
16.03 32.39 -4.5911999284929115 | a carrier on ... a trusty dog. | ...
33.31 45.01 -0.22886803973405373 | upon his ... passengers. | ...
46.17 58.57 -0.3523662826061572 | on inquiry ... at full speed. | ...
59.75 69.43 -0.04128918756038118 | the carrier ... dog. | ...
69.93 85.31 -0.3595261826390344 | oscar observed ... animal fast. | ...
85.95 93.43 -0.04447770533708611 | several people ... relieve her. | ...
93.61 105.95 -0.07326174931639003 | oscar however ... arrive. | ...
106.65 116.91 -0.14680841514778062 | upon the carrier ... his master. | ...
```

Details of the file content:
- the first line of the file contains the path to the original audio file
- all subsequent lines contain:
  * the first number is the start of the segment (in seconds)
  * the second one is the end of the segment (in seconds)
  * the third value - alignment confidence score (in log space)
  * text fragments corresponding to the timestamps
  * original text without pre-processing

In [None]:
alignment_file = str(WINDOW) + '_' + audio_file.replace('.mp3', '_segments.txt')
! cat $OUTPUT_DIR/segments/$alignment_file

Finally, we're going to split the original audio file into segments based on the found alignments. We're going to create three subsets and three corresponding manifests:
* high scored clips (segments with the segmentation score above the threshold value, default threshold value = -5)
* low scored clips (segments with the segmentation score below the threshold)
* deleted segments (segments that were excluded during the alignment. For example, in our sample audio file, the prologue and epilogue that don't have the corresponding transcript were excluded. Oftentimes, deleted files also contain such things as clapping, music, or hard breathing. 

The alignment score values depend on the pre-trained model quality and the dataset, the `THRESHOLD` parameter might be worth adjusting based on the analysis of the low/high scored clips.

Also note, that the `OFFSET` parameter is something one might want to experiment with since timestamps have a delay (offset) depending on the model.


In [None]:
OFFSET = 0
THRESHOLD = -5

! python $TOOLS_DIR/cut_audio.py \
--output_dir=$OUTPUT_DIR \
--model=$MODEL \
--alignment=$OUTPUT_DIR/segments/ \
--threshold=$THRESHOLD \
--offset=$OFFSET

`manifests` folder should be created under `OUTPUT_DIR`, and it should contain
corresponding manifests for the three groups of clips described above:

In [None]:
! ls -l $OUTPUT_DIR/manifests

In [None]:
def plot_signal(signal, sample_rate):
    """ Plot the signal in time domain """
    fig_signal = go.Figure(
        go.Scatter(x=np.arange(signal.shape[0])/sample_rate,
                   y=signal, line={'color': 'green'},
                   name='Waveform',
                   hovertemplate='Time: %{x:.2f} s<br>Amplitude: %{y:.2f}<br><extra></extra>'),
        layout={
            'height': 200,
            'xaxis': {'title': 'Time, s'},
            'yaxis': {'title': 'Amplitude'},
            'title': 'Audio Signal',
            'margin': dict(l=0, r=0, t=40, b=0, pad=0),
        }
    )
    fig_signal.show()
    
def display_samples(manifest):
    """ Display audio and reference text."""
    with open(manifest, 'r') as f:
        for line in f:
            sample = json.loads(line)
            sample_rate, signal = wav.read(sample['audio_filepath'])
            plot_signal(signal, sample_rate)
            display(Audio(sample['audio_filepath']))
            display('Reference text:       ' + sample['text_no_preprocessing'])
            display('ASR transcript: ' + sample['transcript'])
            print('\n' + '-' * 110)

Let's examine the high scored segments we obtained.

The `Reference text` in the next cell represents the original text without pre-processing, while `ASR transcript` is an ASR model prediction with greedy decoding. Also notice, that `ASR transcript` in some cases contains errors that could decrease the alignment score, but usually it doesn’t hurt the quality of the aligned segments.

In [None]:
high_score_manifest = str(WINDOW) + '_' + audio_file.replace('.mp3', '_high_score_manifest.json')
display_samples(os.path.join(OUTPUT_DIR, 'manifests', high_score_manifest))

# Multiple files alignment

Up until now, we were processing only one file at a time, but to create a large dataset processing of multiple files simultaneously could help speed up things considerably. 

Let's download another audio file and corresponding text.

In [None]:
# https://librivox.org/frost-to-night-by-edith-m-thomas/
audio_file_2 = 'frosttonight_thomas_bk_128kb.mp3'
if not os.path.exists(os.path.join(DATA_DIR, audio_file_2)):
    print('Downloading audio file')
    wget.download('http://www.archive.org/download/frost_to-night_1710.poem_librivox/frosttonight_thomas_bk_128kb.mp3', DATA_DIR)


# text source: text source: https://www.bartleby.com/267/151.html
text =  """
    APPLE-GREEN west and an orange bar,	
    And the crystal eye of a lone, one star …	
    And, “Child, take the shears and cut what you will,	
    Frost to-night—so clear and dead-still.”	
    
    Then, I sally forth, half sad, half proud,	        
    And I come to the velvet, imperial crowd,	
    The wine-red, the gold, the crimson, the pied,—	
    The dahlias that reign by the garden-side.	
    
    The dahlias I might not touch till to-night!	
    A gleam of the shears in the fading light,	        
    And I gathered them all,—the splendid throng,	
    And in one great sheaf I bore them along.
    .    .    .    .    .    .
    
    In my garden of Life with its all-late flowers	
    I heed a Voice in the shrinking hours:	
    “Frost to-night—so clear and dead-still” …	        
    Half sad, half proud, my arms I fill.	
    """

with open(os.path.join(DATA_DIR, audio_file_2.replace('mp3', 'txt')), 'w') as f:
  f.write(text)

`DATA_DIR` should now contain two .mp3 files and two .txt files:

In [None]:
! ls -l $DATA_DIR

In [None]:
Audio(os.path.join(DATA_DIR, audio_file_2))

Finally, we need to download a script to perform all the above steps starting from the text and audio preprocessing to segmentation and manifest creation in a single step.

In [None]:
if 'google.colab' in str(get_ipython()) and not os.path.exists('run_sample.sh'):
    wget.download('https://raw.githubusercontent.com/NVIDIA/NeMo/main/tools/ctc_segmentation/run_sample.sh', '.')

In [None]:
! ls -l

Next, we're going to execute `run_sample.sh` script to find alignment for two audio/text samples. By default, if the alignment is not found for an initial WINDOW size, the initial window size will be doubled a few times to re-attempt alignment. 

`run_sample.sh` applies two initial WINDOW sizes, 8000 and 12000, and then adds segments that were similarly aligned with two window sizes to `verified_segments` folder. This could be useful to reduce the amount of manual work while checking the alignment quality.

In [None]:
OUTPUT_DIR_2 = os.path.join(WORK_DIR, 'output_multiple_files')

! bash $TOOLS_DIR/../run_sample.sh \
--MODEL_NAME_OR_PATH=$MODEL \
--DATA_DIR=$DATA_DIR \
--OUTPUT_DIR=$OUTPUT_DIR_2 \
--SCRIPTS_DIR=$TOOLS_DIR \
--CUT_PREFIX=3 \
--MIN_SCORE=$THRESHOLD

High scored manifests for the data samples were aggregated to the `all_manifest.json` under `OUTPUT_DIR_2`.

In [None]:
display_samples(os.path.join(OUTPUT_DIR_2, 'all_manifest.json'))

# Next Steps

Check out [NeMo Speech Data Explorer tool](https://github.com/NVIDIA/NeMo/tree/main/tools/speech_data_explorer#speech-data-explorer) to interactively evaluate the aligned segments.

# References
Kürzinger, Ludwig, et al. ["CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition."](https://arxiv.org/abs/2007.09127) International Conference on Speech and Computer. Springer, Cham, 2020.