In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect


NOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.
"""
# If you're using Google Colab and not running locally, run this cell.

## Install dependencies
!pip install wget

## Install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case
that you want to use the "Run All Cells" (or similar) option.
"""
# exit()

In [None]:
import json
import torch
import os
from nemo.collections.asr.metrics.wer import word_error_rate
from nemo.collections.asr.parts.utils.vad_utils import stitch_segmented_asr_output, construct_manifest_eval

# Offline ASR+VAD

In this tutorial, we will demonstrate how to use offline VAD to extract speech segments and transcribe the speech segments with CTC models. This will help to exclude some non_speech utterances and could save computation resources by removing unnecessary input to the ASR system. 

The pipeline includes the following steps.

0. [Prepare data and script for demonstration](#Prepare-data-and-script-for-demonstration)
1. [Use offline VAD to extract speech segments](#Use-offline-VAD-to-extract-speech-segments)
2. [Transcribe speech segments with CTC models](#Transcribe-speech-segments-with-CTC-models)
3. [Stitch the prediction text of speech segments](#Stitch-the-prediction-text-of-speech-segments)
4. [Evaluate the performance of offline ASR with VAD ](#Evaluate-the-performance-of-offline-VAD-with-ASR)

## Prepare data and script for demonstration


In [None]:
!mkdir -p data
!wget -P data/ https://nemo-public.s3.us-east-2.amazonaws.com/chris-sample01_02.wav
!wget -P data/ https://nemo-public.s3.us-east-2.amazonaws.com/chris-sample03.wav
!wget https://nemo-public.s3.us-east-2.amazonaws.com/chris_demo.json

In [None]:
input_manifest="chris_demo.json"
vad_out_manifest_filepath="vad_out.json"
vad_model="vad_multilingual_marblenet" # here we use vad_multilingual_marblenet for example, you can choose other VAD models.

In [None]:
!head -n 10 $input_manifest

In [None]:
# This cell is mainly for colab. 
# You can ignore it if run locally but do make sure change the filepaths of scripts and config file in cells below.
!mkdir -p scripts
if not os.path.exists("scripts/vad_infer.py"):
  !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/speech_classification/vad_infer.py
if not os.path.exists("scripts/transcribe_speech.py"):
  !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/transcribe_speech.py
    
!mkdir -p conf/vad
if not os.path.exists("conf/vad/vad_inference_postprocessing.yaml"):
    !wget -P conf/vad/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/vad/vad_inference_postprocessing.yaml

## Use offline VAD to extract speech segments

Here we are using very simple parameters to demonstrate the process. 

Please choose or tune your own postprocessing parameters. 

You can find more details in 
```python 
<NeMo_git_root>/tutorials/asr/Online_Offline_Microphone_VAD_Demo.ipynb and 
<NeMo_git_root>/scripts/voice_activity_detection/vad_tune_threshold.py
```

The <code>vad_infer.py</code> script will help you generate speech segments. See more details in the script below.

In [None]:
# if run locally, vad_infer.py is located in <NeMo_git_root>/examples/asr/speech_classification/vad_infer.py
%run -i scripts/vad_infer.py --config-path="../conf/vad" --config-name="vad_inference_postprocessing.yaml" \
dataset=$input_manifest \
vad.model_path=$vad_model \
frame_out_dir="chris_demo" \
vad.parameters.window_length_in_sec=0.63 \
vad.parameters.postprocessing.onset=0.7 \
vad.parameters.postprocessing.offset=0.4 \
vad.parameters.postprocessing.min_duration_on=1 \
vad.parameters.postprocessing.min_duration_off=0.5 \
out_manifest_filepath=$vad_out_manifest_filepath

Let's have a look at VAD output. If there are no speech segments in the sample. The sample will not appear in VAD output.

In [None]:
!head -n 10 $vad_out_manifest_filepath

## Transcribe speech segments with CTC models

In [None]:
segmented_output_manifest="asr_segmented_output_manifest.json"
asr_model="stt_en_citrinet_1024_gamma_0_25" # here we use citrinet for example, you can choose other CTC models.

The <code>transcribe_speech.py</code> script will help you transcribe each speech segment. See more details in the script below.

In [None]:
# if run locally, transcribe_speech.py is located in <NeMo_git_root>/examples/asr/transcribe_speech.py
%run -i scripts/transcribe_speech.py \
    pretrained_name=$asr_model \
    dataset_manifest=$vad_out_manifest_filepath \
    batch_size=32 \
    amp=True \
    output_filename=$segmented_output_manifest

Let's have a look at the segmented ASR transcript.

In [None]:
!head -n 5 $segmented_output_manifest

## Stitch the prediction text of speech segments

You can also evaluate the whole ASR output by stitching the segmented outputs together.

Note, there would be a better method to stitch them together. Here, we just demonstrate the simplest method, concatenating.

In [None]:
stitched_output_manifest="stitched_asr_output_manifest.json"

In [None]:
stitched_output_manifest = stitch_segmented_asr_output(segmented_output_manifest)

Let's have a look at the stitched output and the stored speech segments of the first sample.

In [None]:
stitched_output = []
for line in open(stitched_output_manifest, 'r', encoding='utf-8'):
    file = json.loads(line)
    stitched_output.append(file)

print(stitched_output[0])
print(f"\n The speech segments of above file are \n {torch.load(stitched_output[0]['speech_segments_filepath'])}")

# Evaluate the performance of offline VAD with ASR 

If we have ground-truth <code>'text'</code> in input_manifest, we can evaluate our performance of stitched output. Let's align the <code>'text'</code> in input manifest and <code>'pred_text'</code> in stitched segmented asr output first, since some samples from input_manifest might be pure noise and have been removed in VAD output and excluded for ASR inference. 

In [None]:
aligned_vad_asr_output_manifest = construct_manifest_eval(input_manifest, stitched_output_manifest)

In [None]:
!head -n 10 $aligned_vad_asr_output_manifest

In [None]:
predicted_text, ground_truth_text = [], []
for line in open(aligned_vad_asr_output_manifest, 'r', encoding='utf-8'):
    sample = json.loads(line)
    predicted_text.append(sample['pred_text'])
    ground_truth_text.append(sample['text'])

In [None]:
metric_value = word_error_rate(hypotheses=predicted_text, references=ground_truth_text, use_cer=False)
print(f"WER is {metric_value}")

# Further Reading

There are two ways to incorporate VAD into ASR pipeline. The first strategy is to drop the frames that are predicted as `non-speech` by VAD, as already discussed in this tutorial. The second strategy is to keep all the frames and mask the `non-speech` frames with zero-signal values. Also, instead of using segment-VAD as shown in this tutorial, we can use frame-VAD model for faster inference and better accuracy. For more information, please refer to the script [speech_to_text_with_vad.py](https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_vad/speech_to_text_with_vad.py).