In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell.
!pip install wget
!pip install git+https://github.com/NVIDIA/apex.git
!pip install nemo-toolkit
!pip install nemo-asr
!pip install unidecode

!mkdir configs
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/master/examples/asr/configs/quartznet_speech_commands_3x1_v1.yaml
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/master/examples/asr/configs/quartznet_speech_commands_3x1_v2.yaml

In [2]:
# Import some necessary libraries
import os
import argparse
import copy
import math
import os
import glob
from functools import partial
from datetime import datetime
from ruamel.yaml import YAML

# Introduction

This Speech Command recognition tutorial is based on the QuartzNet model from the paper "[QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions](https://arxiv.org/pdf/1910.10261.pdf)" with a modified decoder head to suit classification tasks.

The notebook will follow the steps below:

 - Dataset preparation: Preparing Google Speech Commands dataset

 - Audio preprocessing (feature extraction): signal normalization, windowing, (log) spectrogram (or mel scale spectrogram, or MFCC)

 - Data augmentation using SpecAugment "[SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)" to increase number of data samples.
 
 - Develop a small Neural classification model which can be trained efficiently.
 
 - Model training on the Google Speech Commands dataset in NeMo.
 
 - Evaluation of error cases of the model by audibly hearing the samples

In [3]:
# This is where the Google Speech Commands directory will be placed.
# Change this if you don't want the data to be extracted in the current directory.
# Select the version of the dataset required as well (can be 1 or 2)
DATASET_VER = 1
data_dir = './google_dataset_v{0}/'.format(DATASET_VER)

# Data Preparation

We will be using the open source Google Speech Commands Dataset (we will use V1 of the dataset for the tutorial, but require very minor changes to support V2 dataset). These scripts below will download the dataset and convert it to a format suitable for use with nemo_asr

## Download the dataset

The dataset must be prepared using the scripts provided under the `{NeMo root directory}/scripts` sub-directory. 

Run the following command below to download the training script and execute it.

**NOTE**: You should have at least 4GB of disk space available if you’ve used --data_version=1; and at least 6GB if you used --data_version=2. Also, it will take some time to download and process, so go grab a coffee.

**NOTE**: You may additionally pass a `--rebalance` flag at the end of the `process_speech_commands_data.py` script to rebalance the class samples in the manifest.

In [4]:
!wget https://raw.githubusercontent.com/NVIDIA/NeMo/master/scripts/process_speech_commands_data.py

--2020-02-27 16:14:08--  https://raw.githubusercontent.com/NVIDIA/NeMo/master/scripts/process_speech_commands_data.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.40.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.40.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6872 (6.7K) [text/plain]
Saving to: ‘process_speech_commands_data.py.1’


2020-02-27 16:14:08 (58.8 MB/s) - ‘process_speech_commands_data.py.1’ saved [6872/6872]



In [5]:
!mkdir {data_dir}
!python process_speech_commands_data.py --data_root={data_dir} --data_version={DATASET_VER}
print("Dataset ready !")

mkdir: cannot create directory ‘./google_dataset_v1/’: File exists
Dataset ready !


## Prepare the path to manifest files

In [6]:
dtaset_path = 'google_speech_recognition_v{0}'.format(DATASET_VER)
dataset_basedir = os.path.join(data_dir, dtaset_path)

train_dataset = os.path.join(dataset_basedir, 'train_manifest.json')
val_dataset = os.path.join(dataset_basedir, 'validation_manifest.json')
test_dataset = os.path.join(dataset_basedir, 'validation_manifest.json')

## Read a few rows of the manifest file 

Manifest files are the data structure used by NeMo to declare a few important details about the data :

1) `audio_filepath`: Refers to the path to the raw audio file <br>
2) `command`: The class label (or speech command) of this sample <br>
3) `duration`: The length of the audio file, in seconds.

In [7]:
!head -n 5 {train_dataset}

head: cannot open '{train_manifest}' for reading: No such file or directory


# Training - Preparation

We will be training a QuartzNet model from the paper "[QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions](https://arxiv.org/pdf/1910.10261.pdf)". The benefit of QuartzNet over JASPER models is that they use Separable Convolutions, which greatly reduce the number of parameters required to get good model accuracy.

QuartzNet models generally follow the model definition pattern QuartzNet-[BxR], where B is the number of blocks and R is the number of convolutional sub-blocks. Each sub-block contains a 1-D masked convolution, batch normalization, ReLU, and dropout:


In [8]:
# Lets load the config file for the QuartzNet 3x1 model
# Here we will be using separable convolutions
# with 3 blocks (k=3 repeated once r=1 from the picture above)
yaml = YAML(typ="safe")
with open("configs/quartznet_speech_commands_3x1_v{0}.yaml".format(DATASET_VER)) as f:
    jasper_params = yaml.load(f)

# Pre-define a set of labels that this model must learn to predict
labels = jasper_params['labels']

# Get the sampling rate of the data
sample_rate = jasper_params['sample_rate']

In [9]:
# Import NeMo core functionality
# NeMo's "core" package
import nemo
# NeMo's ASR collection
import nemo.collections.asr as nemo_asr
# NeMo's learning rate policy
from nemo.utils.lr_policies import CosineAnnealing
from nemo.collections.asr.helpers import (
    monitor_classification_training_progress,
    process_classification_evaluation_batch,
    process_classification_evaluation_epoch,
)
from nemo.collections.asr.metrics import classification_accuracy

logging = nemo.logging

## Define some model hyper parameters

In [10]:
# Lets define some hyper parameters
lr = 0.05
num_epochs = 5
batch_size = 128
weight_decay = 0.001

## Define the NeMo components

In [11]:
# Create a Neural Factory
# It creates log files and tensorboard writers for us among other functions
neural_factory = nemo.core.NeuralModuleFactory(
    log_dir='./{0}/quartznet-3x1-v{1}'.format(dataset_basedir, DATASET_VER),
    create_tb_writer=True)
tb_writer = neural_factory.tb_writer

[NeMo W 2020-02-27 16:14:20 deprecated:68] Function ``_get_trainer`` is deprecated. It is going to be removed in the future version.


In [12]:
# Check if data augmentation such as white noise and time shift augmentation should be used
audio_augmentor = jasper_params.get('AudioAugmentor', None)

# Build the input data layer and the preprocessing layers for the train set
train_data_layer = nemo_asr.AudioToSpeechLabelDataLayer(
    manifest_filepath=train_dataset,
    labels=labels,
    sample_rate=sample_rate,
    batch_size=batch_size,
    num_workers=os.cpu_count(),
    augmentor=audio_augmentor,
    shuffle=True
)

 # Build the input data layer and the preprocessing layers for the test set
eval_data_layer = nemo_asr.AudioToSpeechLabelDataLayer(
    manifest_filepath=test_dataset,
    sample_rate=sample_rate,
    labels=labels,
    batch_size=batch_size,
    num_workers=os.cpu_count(),
    shuffle=False,
)

# We will convert the raw audio data into MelSpectrogram Features to feed as input to our model
data_preprocessor = nemo_asr.AudioToMelSpectrogramPreprocessor(
    sample_rate=sample_rate, **jasper_params["AudioToMelSpectrogramPreprocessor"],
)

# Compute the total number of samples and the number of training steps per epoch
N = len(train_data_layer)
steps_per_epoch = math.ceil(N / float(batch_size) + 1)

logging.info("Steps per epoch : {0}".format(steps_per_epoch))
logging.info('Have {0} examples to train on.'.format(N))

# Here we begin defining all of the augmentations we want
# We will pad the preprocessed spectrogram image to have a certain number of timesteps
# This centers the generated spectrogram and adds black boundaries to either side
# of the padded image.
crop_pad_augmentation = nemo_asr.CropOrPadSpectrogramAugmentation(audio_length=128)

# We also optionally add `SpecAugment` augmentations based on the config file
# SpecAugment has various possible augmentations to the generated spectrogram
# 1) Frequency band masking
# 2) Time band masking
# 3) Rectangular cutout
spectr_augment_config = jasper_params.get('SpectrogramAugmentation', None)
if spectr_augment_config:
    data_spectr_augmentation = nemo_asr.SpectrogramAugmentation(**spectr_augment_config)

# Build the QuartzNet Encoder model
# The config defines the layers as a list of dictionaries
# The first and last two blocks are not considered when we say QuartzNet-[BxR]
# B is counted as the number of blocks after the first layer and before the penultimate layer.
# R is defined as the number of repetitions of each block in B.
# Note: We can scale the convolution kernels size by the float parameter `kernel_size_factor`
jasper_encoder = nemo_asr.JasperEncoder(**jasper_params["JasperEncoder"])

# We then define the QuartzNet decoder.
# This decoder head is specialized for the task for classification, such that it
# accepts a set of `N-feat` per timestep of the model, and averages these features
# over all the timesteps, before passing a Linear classification layer on those features.
jasper_decoder = nemo_asr.JasperDecoderForClassification(
    feat_in=jasper_params["JasperEncoder"]["jasper"][-1]["filters"],
    num_classes=len(labels),
    **jasper_params['JasperDecoderForClassification'],
)

# We can easily apply cross entropy loss to train this model
ce_loss = nemo_asr.CrossEntropyLossNM()

[NeMo I 2020-02-27 16:14:21 collections:215] Filtered duration for loading collection is 0.000000.
[NeMo I 2020-02-27 16:14:21 collections:215] Filtered duration for loading collection is 0.000000.
[NeMo I 2020-02-27 16:14:21 features:144] PADDING: 16
[NeMo I 2020-02-27 16:14:21 features:152] STFT using conv
[NeMo I 2020-02-27 16:14:24 <ipython-input-12-242cb97ccf7d>:34] Steps per epoch : 401
[NeMo I 2020-02-27 16:14:24 <ipython-input-12-242cb97ccf7d>:35] Have 51088 examples to train on.




In [13]:
# Lets print out the number of parameters of this model
logging.info('================================')
logging.info(f"Number of parameters in encoder: {jasper_encoder.num_weights}")
logging.info(f"Number of parameters in decoder: {jasper_decoder.num_weights}")
logging.info(
    f"Total number of parameters in model: " f"{jasper_decoder.num_weights + jasper_encoder.num_weights}"
)
logging.info('================================')

[NeMo I 2020-02-27 16:14:24 <ipython-input-13-6805b5462cf6>:3] Number of parameters in encoder: 73344
[NeMo I 2020-02-27 16:14:24 <ipython-input-13-6805b5462cf6>:4] Number of parameters in decoder: 3870
[NeMo I 2020-02-27 16:14:24 <ipython-input-13-6805b5462cf6>:6] Total number of parameters in model: 77214


## Compile the Training Graph for NeMo

In [14]:
# Now we have all of the components that are required to build the NeMo execution graph!
## Build the training data loaders and preprocessors first
audio_signal, audio_signal_len, commands, command_len = train_data_layer()
processed_signal, processed_signal_len = data_preprocessor(input_signal=audio_signal, length=audio_signal_len)
processed_signal, processed_signal_len = crop_pad_augmentation(
    input_signal=processed_signal,
    length=audio_signal_len
)

## Augment the dataset for training
if spectr_augment_config:
    processed_signal = data_spectr_augmentation(input_spec=processed_signal)

## Define the model
encoded, encoded_len = jasper_encoder(audio_signal=processed_signal, length=processed_signal_len)
decoded = jasper_decoder(encoder_output=encoded)

## Obtain the train loss
train_loss = ce_loss(logits=decoded, labels=commands)

## Compile the Test Graph for NeMo

In [15]:
# Now we build the test graph in a similar way, reusing the above components
## Build the test data loader and preprocess same way as train graph
## But note, we do not add the spectrogram augmentation to the test graph !
test_audio_signal, test_audio_signal_len, test_commands, test_command_len = eval_data_layer()
test_processed_signal, test_processed_signal_len = data_preprocessor(
    input_signal=test_audio_signal, length=test_audio_signal_len
)
test_processed_signal, test_processed_signal_len = crop_pad_augmentation(
    input_signal=test_processed_signal, length=test_processed_signal_len
)

# Pass the test data through the model encoder and decoder
test_encoded, test_encoded_len = jasper_encoder(
    audio_signal=test_processed_signal, length=test_processed_signal_len
)
test_decoded = jasper_decoder(encoder_output=test_encoded)

# Compute test loss for visualization
test_loss = ce_loss(logits=test_decoded, labels=test_commands)

## Setting up callbacks for training and test set evaluation, and checkpoint saving

In [16]:
# Now that we have our training and evaluation graphs built,
# we can focus on a few callbacks to help us save the model checkpoints
# during training, as well as display train and test metrics

# Callbacks needed to print train info to console and Tensorboard
train_callback = nemo.core.SimpleLossLoggerCallback(
    # Notice that we pass in loss, predictions, and the labels.
    # Of course we would like to see our training loss, but we need the
    # other arguments to calculate the accuracy.
    tensors=[train_loss, decoded, commands],
    # The print_func defines what gets printed.
    print_func=partial(monitor_classification_training_progress, eval_metric=None),
    get_tb_values=lambda x: [("loss", x[0])],
    tb_writer=neural_factory.tb_writer,
)

# Callbacks needed to print test info to console and Tensorboard
tagname = 'TestSet'
eval_callback = nemo.core.EvaluatorCallback(
    eval_tensors=[test_loss, test_decoded, test_commands],
    user_iter_callback=partial(process_classification_evaluation_batch, top_k=1),
    user_epochs_done_callback=partial(process_classification_evaluation_epoch, eval_metric=1, tag=tagname),
    eval_step=200,  # How often we evaluate the model on the test set
    tb_writer=neural_factory.tb_writer,
)

# Callback to save model checkpoints
chpt_callback = nemo.core.CheckpointCallback(
    folder=neural_factory.checkpoint_dir,
    step_freq=1000,
)

# Prepare a list of checkpoints to pass to the engine
callbacks = [train_callback, eval_callback, chpt_callback]

# Training the model

Even with such a small model (77k parameters), and just 5 epochs (should take just a few minutes to train), you should be able to get a test set accuracy score in the range 85 - 90%. Not bad for a 30 (v1) or 35 (v2) way classification problem !

Experiment with increasing the number of epochs or with batch size to see how much you can improve the score!

In [17]:
# Now we have all the components required to train the model
# Lets define a learning rate schedule

# Define a learning rate schedule
lr_policy = CosineAnnealing(
    total_steps=num_epochs * steps_per_epoch,
    warmup_ratio=0.05,
    min_lr=0.001,
)

logging.info(f"Using `{lr_policy}` Learning Rate Scheduler")

# Finally, lets train this model !
neural_factory.train(
    tensors_to_optimize=[train_loss],
    callbacks=callbacks,
    lr_policy=lr_policy,
    optimizer="novograd",
    optimization_params={
        "num_epochs": num_epochs,
        "max_steps": None,
        "lr": lr,
        "momentum": 0.95,
        "betas": (0.98, 0.5),
        "weight_decay": weight_decay,
        "grad_norm_clip": None,
    },
    batches_per_step=1,
)

[NeMo I 2020-02-27 16:14:24 <ipython-input-17-e865cc4031ec>:11] Using `<nemo.utils.lr_policies.CosineAnnealing object at 0x7f547c65e210>` Learning Rate Scheduler
[NeMo I 2020-02-27 16:14:24 callbacks:179] Starting .....
[NeMo I 2020-02-27 16:14:24 callbacks:343] Found 2 modules with weights:
[NeMo I 2020-02-27 16:14:24 callbacks:345] JasperEncoder
[NeMo I 2020-02-27 16:14:24 callbacks:345] JasperDecoderForClassification
[NeMo I 2020-02-27 16:14:24 callbacks:346] Total model parameters: 77214
[NeMo I 2020-02-27 16:14:24 callbacks:301] Restoring checkpoint from folder ././google_dataset_v1/google_speech_recognition_v1/quartznet-3x1-v1/checkpoints ...
[NeMo I 2020-02-27 16:14:24 callbacks:186] Done in 0.025618553161621094
[NeMo I 2020-02-27 16:14:24 callbacks:432] Final Evaluation ..............................
[NeMo I 2020-02-27 16:14:27 callbacks:437] Evaluation time: 2.7541420459747314 seconds
[NeMo I 2020-02-27 16:14:27 callbacks:293] Saved checkpoint: ././google_dataset_v1/google_spe

# Evaluation of incorrectly predicted samples

Given that we have a trained model, which performs reasonably well, lets try to listen to the samples where the model is least confident in its predictions.

For this, we need support of the librosa library.

**NOTE**: The following code depends on librosa. To install it, run the following code block first

In [18]:
!pip install librosa



In [19]:
# lets add a path to the checkpoint dir
model_path = neural_factory.checkpoint_dir

## Extract the predictions from the model

We want to possess the actual logits of the model instead of just the final evaluation score, so we use `NeuralFactory.infer(...)` to extract the logits per batch of samples provided.

In [20]:
# --- Inference Only --- #
# We've already built the inference DAG above, so all we need is to call infer().
evaluated_tensors = neural_factory.infer(
    # These are the tensors we want to get from the model.
    tensors=[test_loss, test_decoded, test_commands],
    # checkpoint_dir specifies where the model params are loaded from.
    checkpoint_dir=model_path
    )

[NeMo I 2020-02-27 16:14:28 actions:1453] Restoring JasperEncoder from ././google_dataset_v1/google_speech_recognition_v1/quartznet-3x1-v1/checkpoints/JasperEncoder-STEP-2000.pt
[NeMo I 2020-02-27 16:14:28 actions:1453] Restoring JasperDecoderForClassification from ././google_dataset_v1/google_speech_recognition_v1/quartznet-3x1-v1/checkpoints/JasperDecoderForClassification-STEP-2000.pt
[NeMo I 2020-02-27 16:14:29 actions:726] Evaluating batch 0 out of 54
[NeMo I 2020-02-27 16:14:29 actions:726] Evaluating batch 5 out of 54
[NeMo I 2020-02-27 16:14:29 actions:726] Evaluating batch 10 out of 54
[NeMo I 2020-02-27 16:14:29 actions:726] Evaluating batch 15 out of 54
[NeMo I 2020-02-27 16:14:30 actions:726] Evaluating batch 20 out of 54
[NeMo I 2020-02-27 16:14:30 actions:726] Evaluating batch 25 out of 54
[NeMo I 2020-02-27 16:14:30 actions:726] Evaluating batch 30 out of 54
[NeMo I 2020-02-27 16:14:30 actions:726] Evaluating batch 35 out of 54
[NeMo I 2020-02-27 16:14:30 actions:726] Eva

## Accuracy calculation

In [21]:
correct_count = 0
total_count = 0

for batch_idx, (logits, labels) in enumerate(zip(evaluated_tensors[1], evaluated_tensors[2])):
    acc = classification_accuracy(
        logits=logits,
        targets=labels,
        top_k=[1]
    )

    # Select top 1 accuracy only
    acc = acc[0]

    # Since accuracy here is "per batch", we simply denormalize it by multiplying
    # by batch size to recover the count of correct samples.
    correct_count += int(acc * logits.size(0))
    total_count += logits.size(0)

logging.info(f"Total correct / Total count : {correct_count} / {total_count}")
logging.info(f"Final accuracy : {correct_count / float(total_count)}")

[NeMo I 2020-02-27 16:14:31 <ipython-input-21-674fb7de9132>:19] Total correct / Total count : 6094 / 6798
[NeMo I 2020-02-27 16:14:31 <ipython-input-21-674fb7de9132>:20] Final accuracy : 0.8964401294498382


## Filtering out incorrect samples
Let us now filter out the incorrectly labeled samples from the total set of samples in the test set

In [22]:
import torch
import librosa
import json
import IPython.display as ipd

In [23]:
# First lets create a utility class to remap the integer class labels to actual string label
class ReverseMapLabel:
    def __init__(self, data_layer: nemo_asr.AudioToSpeechLabelDataLayer):
        self.label2id = dict(data_layer._dataset.label2id)
        self.id2label = dict(data_layer._dataset.id2label)

    def __call__(self, pred_idx, label_idx):
        return self.id2label[pred_idx], self.id2label[label_idx]

In [24]:
# Next, lets get the indices of all the incorrectly labeled samples
sample_idx = 0
incorrect_preds = []
rev_map = ReverseMapLabel(eval_data_layer)

# Remember, evaluated_tensor = (loss, logits, labels)
for batch_idx, (logits, labels) in enumerate(zip(evaluated_tensors[1], evaluated_tensors[2])):
    probs = torch.softmax(logits, dim=-1)
    probas, preds = torch.max(probs, dim=-1)

    incorrect_ids = (preds != labels).nonzero()
    for idx in incorrect_ids:
        proba = float(probas[idx][0])
        pred = int(preds[idx][0])
        label = int(labels[idx][0])
        idx = int(idx[0]) + sample_idx

        incorrect_preds.append((idx, *rev_map(pred, label), proba))

    sample_idx += labels.size(0)

logging.info(f"Num test samples : {total_count}")
logging.info(f"Num errors : {len(incorrect_preds)}")

# First lets sort by confidence of prediction
incorrect_preds = sorted(incorrect_preds, key=lambda x: x[-1], reverse=False)

[NeMo I 2020-02-27 16:14:31 <ipython-input-24-3ed571e8b863>:22] Num test samples : 6798
[NeMo I 2020-02-27 16:14:31 <ipython-input-24-3ed571e8b863>:23] Num errors : 704


## Examine a subset of incorrect samples
Lets print out the (test id, predicted label, ground truth label, confidence) tuple of first 20 incorrectly labeled samples

In [25]:
for incorrect_sample in incorrect_preds[:20]:
    logging.info(str(incorrect_sample))

[NeMo I 2020-02-27 16:14:31 <ipython-input-25-631305d430a9>:2] (3184, 'up', 'two', 0.13125509023666382)
[NeMo I 2020-02-27 16:14:31 <ipython-input-25-631305d430a9>:2] (1966, 'wow', 'no', 0.13236339390277863)
[NeMo I 2020-02-27 16:14:31 <ipython-input-25-631305d430a9>:2] (1415, 'up', 'yes', 0.13250434398651123)
[NeMo I 2020-02-27 16:14:31 <ipython-input-25-631305d430a9>:2] (5428, 'nine', 'up', 0.13804833590984344)
[NeMo I 2020-02-27 16:14:31 <ipython-input-25-631305d430a9>:2] (1837, 'up', 'zero', 0.1411990523338318)
[NeMo I 2020-02-27 16:14:31 <ipython-input-25-631305d430a9>:2] (3083, 'four', 'two', 0.14131611585617065)
[NeMo I 2020-02-27 16:14:31 <ipython-input-25-631305d430a9>:2] (885, 'one', 'eight', 0.143906369805336)
[NeMo I 2020-02-27 16:14:31 <ipython-input-25-631305d430a9>:2] (5584, 'go', 'cat', 0.14928434789180756)
[NeMo I 2020-02-27 16:14:31 <ipython-input-25-631305d430a9>:2] (6056, 'dog', 'sheila', 0.1584177315235138)
[NeMo I 2020-02-27 16:14:31 <ipython-input-25-631305d430a9

##  Define a threshold below which we designate a model's prediction as "low confidence"

In [26]:
# Filter out how many such samples exist
low_confidence_threshold = 0.25
count_low_confidence = len(list(filter(lambda x: x[-1] <= low_confidence_threshold, incorrect_preds)))
logging.info(f"Number of low confidence predictions : {count_low_confidence}")

[NeMo I 2020-02-27 16:14:31 <ipython-input-26-a1b4199a519e>:4] Number of low confidence predictions : 39


# Lets hear the samples which the model has least confidence in !

In [27]:
# First lets create a helper function to parse the manifest files
def parse_manifest(manifest):
    data = []
    for line in manifest:
        line = json.loads(line)
        data.append(line)

    return data

In [28]:
# Next, lets create a helper function to actually listen to certain samples
def listen_to_file(sample_id, pred=None, label=None, proba=None):
    # Load the audio waveform using librosa
    filepath = test_samples[sample_id]['audio_filepath']
    audio, sample_rate = librosa.load(filepath)

    if pred is not None and label is not None and proba is not None:
        logging.info(f"Sample : {sample_id} Prediction : {pred} Label : {label} Confidence = {proba: 0.4f}")
    else:
        logging.info(f"Sample : {sample_id}")

    return ipd.Audio(audio, rate=sample_rate)


In [30]:
# Now lets load the test manifest into memory
test_samples = []
with open(test_dataset, 'r') as test_f:
    test_samples = test_f.readlines()

test_samples = parse_manifest(test_samples)

In [None]:
# Finally, lets listen to all the audio samples where the model made a mistake
# Note: This list of incorrect samples may be quite large, so you may choose to subsample `incorrect_preds`
for sample_id, pred, label, proba in incorrect_preds[:count_low_confidence]:
    ipd.display(listen_to_file(sample_id, pred=pred, label=label, proba=proba))