# Introduction

The goal of this tutorial is to demonstrate the basic steps required to setup and train a simple single-channel speech enhancement model in NeMo.

This notebook covers the following steps:

* Download speech and noise data
* Prepare the training data by mixing speech and noise
* Configure and train a simple single-output model
* Configure and train a simple dual-output model

Note that this tutorial is only for demonstration purposes.
To achieve best performance for a particular use case, carefully prepared data and more advanced models should be used.

*Disclaimer:*
User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect
"""

GIT_USER, GIT_BRANCH = 'NVIDIA', 'main'

if 'google.colab' in str(get_ipython()):

    # Install dependencies
    !pip install wget
    !apt-get install sox libsndfile1 ffmpeg
    !pip install text-unidecode
    !pip install matplotlib>=3.3.2

    ## Install NeMo
    !python -m pip install git+https://github.com/{GIT_USER}/NeMo.git@{GIT_BRANCH}#egg=nemo_toolkit[all]

    ## Install TorchAudio
    !pip install torchaudio>=0.13.0 -f https://download.pytorch.org/whl/torch_stable.html

The following cell will take care of the necessary imports and prepare utility functions used throughout the notebook.

In [None]:
import glob
import librosa
import os
import torch
import tqdm

import IPython.display as ipd
import matplotlib.pyplot as plt
import numpy as np
import pytorch_lightning as pl
import soundfile as sf

from omegaconf import OmegaConf, open_dict
from pathlib import Path
from torchmetrics.functional.audio import signal_distortion_ratio, scale_invariant_signal_distortion_ratio

from nemo.utils.notebook_utils import download_an4
from nemo.collections.asr.parts.preprocessing.segment import AudioSegment
from nemo.collections.asr.parts.utils.manifest_utils import read_manifest, write_manifest


# Utility functions for displaying signals and metrics
def show_signal(signal: np.ndarray, sample_rate: int = 16000, tag: str = 'Signal'):
    """Show the time-domain signal and its spectrogram.
    """
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 2.5))

    # show waveform
    t = np.arange(0, len(signal)) / sample_rate

    ax[0].plot(t, signal)
    ax[0].set_xlim(0, t.max())
    ax[0].grid()
    ax[0].set_xlabel('time / s')
    ax[0].set_ylabel('amplitude')
    ax[0].set_title(tag)

    n_fft = 1024
    hop_length = 256

    D = librosa.amplitude_to_db(np.abs(librosa.stft(signal, n_fft=n_fft, hop_length=hop_length)), ref=np.max)
    img = librosa.display.specshow(D, y_axis='linear', x_axis='time', sr=sample_rate, n_fft=n_fft, hop_length=hop_length, ax=ax[1])
    ax[1].set_title(tag)
    
    plt.tight_layout()
    plt.colorbar(img, format="%+2.f dB", ax=ax)

def show_metrics(signal: np.ndarray, reference: np.ndarray, sample_rate: int = 16000, tag: str = 'Signal'):
    """Show metrics for the time-domain signal and the reference signal.
    """
    sdr = signal_distortion_ratio(preds=torch.tensor(signal), target=torch.tensor(reference))
    sisdr = scale_invariant_signal_distortion_ratio(preds=torch.tensor(signal), target=torch.tensor(reference))
    print(tag)
    print('\tsdr:  ', sdr.item())
    print('\tsisdr:', sisdr.item())

### Data preparation

In this notebook, it is assumed that all audio will be resampled to 16kHz and the data and configuration will be stored under `root_dir` as defined below.

In [None]:
# sample rate used throughout the notebook
sample_rate = 16000

# root directory for data preparation, configurations, etc
root_dir = Path('./')

# data directory
data_dir = root_dir / 'data'
data_dir.mkdir(exist_ok=True)

# scripts directory
scripts_dir = root_dir / 'scripts'
scripts_dir.mkdir(exist_ok=True)

Clean speech data is used to prepare datasets used for training a simple speech enhancement model.

In this tutorial, a subset of LibriSpeech dataset [1] will be downloaded and used as the speech material.
The following cell will download and prepare the speech data.

In [None]:
speech_dir = data_dir / 'speech'
speech_data_set = 'mini'

# Copy script
get_librispeech_script = os.path.join(scripts_dir, 'get_librispeech_data.py')
if not os.path.exists(get_librispeech_script):
    !wget -P $scripts_dir https://raw.githubusercontent.com/{GIT_USER}/NeMo/{GIT_BRANCH}/scripts/dataset_processing/get_librispeech_data.py

# Download the data
if not speech_dir.is_dir():
    speech_dir.mkdir(exist_ok=True)
    !python {get_librispeech_script} --data_root={speech_dir} --data_set={speech_data_set}
else:
    print('Speech dataset already exists in:', speech_dir)

# Reduce the size of test dataset for this tutorial to 1000 clean utterances for train and 100 clean utterances for test
train_metadata = read_manifest(speech_dir / 'train_clean_5.json')
write_manifest(speech_dir / 'train.json', train_metadata[:1000])

test_metadata = read_manifest(speech_dir / 'dev_clean_2.json')
write_manifest(speech_dir / 'test.json', test_metadata[:100])

# Speech manifests
speech_manifest = {
    'train': speech_dir / 'train.json',
    'test': speech_dir / 'test.json',
}

Noise data will be mixed with the downloaded speech data to prepare a noisy dataset.

The following cell will download and prepare the noise data using a subset of the DEMAND dataset [2] will be downloaded and used as the noise data.

In [None]:
noise_dir = data_dir / 'noise'
noise_data_set = 'STRAFFIC,PSTATION'

# Copy script
get_demand_script = os.path.join(scripts_dir, 'get_demand_data.py')
if not os.path.exists(get_demand_script):
    !wget -P $scripts_dir https://raw.githubusercontent.com/{GIT_USER}/NeMo/{GIT_BRANCH}/scripts/dataset_processing/get_demand_data.py

if not noise_dir.is_dir():
    noise_dir.mkdir(exist_ok=True)
    !python {get_demand_script} --data_root={noise_dir} --data_sets={noise_data_set}
else:
    print('Noise directory already exists in:', noise_dir)


def create_noise_manifest(base_dir, subset, offset=0, duration=None):
    """Split the noise data set into train and test subsets.
    """
    complete_noise_manifests = glob.glob(str(base_dir / 'manifests' / '*.json'))
    subset_noise_manifest = base_dir / f'{subset}_manifest.json'
    
    subset_metadata = []

    for noise_manifest in complete_noise_manifests:
        complete_metadata = read_manifest(noise_manifest)
    
        for item in complete_metadata:
            new_item = item.copy()
            new_item['offset'] = offset
            new_item['duration'] = duration
            subset_metadata.append(new_item)

    write_manifest(subset_noise_manifest.as_posix(), subset_metadata)

    return subset_noise_manifest

noise_manifest = {
    'train': create_noise_manifest(noise_dir, 'train', offset=0, duration=200),
    'test': create_noise_manifest(noise_dir, 'test', offset=200, duration=100),
}

For this tutorial, a single-channel noisy dataset is constructed by adding speech and noise.

The following block will add speech and noise and save the noisy data. The noisy data is created by mixing speech and noise at a few pre-defined signal-to-noise ratios (SNRs). Note that a separate manifest will be created for each SNR.

In [None]:
%%capture
# Suppress output of this cell, since the script used below is relatively verbose.

# Copy script
add_noise_script = os.path.join(scripts_dir, 'add_noise.py')
if not os.path.exists(add_noise_script):
    !wget -P $scripts_dir https://raw.githubusercontent.com/{GIT_USER}/NeMo/{GIT_BRANCH}/scripts/dataset_processing/add_noise.py

# Generate noisy datasets and save the noise component as well.
noisy_dir = data_dir / 'noisy'
noisy_dir.mkdir(exist_ok=True)

for subset in ['train', 'test']:
    noisy_subset_dir = noisy_dir / subset

    if not noisy_subset_dir.is_dir():
        noisy_subset_dir.mkdir(exist_ok=True)
        !python {add_noise_script} --input_manifest={speech_manifest[subset]} --noise_manifest={noise_manifest[subset]} --out_dir={noisy_subset_dir} --snrs 0 5 10 15 20 --num_workers 4 --save_noise

Training a model requires an input dataset which includes information about the noisy input signal and the desired (target) output signal.

In this tutorial, train and test manifests are created by combining the information from the speech manifests and each noisy manifest generated in the previous step. Note that the final manifests include `noisy_filepath`, `speech_filepath` and `noise_filepath`. These keys can be used to define the input signal and the output signal for the model.

In [None]:
dataset_manifest = {
    'train': data_dir / 'dataset_train.json',
    'test': data_dir / 'dataset_test.json',
}

for subset in ['train', 'test']:
    # Load clean manifest
    speech_metadata = read_manifest(speech_manifest[subset])

    # Load noisy manifests
    noisy_manifests = glob.glob(str(noisy_dir / subset / 'manifests/*.json'))
    noisy_manifests.sort()

    subset_metadata = []

    for noisy_manifest in noisy_manifests:
        noisy_metadata = read_manifest(noisy_manifest)

        for speech_item, noisy_item in tqdm.tqdm(zip(speech_metadata, noisy_metadata), total=len(noisy_metadata)):
            # Check that the file matches
            assert os.path.basename(speech_item['audio_filepath']) == os.path.basename(noisy_item['audio_filepath']), f'Speech: {speech_item}. Noisy: {noisy_item}'

            # Create a new item for the subset manifest
            subset_item = {
                'noisy_filepath': noisy_item['audio_filepath'],
                'speech_filepath': speech_item['audio_filepath'],
                'noise_filepath': noisy_item['noise_filepath'],
                'duration': noisy_item['duration'],
                'offset': noisy_item.get('offset', 0)
            }

            subset_metadata.append(subset_item)

    # Save the subset manifest
    write_manifest(dataset_manifest[subset].as_posix(), subset_metadata)

### Model configuration

Here, a simple encoder-mask-decoder model will be used to process the noisy input signal and produce an enhanced output signal.

In general, an encoder-mask-decoder model can be configured using `EncMaskDecAudioToAudioModel` class, which is depicted in the following block diagram.

<img src="https://github.com/NVIDIA/NeMo/releases/download/v1.18.0/encmaskdecoder_model.png" alt="encmaskdecoder_model" style="width: 800px;"/>

The model structure can briefly be described as follows:
* Input to the model is a time-domain signal.
* Encoder transforms the input signal to the analysis domain.
* Mask estimator estimates a mask used to generate the output signal.
* Mask processor combines the estimated mask and the encoded input to produce the encoded output.
* Decoder transforms the encoded output into a time-domain signal.
* Output is a time-domain signal.

For this example, the model will be configured to use a fixed short-time Fourier transform-based encoder and decoder, and the mask will be estimated using a recurrent neural network. The model used here is depicted in the following block diagram.

<img src="https://github.com/NVIDIA/NeMo/releases/download/v1.18.0/single_output_example_model.png" alt="single_output_example_model" style="width: 1000px;"/>

In this particular configuration, the model structure can be described as follows:
* `AudioToSpectrogram` implements the analysis STFT transform.
* `MaskEstimatorRNN` is a mask estimator using RNNs.
* `MaskReferenceChannel` is a simple processor which applies the estimated mask on the reference channel. In this tutorial, the input signal has only a single channel, so the reference channel will be set to `0`.
* `SpectrogramToAudio` implements the synthesis STFT transform.

The following cell will load and show the default configuration for the model depicted above.

In [None]:
config_dir = root_dir / 'conf'
config_dir.mkdir(exist_ok=True)

config_path = config_dir / 'masking.yaml'

if not config_path.is_file():
    !wget https://raw.githubusercontent.com/{GIT_USER}/NeMo/{GIT_BRANCH}/examples/audio_tasks/conf/masking.yaml -P {config_dir.as_posix()}

config = OmegaConf.load(config_path)
config = OmegaConf.to_container(config, resolve=True)
config = OmegaConf.create(config)

print('Loaded config')
print(OmegaConf.to_yaml(config))

Training dataset is configured with the following parameters
* `manifest_filepath` points to a manifest file, with each line containing a dictionary corresponding to a single example
* `input_key` is the key corresponding to the input audio signal in the example dictionary
* `target_key` is the key corresponding to the desired output (target) audio signal in the example dictionary
* `min_duration` can be used to filter out short examples

In [None]:
# Setup training dataset
config.model.train_ds.manifest_filepath = dataset_manifest['train'].as_posix()
config.model.train_ds.input_key = 'noisy_filepath'
config.model.train_ds.target_key = 'speech_filepath'
config.model.train_ds.min_duration = 0 # load all audio files, without filtering short ones

print("Train dataset config:")
print(OmegaConf.to_yaml(config.model.train_ds))

Validation and test datasets can be configured in the same way as the training dataset. Here, we use the same dataset for validation and testing purposes for simplicity.

In [None]:
# Use test manifest for validation and test sets
config.model.validation_ds.manifest_filepath = dataset_manifest['test'].as_posix()
config.model.validation_ds.input_key = 'noisy_filepath'
config.model.validation_ds.target_key = 'speech_filepath'

config.model.test_ds.manifest_filepath = dataset_manifest['test'].as_posix()
config.model.test_ds.input_key = 'noisy_filepath'
config.model.test_ds.target_key = 'speech_filepath'

print("Validation dataset config:")
print(OmegaConf.to_yaml(config.model.validation_ds))

print("Test dataset config:")
print(OmegaConf.to_yaml(config.model.test_ds))

Metrics for validation and test set are configured in the following cell.

In this tutorial, signal-to-distortion ratio (SDR) and scale-invariant SDR from torch metrics are used [4].

In [None]:
# Setup metrics to compute on validation and test sets
metrics = OmegaConf.create({
    'sisdr': {
        '_target_': 'torchmetrics.audio.ScaleInvariantSignalDistortionRatio',
    },
    'sdr': {
        '_target_': 'torchmetrics.audio.SignalDistortionRatio',
    }
})
config.model.metrics.val = metrics
config.model.metrics.test = metrics

print("Metrics config:")
print(OmegaConf.to_yaml(config.model.metrics))

### Trainer configuration
NeMo models are primarily PyTorch Lightning modules and therefore are entirely compatible with the PyTorch Lightning ecosystem.

In [None]:
print("Trainer config:")
print(OmegaConf.to_yaml(config.trainer))

We can modify some trainer configs for this tutorial.
Most importantly, the number of epochs is set to a small value, to limit the runtime for the purpose of this example.

In [None]:
# Checks if we have GPU available and uses it
accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'
config.trainer.devices = 1
config.trainer.accelerator = accelerator

# Reduces maximum number of epochs for quick demonstration
config.trainer.max_epochs = 10

# Remove distributed training flags
config.trainer.strategy = 'auto'

# Instantiate the trainer
trainer = pl.Trainer(**config.trainer)

### Experiment manager

NeMo has an experiment manager that handles logging and checkpointing.

In [None]:
from nemo.utils.exp_manager import exp_manager

exp_dir = exp_manager(trainer, config.get("exp_manager", None))
# The exp_dir provides a path to the current experiment for easy access

print("Experiment directory:")
print(exp_dir)

### Model instantiation

In [None]:
from nemo.collections import asr as nemo_asr

enhancement_model = nemo_asr.models.EncMaskDecAudioToAudioModel(cfg=config.model, trainer=trainer)

### Training
Create a Tensorboard visualization to monitor progress

In [None]:
try:
    from google import colab
    COLAB_ENV = True
except (ImportError, ModuleNotFoundError):
    COLAB_ENV = False

# Load the TensorBoard notebook extension
if COLAB_ENV:
    %load_ext tensorboard
    %tensorboard --logdir {exp_dir}
else:
    print("To use tensorboard, please use this notebook in a Google Colab environment.")

Training can be started using `trainer.fit`:

In [None]:
trainer.fit(enhancement_model)

After the training is completed, the configured metrics can be easily computed on the test set as follows:

In [None]:
trainer.test(enhancement_model, ckpt_path=None)

### Inference

The following cell provides an example of inference on an single audio file.
For simplicity, the audio file information is taken from the test dataset.

In [None]:
# Load a single audio example from the test set
test_metadata = read_manifest(dataset_manifest['test'].as_posix())

# Path to audio files
noisy_filepath = test_metadata[-1]['noisy_filepath'] # noisy audio
speech_filepath = test_metadata[-1]['speech_filepath'] # clean speech
noise_filepath = test_metadata[-1]['noise_filepath'] # corresponding noise

# Load audio
noisy_signal = AudioSegment.from_file(noisy_filepath, target_sr=sample_rate).samples
speech_signal = AudioSegment.from_file(speech_filepath, target_sr=sample_rate).samples

# Move to device
device = 'cuda' if accelerator == 'gpu' else 'cpu'
enhancement_model = enhancement_model.to(device)

# Process using the model
noisy_tensor = torch.tensor(noisy_signal).reshape(1, 1, -1).to(device) # (batch, channel, time)
with torch.no_grad():
    output_tensor, _ = enhancement_model(input_signal=noisy_tensor)
output_signal = output_tensor[0][0].detach().cpu().numpy()

Signals can be easily plotted and signal metrics can be calculated for the given example.

In [None]:
# Show noisy and clean signals
show_metrics(signal=noisy_signal, reference=speech_signal, tag='Noisy signal', sample_rate=sample_rate)
show_metrics(signal=output_signal, reference=speech_signal, tag='Output signal', sample_rate=sample_rate)

# Show signals
show_signal(speech_signal, tag='Speech signal')
show_signal(noisy_signal, tag='Noisy signal')
show_signal(output_signal, tag='Output signal')

# Play audio
print('Speech signal')
ipd.display(ipd.Audio(speech_signal, rate=sample_rate))

print('Noisy signal')
ipd.display(ipd.Audio(noisy_signal, rate=sample_rate))

print('Output signal')
ipd.display(ipd.Audio(output_signal, rate=sample_rate))

If necessary, it is easy to limit the amount of suppression by setting the value of `mask_min` of the `mask_processor`.

This will add a lower-bound (limit) on the applied mask, thereby limiting the amount of signal suppression that the model can achieve.

In [None]:
from nemo.collections.asr.parts.utils.audio_utils import db2mag

# Limit suppression to 10dB
min_mask_db = -10
enhancement_model.mask_processor.mask_min = db2mag(min_mask_db)

with torch.no_grad():
    output_tensor_min_mask, _ = enhancement_model(input_signal=noisy_tensor)
output_signal_min_mask = output_tensor_min_mask[0][0].detach().cpu().numpy()

print('Noisy signal')
ipd.display(ipd.Audio(noisy_signal, rate=sample_rate))

print(f'Output signal with min_mask = {min_mask_db}dB')
ipd.display(ipd.Audio(output_signal_min_mask, rate=sample_rate))

show_metrics(signal=output_signal_min_mask, reference=speech_signal, tag=f'Output signal with min_mask = {min_mask_db}dB', sample_rate=sample_rate)
show_signal(output_signal_min_mask, tag=f'Output signal with min_mask = {min_mask_db}dB')

### Training a multi-output model

The simple model used in this tutorial can be easily configured to generate two outputs.
This can be useful for either performing speech enhancement or source separation.

For example, it may be beneficial to enforce noise reconstruction in the loss, as has been demonstrated in [3].

This can be achieved with small changes to the model configuration.

<img src="https://github.com/NVIDIA/NeMo/releases/download/v1.18.0/dual_output_example_model.png" alt="dual_output_example_model" style="width: 1000px;"/>

For start, load the same config as earlier and set the number of outputs to two.

In [None]:
# Dual output model
config_dual_output = OmegaConf.load(config_path)

# Set model to have two outputs
config_dual_output.model.num_outputs = 2
config_dual_output = OmegaConf.to_container(config_dual_output, resolve=True)
config_dual_output = OmegaConf.create(config_dual_output)

print('Loaded config for dual output model')
print(OmegaConf.to_yaml(config_dual_output))

Now, datasets need to be configured to generate a single-channel input signal and a two-channel output signal.
This can be easily achieved by providing multiple keys for `target_key`.

In [None]:
# Use the same dataset as before
config_dual_output.model.train_ds.manifest_filepath = dataset_manifest['train'].as_posix()

# The input signal is the same as before
config_dual_output.model.train_ds.input_key = 'noisy_filepath'

# The target signal is not two channel:
#   the first channel is the clean speech signal
#   the second channel is the noise signal
config_dual_output.model.train_ds.target_key = 'speech_filepath,noise_filepath'
config_dual_output.model.train_ds.min_duration = 0 # load all audio files, without filtering out short files

# Validation and test datasets are configured in the same way
config_dual_output.model.validation_ds.manifest_filepath = dataset_manifest['test'].as_posix()
config_dual_output.model.validation_ds.input_key = 'noisy_filepath'
config_dual_output.model.validation_ds.target_key = 'speech_filepath,noise_filepath'

config_dual_output.model.test_ds.manifest_filepath = dataset_manifest['test'].as_posix()
config_dual_output.model.test_ds.input_key = 'noisy_filepath'
config_dual_output.model.test_ds.target_key = 'speech_filepath,noise_filepath'

The loss can be easily configured by assigning weights for each output of the model.
For example, speech- and noise-related losses with equal weights can be configured as follows:

In [None]:
# Assign equal weights to speech and noise loss
#   total_loss = 0.5 * speech_loss + 0.5 * noise_loss
config_dual_output.model.loss.weight = [0.5, 0.5]

A mixture consistency layer can be added to enforce the estimated sources (speech and noise, in this case) to be consistent with the input mixture [5].

In [None]:
# Add a mixture consistency projection
with open_dict(config_dual_output):
    config_dual_output.model.mixture_consistency = OmegaConf.create({
        '_target_': 'nemo.collections.asr.modules.audio_modules.MixtureConsistencyProjection',
        'weighting': 'power',
    })

Metrics can be calculated for each output channel separately.
If a channel parameter is not provided, the configured metrics are averaged across all channels.
For example, this can be configured as follows:

In [None]:
# Setup metrics
metrics = OmegaConf.create({
    # Calculate speech metric using the first channel
    'speech_sisdr': {
        '_target_': 'torchmetrics.audio.ScaleInvariantSignalDistortionRatio',
        'channel': 0,
    },
    'speech_sdr': {
        '_target_': 'torchmetrics.audio.SignalDistortionRatio',
        'channel': 0,
    },
    # Calculate noise metric using the second channel
    'noise_sisdr': {
        '_target_': 'torchmetrics.audio.ScaleInvariantSignalDistortionRatio',
        'channel': 1,
    },
    'noise_sdr': {
        '_target_': 'torchmetrics.audio.SignalDistortionRatio',
        'channel': 1,
    },
})
config_dual_output.model.metrics.val = metrics
config_dual_output.model.metrics.test = metrics

The trainer and the experiment manager are set up in the exactly the same way as earlier.

In [None]:
# Checks if we have GPU available and uses it
accelerator = 'gpu' if torch.cuda.is_available() else 'cpu'
config_dual_output.trainer.devices = 1
config_dual_output.trainer.accelerator = accelerator

# Reduces maximum number of epochs for quick demonstration
config_dual_output.trainer.max_epochs = 10

# Remove distributed training flags
config_dual_output.trainer.strategy = 'auto'

# Instantiate the trainer
trainer = pl.Trainer(**config_dual_output.trainer)
exp_dir = exp_manager(trainer, config_dual_output.get("exp_manager", None))

print("Experiment directory:")
print(exp_dir)

As earlier, training can be started using `trainer.fit`:

In [None]:
dual_output_model = nemo_asr.models.EncMaskDecAudioToAudioModel(cfg=config_dual_output.model, trainer=trainer)
trainer.fit(dual_output_model)

As earlier, metrics on the test set can be easily calculated as follows:

In [None]:
trainer.test(dual_output_model, ckpt_path=None)

As earlier, it is easy to run inference on an input signal, and the output will contain multiple channels.

In [None]:
dual_output_model = dual_output_model.to(device)

# (batch, channel, time)
noisy_tensor = torch.tensor(noisy_signal).reshape(1, 1, -1).to(device)

with torch.no_grad():
    processed_tensor, _ = dual_output_model(input_signal=noisy_tensor)

# First output channel is the speech estimate
output_speech = processed_tensor[0][0].detach().cpu().numpy()

# The second output channel is the noise estimate
output_noise = processed_tensor[0][1].detach().cpu().numpy()

show_metrics(signal=output_speech, reference=speech_signal, tag='Output speech', sample_rate=sample_rate)

show_signal(noisy_signal, tag='Noisy input')
show_signal(output_speech, tag='Output speech')
show_signal(output_noise, tag='Output noise')

print('Noisy input')
ipd.display(ipd.Audio(noisy_signal, rate=sample_rate))

print('Output speech')
ipd.display(ipd.Audio(output_speech, rate=sample_rate))

print('Output noise')
ipd.display(ipd.Audio(output_noise, rate=sample_rate))

## Next steps
This is a simple tutorial which can serve as a starting point for prototyping and experimentation with audio-to-audio models.
A processed audio output can be used, for example, for ASR or TTS.

For more details about NeMo models and applications in in ASR and TTS, we recommend you checkout other tutorials next:

* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb)
* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)
* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)
* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)

## References

[1] V. Panayotov, G. Chen, D. Povery, S. Khudanpur, "LibriSpeech: An ASR corpus based on public domain audio books," ICASSP 2015

[2] J. Thieman, N. Ito, V. Emmanuel, "DEMAND: collection of multi-channel recordings of acoustic noise in diverse environments," ICA 2013

[3] K. Kinoshita, T. Ochiai, M. Delcroix, T. Nakatani, "Improving noise robust automatic speech recognition with single-channel time-domain enhancement network," ICASSP 2020.

[4] https://github.com/Lightning-AI/torchmetrics

[5] Wisdom et al., Differentiable consistency constraints for improved deep speech enhancement, ICASSP 2018