# Train Adapt Optimize (TAO) Toolkit

Train Adapt Optimize (TAO) Toolkit  is a python based AI toolkit for taking purpose-built pre-trained AI models and customizing them with your own data. 

Transfer learning extracts learned features from an existing neural network to a new one. Transfer learning is often used when creating a large training dataset is not feasible. 

Developers, researchers and software partners building intelligent AI apps and services, can bring their own data to fine-tune pre-trained models instead of going through the hassle of training from scratch.

![Train Adapt Optimize (TAO) Toolkit](https://developer.nvidia.com/sites/default/files/akamai/embedded-transfer-learning-toolkit-software-stack-1200x670px.png)

The goal of this toolkit is to reduce that 80 hour workload to an 8 hour workload, which can enable data scientist to have considerably more train-test iterations in the same time frame.

Let's see this in action with a use case for Speech Synthesis!

## Text to Speech

Text to Speech (TTS) is often the last step in building a Conversational AI model. A TTS model converts text into audible speech. The main objective is to synthesize reasonable and natural speech for given text. Since there are no universal standard to measure quality of synthesized speech, you will need to listen to some inferred speech to tell whether a TTS model is well trained.

In TAO Toolkit, TTS is made up with two models: [FastPitch](https://arxiv.org/pdf/2006.06873.pdf) for spectrogram generation and [HiFiGAN](https://arxiv.org/pdf/2010.05646.pdf) as vocoder.

---
## Let's Dig in: TTS using TAO

This notebook assumes that you are already familiar with TTS Training using TAO, as described in the [text-to-speech-training](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/texttospeech_notebook) notebook, and that you have a pretrained TTS model.

### Installing and setting up TAO

For ease of use, please install TAO inside a python virtual environment. We recommend performing this step first and then launching the notebook from the virtual environment.

In addition to installing TAO python package, please make sure of the following software requirements:

1. python 3.6.9
2. docker-ce > 19.03.5
3. docker-API 1.40
4. nvidia-container-toolkit > 1.3.0-1
5. nvidia-container-runtime > 3.4.0-1
6. nvidia-docker2 > 2.5.0-1
7. nvidia-driver >= 455.23

Running the cell below installs TAO Toolkit.

In [None]:
! pip3 install wheel
! pip3 install --force-reinstall nvidia-pyindex
! pip3 install --force-reinstall nvidia-tao
#! sudo apt install --reinstall nvidia-container-toolkit nvidia-container-runtime nvidia-docker2

In [None]:
!pip3 install librosa
!pip3 install matplotlib
!pip3 install --upgrade hydra-core hydra
#! pip install numba==0.48
#! pip install librosa==0.7
#! pip install soundfile

After installing TAO, the next step is to setup the mounts for TAO. The TAO launcher uses docker containers under the hood, and **for our data and results directory to be visible to the docker, they need to be mapped**. The launcher can be configured using the config file `~/.tao_mounts.json`. Apart from the mounts, you can also configure additional options like the Environment Variables and amount of Shared Memory available to the TAO launcher. <br>

Replace the variables FIXME with the required paths enclosed in `""` as a string.

`IMPORTANT NOTE:` The code below creates a sample `~/.tao_mounts.json`  file. Here, we can map directories in which we save the data, specs, results and cache. You should configure it for your specific case so these directories are correctly visible to the docker container.

In [None]:
# please define these paths on your local host machine
import os
from pathlib import Path

os.environ["HOST_DATA_DIR"] = "/home/davesarmoury/ws/glados_ws/TAO/tmp/data"
os.environ["HOST_SPECS_DIR"] = "/home/davesarmoury/ws/glados_ws/TAO/tmp/specs"
os.environ["HOST_RESULTS_DIR"] = "/home/davesarmoury/ws/glados_ws/TAO/tmp/results"

In [None]:
! mkdir -p $HOST_DATA_DIR
! mkdir -p $HOST_SPECS_DIR
! mkdir -p $HOST_RESULTS_DIR

In [None]:
# Mapping up the local directories to the TAO docker.
import json
import os
mounts_file = os.path.expanduser("~/.tao_mounts.json")
tlt_configs = {
   "Mounts":[
       {
           "source": os.environ["HOST_DATA_DIR"],
           "destination": "/data"
       },
       {
           "source": os.environ["HOST_SPECS_DIR"],
           "destination": "/specs"
       },
       {
           "source": os.environ["HOST_RESULTS_DIR"],
           "destination": "/results"
       },
       {
           "source": os.path.expanduser("~/.cache"),
           "destination": "/root/.cache"
       },
   ],
   "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
         }
   }
}
# Writing the mounts file.
with open(mounts_file, "w") as mfile:
    json.dump(tlt_configs, mfile, indent=4)

You can check the docker image versions and the tasks that perform. You can also check this out with a `tao --help` or

In [None]:
! tao info --verbose

### Set Relevant Paths

In [None]:
# NOTE: The following paths are set from the perspective of the TAO Docker.

import os

# The data is saved here
DATA_DIR = "/data"
SPECS_DIR = "/specs"
RESULTS_DIR = "/results"

# Set your encryption key, and use the same key for all commands
KEY = 'tlt_encode'

os.environ["DATA_DIR"] = DATA_DIR
os.environ["SPECS_DIR"] = SPECS_DIR
os.environ["RESULTS_DIR"] = RESULTS_DIR

Now that everything is setup, we would like to take a bit of time to explain the tao interface for ease of use. The command structure can be broken down as follows: `tao <task name> <subcommand>` <br> 

Let's see this in further detail.


### Downloading Specs
TAO's Conversational AI Toolkit works off of spec files which make it easy to edit hyperparameters on the fly. We can proceed to downloading the spec files. The user may choose to modify/rewrite these specs, or even individually override them through the launcher. You can download the default spec files by using the `download_specs` command. <br>

The -o argument indicating the folder where the default specification files will be downloaded, and -r that instructs the script where to save the logs. **Make sure the -o points to an empty folder!**

In [None]:
# download spec files for FastPitch
!tao spectro_gen download_specs \
    -r $RESULTS_DIR/spectro_gen \
    -o $SPECS_DIR/spectro_gen

In [None]:
# download spec files for HiFiGAN
!tao vocoder download_specs \
    -r $RESULTS_DIR/vocoder \
    -o $SPECS_DIR/vocoder

In [None]:
import requests
from multiprocessing import cpu_count
from multiprocessing.pool import ThreadPool
import shutil
import os
from bs4 import BeautifulSoup
import soundfile as sf
import string
import json
import re
import num2words

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

blocklist = ["potato", "_ding_", "00_part1_entry-6", "_escape_"]
audio_dir = 'tmp/data/GLaDOS'
download_threads = 64

def prep(args, overwrite=False):
    already_exists = os.path.exists(audio_dir)
    
    if already_exists and not overwrite:
        print("Data already downloaded")
        return
    
    if already_exists:
        print("Deleting previously downloaded audio")
        shutil.rmtree(audio_dir)

    os.mkdir(audio_dir)
    download_parallel(args)

def remove_punctuation(str):
    return str.translate(str.maketrans('', '', string.punctuation))
    
def audio_duration(fn):
    f = sf.SoundFile(fn)
    return f.frames / f.samplerate

def download_file(args):
    url, filename = args[0], args[1]

    try:
        response = requests.get(url)
        open(os.path.join(audio_dir, filename), "wb").write(response.content)
        return filename, True
    except:
        return filename, False

def download_parallel(args):
    results = ThreadPool(download_threads).imap_unordered(download_file, args)
    for result in results:
        if result[1]:
            print(bcolors.OKGREEN + "[" + u'\u2713' + "] " + bcolors.ENDC + result[0])
        else:
            print(bcolors.FAIL + "[" + u'\u2715' + "] " + bcolors.ENDC + result[0])

def main():
    r = requests.get("https://theportalwiki.com/wiki/GLaDOS_voice_lines")

    urls = []
    filenames = []
    texts = []

    soup = BeautifulSoup(r.text.encode('utf-8').decode('ascii', 'ignore'), 'html.parser')
    for link_item in soup.find_all('a'):
        url = link_item.get("href", None)
        if url:
            if "https:" in url and ".wav" in url:
                list_item = link_item.find_parent("li")
                ital_item = list_item.find_all('i')
                if ital_item:
                    text = ital_item[0].text
                    text = text.replace('"', '')
                    filename = url[url.rindex("/")+1:]

                    if "[" not in text and "]" not in text and "$" not in text:
                        if url not in urls:
                            for s in blocklist:
                                if s in url:
                                    break
                            else:
                                urls.append(url)
                                filenames.append(filename)
                                text = text.replace('*', '')
                                texts.append(text)

    print("Found " + str(len(urls)) + " urls")

    args = zip(urls, filenames)

    prep(args)
    
    total_audio_time = 0
    outFile=open(os.path.join(audio_dir, "manifest.json"), 'w')
    for i in range(len(urls)):
        item = {}
        text = texts[i]
        filename = filenames[i]
        item["audio_filepath"] = filename
        item["text_normalized"] = re.sub(r"(\d+)", lambda x: num2words.num2words(int(x.group(0))), text)
        item["text"] = text.lower()
        item["duration"] = audio_duration(os.path.join(audio_dir, filename))
        total_audio_time = total_audio_time + item["duration"]
        outFile.write(json.dumps(item, ensure_ascii=True, sort_keys=True) + "\n")
 
    outFile.close()
    print(str(total_audio_time/60.0) + " min")

main()

In [None]:
! wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
! tar -xf LJSpeech-1.1.tar.bz2
! mv LJSpeech-1.1 tmp/data/LJSpeech
! rm LJSpeech-1.1.tar.bz2

In [None]:
! tao spectro_gen dataset_convert \
      -e $SPECS_DIR/spectro_gen/dataset_convert_ljs.yaml \
      -r $RESULTS_DIR/spectro_gen/dataset_convert \
      data_dir=$DATA_DIR/LJSpeech \
      dataset_name=ljspeech

In [None]:
original_data_csv = os.path.join(os.environ["HOST_DATA_DIR"], "LJSpeech/metadata.csv")
original_data_json = os.path.join(os.environ["HOST_DATA_DIR"], "LJSpeech/ljspeech_train.json")

original_data_name = "LJSpeech"
os.environ["finetune_data_name"] = original_data_name

os.environ["original_data_json"] = original_data_json
print(original_data_json)

Let's now download the data from the NVIDIA Custom Voice Recorder tool, and place the data in the `$HOST_DATA_DIR`

In [None]:
import os

# Name of the untarred dataset from the NVIDIA Custom Voice Recorder.
finetune_data_name = "GLaDOS"
os.environ["finetune_data_name"] = finetune_data_name
finetune_data_path = os.path.join(os.getenv("HOST_DATA_DIR"), finetune_data_name)
print(finetune_data_path)

Now that you have downloaded the data, let's make sure that the audio clips and sample at the same sampling frequency as the clips used to train the pretrained model. For the course of this notebook, NVIDIA recommends using a model trained on the LJSpeech dataset. The sampling rate for this model is 22.05kHz.

In [None]:
import soundfile
import librosa
import json
import os

print(librosa.__version__)
def resample_audio(input_file_path, output_path, target_sampling_rate=22050):
    """Resample a single audio file.
    
    Args:
        input_file_path (str): Path to the input audio file.
        output_path (str): Path to the output audio file.
        target_sampling_rate (int): Sampling rate for output audio file.
        
    Returns:
        No explicit returns
    """
    if not input_file_path.endswith(".wav"):
        raise NotImplementedError("Loading only implemented for wav files.")
    if not os.path.exists(input_file_path):
        raise FileNotFoundError(f"Cannot file input file at {input_file_path}")
    audio, sampling_rate = librosa.load(
      input_file_path,
      sr=target_sampling_rate
    )
    filename = os.path.basename(input_file_path)
    if not os.path.exists(output_path):
        os.makedirs(output_path)
    soundfile.write(
        os.path.join(output_path, filename),
        audio,
        samplerate=target_sampling_rate,
        format="wav"
    )
    return filename

In [None]:
from tqdm.notebook import tqdm

relative_path = f"{finetune_data_name}/clips_resampled"
resampled_manifest_file = os.path.join(
    os.environ["HOST_DATA_DIR"],
    f"{finetune_data_name}/manifest_resampled.json"
)
input_manifest_file = os.path.join(
    os.environ["HOST_DATA_DIR"],
    f"{finetune_data_name}/manifest.json"
)
sampling_rate = 22050
output_path = os.path.join(os.environ["HOST_DATA_DIR"], relative_path)

print("########################################")
print("resampled_manifest_file: " + resampled_manifest_file)
print("input_manifest_file: " + input_manifest_file)
print("output_path: " + output_path)
print("########################################")

# Resampling the audio clip.
with open(input_manifest_file, "r") as finetune_file:
    with open(resampled_manifest_file, "w") as resampled_file:
        for line in tqdm(finetune_file.readlines()):
            data = json.loads(line)
            filename = resample_audio(
                os.path.join(
                    os.environ["HOST_DATA_DIR"],
                    finetune_data_name,
                    data["audio_filepath"]
                ),
                output_path,
                target_sampling_rate=sampling_rate
            )
            data["audio_filepath"] = os.path.join(
                os.environ["DATA_DIR"],
                relative_path, filename
            )
            resampled_file.write(f"{json.dumps(data)}\n")

assert resampled_file.closed, "Output file wasn't closed properly"
assert finetune_file.closed, "Input file wasn't closed properly"

In [None]:
# Splitting the dataset to train and val set.
!cat $HOST_DATA_DIR/$finetune_data_name/manifest_resampled.json | tail -n 2 > $HOST_DATA_DIR/$finetune_data_name/manifest_val.json
!cat $HOST_DATA_DIR/$finetune_data_name/manifest_resampled.json | head -n -2 > $HOST_DATA_DIR/$finetune_data_name/manifest_train.json

In [None]:
from pathlib import Path

finetune_data_json = Path(DATA_DIR) / f'{finetune_data_name}/manifest_train.json'
original_data_json = Path(DATA_DIR) / f'{original_data_name}/ljspeech_train.json'

The first step is to create a json that contains data from both the original data and the finetuning data. We can do this using dataset_convert:

In [None]:
! tao spectro_gen dataset_convert \
    dataset_name=merge \
    original_json=$original_data_json \
    finetune_json=$finetune_data_json \
    save_path=$DATA_DIR/$finetune_data_name/merged_train.json \
    -r $DATA_DIR/dataset_convert/merge \
    -e $SPECS_DIR/spectro_gen/dataset_convert_ljs.yaml

In [None]:
#! sed -i 's/"speaker":/"speaker_id":/g' $HOST_DATA_DIR/$finetune_data_name/merged_train.json

### Getting Pitch Statistics

Training Fastpitch requires you to set 4 values for pitch extraction:
  - `fmin`: The minimum frequence value in Hz used to estimate the fundamental frequency (f0)
  - `fmax`: The maximum frequency value in Hz used to estimate the fundamental frequency (f0)
  - `avg`: The average used to normalize the pitch
  - `std`: The std deviation used to normalize the pitch

In order to get these, we first find a good `fmin` and `fmax` which are hyperparameters to librosa's pyin function.
After we set those, we can iterate over the finetuning dataset to extract the pitch mean and standard deviation.

#### Obtain fmin and fmax

To get fmin and fmax, we start with some defaults, and iterate through random samples of the dataset to ensure that pyin is correctly extracting the pitch.

We look at the plotted spectrogram as well as the predicted fundamental frequency, f0. We want the predicted f0 (the cyan line) to match the lowest energy band in the spectrogram. Here is an example of a good match between the predicted f0 and the spectrogram:

![good_pitch.png](https://github.com/vpraveen-nv/model_card_images/raw/main/conv_ai/samples/texttospeech/noise.png)

Here is an example of a bad match between the f0 and the spectrogram. The fmin was likely set too high. The f0 algorithm is missing the first two vocalizations, and is correctly matching the last half of speech. To fix this, the fmin should be set lower.

![bad_pitch.png](https://github.com/vpraveen-nv/model_card_images/raw/main/conv_ai/samples/texttospeech/noise.png)

Here is an example of samples that have low frequency noise. In order to eliminate the effects of noise, you have to set fmin above the noise frequency. Unfortunately, this will result in degraded TTS quality. It would be best to re-record the data in a environment with less noise.

![noise.png](https://github.com/vpraveen-nv/model_card_images/raw/main/conv_ai/samples/texttospeech/noise.png)


*Note: You will have to run the below cell multiple times with different hyperparameters before you are able to find a good value for fmin and fmax.*

*We set the `num_files` parameter to 10, so as to visualize only 10 plots in the dataset. You may choose to increase or decrease this value to generate more or fewer plots*

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import os
from math import ceil
from IPython.display import Image

valid_image_ext = ['.jpg', '.png', '.jpeg', '.ppm']

def visualize_images(image_dir, num_cols=2, num_images=10):
    """Visualize images in the notebook.
    
    Args:
        image_dir (str): Path to the directory containing images.
        num_cols (int): Number of columns.
        num_images (int): Number of images.

    """
    output_path = os.path.join(os.environ['HOST_RESULTS_DIR'], image_dir)
    num_rows = int(ceil(float(num_images) / float(num_cols)))
    f, axarr = plt.subplots(num_rows, num_cols, figsize=[240,90])
    f.tight_layout()
    a = [os.path.join(output_path, image) for image in os.listdir(output_path) 
         if os.path.splitext(image)[1].lower() in valid_image_ext]
    for idx, img_path in enumerate(a[:num_images]):
        col_id = idx % num_cols
        row_id = idx // num_cols
        img = plt.imread(img_path)
        axarr[row_id, col_id].imshow(img)

In [None]:
# Holy wow this takes FOREVER
!tao spectro_gen pitch_stats num_files=10 \
     pitch_fmin=80 \
     pitch_fmax=2048 \
     output_path=/results/spectro_gen/pitch_stats \
     compute_stats=false \
     render_plots=true \
     manifest_filepath=$DATA_DIR/$finetune_data_name/manifest_train.json \
     --results_dir $RESULTS_DIR/spectro_gen/pitch_stats

In [None]:
visualize_images("spectro_gen/pitch_stats", num_cols=5, num_images=10)

### Finetuning

For finetuning TTS models in TAO, we use the `tao spectro_gen finetune` and `tao vocoder finetune` command with the following args:
<ul>
    <li> <b>-m</b> : Path to the model weights we want to finetune from </li>
    <li> <b>-e</b> : Path to the spec file </li>
    <li> <b>-g</b> : Number of GPUs to use </li>
    <li> <b>-r</b> : Path to the results folder </li>
    <li> <b>-k</b> : User specified encryption key to use while saving/loading the model </li>
    <li> Any overrides to the spec file </li>
</ul>

In order to get a finetuned TTS pipeline, you need to finetune FastPitch. For best results, you need to finetune HiFiGAN as well.

#### Finetuning FastPitch

In [None]:
# Prior is needed for FastPitch training. If empty folder is provided, prior will generate on-the-fly
! mkdir -p $HOST_RESULTS_DIR/spectro_gen/finetune/prior_folder

In [None]:
# Please set the fmin, fmax, pitch_mean and pitch_std values based on
# the output from the tao spectro_gen pitch_stats task.
pitch_fmin = 80.0
pitch_fmax = 2048.0
pitch_mean = 165.458
pitch_std = 40.1891

print(f"pitch fmin:{pitch_fmin}")
print(f"pitch fmax:{pitch_fmax}")
print(f"pitch mean:{pitch_mean}")
print(f"pitch std:{pitch_std}")

os.environ["pitch_fmin"] = str(pitch_fmin)
os.environ["pitch_fmax"] = str(pitch_fmax)
os.environ["pitch_mean"] = str(pitch_mean)
os.environ["pitch_std"] = str(pitch_std)

assert pitch_fmin < pitch_fmax , f"pitch_fmin [{pitch_fmin}] > pitch_fmax [{pitch_fmax}]"

Please be patient especially if you provided an empty prior folder.

Please update the `-m` parameter to the path of your pre-trained checkpoint. This can be a previously trained `.tlt` or `.nemo` file.

NVIDIA recommends using these [FastPitch](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_fastpitch) and [HiFiGAN](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_hifigan) checkpoints on [NGC](https://ngc.nvidia.com)

In [None]:
# Installing NGC CLI to pull the models.
## Download and install
import os

%env CLI=ngccli_cat_linux.zip
!mkdir -p $HOST_RESULTS_DIR/ngccli

# Remove any previously existing CLI installations
!rm -rf $HOST_RESULTS_DIR/ngccli/*
!wget "https://ngc.nvidia.com/downloads/$CLI" -P $HOST_RESULTS_DIR/ngccli
!unzip -u "$HOST_RESULTS_DIR/ngccli/$CLI" -d $HOST_RESULTS_DIR/ngccli/
!rm $HOST_RESULTS_DIR/ngccli/*.zip 
os.environ["PATH"]="{}/ngccli/ngc-cli:{}".format(os.getenv("HOST_RESULTS_DIR", ""), os.getenv("PATH", ""))

#!ngc registry model download-version "nvidia/nemo/tts_en_fastpitch:1.0.0" --dest $HOST_DATA_DIR/
!ngc registry model download-version "nvidia/nemo/tts_en_fastpitch:1.4.0" --dest $HOST_DATA_DIR/
#!ngc registry model download-version "nvidia/nemo/tts_en_fastpitch:1.8.1" --dest $HOST_DATA_DIR/
!ngc registry model download-version "nvidia/nemo/tts_hifigan:1.0.0rc1" --dest $HOST_DATA_DIR/

In [None]:
pretrained_fastpitch_model = os.path.join(os.environ["DATA_DIR"], "tts_en_fastpitch_v1.4.0/tts_en_fastpitch_align.nemo")
os.environ["pretrained_fastpitch_model"] = pretrained_fastpitch_model
pretrained_hifigan_model = os.path.join(os.environ["DATA_DIR"], "tts_hifigan_v1.0.0rc1/tts_hifigan.nemo")
os.environ["pretrained_hifigan_model"] = pretrained_hifigan_model
os.environ["HYDRA_FULL_ERROR"]="1"

In [None]:
!tao spectro_gen finetune \
     -e $SPECS_DIR/spectro_gen/finetune.yaml \
     -g 1 \
     -k tlt_encode \
     -r $RESULTS_DIR/spectro_gen/finetune \
     -m $pretrained_fastpitch_model \
     train_dataset=$DATA_DIR/$finetune_data_name/merged_train.json \
     validation_dataset=$DATA_DIR/$finetune_data_name/manifest_val.json \
     prior_folder=$RESULTS_DIR/spectro_gen/finetune/prior_folder \
     trainer.max_epochs=200 \
     n_speakers=2 \
     pitch_fmin=$pitch_fmin \
     pitch_fmax=$pitch_fmax \
     pitch_avg=$pitch_mean \
     pitch_std=$pitch_std \
     trainer.precision=16

#### Finetuning HiFiGAN

In order to get the best audio from HiFiGAN, we need to finetune it:
  - on the new speaker
  - using mel spectrograms from our finetuned FastPitch Model

Let's first generate mels from our FastPitch model, and save it to a new .json manifest for use with HiFiGAN

In [None]:
import json
import os

def infer_and_save_json(infer_json, save_json, subdir="train"):
    # Get records from the training manifest
    host_manifest_path = os.path.join(os.environ["HOST_DATA_DIR"], infer_json)
    tao_manifest_path = os.path.join(DATA_DIR, infer_json)
    host_save_json = os.path.join(os.environ["HOST_DATA_DIR"], save_json)
    records = []
    text = {"input_batch": []}
    print("Appending mel spectrogram paths to the dataset.")
    with open(host_manifest_path, "r") as f:
        for i, line in enumerate(f):
            manifest_info = json.loads(line)
            manifest_info["mel_filepath"] = f"{RESULTS_DIR}/spectro_gen/infer/spectro/{subdir}/{i}.npy"
            records.append(manifest_info)
            text["input_batch"].append(manifest_info["text"])

    !tao spectro_gen infer \
         -e $SPECS_DIR/spectro_gen/infer.yaml \
         -g 1 \
         -k $KEY \
         -m $RESULTS_DIR/spectro_gen/finetune/checkpoints/finetuned-model.tlt \
         -r $RESULTS_DIR/spectro_gen/infer \
         output_path=$RESULTS_DIR/spectro_gen/infer/spectro/$subdir \
         speaker=1 \
         mode="infer_hifigan_ft" \
         input_json=$tao_manifest_path

    # Save to a new json
    with open(host_save_json, "w") as f:
        for r in records:
            f.write(json.dumps(r) + '\n')

# Infer for train
infer_and_save_json(f"{finetune_data_name}/manifest_train.json", f"{finetune_data_name}/hifigan_train_ft.json")
# Infer for dev
infer_and_save_json(f"{finetune_data_name}/manifest_val.json", f"{finetune_data_name}/hifigan_dev_ft.json", "dev")

Now let's finetune hifigan.

Please update the `-m` parameter to the path of your pre-trained checkpoint.

In [None]:
!tao vocoder finetune \
     -e $SPECS_DIR/vocoder/finetune.yaml \
     -g 1 \
     -k $KEY \
     -r $RESULTS_DIR/vocoder/finetune \
     -m $pretrained_hifigan_model \
     train_dataset=$DATA_DIR/$finetune_data_name/hifigan_train_ft.json \
     validation_dataset=$DATA_DIR/$finetune_data_name/hifigan_dev_ft.json \
     trainer.max_epochs=200

### TTS Inference

As aforementioned, since there are no universal standard to measure quality of synthesized speech, you will need to listen to some inferred speech to tell whether a TTS model is well trained. Therefore, we do not provide `evaluate` functionality in TAO Toolkit for TTS but only provide `infer` functionality.

#### Generate spectrogram

The first step for inference is generating spectrogram. That's a numpy array (saved as `.npy` file) for a sentence which can be converted to voice by a vocoder. We use FastPitch we just trained to generate spectrogram

You might have to work with the infer.yaml file to set the texts you want for inference

In [None]:
!tao spectro_gen infer \
     -e $SPECS_DIR/spectro_gen/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/spectro_gen/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/spectro_gen/infer_output \
     output_path=$RESULTS_DIR/spectro_gen/infer_output/spectro \
     speaker=1

#### Generate sound file

The second step for inference is generating wav sound file based on spectrogram you generated in last step.

In [None]:
!tao vocoder infer \
     -e $SPECS_DIR/vocoder/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/vocoder/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/vocoder/infer_output \
     input_path=$RESULTS_DIR/spectro_gen/infer_output/spectro \
     output_path=$RESULTS_DIR/vocoder/infer_output/wav

In [None]:
import os
import IPython.display as ipd
# change path of the file here
ipd.Audio(os.environ["HOST_RESULTS_DIR"] + '/vocoder/infer_output/wav/0.wav')
# ipd.Audio(os.environ["HOST_RESULTS_DIR"] + '/vocoder/infer_output/wav/1.wav')
# ipd.Audio(os.environ["HOST_RESULTS_DIR"] + '/vocoder/infer_output/wav/2.wav')

#### Debug

The data provided is only meant to be a sample to understand how finetuning works in TAO. In order to generate better speech quality, we recommend recording at least 30 mins of audio, and increasing the number of finetuning steps from the current `trainer.max_steps=1000` to `trainer.max_steps=5000` for both models.

### TTS model export

With TAO, you can also export your model in a format that can deployed using Nvidia Riva, a highly performant application framework for multi-modal conversational AI services using GPUs! The same command for exporting to ONNX can be used here. The only small variation is the configuration for `export_format` in the spec file!

#### Export to ONNX

Executing the snippets in the cells below, allows you to generate a `.riva` model file for the spectrogram generator and vocoder models that were trained the preceding cells. These models are required to generate a complete Text-To-Speech pipeline.

In [None]:
!tao spectro_gen export \
     -e $SPECS_DIR/spectro_gen/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/spectro_gen/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/spectro_gen/export \
     export_format=ONNX \
     export_to=spectro_gen.eonnx

In [None]:
!tao vocoder export \
     -e $SPECS_DIR/vocoder/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/vocoder/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/vocoder/export \
     export_format=ONNX \
     export_to=vocoder.eonnx

#### Export to RIVA

Executing the snippets in the cells below, allows you to generate a `.riva` model file for the spectrogram generator and vocoder models that were trained the preceding cells. These models are required to generate a complete Text-To-Speech pipeline.


In [None]:
!tao spectro_gen export \
     -e $SPECS_DIR/spectro_gen/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/spectro_gen/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/spectro_gen/export \
     export_format=RIVA \
     export_to=spectro_gen.riva

In [None]:
!tao vocoder export \
     -e $SPECS_DIR/vocoder/export.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/vocoder/finetune/checkpoints/finetuned-model.tlt \
     -r $RESULTS_DIR/vocoder/export \
     export_format=RIVA \
     export_to=vocoder.riva

### TTS Inference using ONNX

TAO provides the capability to use the exported .eonnx model for inference. The commands are very similar to the inference command for .tlt models. Again, the inputs in the spec file used is just for demo purposes, you may choose to try out your custom input!

#### Generate spectrogram

The first step for inference is generating spectrogram. That's a numpy array (saved as `.npy` file) for a sentence which can be converted to voice by a vocoder. We use FastPitch we just trained to generate spectrogram

You might have to work with the infer.yaml file to set the texts you want for inference

In [None]:
!tao spectro_gen infer_onnx \
     -e $SPECS_DIR/spectro_gen/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/spectro_gen/export/spectro_gen.eonnx \
     -r $RESULTS_DIR/spectro_gen/infer_onnx \
     output_path=$RESULTS_DIR/spectro_gen/infer_onnx/spectro \
     speaker=1

#### Generate sound file

The second step for inference is generating wav sound file based on spectrogram you generated in last step.

In [None]:
!tao vocoder infer_onnx \
     -e $SPECS_DIR/vocoder/infer.yaml \
     -g 1 \
     -k $KEY \
     -m $RESULTS_DIR/vocoder/export/vocoder.eonnx \
     -r $RESULTS_DIR/vocoder/infer_onnx \
     input_path=$RESULTS_DIR/spectro_gen/infer_onnx/spectro \
     output_path=$RESULTS_DIR/vocoder/infer_onnx/wav

If everything works properly, wav file below should sound exactly same as wav file in previous section

In [None]:
import os
import IPython.display as ipd

# change path of the file here
ipd.Audio(os.environ["HOST_RESULTS_DIR"] + '/vocoder/infer_onnx/wav/0.wav')
# ipd.Audio(os.environ["HOST_RESULTS_DIR"] + '/vocoder/infer_onnx/wav/1.wav')
# ipd.Audio(os.environ["HOST_RESULTS_DIR"] + '/vocoder/infer_onnx/wav/2.wav')

### What's Next ?

 You could use TAO to build custom models for your own applications, and deploy them to Nvidia Riva! To try deploying these models to RIVA, use the [text-to-speech-deployment.ipynb](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/resources/texttospeech_notebook) as a quick sample.