# Finetuning FastPitch for a new speaker

In this tutorial, we will finetune a single speaker FastPitch (with alignment) model on 5 mins of a new speaker's data. We will finetune the model parameters only on new speaker's text and speech pairs.

We will download the training data, then generate and run a training command to finetune Fastpitch on 5 mins of data, and synthesize the audio from the trained checkpoint.

A final section will describe approaches to improve audio quality past this notebook.

## License

> Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
>
> Licensed under the Apache License, Version 2.0 (the "License");
> you may not use this file except in compliance with the License.
> You may obtain a copy of the License at
>
>     http://www.apache.org/licenses/LICENSE-2.0
>
> Unless required by applicable law or agreed to in writing, software
> distributed under the License is distributed on an "AS IS" BASIS,
> WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> See the License for the specific language governing permissions and
> limitations under the License.

In [None]:
"""
You can either run this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.
Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
BRANCH = 'main'
# # If you're using Google Colab and not running locally, uncomment and run this cell.
# !apt-get install sox libsndfile1 ffmpeg
# !pip install wget unidecode
# !python -m pip install git+https://github.com/NeMo/NeMo.git@$BRANCH#egg=nemo_toolkit[tts]

## Downloading Data
___

Download and untar the data.

The data contains a 5 minute subset of audio from speaker 6097 from the HiFiTTS dataset.

In [None]:
!wget https://nemo-public.s3.us-east-2.amazonaws.com/6097_5_mins.tar.gz  # Contains 10MB of data
!tar -xzf 6097_5_mins.tar.gz

Looking at manifest.json, we see a standard NeMo json that contains the filepath, text, and duration. Please note that manifest.json only contains the relative path.

```
{"audio_filepath": "audio/presentpictureofnsw_02_mann_0532.wav", "text": "not to stop more than ten minutes by the way", "duration": 2.6, "text_no_preprocessing": "not to stop more than ten minutes by the way,", "text_normalized": "not to stop more than ten minutes by the way,"}
```

Let's take 2 samples from the dataset and split it off into a validation set. Then, split all other samples into the training set.

In [None]:
!cat ./6097_5_mins/manifest.json | tail -n 2 > ./6097_manifest_dev_ns_all_local.json
!cat ./6097_5_mins/manifest.json | head -n -2 > ./6097_manifest_train_dur_5_mins_local.json
!ln -s ./6097_5_mins/audio audio

## Finetuning FastPitch
___



Let's first download the pretrained checkpoint that we want to finetune from. NeMo will save checkpoints to ~/.cache, so let's move that to our current directory

In [None]:
import os
import json

import torch
import IPython.display as ipd
from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt

from nemo.collections.tts.models import FastPitchModel
FastPitchModel.from_pretrained("tts_en_fastpitch")

from pathlib import Path
nemo_files = [p for p in Path("/root/.cache/torch/NeMo/").glob("**/tts_en_fastpitch_align.nemo")]
print(f"Copying {nemo_files[0]} to ./")
Path("./tts_en_fastpitch_align.nemo").write_bytes(nemo_files[0].read_bytes())

To finetune the FastPitch model on the above created filelists, we use `examples/tts/fastpitch2_finetune.py` script to train the models with the `fastpitch_align.yaml` configuration.

Let's grab those files.

In [None]:
!wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/fastpitch2_finetune.py
!mkdir -p conf && cd conf && wget https://raw.githubusercontent.com/nvidia/NeMo/$BRANCH/examples/tts/conf/fastpitch_align.yaml && cd ..

We can now train our model with the following command:

**NOTE: This will take about 50 minutes on colab's K80 GPUs.**

`python fastpitch2_finetune.py --config-name=fastpitch_align.yaml train_dataset=./6097_manifest_train_dur_5_mins_local.json validation_datasets=./6097_manifest_dev_ns_all_local.json +init_from_nemo_model=./tts_en_fastpitch_align.nemo +trainer.max_steps=1000 ~trainer.max_epochs trainer.check_val_every_n_epoch=25 prior_folder=./Priors6097 model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 exp_manager.exp_dir=./ljspeech_to_6097_no_mixing_5_mins model.n_speakers=1 model.pitch_avg=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512 model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam trainer.devices=1 trainer.strategy=null`

In [None]:
!python fastpitch2_finetune.py --config-name=fastpitch_align.yaml train_dataset=./6097_manifest_train_dur_5_mins_local.json validation_datasets=./6097_manifest_dev_ns_all_local.json +init_from_nemo_model=./tts_en_fastpitch_align.nemo +trainer.max_steps=1000 ~trainer.max_epochs trainer.check_val_every_n_epoch=25 prior_folder=./Priors6097 model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24 exp_manager.exp_dir=./ljspeech_to_6097_no_mixing_5_mins model.n_speakers=1 model.pitch_avg=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512 model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam trainer.devices=1 trainer.strategy=null

Let's take a closer look at the training command:

* `python fastpitch2_finetune.py --config-name=fastpitch_align.yaml`
  * --config-name tells the script what config to use.

* `train_dataset=./6097_manifest_train_dur_5_mins_local.json validation_datasets=./6097_manifest_dev_ns_all_local.json`
  * We tell the model what manifest files we can to train and eval on.

* `+init_from_nemo_model=./tts_en_fastpitch_align.nemo`
  * We tell the script what checkpoint to finetune from.

* `+trainer.max_steps=1000 ~trainer.max_epochs trainer.check_val_every_n_epoch=25`
  * For this experiment, we need to tell the script to train for 1000 training steps/iterations. We need to remove max_epochs using `~trainer.max_epochs`.

* `prior_folder=./Priors6097 model.train_ds.dataloader_params.batch_size=24 model.validation_ds.dataloader_params.batch_size=24`
  * Some dataset parameters. The dataset does some online processing and stores the processing steps to the `prior_folder`.

* `exp_manager.exp_dir=./ljspeech_to_6097_no_mixing_5_mins`
  * Where we want to save our log files, tensorboard file, checkpoints, and more

* `model.n_speakers=1`
  * The number of speakers in the data. There is only 1 for now, but we will revisit this parameter later in the notebook

* `model.pitch_avg=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512`
  * For the new speaker, we need to define new pitch hyperparameters for better audio quality.
  * These parameters work for speaker 6097 from the HiFiTTS dataset
  * For speaker 92, we suggest `model.pitch_avg=214.5 model.pitch_std=30.9 model.pitch_fmin=80 model.pitch_fmax=512`
  * fmin and fmax are hyperparameters to librosa's pyin function. We recommend tweaking these per speaker.
  * After fmin and fmax are defined, pitch mean and std can be easily extracted

* `model.optim.lr=2e-4 ~model.optim.sched model.optim.name=adam`
  * For fine-tuning, we lower the learning rate
  * We use a fixed learning rate of 2e-4
  * We switch from the lamb optimizer to the adam optimizer

* `trainer.devices=1 trainer.strategy=null`
  * For this notebook, we default to 1 gpu which means that we do not need ddp
  * If you have the compute resources, feel free to scale this up to the number of free gpus you have available
  * Please remove the `trainer.strategy=null` section if you intend on multi-gpu training

## Synthesize Samples from Finetuned Checkpoints

---



Once we have finetuned our FastPitch model, we can synthesize the audio samples for given text using the following inference steps. We use a HiFiGAN vocoder trained on LJSpeech.

We define some helper functions as well.

In [None]:
from nemo.collections.tts.models import HifiGanModel
from nemo.collections.tts.models import FastPitchModel

vocoder = HifiGanModel.from_pretrained("tts_hifigan")
vocoder = vocoder.eval().cuda()

In [None]:
def infer(spec_gen_model, vocoder_model, str_input, speaker = None):
    """
    Synthesizes spectrogram and audio from a text string given a spectrogram synthesis and vocoder model.
    
    Arguments:
    spec_gen_model -- Instance of FastPitch model
    vocoder_model -- Instance of a vocoder model (HiFiGAN in our case)
    str_input -- Text input for the synthesis
    speaker -- Speaker number (in the case of a multi-speaker model -- in the mixing case)
    
    Returns:
    spectrogram, waveform of the synthesized audio.
    """
    parser_model = spec_gen_model
    with torch.no_grad():
        parsed = parser_model.parse(str_input)
        if speaker is not None:
            speaker = torch.tensor([speaker]).long().cuda()
        spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed, speaker = speaker)
        audio = vocoder_model.convert_spectrogram_to_audio(spec=spectrogram)
        
    if spectrogram is not None:
        if isinstance(spectrogram, torch.Tensor):
            spectrogram = spectrogram.to('cpu').numpy()
        if len(spectrogram.shape) == 3:
            spectrogram = spectrogram[0]
    if isinstance(audio, torch.Tensor):
        audio = audio.to('cpu').numpy()
    return spectrogram, audio

def get_best_ckpt(experiment_base_dir, new_speaker_id, duration_mins, mixing_enabled, original_speaker_id):
    """
    Gives the model checkpoint paths of an experiment  we ran. 
    
    Arguments:
    experiment_base_dir -- Base experiment directory (specified on top of this notebook as exp_base_dir)
    new_speaker_id -- Speaker id of new HiFiTTS speaker we finetuned FastPitch on
    duration_mins -- total minutes of the new speaker data
    mixing_enabled -- True or False depending on whether we want to mix the original speaker data or not
    original_speaker_id -- speaker id of the original HiFiTTS speaker
    
    Returns:
    List of all checkpoint paths sorted by validation error, Last checkpoint path
    """
    if not mixing_enabled:
        exp_dir = "{}/{}_to_{}_no_mixing_{}_mins".format(experiment_base_dir, original_speaker_id, new_speaker_id, duration_mins)
    else:
        exp_dir = "{}/{}_to_{}_mixing_{}_mins".format(experiment_base_dir, original_speaker_id, new_speaker_id, duration_mins)
    
    ckpt_candidates = []
    last_ckpt = None
    for root, dirs, files in os.walk(exp_dir):
        for file in files:
            if file.endswith(".ckpt"):
                val_error = float(file.split("v_loss=")[1].split("-epoch")[0])
                if "last" in file:
                    last_ckpt = os.path.join(root, file)
                ckpt_candidates.append( (val_error, os.path.join(root, file)))
    ckpt_candidates.sort()
    
    return ckpt_candidates, last_ckpt

Specify the speaker id, duration mins and mixing variable to find the relevant checkpoint from the exp_base_dir and compare the synthesized audio with validation samples of the new speaker.

In [None]:
new_speaker_id = 6097
duration_mins = 5
mixing = False
original_speaker_id = "ljspeech"

_ ,last_ckpt = get_best_ckpt("./", new_speaker_id, duration_mins, mixing, original_speaker_id)
print(last_ckpt)

spec_model = FastPitchModel.load_from_checkpoint(last_ckpt)
spec_model.eval().cuda()
_speaker=None
if mixing:
    _speaker = 1

num_val = 2

manifest_path = os.path.join("./", "{}_manifest_dev_ns_all_local.json".format(new_speaker_id))
val_records = []
with open(manifest_path, "r") as f:
    for i, line in enumerate(f):
        val_records.append( json.loads(line) )
        if len(val_records) >= num_val:
            break
            
for val_record in val_records:
    print ("Real validation audio")
    ipd.display(ipd.Audio(val_record['audio_filepath'], rate=22050))
    print ("SYNTHESIZED FOR -- Speaker: {} | Dataset size: {} mins | Mixing:{} | Text: {}".format(new_speaker_id, duration_mins, mixing, val_record['text']))
    spec, audio = infer(spec_model, vocoder, val_record['text'], speaker = _speaker)
    ipd.display(ipd.Audio(audio, rate=22050))
    %matplotlib inline
    #if spec is not None:
    imshow(spec, origin="lower", aspect = "auto")
    plt.show()

## Improving Speech Quality
___

We see that from fine-tuning FastPitch, we were able to generate audio in a male voice but the audio quality is not as good as we expect. We recommend two steps to improve audio quality:

* Finetuning HiFiGAN
* Adding more data

Both of these steps are outside the scope of the notebook due to the limited compute available on colab.

### Finetuning HiFiGAN
From the synthesized samples, there might be audible audio crackling. To fix this, we need to finetune HiFiGAN on the new speaker's data. HiFiGAN shows improvement using synthesized mel spectrograms, so the first step is to generate mel spectrograms with our finetuned FastPitch model.

```python
# Get records from the training manifest
manifest_path = "./6097_manifest_train_dur_5_mins_local.json"
records = []
with open(manifest_path, "r") as f:
    for i, line in enumerate(f):
        records.append(json.loads(line))

# Generate a spectrogram for each item
for i, r in enumerate(records):
  with torch.no_grad():
      parsed = parser_model.parse(r['text'])
      spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed)
      if isinstance(spectrogram, torch.Tensor):
          spectrogram = spectrogram.to('cpu').numpy()
      if len(spectrogram.shape) == 3:
          spectrogram = spectrogram[0]
      np.save(f"mel_{i}", spectrogram)
      r["mel_filepath"] = f"mel_{i}.npy"

# Save to a new json
with open("hifigan_train_ft.json", "w") as f:
  for r in records:
    f.write(json.dumps(r) + '\n')

# Please do the same for the validation json. Code is omitted.
```

We can then finetune hifigan similarly to fastpitch using NeMo's [hifigan_finetune.py](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/hifigan_finetune.py) and [hifigan.yaml](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/hifigan/hifigan.yaml):

`python examples/tts/hifigan_finetune.py --config_name=hifigan.yaml model.train_ds.dataloader_params.batch_size=32 model.max_steps=1000 ~model.sched model.optim.lr=0.0001 train_dataset=./hifigan_train_ft.json validation_datasets=./hifigan_val_ft.json exp_manager.exp_dir=hifigan_ft +init_from_nemo_model=tts_hifigan.nemo trainer.check_val_every_n_epoch=10`

### Improving TTS by Adding More Data
We can add more data in two ways. they can be combined for the best effect:

* Add more training data from the new speaker

The entire notebook can be repeated from the top after a new .json is defined for the additional data. Modify your finetuning commands to point to the new json. Be sure to increase the number of steps as more data is added to both the fastpitch and hifigan finetuning. We recommend 1000 steps per minute of audio for fastpitch and 500 steps per minute of audio for hifigan.

* Mix new speaker data with old speaker data

We recommend to train fastpitch using both old speaker data (LJSpeech in this notebook) and the new speaker data. In this case, please modify the .json when finetuning fastpitch to include speaker information:

`
{"audio_filepath": "new_speaker.wav", "text": "sample", "duration": 2.6, "speaker": 1}
{"audio_filepath": "old_speaker.wav", "text": "LJSpeech sample", "duration": 2.6, "speaker": 0}
`
5 hours of data from the old speaker should be sufficient. Since we should have less data from the new speaker, we need to ensure that the model sees a similar amount of new data and old data. For each sample from the old speaker, please add a sample from the new speaker in the .json. The samples from the new speaker will be repeated.

Modify the fastpitch training command to point to the new training and validation .jsons, and update `model.n_speakers=1` to `model.n_speakers=2`. Ensure the pitch statistics correspond to the new speaker.

For HiFiGAN finetuning, the training should be done on the new speaker data.