## Improving Speech Quality

We see that from fine-tuning FastPitch, we were able to generate audio in a male voice but the audio quality is not as good as we expect. We recommend two steps to improve audio quality:

* Finetuning HiFi-GAN
* Adding more data

**Note that both of these steps are outside the scope of the notebook due to the limited compute available on SageMaker/Colab, but the code is included below for you to use outside of this notebook.**

### Finetuning HiFi-GAN
From the synthesized samples, there might be audible audio crackling. To fix this, we need to finetune HiFi-GAN on the new speaker's data. HiFi-GAN shows improvement using **synthesized mel spectrograms**, so the first step is to generate mel spectrograms with our finetuned FastPitch model to use as input.

The code below uses our finetuned model to generate synthesized mels for the training set we have been using. You will also need to do the same for the validation set (code should be very similar, just with paths changed).

In [1]:
import json
import numpy as np
import torch
import soundfile as sf

import wandb

from pathlib import Path

import pytorch_lightning as pl
from hydra import compose, initialize

from nemo.utils.exp_manager import exp_manager
from nemo.collections.tts.models import FastPitchModel, HifiGanModel
from nemo.collections.tts.torch.helpers import BetaBinomialInterpolator
from nemo.collections.common.callbacks import LogEpochTimeCallback

[NeMo W 2022-12-07 18:13:06 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2022-12-07 18:13:07 experimental:27] Module <class 'nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-12-07 18:13:07 experimental:27] Module <class 'nemo.collections.tts.models.radtts.RadTTSModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.


In [2]:
SPEAKER_ID = 9017
MODEL_NAME = "tts_en_fastpitch"

WANDB_PROJECT = "tts-workshop"
WANDB_ENTITY = "capecape" # replace with your wandb username or team

load the FastPitch finetuned model

In [3]:
# which split we are using
validation_split_artifact = f'{WANDB_ENTITY}/{WANDB_PROJECT}/9017_5_split:v0'

# which model
model_artifact = f'{WANDB_ENTITY}/{WANDB_PROJECT}/model-2022-12-07_17-56-27:v2'

In [4]:
wandb.init(entity=WANDB_ENTITY, project=WANDB_PROJECT, job_type="hifigan_finetune")

split_artifact = wandb.use_artifact(validation_split_artifact)
split_artifact_dir = split_artifact.download()

model_artifact = wandb.use_artifact(model_artifact, type='model')
model_artifact_dir = model_artifact.download()

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mcapecape[0m. Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m:   2 of 2 files downloaded.  
[34m[1mwandb[0m: Downloading large artifact model-2022-12-07_17-56-27:v2, 524.07MB. 1 files... 
[34m[1mwandb[0m:   1 of 1 files downloaded.  
Done. 0:0:0.0


In [5]:
def ls(path): return list(Path(path).iterdir())

In [6]:
ls(split_artifact_dir)

[PosixPath('artifacts/9017_5_split:v0/9017_manifest_valid_local.json'),
 PosixPath('artifacts/9017_5_split:v0/9017_manifest_train_local.json')]

In [7]:
ls(model_artifact_dir)

[PosixPath('artifacts/model-2022-12-07_17-56-27:v2/model.ckpt')]

In [8]:
last_ckpt = str(ls(model_artifact_dir)[0])
print(last_ckpt)

spec_model = FastPitchModel.load_from_checkpoint(last_ckpt)
spec_model.eval().cuda();

artifacts/model-2022-12-07_17-56-27:v2/model.ckpt
[NeMo I 2022-12-07 18:13:21 tokenize_and_classify:87] Creating ClassifyFst grammars.


[NeMo W 2022-12-07 18:13:53 experimental:27] Module <class 'nemo_text_processing.g2p.modules.IPAG2P'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2022-12-07 18:13:53 modules:95] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2022-12-07 18:13:53 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: 9017_manifest_train_local.json
      sample_rate: 22050
      sup_data_path: ./fastpitch_sup_data
      sup_data_types:
      - align_prior_matrix
      - pitch
      n

[NeMo I 2022-12-07 18:13:54 features:267] PADDING: 1


In [9]:
train_manifest = Path(split_artifact_dir)/f"{SPEAKER_ID}_manifest_train_local.json"
valid_manifest = Path(split_artifact_dir)/f"{SPEAKER_ID}_manifest_valid_local.json"

In [10]:
def load_wav(audio_file, target_sr=None):
    with sf.SoundFile(audio_file, 'r') as f:
        samples = f.read(dtype='float32')
        sample_rate = f.samplerate
        if target_sr is not None and target_sr != sample_rate:
            samples = librosa.core.resample(samples, orig_sr=sample_rate, target_sr=target_sr)
    return samples.transpose()

In [11]:
def process_manifest(manifest_path, hifigan_manifest_path="hifigan_train_ft.json"):
    "Get records from the training manifest"
    records = []
    with open(manifest_path, "r") as f:
        for i, line in enumerate(f):
            records.append(json.loads(line))

    beta_binomial_interpolator = BetaBinomialInterpolator()
    spec_model.eval()

    device = spec_model.device

    save_dir = Path(manifest_path.name.split(".")[0] + "_mels")
    save_dir.mkdir(exist_ok=True, parents=True)

    # Generate a spectrograms (we need to use ground truth alignment for correct matching between audio and mels)
    for i, r in enumerate(records):
        audio = load_wav(r["audio_filepath"])
        audio = torch.from_numpy(audio).unsqueeze(0).to(device)
        audio_len = torch.tensor(audio.shape[1], dtype=torch.long, device=device).unsqueeze(0)

        # Again, our finetuned FastPitch model doesn't use multiple speakers,
        # but we keep the code to support it here for reference
        if spec_model.fastpitch.speaker_emb is not None and "speaker" in r:
            speaker = torch.tensor([r['speaker']]).to(device)
        else:
            speaker = None

        with torch.no_grad():
            if "normalized_text" in r:
                text = spec_model.parse(r["normalized_text"], normalize=False)
            else:
                text = spec_model.parse(r['text'])

            text_len = torch.tensor(text.shape[-1], dtype=torch.long, device=device).unsqueeze(0)

            spect, spect_len = spec_model.preprocessor(input_signal=audio, length=audio_len)

            # Generate attention prior and spectrogram inputs for HiFi-GAN
            attn_prior = torch.from_numpy(
              beta_binomial_interpolator(spect_len.item(), text_len.item())
            ).unsqueeze(0).to(text.device)

            spectrogram = spec_model.forward(
              text=text, 
              input_lens=text_len, 
              spec=spect, 
              mel_lens=spect_len, 
              attn_prior=attn_prior,
              speaker=speaker,
            )[0]

            save_path = save_dir / f"mel_{i}.npy"
            np.save(save_path, spectrogram[0].to('cpu').numpy())
            r["mel_filepath"] = str(save_path)

    with open(hifigan_manifest_path, "w") as f:
        for r in records:
            f.write(json.dumps(r) + '\n')

In [12]:
process_manifest(train_manifest, "hifigan_train_ft.json")
process_manifest(valid_manifest, "hifigan_valid_ft.json")

Just as before, we can fine tune using the underlying Pytorch Lightning code (or the `hifigan_finetune.py` script

In [18]:
with initialize(config_path="./conf/hifigan"):
    cfg = compose(config_name="hifigan.yaml")

In [19]:
# datasets
cfg.train_dataset = "hifigan_train_ft.json"
cfg.validation_datasets = "hifigan_valid_ft.json"

# model ckpt
cfg.init_from_pretrained_model = "tts_hifigan"

# iterate fast
cfg.model.max_steps = 300
cfg.model.optim.lr = 0.00001

In [20]:
trainer = pl.Trainer(**cfg.trainer, enable_progress_bar=False)
exp_manager(trainer, cfg.get("exp_manager", None))

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


[NeMo I 2022-12-07 18:15:07 exp_manager:343] Experiments will be logged at exp_dir_hifi/HifiGan/2022-12-07_18-14-37
[NeMo I 2022-12-07 18:15:07 exp_manager:733] WandBLogger has been set up


[NeMo W 2022-12-07 18:15:07 exp_manager:988] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 300. Please ensure that max_steps will run for at least 10 epochs to ensure that checkpointing will not error out.


PosixPath('exp_dir_hifi/HifiGan/2022-12-07_18-14-37')

In [21]:
model = HifiGanModel(cfg=cfg.model, trainer=trainer)
model.maybe_init_from_pretrained_checkpoint(cfg=cfg)
lr_logger = pl.callbacks.LearningRateMonitor()
epoch_time_logger = LogEpochTimeCallback()
trainer.callbacks.extend([lr_logger, epoch_time_logger])

[NeMo I 2022-12-07 18:15:09 data:930] Loading dataset from hifigan_train_ft.json.


76it [00:00, 62307.88it/s]

[NeMo I 2022-12-07 18:15:09 data:954] Loaded dataset with 76 files.
[NeMo I 2022-12-07 18:15:09 data:956] Dataset contains 0.08 hours.
[NeMo I 2022-12-07 18:15:09 data:358] Pruned 2 files. Final dataset contains 74 files
[NeMo I 2022-12-07 18:15:09 data:360] Pruned 0.00 hours. Final dataset contains 0.08 hours.
[NeMo I 2022-12-07 18:15:09 data:930] Loading dataset from hifigan_valid_ft.json.



2it [00:00, 3508.41it/s]

[NeMo I 2022-12-07 18:15:09 data:954] Loaded dataset with 2 files.
[NeMo I 2022-12-07 18:15:09 data:956] Dataset contains 0.00 hours.
[NeMo I 2022-12-07 18:15:09 data:358] Pruned 0 files. Final dataset contains 2 files
[NeMo I 2022-12-07 18:15:09 data:360] Pruned 0.00 hours. Final dataset contains 0.00 hours.
[NeMo I 2022-12-07 18:15:09 features:267] PADDING: 0
[NeMo I 2022-12-07 18:15:09 features:275] STFT using exact pad
[NeMo I 2022-12-07 18:15:09 features:267] PADDING: 0
[NeMo I 2022-12-07 18:15:09 features:275] STFT using exact pad





[NeMo I 2022-12-07 18:15:10 cloud:56] Found existing object /home/tcapelle/.cache/torch/NeMo/NeMo_1.14.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.
[NeMo I 2022-12-07 18:15:10 cloud:62] Re-using file from: /home/tcapelle/.cache/torch/NeMo/NeMo_1.14.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo
[NeMo I 2022-12-07 18:15:10 common:912] Instantiating model from pre-trained checkpoint


[NeMo W 2022-12-07 18:15:13 modelPT:142] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2022-12-07 18:15:13 modelPT:149] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segments: 66150


[NeMo I 2022-12-07 18:15:13 features:267] PADDING: 0


[NeMo W 2022-12-07 18:15:13 features:244] Using torch_stft is deprecated and has been removed. The values have been forcibly set to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2022-12-07 18:15:13 features:267] PADDING: 0
[NeMo I 2022-12-07 18:15:14 save_restore_connector:243] Model HifiGanModel was successfully restored from /home/tcapelle/.cache/torch/NeMo/NeMo_1.14.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.
[NeMo I 2022-12-07 18:15:14 modelPT:1115] Model checkpoint restored from pretrained checkpoint with name : `tts_hifigan`


In [22]:
trainer.fit(model)

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name                       | Type                     | Params
------------------------------------------------------------------------
0 | audio_to_melspec_precessor | FilterbankFeatures       | 0     
1 | trg_melspec_fn             | FilterbankFeatures       | 0     
2 | generator                  | Generator                | 13.9 M
3 | mpd                        | MultiPeriodDiscriminator | 41.1 M
4 | msd                        | MultiScaleDiscriminator  | 29.6 M
5 | feature_loss               | FeatureMatchingLoss      | 0     
6 | discriminator_loss         | DiscriminatorLoss        | 0     
7 | generator_loss             | GeneratorLoss            | 0     
------------------------------------------------------------------------
84.7 M    Trainable params
0         Non-trainable params
84.7 M    Total params
338.643   Total estimated model params size (MB)
Epoch 9, global step 100: 'val_loss' reached 0.43742 (best 0.43742), saving mo

In [23]:
wandb.finish()

0,1
d_loss,█▂▅▃▃▃▄▅█▂▂▁▄▂▄
d_loss_mpd,▆▄█▅▁▁▄▄▂▃▁▃▄▃▇
d_loss_msd,▆▂▂▂▃▄▃▄█▂▂▁▃▂▂
epoch,▁▁▂▃▃▃▃▄▅▅▅▅▆▇▇▇██
epoch_time,▁▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁█
g_loss,▆▄▂█▄▄▃▃▃▁▃▂▃▄▄
g_loss_fm_mpd,▆▄▁▃█▇▆▃▄▄▅▃▆▅▅
g_loss_fm_msd,█▄▃▅▇▅▅▅▆▄▃▁▆▃▅
g_loss_gen_mpd,▂▄▁▅▅▇▅▄▆▆██▅▇▆
g_loss_gen_msd,▆▆▃█▃▇▂▆▁▃▆▃█▂▄

0,1
d_loss,3.42249
d_loss_mpd,1.99808
d_loss_msd,1.42442
epoch,29.0
epoch_time,9.63934
g_loss,42.46178
g_loss_fm_mpd,8.54431
g_loss_fm_msd,5.5821
g_loss_gen_mpd,2.38173
g_loss_gen_msd,1.69788


### Running as a python program on a shell

We can then finetune hifigan similarly to fastpitch using NeMo's [hifigan_finetune.py](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/hifigan_finetune.py) and [hifigan.yaml](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/hifigan/hifigan.yaml):

```bash
python examples/tts/hifigan_finetune.py \
--config-name=hifigan.yaml \
model.train_ds.dataloader_params.batch_size=32 \
model.max_steps=1000 \
model.optim.lr=0.00001 \
~model.optim.sched \
train_dataset=./hifigan_train_ft.json \
validation_datasets=./hifigan_val_ft.json \
exp_manager.exp_dir=hifigan_ft \
+init_from_pretrained_model=tts_hifigan \
trainer.check_val_every_n_epoch=10 \
model/train_ds=train_ds_finetune \
model/validation_ds=val_ds_finetune
```

Like when finetuning FastPitch, we lower the learning rate and get rid of the optimizer schedule for finetuning. You will need to create `<your_hifigan_val_manifest>` and the synthesized mels corresponding to it.

As mentioned, the above command is not runnable in Colab due to limited compute resources, but you are free to finetune HiFi-GAN on your own machines.

### Adding more data

We can add more data in two ways. They can be combined for the best effect:

* **Add more training data from the new speaker**

The entire notebook can be repeated from the top after a new JSON manifest is defined that includes the additional data. Modify your finetuning commands to point to the new manifest. Be sure to increase the number of steps as more data is added to both the FastPitch and HiFi-GAN finetuning.

We recommend **1000 steps per minute of audio for fastpitch and 500 steps per minute of audio for HiFi-GAN**.

* **Mix new speaker data with old speaker data**

We recommend finetuning FastPitch (but not HiFi-GAN) using both old speaker data (LJSpeech in this notebook) and the new speaker data. In this case, please modify the JSON manifests when finetuning FastPitch to include speaker information by adding a `speaker` field to each entry:

```
{"audio_filepath": "new_speaker.wav", "text": "sample", "duration": 2.6, "speaker": 1}
{"audio_filepath": "old_speaker.wav", "text": "LJSpeech sample", "duration": 2.6, "speaker": 0}
```

5 hours of data from the old speaker should be sufficient for training; it's up to you how much data from the old speaker to use in validation.

For the training manifest, since we likely have less data from the new speaker, we need to ensure that the model sees a similar amount of new data and old data. We can do this by repeating samples from the new speaker until we have an equivalent number of samples from the old and new speaker. For each sample from the old speaker, please add a sample from the new speaker in the .json.

As a toy example, if we use 4 samples of the old speaker and only 2 samples of the new speaker, we would want the order of samples in our manifest to look something like this:

```
old_speaker_sample_0
new_speaker_sample_0
old_speaker_sample_1
new_speaker_sample_1
old_speaker_sample_2
new_speaker_sample_0  # Start repeat of new speaker samples
old_speaker_sample_3
new_speaker_sample_1
```

Once the manifests are created, we can modify the FastPitch training command to point to the new training and validation JSON files.

We also need to update `model.n_speakers=1` to `model.n_speakers=2`, as well as update the `sup_data_types` specified in the config file to include `speaker_id` (`sup_data_types=[align_prior_matrix,pitch,speaker_id]`). Updating these two fields is very important--otherwise the model will not recognize that there is more than one speaker!

Ensure the pitch statistics correspond to the new speaker rather than the old speaker for best results.

**For HiFiGAN finetuning, the training should be done on the new speaker data.**