# Tacotron2 : Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

<img src="./figures/tacotron2_arch.png" title="Tacotron2" width="50%"/>

### Model Description

The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding speech from raw transcripts without any additional prosody information. The Tacotron 2 model produces mel spectrograms from input text using encoder-decoder architecture. WaveGlow (also available via torch.hub) is a flow-based model that consumes the mel spectrograms to generate speech.

This implementation of Tacotron 2 model differs from the model described in the paper. Our implementation uses Dropout instead of Zoneout to regularize the LSTM layers.

### Example

In the example below:
- pretrained Tacotron2 and Waveglow models are loaded from torch.hub
- Given a tensor representation of the input text ("Hello world, I missed you so much"), Tacotron2 generates a Mel spectrogram as shown on the illustration
- Waveglow generates sound given the mel spectrogram
- the output sound is saved in an 'audio.wav' file

To run the example you need some extra python packages installed.
These are needed for preprocessing the text and audio, as well as for display and input / output.

In [2]:
# %%bash
# pip install numpy scipy librosa unidecode inflect librosa
# apt-get update
# apt-get install -y libsndfile1

In [1]:
import os
# Set GPU device
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ['http_proxy']  = 'http://192.41.170.23:3128'
os.environ['https_proxy'] = 'http://192.41.170.23:3128'

In [2]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

Load the Tacotron2 model pre-trained on [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/) and prepare it for inference:

In [3]:
import torch
tacotron2 = torch.hub.load(
    repo_or_dir = 'NVIDIA/DeepLearningExamples:torchhub', 
    model = 'nvidia_tacotron2', 
    model_math='fp16'
)
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()

Using cache found in /home/todsavadt/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
  ckpt = torch.load(ckpt_file)


Tacotron2(
  (embedding): Embedding(148, 512)
  (encoder): Encoder(
    (convolutions): ModuleList(
      (0-2): 3 x Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
  )
  (decoder): Decoder(
    (prenet): Prenet(
      (layers): ModuleList(
        (0): LinearNorm(
          (linear_layer): Linear(in_features=80, out_features=256, bias=False)
        )
        (1): LinearNorm(
          (linear_layer): Linear(in_features=256, out_features=256, bias=False)
        )
      )
    )
    (attention_rnn): LSTMCell(768, 1024)
    (attention_layer): Attention(
      (query_layer): LinearNorm(
        (linear_layer): Linear(in_features=1024, out_features=128, bias=False)
      )
      (memory_layer): LinearNorm(
        (linear_layer): Linear(in_fea

Load pretrained WaveGlow model

In [4]:
waveglow = torch.hub.load(
    repo_or_dir = 'NVIDIA/DeepLearningExamples:torchhub', 
    model = 'nvidia_waveglow', 
    model_math='fp16'
)
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()

Using cache found in /home/todsavadt/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
  ckpt = torch.load(ckpt_file)
  WeightNorm.apply(module, name, dim)


WaveGlow(
  (upsample): ConvTranspose1d(80, 80, kernel_size=(1024,), stride=(256,))
  (WN): ModuleList(
    (0-3): 4 x WN(
      (in_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
        (1): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(2,), dilation=(2,))
        (2): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(4,))
        (3): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(8,))
        (4): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(16,))
        (5): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(32,))
        (6): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(64,))
        (7): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(128,), dilation=(128,))
      )
      (res_skip_layers): ModuleList(
        (0-6): 7 x Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
        (7

Now, let's make the model say:

In [5]:
text = "Hello world, I missed you so much."

Format the input using utility methods

In [10]:
utils = torch.hub.load(
    repo_or_dir = 'NVIDIA/DeepLearningExamples:torchhub', 
    model = 'nvidia_tts_utils'
)
sequences, lengths = utils.prepare_input_sequence([text])
sequences.shape, lengths.shape

Using cache found in /home/todsavadt/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


(torch.Size([1, 34]), torch.Size([1]))

Run the chained models:

In [11]:
with torch.no_grad():
    mel, _, _ = tacotron2.infer(sequences, lengths)
    audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050

You can write it to a file and listen to it

In [12]:
from scipy.io.wavfile import write
write("audio.wav", rate, audio_numpy)

Alternatively, play it right away in a notebook with IPython widgets

In [13]:
from IPython.display import Audio
Audio(audio_numpy, rate=rate)

## Training 

## Inference

### Details
For detailed information on model input and output, training recipies, inference and performance visit: [github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2) and/or [NGC](https://ngc.nvidia.com/catalog/resources/nvidia:tacotron_2_and_waveglow_for_pytorch)

### References

 - [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884)
 - [WaveGlow: A Flow-based Generative Network for Speech Synthesis](https://arxiv.org/abs/1811.00002)
 - [Tacotron2 and WaveGlow on NGC](https://ngc.nvidia.com/catalog/resources/nvidia:tacotron_2_and_waveglow_for_pytorch)
 - [Tacotron2 and Waveglow on github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)