# ESPnet2-TTS realtime demonstration

This notebook provides a demonstration of the realtime E2E-TTS using ESPnet2-TTS and ParallelWaveGAN (+ MelGAN).

- ESPnet2-TTS: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1
- ParallelWaveGAN: https://github.com/kan-bayashi/ParallelWaveGAN

Author: Tomoki Hayashi ([@kan-bayashi](https://github.com/kan-bayashi)), Airenas Vaičiūnas

## Installation

In [None]:
# NOTE: pip shows imcompatible errors due to preinstalled libraries but you do not need to care
!pip install -q espnet==0.9.5 parallel_wavegan==0.4.8

## Single speaker model demo

### Model Selection

In [None]:
###################################
#          SABINA MODELS         #
###################################
fs, lang = 22050, "Lithuanian"
tag_dir= "/home/avaiciunas/gfs/tts/espnet/egs2/sabina/tts1/p-03/exp/tts_train_raw_phn_none"
tag = "1epoch.pth"

vocoder_tag = "ljspeech_parallel_wavegan.v1"
vocoder_path=("/home/avaiciunas/gfs/tts/ParallelWaveGAN/egs/sabina/voc1/exp/" 
    "train_nodev_ljspeech_parallel_wavegan.v3/checkpoint-710000steps.pkl")



### Model Setup

In [15]:
import time
import torch
from espnet2.bin.tts_inference import Text2Speech
from parallel_wavegan.utils import download_pretrained_model
from parallel_wavegan.utils import load_model
mf = {'train_config': tag_dir + '/config-run.yaml',
 'model_file': tag_dir + "/" + tag}
dev="cpu"
text2speech = Text2Speech(
    **mf,
    device=dev,
    # Only for Tacotron 2
    threshold=0.5,
    minlenratio=0.0,
    maxlenratio=10.0,
    use_att_constraint=False,
    backward_window=1,
    forward_window=3,
    # Only for FastSpeech & FastSpeech2
    speed_control_alpha=1.0,
)
text2speech.spc2wav = None  # Disable griffin-lim
# NOTE: Sometimes download is failed due to "Permission denied". That is 
#   the limitation of google drive. Please retry after serveral hours.
# vocoder = load_model(download_pretrained_model(vocoder_tag)).to(dev).eval()
vocoder = load_model(vocoder_path).to(dev).eval()
vocoder.remove_weight_norm()

### Synthesis

In [16]:
# decide the input sentence by yourself
x = ("sil s t ai g \"a k a Z' \"i N.' k' ie n ^o: r a ^N. k o: s \"i S \"u S p a k a l' io:" 
     " u s p \"au d' E: j \"e: m. a k' \"i s i ^r. t ^ai p' n' e t' i k' \"E: t ai ,"
     " k \"a t' j \"i s' n' ^e: t k r \"u: p' t' e l' E: j io: . sil")
# synthesis
with torch.no_grad():
    start = time.time()
    wav, c, *_ = text2speech(x)
    wav = vocoder.inference(c)
rtf = (time.time() - start) / (len(wav) / fs)
print(f"RTF = {rtf:5f}")

# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=fs))

RTF = 2.223634
