[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb)

# ESPnet LT real time E2E-TTS demonstration

This notebook provides a demonstration of the realtime E2E-TTS using ESPnet-TTS and ParallelWaveGAN (+ MelGAN).

- ESPnet: https://github.com/airenas/espnet
- ParallelWaveGAN: https://github.com/kan-bayashi/ParallelWaveGAN

Author: Airenas Vaičiūnas ([airenass@gmail.com](https://github.com/airenas))

## Install

In [2]:
# install minimal componentshttps://colab.research.google.com/github/espnet/notebook/blob/master/tts_realtime_demo.ipynb
!pip install -q parallel_wavegan PyYaml unidecode ConfigArgparse 
#!git clone -q https://github.com/espnet/espnet.git
#!cd espnet && git fetch && git checkout -b v.0.6.1 1e8b6ce88d57b53d1b60cbb3f306652468b0ab63

[31mERROR: espnet 0.6.2 has requirement h5py==2.9.0, but you'll have h5py 2.10.0 which is incompatible.[0m


In [1]:
%pwd
%cd /mnt/gfs/avaiciunas/tts/espnet/notebook

/mnt/gfs/avaiciunas/tts/espnet/notebook




---
## LT demo. Select trained model from egs/lab


#### (a) Tacotron2

In [62]:
# set path
trans_type = "char"
dict_path = "../egs/sabina/tts1/data/lang_1char/char_train_no_dev_units.txt"
#model_path = "../egs/lab/tts1/exp/char_train_no_dev_pytorch_train_pytorch_tacotron2/results/model.last1.avg.best"
model_path = "../egs/sabina/tts1/exp/char_train_no_dev_pytorch_train_pytorch_tacotron2/results/snapshot.ep.348"

print("sucessfully set prepared models.")


sucessfully set prepared models.


In [52]:
# download pretrained model
# import os
# if not os.path.exists("downloads/en/tacotron2"):
#     !./espnet/utils/download_from_google_drive.sh \
#         https://drive.google.com/open?id=1Jo06IbVlq79lMA5wM9OMuZ-ByH1eRPkC downloads/en/tacotron2 tar.gz

# set path
trans_type = "phn"
dict_path = "../espnet/egs/ljspeech/tts1/data/lang_1phn/phn_train_no_dev_units.txt"
model_path = "../espnet/egs/ljspeech/tts1/exp/phn_train_no_dev_pytorch_train_pytorch_tacotron2/results/model.last1.avg.best"

print("sucessfully prepared models.")

sucessfully prepared models.


#### (b) Transformer

In [0]:
# download pretrained model
import os
if not os.path.exists("downloads/en/transformer"):
    !./espnet/utils/download_from_google_drive.sh \
        https://drive.google.com/open?id=1z8KSOWVBjK-_Ws4RxVN4NTx-Buy03-7c downloads/en/transformer tar.gz

# set path
trans_type = "phn"
dict_path = "downloads/en/transformer/data/lang_1phn/phn_train_no_dev_units.txt"
model_path = "downloads/en/transformer/exp/phn_train_no_dev_pytorch_train_pytorch_transformer.v3.single/results/model.last1.avg.best"

print("sucessfully finished download.")

#### (c) FastSpeech


In [0]:
# download pretrained model
import os
if not os.path.exists("downloads/en/fastspeech"):
    !./espnet/utils/download_from_google_drive.sh \
        https://drive.google.com/open?id=1P9I4qag8wAcJiTCPawt6WCKBqUfJFtFp downloads/en/fastspeech tar.gz

# set path
trans_type = "phn"
dict_path = "downloads/en/fastspeech/data/lang_1phn/phn_train_no_dev_units.txt"
model_path = "downloads/en/fastspeech/exp/phn_train_no_dev_pytorch_train_tacotron2.v3_fastspeech.v4.single/results/model.last1.avg.best"

print("Sucessfully finished download.")

### Download pretrained vocoder model

You can select one from two models. Please only run the seletected model cells.

#### (a) Parallel WaveGAN

In [42]:
# download pretrained model
import os
if not os.path.exists("downloads/en/parallel_wavegan"):
    !../utils/download_from_google_drive.sh \
        https://drive.google.com/open?id=1Grn7X9wD35UcDJ5F7chwdTqTa4U7DeVB downloads/en/parallel_wavegan tar.gz

# set path
vocoder_path = "downloads/en/parallel_wavegan/ljspeech.parallel_wavegan.v2/checkpoint-400000steps.pkl"
vocoder_conf = "downloads/en/parallel_wavegan/ljspeech.parallel_wavegan.v2/config.yml"

print("Sucessfully finished download.")

Sucessfully finished download.


#### (b) MelGAN

This is an **EXPERIMENTAL** model.



In [38]:
# download pretrained model
import os
if not os.path.exists("downloads/en/melgan"):
    !../utils/download_from_google_drive.sh \
        https://drive.google.com/open?id=1ipPWYl8FBNRlBFaKj1-i23eQpW_W_YcR downloads/en/melgan tar.gz

# set path
vocoder_path = "downloads/en/melgan/train_nodev_ljspeech_melgan.v1.long/checkpoint-1000000steps.pkl"
vocoder_conf = "downloads/en/melgan/train_nodev_ljspeech_melgan.v1.long/config.yml"

print("Sucessfully finished download.")

Sucessfully finished download.


### Setup

In [63]:
# add path
import sys
sys.path.append("../egs/lab/tts1/local")
sys.path.append("../")

# define device
import torch
device = torch.device("cpu")

# define E2E-TTS model
from argparse import Namespace
from espnet.asr.asr_utils import get_model_conf
from espnet.asr.asr_utils import torch_load
from espnet.utils.dynamic_import import dynamic_import
idim, odim, train_args = get_model_conf(model_path)
model_class = dynamic_import(train_args.model_module)
model = model_class(idim, odim, train_args)
torch_load(model_path, model)
model = model.eval().to(device)
inference_args = Namespace(**{"threshold": 0.5, "minlenratio": 0.0, "maxlenratio": 10.0})

# define neural vocoder
import yaml
import parallel_wavegan.models
with open(vocoder_conf) as f:
    config = yaml.load(f, Loader=yaml.Loader)
vocoder_class = config.get("generator_type", "ParallelWaveGANGenerator")
vocoder = getattr(parallel_wavegan.models, vocoder_class)(**config["generator_params"])
vocoder.load_state_dict(torch.load(vocoder_path, map_location="cpu")["model"]["generator"])
vocoder.remove_weight_norm()
vocoder = vocoder.eval().to(device)

# define text frontend
with open(dict_path) as f:
    lines = f.readlines()
lines = [line.replace("\n", "").split(" ") for line in lines]
char_to_id = {c: int(i) for c, i in lines}
def frontend(text):
    """Clean text and then convert to id sequence."""
    if trans_type == "phn":
        text = filter(lambda s: s != " ", g2p(text))
        text = " ".join(text)
        print(f"Cleaned text: {text}")
        charseq = text.split(" ")
    else:
        print(f"Cleaned text: {text}")
        charseq = list(text)
    idseq = []
    for c in charseq:
        if c.isspace():
            idseq += [char_to_id["<space>"]]
        elif c not in char_to_id.keys():
            idseq += [char_to_id["<unk>"]]
        else:
            idseq += [char_to_id[c]]
    idseq += [idim - 1]  # <eos>
    return torch.LongTensor(idseq).view(-1).to(device)

print("Now ready to synthesize!")

Now ready to synthesize!


### Synthesis

In [69]:
import time
input_text = "alergija jodui yra labai reta."
# input_text = "apie tai šeštadienio rytą socialiniame tinkle paskelbė miesto meras vytautas grubliauskas. aiškėja, kad susirgo dvidešimt vienų metų kretingiškis"
# input_text = "studijuojantis danijoje, ten dirbęs bare."

pad_fn = torch.nn.ReplicationPad1d(
    config["generator_params"].get("aux_context_window", 0))
use_noise_input = vocoder_class == "ParallelWaveGANGenerator"
with torch.no_grad():
    start = time.time()
    x = frontend(input_text)
    print(f"x = {x}")
    c, _, _ = model.inference(x, inference_args)
    c = pad_fn(c.unsqueeze(0).transpose(2, 1)).to(device)
    xx = (c,)
    if use_noise_input:
        z_size = (1, 1, (c.size(2) - sum(pad_fn.padding)) * config["hop_size"])
        z = torch.randn(z_size).to(device)
        xx = (z,) + xx
    y = vocoder(*xx).view(-1)
rtf = (time.time() - start) / (len(y) / config["sampling_rate"])
print(f"RTF = {rtf:5f}")

from IPython.display import display, Audio
display(Audio(y.view(-1).cpu().numpy(), rate=config["sampling_rate"]))

Cleaned text: alergija jodui yra labai reta.
x = tensor([ 9, 20, 13, 25, 15, 17, 18,  9,  7, 18, 23, 12, 28, 17,  7, 30, 25,  9,
         7, 20,  9, 10,  9, 17,  7, 25, 13, 27,  9,  5, 41])
RTF = 1.151063
