[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_tts_realtime_demo.ipynb)

# ESPnet2-TTS realtime demonstration

This notebook provides a demonstration of the realtime E2E-TTS using ESPnet2-TTS and ParallelWaveGAN repo.

- ESPnet2-TTS: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1
- ParallelWaveGAN: https://github.com/kan-bayashi/ParallelWaveGAN

Author: Tomoki Hayashi ([@kan-bayashi](https://github.com/kan-bayashi))

## Installation

In [1]:
# NOTE: pip shows imcompatible errors due to preinstalled libraries but you do not need to care
!pip install -q espnet==0.10.3 parallel_wavegan==0.5.3 espnet_model_zoo

[K     |████████████████████████████████| 907 kB 5.2 MB/s 
[K     |████████████████████████████████| 68 kB 6.0 MB/s 
[K     |████████████████████████████████| 1.7 MB 37.7 MB/s 
[K     |████████████████████████████████| 212 kB 52.0 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 13.1 MB 16.0 MB/s 
[K     |████████████████████████████████| 596 kB 42.0 MB/s 
[K     |████████████████████████████████| 174 kB 44.9 MB/s 
[K     |████████████████████████████████| 86 kB 5.5 MB/s 
[K     |████████████████████████████████| 1.5 MB 37.8 MB/s 
[K     |████████████████████████████████| 124 kB 48.6 MB/s 
[K     |████████████████████████████████| 1.2 MB 42.8 MB/s 
[K     |████████████████████████████████| 71 kB 8.2 MB/s 
[K     |████████████████████████████████| 749 kB 46.9 MB/s 
[K     |████████████████████████████████| 67

## Single speaker model demo


### Model Selection

Please select model: English, Japanese, and Mandarin are supported.

You can try end-to-end text2wav model & combination of text2mel and vocoder.  
If you use text2wav model, you do not need to use vocoder (automatically disabled).

**Text2wav models**:
- VITS

**Text2mel models**:
- Tacotron2
- Transformer-TTS
- (Conformer) FastSpeech
- (Conformer) FastSpeech2

**Vocoders**:
- Parallel WaveGAN
- Multi-band MelGAN
- HiFiGAN
- Style MelGAN.


> The terms of use follow that of each corpus. We use the following corpora:
- `ljspeech_*`: LJSpeech dataset 
  - https://keithito.com/LJ-Speech-Dataset/
- `jsut_*`: JSUT corpus
  - https://sites.google.com/site/shinnosuketakamichi/publication/jsut
- `jvs_*`: JVS corpus + JSUT corpus
  - https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus
  - https://sites.google.com/site/shinnosuketakamichi/publication/jsut
- `tsukuyomi_*`: つくよみちゃんコーパス + JSUT corpus
  - https://tyc.rei-yumesaki.net/material/corpus/
  - https://sites.google.com/site/shinnosuketakamichi/publication/jsut
- `csmsc_*`: Chinese Standard Mandarin Speech Corpus
  - https://www.data-baker.com/open_source.html 



In [2]:
#@title Choose English model { run: "auto" }
lang = 'English'
tag = "kan-bayashi/ljspeech_tacotron2" #@param ["kan-bayashi/ljspeech_tacotron2", "kan-bayashi/ljspeech_fastspeech", "kan-bayashi/ljspeech_fastspeech2", "kan-bayashi/ljspeech_conformer_fastspeech2", "kan-bayashi/ljspeech_joint_finetune_conformer_fastspeech2_hifigan", "kan-bayashi/ljspeech_joint_train_conformer_fastspeech2_hifigan", "kan-bayashi/ljspeech_vits"] {type:"string"}
vocoder_tag ="parallel_wavegan/ljspeech_parallel_wavegan.v1" #@param ["none", "parallel_wavegan/ljspeech_parallel_wavegan.v1", "parallel_wavegan/ljspeech_full_band_melgan.v2", "parallel_wavegan/ljspeech_multi_band_melgan.v2", "parallel_wavegan/ljspeech_hifigan.v1", "parallel_wavegan/ljspeech_style_melgan.v1"] {type:"string"}

### Model Setup

In [3]:
from espnet2.bin.tts_inference import Text2Speech
from espnet2.utils.types import str_or_none

text2speech = Text2Speech.from_pretrained(
    model_tag=str_or_none(tag),
    vocoder_tag=str_or_none(vocoder_tag),
    device="cuda",
    # Only for Tacotron 2 & Transformer
    threshold=0.5,
    # Only for Tacotron 2
    minlenratio=0.0,
    maxlenratio=10.0,
    use_att_constraint=False,
    backward_window=1,
    forward_window=3,
    # Only for FastSpeech & FastSpeech2 & VITS
    # speed_control_alpha=1.0,
    # # Only for VITS
    # noise_scale=0.333,
    # noise_scale_dur=0.333,
)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package cmudict to /root/nltk_data...
[nltk_data]   Unzipping corpora/cmudict.zip.
https://zenodo.org/record/3989498/files/tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space_train.loss.best.zip?download=1: 100%|██████████| 102M/102M [00:05<00:00, 19.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1PdZv37JhAQH6AwNh31QlqruqrvjTBq7U
To: /root/.cache/parallel_wavegan/ljspeech_parallel_wavegan.v1.tar.gz
100%|██████████| 15.9M/15.9M [00:00<00:00, 233MB/s]


### Synthesis


In [10]:
import os
import torch
import time
# decide the input sentence by yourself
print(f"Input your favorite sentence in {lang}.")


# path = '/home/data_analysis/netflix'

# files = os.listdir(path)

# for f in files:
x = input()

# synthesis
with torch.no_grad():
    start = time.time()
    wav = text2speech(x)["wav"]
rtf = (time.time() - start) / (len(wav) / text2speech.fs)
print(f"RTF = {rtf:5f}")

# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs))

Input your favorite sentence in English.
Input your favorite sentence in English
RTF = 0.190665
