[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/interspeech2019-tutorial/blob/kan-bayashi/tts/tts_demo.ipynb)

# ESPnet Text-to-Speech Demonstration

[**Tomoki Hayashi**](https://github.com/kan-bayashi)

Department of infomatics, Nagoya University  
Human Dataware Lab. Co., Ltd.


## Setup envrionment
 
It take around 10 minues. Please keep waiting for a while.


In [None]:
# OS setup
!apt-get install -qq bc tree
!cat /etc/os-release

# espnet setup
!git clone https://github.com/espnet/espnet
!pip install -q torch==1.1
!cd espnet; pip install -q -e .

# warp ctc setup
!git clone https://github.com/espnet/warp-ctc -b pytorch-1.1
!cd warp-ctc && mkdir build && cd build && cmake .. && make -j
!cd warp-ctc/pytorch_binding && python setup.py install 

# kaldi setup
!cd espnet/tools && git clone https://github.com/kaldi-asr/kaldi
!echo "" > ./espnet/tools/kaldi/tools/extras/check_dependencies.sh # ignore check
!chmod +x ./espnet/tools/kaldi/tools/extras/check_dependencies.sh
!cd ./espnet/tools/kaldi/tools; make sph2pipe sclite
!rm -rf espnet/tools/kaldi/tools/python
!espnet/utils/download_from_google_drive.sh https://drive.google.com/open?id=1DW4zKQtgDt-YeImLE_kS1Ldj673x7Sx2 downloads tar.gz
!mkdir -p espnet/tools/kaldi/src/featbin && mv downloads/featbin/* espnet/tools/kaldi/src/featbin/

# make dummy activate
!mkdir -p espnet/tools/venv/bin && touch espnet/tools/venv/bin/activate
!echo "setup done."

## Introduction of ESPnet TTS

- Follow the [Kaldi](https://github.com/kaldi-asr/kaldi) recipe style
- Multi GPU training / GPU decoding thanks to Pytorch
- Support three E2E-TTS models and their variants
- Support four corpus
- Support additional attention mechanisms and loss functions
- Support pretrained WaveNet-vocoder

### Supported E2E-TTS models

- [**Tacotron 2**](https://arxiv.org/abs/1712.05884): Standard Tacontron 2
- [**Multi-speaker Tacotron2**](https://arxiv.org/pdf/1806.04558.pdf): Pretrained x-vector + Tacotron 2
- [**Transformer**](https://arxiv.org/pdf/1809.08895.pdf): TTS-Transformer
- [**Multi-speaker Transformer**](): Pretrained x-vector + TTS-Transformer
- [**FastSpeech**](https://arxiv.org/pdf/1905.09263.pdf): Feed-forward TTS-Transformer


### Other remarkable functions

- [**CBHG**](https://arxiv.org/pdf/1703.10135.pdf): Network to convert Mel-filter bank to linear spectrogram
- [**Forward attention**](https://arxiv.org/pdf/1807.06736.pdf): Attention mechanism with causal regularization
- [**Guided attention loss**](https://arxiv.org/pdf/1710.08969.pdf): Loss function to force attention to be diagonal

### Supported corpus

- [`egs/jsut/tts1`](https://sites.google.com/site/shinnosuketakamichi/publication/jsut): Japanese single female speaker. (48 kHz, ~10 hours)
- [`egs/libritts/tts1`](http://www.openslr.org/60/): English multi sepaker (24 kHz, ~500 hours).
- [`egs/ljspeech/tts1`](https://keithito.com/LJ-Speech-Dataset/): English single female speaker (22.05 kHz, ~24 hours).
- [`egs/m_ailabs/tts1`](https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/): Various language speakers (16 kHz, 16~48 hours).

## Run the recipe
 
Use the most simplest recipe `egs/an4/tts1` as an example.  

Unfortunately, `egs/an4/tts1` is too small to train.   
But the flow itself is the same as the other recipes.


In [None]:
# Let's go to an4 recipe!
import os
os.chdir("espnet/egs/an4/tts1")
!echo $(pwd)

### Scripts in the recipe

- `run.sh`: Main script of the recipe.
- `cmd.sh`: Command configuration script to control how-to-run each job.
- `path.sh`: Path configuration script. Basically, we do not have to touch.
- `conf/`: Directory containing configuration files.
- `local/`: Directory containing the recipe-specific scripts e.g. data preparation.
- `steps/` and `utils/`: Directory containing kaldi tools.

In [None]:
!tree -L 1

### Overview of the recipe

<img src=figs/tts_overview.png width=80%>

### Stages in the recipe

Main script **run.sh** consists of several stages:

- **stage -1**: Download data if the data is available online.
- **stage 0**: Prepare data to make kaldi-stype data directory.
- **stage 1**: Extract feature vector, calculate statistics, and perform normalization.
- **stage 2**: Prepare a dictionary and make json files for training.
- **stage 3**: Train the E2E-TTS network.
- **stage 4**: Decode mel-spectrogram using the trained network.
- **stage 5**: Generate a waveform using Griffin-Lim.


### Stage -1: Data download

<img src=figs/tts_stage-1.png width=80%>

In [None]:
# run stage -1 and then stop
!./run.sh --stage -1 --stop_stage -1

`downloads` directory is cretead, which containing donwloaded an4 dataset.

In [None]:
# check directroy structure
!tree -L 1

# check downloads directory
!ls downloads/

### Stage 0: Data preparation

<img src=figs/tts_stage0.png width=80%>

In [None]:
# run stage 0 and then stop
!./run.sh --stage 0 --stop_stage 0

Two kaldi-style data directories are created:  
- `data/train`: data directory of training set
- `data/test`: data directory of evaluation set  

In [None]:
# check directory structure
!tree -L 1 data

# check each directory
!ls data/*

`wav.scp`: 
- Each line has `<utt_id> <wavfile_path or command pipe>`
- `<utt_id>` must be unique

`text`:
- Each line has `<utt_id> <transcription>`
- Assume that `<transcription>` is cleaned

`utt2spk`:
- Each line has `<utt_id> <speaker_id>`

`spk2utt`:
- Each lien has `<speaker_id> <utt_id> ... <utt_id> `
- Can be automatically created from `utt2spk` 

In the ESPnet, speaker information is not used for any processing.   
Therefore, **utt2spk** and **spk2utt** can be a dummy.

In [None]:
# check each file
!head -n 3 data/train/*

### Stage 1: Feature extration

<img src=figs/tts_stage1.png width=80%>

In [None]:
# hyperparameters related to stage 1
!head -n 28 run.sh | tail -n 8

In [None]:
# run stage 1 with default settings
!./run.sh --stage 1 --stop_stage 1 --nj 4

In [None]:
# check directory structure
!tree -L 2

Raw filterbanks are saved in `fbank/` directory with `ark/scp` format.

- `.ark`: binary file of faeture vector
- `.scp`: list of the correspondance b/w `<utt_id>` and `<path_in_ark>`.  

Since feature extraction can be performed for split small sets in parallel, raw_fbank is split into `raw_fbank_*.{1..4}.{scp,ark}`.

In [None]:
!ls fbank

In [None]:
!head -n 3 fbank/raw_fbank_train.1.scp

These files can be loaded in python via **kaldiio** as follows:

In [None]:
import kaldiio
import matplotlib.pyplot as plt

# load scp file
scp_dict = kaldiio.load_scp("fbank/raw_fbank_train.1.scp")
for key in scp_dict:
    plt.imshow(scp_dict[key].T[::-1])
    plt.title(key)
    plt.colorbar()
    plt.show()
    break
    
# load ark file
ark_generator = kaldiio.load_ark("fbank/raw_fbank_train.1.ark")
for key, array in ark_generator:
    plt.imshow(array.T[::-1])
    plt.title(key)
    plt.colorbar()
    plt.show()
    break

Some files are added in `data/train`  
- `feats.scp`: concatenated scp file of `fbank/raw_fbank_train.{1..4}.scp`.  
- `utt2num_frames`: Each line has `<utt_id> <number_of_frames>` .

In [None]:
!ls data/train
!head -n 3 data/train/*

And `data/train/` directory is split into two directory:
- `data/train_nodev/`: data directory for training
- `data/train_dev/`: data directory for validation


In [None]:
!ls data
!ls data/train_*

`cmvn.ark` is saved in `data/train_nodev`, which is the statistics file.  
This file also can be loaded in python via kaldiio.


In [None]:
import kaldiio

# load cmvn.ark file (Be careful not load_ark, but load_mat)
cmvn = kaldiio.load_mat("data/train_nodev/cmvn.ark")

# cmvn consists of mean and variance, the last dimension of mean represents the number of frames.
print("cmvn shape = "+ str(cmvn.shape))

# calculate mean and variance
mu = cmvn[0, :-1] / cmvn[0, -1]
var = cmvn[1, :-1] / cmvn[0, -1]

# show mean
print("mean = " + str(mu))
print("variance = " + str(var))

Normalzed features for train, dev, and eval sets are dumped in
- `dump/{train_nodev,train_dev,test}/*.{ark,scp}`.  

There ark and scp can be loaded as the same as the above procedure.



In [None]:
!ls dump/*

### Stage 2: Dictionary and json preparation

<img src=figs/tts_stage2.png width=80%>

In [None]:
# run stage 2 and then stop
!./run.sh --stage 2 --stop_stage 2

- Dictrionary file will be created in `data/lang_1char/`.  
- Dictionary file consists of `<token>` `<token index>`.  
    - `<token index>` starts from 1 because 0 is used as padding index.


In [None]:
!ls data
!cat data/lang_1char/train_nodev_units.txt

Json file will be created for train, dev, and eval sets as 
- `dump/{train_nodev,train_dev,test}/data.json`.

In [None]:
!ls dump/*/*.json

Each json file contains all of the information in the data directory.

- `shape`: Shape of the input or output sequence. [63, 80] represents the number of frames = 63 and the dimension of mel-spectrogram = 80.
- `text`: Original transcription.
- `token`: Token sequence of original transcription.
- `tokenid` Token id sequence of original transcription, which is converted using the dictionary.

In [None]:
!head -n 27 dump/train_nodev/data.json

Now ready to start training!

### Stage 3: Network training

<img src=figs/tts_stage3.png width=80%>

Training setting can be specified by `train_config`.

In [None]:
# check hyperparmeters in run.sh
!head -n 31 run.sh | tail -n 2

Training configurations are written as `.yaml` format file.  
Let us check the default cofiguration `conf/train_pytroch_tacotron2.yaml`.

In [None]:
!cat conf/train_pytorch_tacotron2.yaml

Let's change the hyperparameters.

In [None]:
# load configuration yaml
import yaml
with open("conf/train_pytorch_tacotron2.yaml") as f:
    params = yaml.load(f, Loader=yaml.Loader)

# change hyperparameters by yourself!
params = {
    "embed-dim": 16,
    "elayers": 1,
    "eunits": 16,
    "econv-layers": 1,
    "econv-chans": 16,
    "econv-filts": 5,
    "dlayers": 1,
    "dunits": 16,
    "prenet-layers": 1,
    "prenet-units": 16,
    "postnet-layers": 1,
    "postnet-chans": 16,
    "postnet-filts": 5,
    "atype": 16,
    "adim": 16,
    "aconv-chans": 16,
    "aconv-filts": 5,
    "reduction-factor": 3,
    "batch-size": 64,
    "epochs": 10,
}

# save
with open("conf/train_pytorch_tacotron2_mini.yaml", "w") as f:
    yaml.dump(params, f, Dumper=yaml.Dumper)

!cat conf/train_pytorch_tacotron2_mini.yaml

Let's train the network.  
You can specify the config file via `--train_config` option.  
It takes several minutes.


In [None]:
# use modified configuration file as train config
!./run.sh --stage 3 --stop_stage 3 --train_config conf/train_pytorch_tacotron2_mini.yaml --verbose 1 --ngpu 0

You can see the training log in `exp/train_*/train.log`.

In [None]:
!cat exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/train.log

The models are saved in `exp/train_*/results/` directory.

- `exp/train_*/results/model.loss.best`: contains only the model parameters.  
- `exp/train_*/results/snapshot.ep.*`: contains the model parameters, optimizer states, and iterator states. 


In [None]:
!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/{results,results/att_ws}

`exp/train_*/results/*.png` are the figures of training curve.

In [None]:
from IPython.display import Image, display_png
print("all loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/results/all_loss.png"))
print("l1 loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/results/l1_loss.png"))
print("mse loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/results/mse_loss.png"))
print("bce loss curve")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/results/bce_loss.png"))

`exp/train_*/results/att_ws/.png` are the figures of attention weights in each epoch.

In [None]:
print("Attention weights of initial epoch")
display_png(Image("exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/results/att_ws/fash-cen1-b.ep.1.png"))

 You can restart from the training by specifying the snapshot file with `--resume` option.

In [None]:
# resume training from snapshot.ep.2
!./run.sh --stage 3 --stop_stage 3 --verbose 1 \
    --train_config conf/train_pytorch_tacotron2_mini.yaml \
    --resume exp/train_nodev_pytorch_train_pytorch_tacotron2_sample/results/snapshot.ep.2

In [None]:
!cat exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/train.log

Also, we support tensorboard.  
You can see the training log through tensorboard.

In [None]:
%load_ext tensorboard
%tensorboard --logdir tensorboard/train_nodev_pytorch_train_pytorch_tacotron2_mini/

### Stage 4: Network decoding

<img src=figs/tts_stage4.png width=80%>


Decoding parameters can be specified by `--decode_config`.

In [None]:
!head -n 32 run.sh | tail -n 1

Decoding configurations are written as `.yaml` format file.  
Let us check the default cofiguration `conf/decode.yaml`.

In [None]:
!cat conf/decode.yaml

In [None]:
# run stage 4 and then stop
!./run.sh --stage 4 --stop_stage 4 --nj 8 --train_config conf/train_pytorch_tacotron2_mini.yaml 

Generated features are saved as `ark/scp` format.

In [None]:
!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/outputs_model.loss.best_decode/*

We can specify the model or snapshot for decoding via `--model`.   

In [None]:
!./run.sh --stage 4 --stop_stage 4 --nj 8 --train_config conf/train_pytorch_tacotron2_mini.yaml --model snapshot.ep.2

In [None]:
!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/outputs_snapshot.ep.2_decode/*

### Stage 5: Waveform synthesis

<img src=figs/tts_stage5.png width=80%>

In [None]:
# run stage 5 and then stop
!./run.sh --stage 5 --stop_stage 5 --nj 8 --train_config conf/train_pytorch_tacotron2_mini.yaml

Generated wav files are saved in 
- `exp/train_nodev_pytorch_*/outputs_model.loss.best_decode_denorm/*/wav`

In [None]:
!tree -L 3 exp/train_nodev_pytorch_train_pytorch_tacotron2_mini

In [None]:
!ls exp/train_nodev_pytorch_train_pytorch_tacotron2_mini/outputs_model.loss.best_decode_denorm/*/wav

## Use pretrained models

We provide pretrained models and these are easy to use them with `synth_wav.sh`.

In [None]:
# move on directory
os.chdir("../../librispeech/asr1")
!pwd

In [None]:
# check usage
!../../../utils/synth_wav.sh --help

In [None]:
# generate your sentence!
print("Please input your favorite sentence!")
text = input()
text = text.upper()
with open("example.txt", "w") as f:
    f.write(text)
!../../../utils/synth_wav.sh --models ljspeech.fastspeech.v1 example.txt

# check generated audio
import IPython.display
IPython.display.display(IPython.display.Audio("decode/example/wav/example.wav"))

Let's recognize generated speech!

In [None]:
# check usage
!../../../utils/recog_wav.sh --help

In [None]:
# downsample to 16 kHz for ASR model
!sox decode/example/wav/example.wav -b 16 decode/example/wav/example_16k.wav rate 16k

# make decode config
import yaml
with open("conf/decode_sample.yaml", "w") as f:
    yaml.dump({
        "batchsize": 0,
        "beam-size": 5,
        "ctc-weight": 0.4,
        "lm-weight": 0.6,
        "maxlenratio": 0.0,
        "minlenratio": 0.0,
        "penalty": 0.0,
    }, f, Dumper=yaml.Dumper)

# let's recognize generated speech
!../../../utils/recog_wav.sh --models librispeech.transformer.v1 \
    --decode_config conf/decode_sample.yaml \
    decode/example/wav/example_16k.wav

## Next steps

- Try other recipes
- Make your own recipe
- Add your original model architecture