## TTS demo for ESPnet-Easy!
In this notebook, we will demonstrate how to train an Text to Speech (TTS) model using the LJSpeech dataset. Basic flow of data preparation and training is the same with ASR.

Before proceeding, please ensure that you have already downloaded the LJSpeech dataset from [here](https://keithito.com/LJ-Speech-Dataset/) and have placed the data in a directory of your choice. In this notebook, we assume that you have stored the dataset in the `/hdd/dataset/` directory. If your dataset is located in a different directory, please make sure to replace `/hdd/dataset/` with the actual path to your dataset.

### Data preparation

First, let's create dump files!  
The format of the dump files is the same as the ASR dump files.

```python
{
    "data_name": ["dump_file_name", "dump_format"]
}
```

In [None]:
import os
import espnetez as ez


DUMP_DIR = "./dump/ljspeech"
LJS_DIRS = "/hdd/database/LJSpeech-1.1"
data_info = {
    "speech": ["wav.scp", "sound"],
    "text": ["text", "text"],
}

# prepare dataset
train_dataset = {}
test_dataset = {}
with open(os.path.join(LJS_DIRS, "metadata.csv"), "r") as f:
    lines = f.readlines()

for t in lines[:-50]:
    k, _, text = t.strip().split("|", 2)
    train_dataset[k] = dict(text=text)

for t in lines[-50:]:
    k, _, text = t.strip().split("|", 2)
    test_dataset[k] = dict(text=text)


# set speech path
for k in train_dataset.keys():
    train_dataset[k]['speech'] = os.path.join(LJS_DIRS, "wavs", f"{k}.wav")

for k in test_dataset.keys():
    test_dataset[k]['speech'] = os.path.join(LJS_DIRS, "wavs", f"{k}.wav")


train_dir = os.path.join(DUMP_DIR, "train")
test_dir = os.path.join(DUMP_DIR, "test")

if not os.path.exists(train_dir):
    os.makedirs(train_dir)

if not os.path.exists(test_dir):
    os.makedirs(test_dir)

ez.data.create_dump_file(train_dir, train_dataset, data_info)
ez.data.create_dump_file(test_dir, test_dataset, data_info)

### Generate token list

To generate a token list, we need to run `espnet2.bin.tokenize_text` script.
ESPnet-Easy has a wrapper function for this script.

In [None]:
# generate training texts from the training data
# you can select several datasets to train sentencepiece.
ez.preprocess.prepare_sentences(["dump/ljspeech/train/text"], "data/")
ez.preprocess.tokenize(
    input="data/train.txt",
    output="data/tokenized.txt",
    token_type="phn",
    cleaner="tacotron",
    g2p="g2p_en"
)

### Training

To prepare the stats file before training, you can execute the `collect_stats` method. This step is required before the training process and ensuring accurate statistics for the model.

In [None]:
EXP_DIR = "exp/train_tts"
STATS_DIR = "exp/stats"

# load config
training_config = ez.config.from_yaml(
    "tts",
    "tacotron2.yaml",
)
with open("data/tokenized.txt", "r") as f:
    training_config["token_list"] = [t.replace("\n", "") for t in f.readlines()]

# Define the Trainer class
trainer = ez.Trainer(
    task='tts',
    train_config=training_config,
    train_dump_dir="dump/ljspeech/train",
    valid_dump_dir="dump/ljspeech/test",
    data_info=data_info,
    output_dir=EXP_DIR,
    stats_dir=STATS_DIR,
    ngpu=1,
)
trainer.collect_stats()

Finally, we are ready to begin the training process!

In [None]:
trainer.train()

### Inference
You can just use the inference API of the ESPnet.

In [None]:
import soundfile as sf
from espnet2.bin.tts_inference import Text2Speech

m = Text2Speech(
    "./exp/finetune/config.yaml",
	"./exp/finetune/valid.loss.ave.pth",
)

text = "hello world"
output = m(text)['wav']
sf.write("output.wav", output, 16000)