Skip to content
Switch branches/tags
Go to file
This branch is 7 commits ahead, 260 commits behind TensorSpeech:master.

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

😋 TensorflowTTS

Build GitHub Colab

Real-Time State-of-the-art Speech Synthesis for Tensorflow 2

🤪 TensorflowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using fake-quantize aware and pruning, make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.

What's new

  • 2020/08/05 (NEW!) Support Korean TTS. Pls see the colab. Thank @crux153.
  • 2020/07/17 Support MultiGPU for all Trainer.
  • 2020/07/05 Support Convert Tacotron-2, FastSpeech to Tflite. Pls see the colab. Thank @jaeyoo from TFlite team for his support.
  • 2020/06/20 FastSpeech2 implementation with Tensorflow is supported.
  • 2020/06/07 Multi-band MelGAN (MB MelGAN) implementation with Tensorflow is supported.


  • High performance on Speech Synthesis.
  • Be able to fine-tune on other languages.
  • Fast, Scalable and Reliable.
  • Suitable for deployment.
  • Easy to implement new model based-on abtract class.
  • Mixed precision to speed-up training if posible.
  • Support both Single/Multi GPU in base trainer class.
  • TFlite conversion for all supported model.


This repository is tested on Ubuntu 18.04 with:

Different Tensorflow version should be working but not tested yet. This repo will try to work with latest stable tensorflow version. We recommend you install tensorflow 2.3.0 to training in case you want to use MultiGPU.


With pip

$ pip install TensorflowTTS

From source

Examples are included in the repository but are not shipped with the framework. Therefore, in order to run the latest verion of examples, you need install from source following bellow.

$ git clone
$ cd TensorFlowTTS
$ pip install .

If you want upgrade the repository and its dependencies:

$ git pull
$ pip install --upgrade .

Supported Model achitectures

TensorflowTTS currently provides the following architectures:

  1. MelGAN released with the paper MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis by Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, Aaron Courville.
  2. Tacotron-2 released with the paper Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions by Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu.
  3. FastSpeech released with the paper FastSpeech: Fast, Robust and Controllable Text to Speech by Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.
  4. Multi-band MelGAN released with the paper Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech by Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie.
  5. FastSpeech2 released with the paper FastSpeech 2: Fast and High-Quality End-to-End Text to Speech by Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.

We are also implement some techniques to improve quality and convergence speed from following papers:

  1. Multi Resolution STFT Loss released with the paper Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram by Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim.
  2. Guided Attention Loss released with the paper Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention by Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara.

Audio Samples

Here in an audio samples on valid set. tacotron-2, fastspeech, melgan, melgan.stft, fastspeech2, multiband_melgan

Tutorial End-to-End

Prepare Dataset

Prepare a dataset in the following format:

|- datasets/
|   |- metadata.csv
|   |- wav/
|       |- file1.wav
|       |- ...

where metadata.csv has the following format: id|transcription. This is a ljspeech-like format, you can ignore preprocessing steps if you have other format dataset.


The preprocessing has two steps:

  1. Preprocess audio features
    • Convert characters to IDs
    • Compute mel spectrograms
    • Normalize mel spectrograms to [-1, 1] range
    • Split dataset into train and validation
    • Compute mean and standard deviation of multiple features from the training split
  2. Standardize mel spectrogram based on computed statistics

To reproduce the steps above:

tensorflow-tts-preprocess --rootdir ./datasets --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech
tensorflow-tts-normalize --rootdir ./dump --outdir ./dump --config preprocess/ljspeech_preprocess.yaml --dataset ljspeech

Right now we only support ljspeech and kss for dataset argument. In the future, we intend to support more datasets.

After preprocessing, the structure of the project folder should be:

|- datasets/
|   |- metadata.csv
|   |- wav/
|       |- file1.wav
|       |- ...
|- dump/
|   |- train/
|       |- ids/
|           |- LJ001-0001-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0001-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0001-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0001-wave.npy
|           |- ...
|   |- valid/
|       |- ids/
|           |- LJ001-0009-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0009-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0009-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0009-wave.npy
|           |- ...
|   |- stats.npy
|   |- stats_f0.npy
|   |- stats_energy.npy
|   |- train_utt_ids.npy
|   |- valid_utt_ids.npy
|- examples/
|   |- melgan/
|   |- fastspeech/
|   |- tacotron2/
|   ...
  • stats.npy contains the mean and std from the training split mel spectrograms
  • stats_energy.npy contains the mean and std of energy values from the training split
  • stats_f0.npy contains the mean and std of F0 values in the training split
  • train_utt_ids.npy / valid_utt_ids.npy contains training and validation utterances IDs respectively

We use suffix (ids, raw-feats, raw-energy, raw-f0, norm-feats and wave) for each type of input.


  • This preprocessing step is based on ESPnet so you can combine all models here with other models from ESPnet repository.

Training models

To know how to training model from scratch or fine-tune with other datasets/languages, pls see detail at example directory.

Abstract Class Explaination

Abstract DataLoader Tensorflow-based dataset

A detail implementation of abstract dataset class from tensorflow_tts/dataset/abstract_dataset. There are some functions you need overide and understand:

  1. get_args: This function return argumentation for generator class, normally is utt_ids.
  2. generator: This funtion have an inputs from get_args function and return a inputs for models. Note that we return dictionary for all generator function with they keys exactly match with the parameter of the model because base_trainer will use model(**batch) to do forward step.
  3. get_output_dtypes: This function need return dtypes for each element from generator function.
  4. get_len_dataset: Return len of datasets, normaly is len(utt_ids).


  • A pipeline of creating dataset should be: cache -> shuffle -> map_fn -> get_batch -> prefetch.
  • If you do shuffle before cache, the dataset won't shuffle when it re-iterate over datasets.
  • You should apply map_fn to make each elements return from generator function have a same length before get batch and feed it into a model.

Some examples to use this abstract_dataset are,,,

Abstract Trainer Class

A detail implementation of base_trainer from tensorflow_tts/trainer/ It include Seq2SeqBasedTrainer and GanBasedTrainer inherit from BasedTrainer. All trainer support both single/multi GPU. There a some functions you MUST overide when implement new_trainer:

  • compile: This function aim to define a models, and losses.
  • generate_and_save_intermediate_result: This function will save intermediate result such as: plot alignment, save audio generated, plot mel-spectrogram ...
  • compute_per_example_losses: This function will compute per_example_loss for model, note that all element of the loss MUST has shape [batch_size].

All models on this repo are trained based-on GanBasedTrainer (see,, and Seq2SeqBasedTrainer (see,

End-to-End Examples

You can know how to inference each model at notebooks or see a colab (for English), colab (for Korean). Here is an example code for end2end inference with fastspeech and melgan.

import numpy as np
import soundfile as sf
import yaml

import tensorflow as tf

from tensorflow_tts.processor import LJSpeechProcessor

from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel

# initialize fastspeech model.
fs_config = AutoConfig.from_pretrained('/examples/fastspeech/conf/fastspeech.v1.yaml')
fastspeech = TFAutoModel.from_pretrained(

# initialize melgan model
melgan_config = AutoConfig.from_pretrained('./examples/melgan/conf/melgan.v1.yaml')
melgan = TFAutoModel.from_pretrained(

# inference
processor = LJSpeechProcessor(None, cleaner_names="english_cleaners")

ids = processor.text_to_sequence("Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning.")
ids = tf.expand_dims(ids, 0)
# fastspeech inference

masked_mel_before, masked_mel_after, duration_outputs = fastspeech.inference(
    speed_ratios=tf.constant([1.0], dtype=tf.float32)

# melgan inference
audio_before = melgan.inference(masked_mel_before)[0, :, 0]
audio_after = melgan.inference(masked_mel_after)[0, :, 0]

# save to file
sf.write('./audio_before.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")


Minh Nguyen Quan Anh:, erogol:, Kuan Chen:, Takuya Ebata:, Trinh Le Quang:


Overrall, Almost models here are licensed under the Apache 2.0 for all countries in the world, except in Viet Nam this framework cannot be used for production in any way without permission from TensorflowTTS's Authors. There is an exception, Tacotron-2 can be used with any perpose. So, if you are VietNamese and want to use this framework for production, you Must contact our in andvance.


We would like to thank Tomoki Hayashi, who discussed with our much about Melgan, Multi-band melgan, Fastspeech and Tacotron. This framework based-on his great open-source ParallelWaveGan project.


😝 TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, Korean, Chinese)




No packages published