Tacotron 2 (with HiFi-GAN)

PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions.

This implementation includes distributed and automatic mixed precision support and uses the RUSLAN dataset.

Distributed and Automatic Mixed Precision support relies on NVIDIA's Apex and AMP.

Generated samples

https://soundcloud.com/andrey-nikishaev/sets/russian-tts-nvidia-tacotron2

New

Added Diagonal guided attention (DGA) from another model https://arxiv.org/abs/1710.08969
Added Maximizing Mutual Information for Tacotron (MMI) https://arxiv.org/abs/1909.01145
- Can't make it work as showed in paper
- DGA still gives better results, and much cleaner
Added Russian text preparation with simple stress dictionary (za'mok i zamo'k)
Using HiFi GAN

Pre-requisites

NVIDIA GPU + CUDA cuDNN

Setup

Download and extract the RUSLAN dataset
Clone this repo: git clone https://github.com/NVIDIA/tacotron2.git
CD into this repo: cd tacotron2
Install PyTorch 1.0
Install Apex
Install python requirements or build docker image
- Install python requirements: pip install -r requirements.txt

Training

python train.py --output_directory=outdir --log_directory=logdir
(OPTIONAL) tensorboard --logdir=outdir/logdir

Training using a pre-trained model

Training using a pre-trained model can lead to faster convergence
By default, the dataset dependent text embedding layers are ignored

Download our published Ruslan Model or LJ Speech model
python train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start

Multi-GPU (distributed) and Automatic Mixed Precision Training

python -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

Inference demo

Download our published Ruslan Model or LJ Speech model
Download published HiFi-GAN Model (Universal model recommended for non-English languages)
jupyter notebook --ip=127.0.0.1 --port=31337
Load inference.ipynb

N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.

Related repos

HiFi-Gan HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Acknowledgements

This implementation uses code from the following repos: Nvidia/Tacotron2

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
hifigan		hifigan
imgs		imgs
notebooks		notebooks
papers		papers
text		text
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
audio_processing.py		audio_processing.py
data_utils.py		data_utils.py
demo.py		demo.py
distributed.py		distributed.py
grad.png		grad.png
gradient_adaptive_factor.py		gradient_adaptive_factor.py
gst.py		gst.py
hparams.py		hparams.py
inference.ipynb		inference.ipynb
layers.py		layers.py
logger.py		logger.py
loss_function.py		loss_function.py
loss_scaler.py		loss_scaler.py
minimize.py		minimize.py
model.py		model.py
multiproc.py		multiproc.py
plotting_utils.py		plotting_utils.py
requirements.txt		requirements.txt
tools.py		tools.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tacotron 2 (with HiFi-GAN)

Generated samples

New

Pre-requisites

Setup

Training

Training using a pre-trained model

Multi-GPU (distributed) and Automatic Mixed Precision Training

Inference demo

Related repos

Acknowledgements

About

Releases

Packages

Languages

License

creotiv/RussianTTS-Tacotron2

Folders and files

Latest commit

History

Repository files navigation

Tacotron 2 (with HiFi-GAN)

Generated samples

New

Pre-requisites

Setup

Training

Training using a pre-trained model

Multi-GPU (distributed) and Automatic Mixed Precision Training

Inference demo

Related repos

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages