Skip to content

the Tensorflow version of multi-speaker TTS training with feedback constraint

License

Notifications You must be signed in to change notification settings

caizexin/tf_multispeakerTTS_fc

Repository files navigation

Multispeaker Speech Synthesis with Feedback Constraint from speaker verificaiton

This is a tensorflow implementation of the multispeaker TTS network introduced in paper From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint. This repository also contains a deep speaker verification model that is used in multi-speaker TTS model as the feedback network. Synthesized samples are provided online.

Citation

@inproceedings{Cai2020,
  author={Zexin Cai and Chuxiong Zhang and Ming Li},
  title={{From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint}},
  year=2020,
  booktitle={Proc. Interspeech 2020}
}

Model Architecture

where the speaker embedding network is a ResNet-based network:


Training

Speaker verification model

The speaker verification model is located in directory deep_speaker. By default setting, the speaker verification model is trained with data Voxceleb 1 and Voxceleb 2. You can find the file list in the directory. Hyperparameters are set in vox12_hparams.py.

To train the speaker verificaiton model from scratch, prepare the data as listed in file list and run:

CUDA_VISIBLE_DEVICES=0 python train.py

TTS synthesizer (without feedback control)

By default setting, the synthesizer is trained using dataset VCTK.

  • Extract audio feature using process_audio.ipynb

  • Extract speaker embeddings using ipython notebook deep_speaker/get_gvector.ipynb

  • Train a baseline multispeaker TTS system

    CUDA_VISIBLE_DEVICES=0 python synthesizer_train.py vctk datasets/vctk/synthesizer
  • Feel free to evaluate and synthesize samples using syn.ipynb during training

Neural vocoder (WaveRNN)

By default setting, the vocoder is also trained using dataset VCTK. It would be easy after you have the acoustic feature extracted from the previous section (TTS synthesizer). For better performance, please use GTA Mel-spectrogram obtained by vocoder_preprocess.py after the synthesizer training is finished.

CUDA_VISIBLE_DEVICES=0 python vocoder_train.py -g --syn_dir datasets/vctk/synthesizer vctk datasets/vctk

TTS synthesizer with feedback constraint

  • Set the path to the two pretrained model (the speaker verification model and the multispeaker synthesizer) by changing the corresponding keys in hparams.py.

  • Train the model and evaluate anytime with feedback_syn.ipynb

    CUDA_VISIBLE_DEVICES=0 python fc_synthesizer_train.py

Pretrained-models


References and Resources

About

the Tensorflow version of multi-speaker TTS training with feedback constraint

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages