Training a TTS model from scratch and Finetuning on private data #1067

VigneshBaskar · 2022-01-05T17:52:27Z

VigneshBaskar
Jan 5, 2022

Hi, I am trying to train a tts model from scratch on the LJ dataset following the instructions here: https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html

I am new to TTS and May I kindly request someone to help me with the following questions please? I have the following questions:

It takes around 15 minutes to train one single epoch for Glow TTS on a single K80 GPU. Which means it could take around 10 days to train a TTS model from scratch on LJ speech data with a single GPU for 1000 epochs. Is this normal?
If I understood it correctly from the Humble FAQ: Would Tacotron 2 or Tacotron train much faster in comparison to GlowTTS?
Is finetuning the only fastest way the to train on a small private dataset of another speaker (~1 hour data)?
Given that my small private dataset is a male speaker, does it make sense to finetune on LJspeech dataset which is female?
Which one would produce better results: Finetuning on a private data of 1 hour or Training from scratch on private data of 5 hours?
Is more data better the performance by default or should I still fight with Hyperparameter tuning?

PS: I understand that the pretrained models are already available for LJ speech dataset. I just wanted to try it by myself :)
May I kindly request someone to help me with the above questions please? Thank you very much for your time

Answered by erogol

Jan 6, 2022

It takes around 15 minutes to train one single epoch for Glow TTS on a single K80 GPU. Which means it could take around 10 days to train a TTS model from scratch on LJ speech data with a single GPU for 1000 epochs. Is this normal?

Yes

If I understood it correctly from the Humble FAQ: Would Tacotron 2 or Tacotron train much faster in comparison to GlowTTS?

Mostly

Is finetuning the only fastest way the to train on a small private dataset of another speaker (~1 hour data)?

Yes

Given that my small private dataset is a male speaker, does it make sense to finetune on LJspeech dataset which is female?

Male model would work better

Which one would produce better results: Finetuning on a p…

View full answer

erogol · 2022-01-06T09:34:57Z

erogol
Jan 6, 2022
Maintainer

It takes around 15 minutes to train one single epoch for Glow TTS on a single K80 GPU. Which means it could take around 10 days to train a TTS model from scratch on LJ speech data with a single GPU for 1000 epochs. Is this normal?

Yes

If I understood it correctly from the Humble FAQ: Would Tacotron 2 or Tacotron train much faster in comparison to GlowTTS?

Mostly

Is finetuning the only fastest way the to train on a small private dataset of another speaker (~1 hour data)?

Yes

Given that my small private dataset is a male speaker, does it make sense to finetune on LJspeech dataset which is female?

Male model would work better

Which one would produce better results: Finetuning on a private data of 1 hour or Training from scratch on private data of 5 hours?

I'd try both

Is more data better the performance by default or should I still fight with Hyperparameter tuning?

More data > parameters

1 reply

VigneshBaskar Jan 6, 2022
Author

Thank you very much @erogol

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a TTS model from scratch and Finetuning on private data #1067

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Training a TTS model from scratch and Finetuning on private data #1067

VigneshBaskar Jan 5, 2022

Replies: 1 comment · 1 reply

erogol Jan 6, 2022 Maintainer

VigneshBaskar Jan 6, 2022 Author

VigneshBaskar
Jan 5, 2022

Replies: 1 comment 1 reply

erogol
Jan 6, 2022
Maintainer

VigneshBaskar Jan 6, 2022
Author