Issue Training VITS on a custom Dataset #1704
Unanswered
rob1392
asked this question in
General Q&A
Replies: 2 comments
-
|
what is your audio sample rate? |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Dear @rob1392 did you manage to find what the problem was? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have created a custom dataset from a variety of publicly accessible audio recordings and am having trouble training VITS to generate intelligible speech. The dataset has multiple speaker totaling ~200 hours with the distribution of speaker times' looking like this:
I also tried just the top 6 speakers representing ~85 hours.
I have used all of the same parameters from here: https://github.com/coqui-ai/TTS/blob/dev/recipes/vctk/vits/train_vits.py with the main difference being I am training on 8 V100 GPUs.
In order to create the dataset I followed the advice of the HiFiTTS paper: https://arxiv.org/abs/2104.01497 using CTC to align speech and text and filtering out everything that had a SNR of at least 20 dB.
The audios when I spot checked them all sound good and the distribution of audio lengths also seems good looking like this:

I have trained these models for a relatively short time but it seems the asymptotes are not reaching the levels of other successfully trained models like VITS on VCTK (which I provide as reference) so my suspicion is that more training time is not the answer and the problem lies elsewhere.
I have attached my tensorboard charts here in the hopes that someone can help me out.
The blue lines are the baseline VITS trained on VCTK and the red lines are the same VITS params trained on my custom dataset.
To my eyes it seems like the discriminator is "winning" a lot more in the custom instance and so maybe there is some tweaking that could be done there?
Please let me know if I am missing any information and thanks for the help!
Beta Was this translation helpful? Give feedback.
All reactions