GlowTTS - Single vs Multi Speaker - No Phoneme, Common Voice Dataset #2923
Unanswered
iprovalo
asked this question in
General Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I am training a GlowTTS model in Ukrainian language using the common voice dataset. Using no phonemes.
I have tried two models - single speaker with about 8K samples, and speaker embedding using the entire set ~70K.
Average loss looks better for the multi speaker model (magenta):

Inference time, the single speaker quality is significantly better than the multi speaker. However, the single speaker still has a slight metallic sound.
I am using multiband melgan for the vocoder. Training steps/times:
Any ideas about why the multi-speaker quality is not matching the single speaker quality?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions