GlowTTS - Single vs Multi Speaker - No Phoneme, Common Voice Dataset #2923

iprovalo · 2023-09-04T17:34:06Z

iprovalo
Sep 4, 2023

I am training a GlowTTS model in Ukrainian language using the common voice dataset. Using no phonemes.

I have tried two models - single speaker with about 8K samples, and speaker embedding using the entire set ~70K.

Average loss looks better for the multi speaker model (magenta):

Inference time, the single speaker quality is significantly better than the multi speaker. However, the single speaker still has a slight metallic sound.

I am using multiband melgan for the vocoder. Training steps/times:

Single Speaker Glow TTS: 1m steps, 6 days
Multi Speaker Glow TTS: 2m steps, 16 days
Multiband Melgan: 6m steps, 21 days

Any ideas about why the multi-speaker quality is not matching the single speaker quality?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GlowTTS - Single vs Multi Speaker - No Phoneme, Common Voice Dataset #2923

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

GlowTTS - Single vs Multi Speaker - No Phoneme, Common Voice Dataset #2923

Uh oh!

iprovalo Sep 4, 2023

Replies: 0 comments

iprovalo
Sep 4, 2023