Help for training a multi speaker model for voice cloning #276

JRMeyer · 2021-03-07T08:57:45Z

JRMeyer
Mar 7, 2021
Maintainer

>>> julian.weber
[September 24, 2020, 1:42pm]

Hi,

I'm training a model in french to do voice cloning and I'm using
CorentinJ's encoder to compute the speaker embedding. I started to train
on the 5 voices of the french mailabs and got good results. It's not
sufficient to reproduce accent or anything but the voice generated is
ineligible and the pitch matches with the cloned speaker.

To increase the fidelity of the cloning I tried adding the 'mix' folder
of the mailabs dataset. (Each chapter of the book is read by a different
speaker) and then the quality of the voice started to go downhill
quickly but I didn't noticed since everything seemed fine on the tensor
board. (grey and orange curves) slash

I then (because the tensorboard was fine) added the french part of the
common voice dataset set where a speaker only has around 4 voice samples
(12k different speakers). The tensorboard told me that the model
struggled to make the loss as small as before but it was understandable
since the data is much more noisy and every speaker has a different mic.

I was a bit horrified by the sythetised results after 260k steps the
models can't form one word correctly in a sentence. Enven the checkpoint
before common voice. So I have a few questions:

a specific speaker ? Or test sentences at all for that matter since
even though I specified test_sentences_file in config.json, I
coulnd't see any of my test sentences in the tensorboard. (it worked
before on a single speaker model)

every sample has its speaker embedding computed ? (I have my
original speakers but all other non identified speaker are under the
mix speaker)

preprocessing that can help ? Audio normalization ?

(in the config.json we can read : )

> 'stats_path': null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler
> stats file computed by 'compute_statistics.py'. If it is defined,
> mean-std based normalization is used and other normalization params
> are ignored

Thanks a lot for reading this post

Note: For big datasets I find the eval interval impractical, I trained
for 260k steps and the last eval was around 190k. Wouldn't it be
interesting to set a eval interval based on training steps and not epoch
?

[This is an archived TTS discussion thread from discourse.mozilla.org/t/help-for-training-a-multi-speaker-model-for-voice-cloning]

JRMeyer · 2021-03-07T08:57:47Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[September 24, 2020, 6:31pm]

I don't think you can have test sentences with specific speakers, but I
might be wrong. I do not think I have seen a switch. But you might be
able to write some lines and use a delimiter in the test.txt file. I
don't think it is trouble what you are saying about the mix speaker.
Since every utterance has its own embedding and I doubt every training
batch separates speakers. For normalization, I personally have used the
default configuration. Re: the mean-var, I think Eren mentioned it is a
bit of a hassle to compute mean-var stats for multispeaker, because
these have to be separate for each speaker.

The multispeaker TTS works ok (I have run some tests), but my opinion is
that it absolutely benefits from restoring pretrained weights on it. So
maybe first you'd like to train a French single speaker and then restore
that. However, do bear in mind that in an ideal scenario the speech is
very clean. What I advise is to start with English instead, because then
you can restore the LJSpeech TTS and either use LibriTTS or VCTK.
has gotten very
impressive results on VCTK and he has even been able to form artificial
speakers.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:57:50Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> julian.weber
[September 29, 2020, 12:12pm]

Thanks for your answer, I wasn't able to get good cloning results from
the collab notebook. Like erogol said in this
comment,
the clonned voice is not very close to the original speaker. But I think
this will be fixed when we have a better speaker encoder.

I feel like it was the attention mechanism that failed me during
training, do you know of an easy way to freeze the weights of certain
part of the network during training ?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:57:52Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[September 29, 2020, 12:29pm]

once taught me
he has it implemented on his fork so that you can toggle freezing
through the config.json, but if you want to use the original repo, you
can go in train_tts.py and, e.g. if you want to freeze different parts:

for param in model.postnet.parameters():
param.requires_grad = False

for name, param in model.decoder.named_parameters():
param.requires_grad = False

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:57:55Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> julian.weber
[September 29, 2020, 12:34pm]

Great thank you so much. I'll keep you posted here if I get good results

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:57:58Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> georroussos
[September 29, 2020, 12:36pm]

No problem! But I'd give restoring pretrained weights a chance
it always helps me no matter what

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:58:00Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> julian.weber
[September 29, 2020, 12:43pm]

You mean restoring only part of the models that you interested in ? I
was going to restore my model to 70k wait for the loss to stabilize and
then freeze the weights (I'm thinking attention and stopnet) and add the
other datasets

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help for training a multi speaker model for voice cloning #276

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Help for training a multi speaker model for voice cloning #276

JRMeyer Mar 7, 2021 Maintainer

Replies: 6 comments

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author