Replies: 6 comments
-
>>> georroussos |
Beta Was this translation helpful? Give feedback.
-
>>> julian.weber |
Beta Was this translation helpful? Give feedback.
-
>>> georroussos |
Beta Was this translation helpful? Give feedback.
-
>>> julian.weber |
Beta Was this translation helpful? Give feedback.
-
>>> georroussos |
Beta Was this translation helpful? Give feedback.
-
>>> julian.weber |
Beta Was this translation helpful? Give feedback.
-
>>> julian.weber
[September 24, 2020, 1:42pm]
Hi,
I'm training a model in french to do voice cloning and I'm using
CorentinJ's encoder to compute the speaker embedding. I started to train
on the 5 voices of the french mailabs and got good results. It's not
sufficient to reproduce accent or anything but the voice generated is
ineligible and the pitch matches with the cloned speaker.
To increase the fidelity of the cloning I tried adding the 'mix' folder
of the mailabs dataset. (Each chapter of the book is read by a different
speaker) and then the quality of the voice started to go downhill
quickly but I didn't noticed since everything seemed fine on the tensor
board. (grey and orange curves) slash
I then (because the tensorboard was fine) added the french part of the
common voice dataset set where a speaker only has around 4 voice samples
(12k different speakers). The tensorboard told me that the model
struggled to make the loss as small as before but it was understandable
since the data is much more noisy and every speaker has a different mic.
I was a bit horrified by the sythetised results after 260k steps the
models can't form one word correctly in a sentence. Enven the checkpoint
before common voice. So I have a few questions:
a specific speaker ? Or test sentences at all for that matter since
even though I specified test_sentences_file in config.json, I
coulnd't see any of my test sentences in the tensorboard. (it worked
before on a single speaker model)
every sample has its speaker embedding computed ? (I have my
original speakers but all other non identified speaker are under the
mix speaker)
preprocessing that can help ? Audio normalization ?
(in the config.json we can read : )
> 'stats_path': null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler
> stats file computed by 'compute_statistics.py'. If it is defined,
> mean-std based normalization is used and other normalization params
> are ignored
Thanks a lot for reading this post
Note: For big datasets I find the eval interval impractical, I trained
for 260k steps and the last eval was around 190k. Wouldn't it be
interesting to set a eval interval based on training steps and not epoch
?
[This is an archived TTS discussion thread from discourse.mozilla.org/t/help-for-training-a-multi-speaker-model-for-voice-cloning]
Beta Was this translation helpful? Give feedback.
All reactions