Replies: 19 comments
-
Beta Was this translation helpful? Give feedback.
-
>>> carlfm01 |
Beta Was this translation helpful? Give feedback.
-
>>> carlfm01 |
Beta Was this translation helpful? Give feedback.
-
>>> nmstoker |
Beta Was this translation helpful? Give feedback.
-
>>> nmstoker |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
>>> carlfm01 |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
>>> erogol |
Beta Was this translation helpful? Give feedback.
-
>>> nmstoker |
Beta Was this translation helpful? Give feedback.
-
>>> nmstoker |
Beta Was this translation helpful? Give feedback.
-
>>> erogol |
Beta Was this translation helpful? Give feedback.
-
>>> nmstoker |
Beta Was this translation helpful? Give feedback.
-
>>> erogol |
Beta Was this translation helpful? Give feedback.
-
>>> erogol |
Beta Was this translation helpful? Give feedback.
-
>>> nmstoker |
Beta Was this translation helpful? Give feedback.
-
>>> alchemi5t |
Beta Was this translation helpful? Give feedback.
-
>>> nmstoker |
Beta Was this translation helpful? Give feedback.
-
>>> alchemi5t |
Beta Was this translation helpful? Give feedback.
-
>>> nmstoker
[September 19, 2019, 12:52pm]
I'm keen to discuss what people have been considering in regard to data
and training approaches to improve voice quality (naturalness of audio)
and overall capabilities.
I've read wiki Dataset
page and played around
with the notebooks and they were helpful. I also realise a big
improvement comes from increasing the size of my dataset (it got
radically better between 6 hrs and when I got it to 12-13 hrs) and am
pushing on to increase that further, but I also wanted to think about
ways I could direct my efforts best.
The phoneme coverage as mentioned on the wiki seems critical, so I've
started getting stats to show how well (or poorly!) my dataset
represents general English speech. And I'm also looking at how well the
Espeak backend converts the words in my dataset to phonemes (since if it
has words that are either wrong or markedly off my dataset
pronunciation, it'll undermine the model's ability to learn well)
One area I'm particularly keen to hear the thoughts of others on is
whether there's any advantage to the following:
1. Initially training with a much simpler subset of my data
2. Then fine-tuning with a broader set
or
My (naïve) intuition here is that babies start with simple words and
build up. I could probably limit the length of training sentences to
those with under a certain short length of characters or better still
single short words (although my dataset probably has those a little
skewed as I've not really got that many single word sentences). Has
anyone tried something similar or seen any commentary on this kind of
thing elsewhere?
[This is an archived TTS discussion thread from discourse.mozilla.org/t/data-and-training-considerations-to-improve-voice-naturalness]
Beta Was this translation helpful? Give feedback.
All reactions