Replies: 4 comments
-
>>> erogol |
Beta Was this translation helpful? Give feedback.
-
>>> rdh |
Beta Was this translation helpful? Give feedback.
-
>>> erogol |
Beta Was this translation helpful? Give feedback.
-
>>> rdh |
Beta Was this translation helpful? Give feedback.
-
>>> vocajon
[October 15, 2020, 12:48pm]
Hi,
Newbie here, so apologies if I'm missing the obvious. I am trying to
achieve the following and think the steps are as follows, but have a few
gaps. Please can someone help me fill in the gaps/answer the inline
questions in Italics? And of course suggest any steps that may be
missing. Maybe this can become a clear quick-start guide for this
particular aim. Thank you.
Aim: To install Mozilla TTS on a Linux machine, and fine-tune a
pre-trained LJSpeech with a new voice of my own.
Steps: slash
1)Install cuda (this will allow the nvidia GPU to be used) slash
sudo apt-get install cuda
2. Install Mozilla TTS using the Simple packaging method detailed
here: slash
https://github.com/mozilla/TTS/wiki/Released-Models slash
Using package: slash
https://github.com/reuben/TTS/releases/download/ljspeech-fwd-attn-pwgan/TTS-0.0.1+92aea2a-py3-none-any.whl
Check audio synth is working by running it up with: slash
python3 -m TTS.server.server slash
Open web page to http://localhost:5002. Enter some text and check the
wav file is produced and can play OK.
Check CUDA is working by: slash
python3 -m TTS.train slash
Before the usage prompt you should see some info about CUDA:
> Using CUDA: True slash
> Number of GPUs: 1 slash
> If it's working Using CUDA should be True, and Number of GPUs match
> what is installed in your machine.
3. Record a set of wav files in the LJSpeech format - 22050Hz 16-bit
Mono WAV slash
Recommended duration:Single sentences 5 - 10 seconds each slash
Remove any silence at the beginning and end of each recording. slash
Normalize the audio level. slash
Minimum number of files/combined duration: Help please
4)Store the wavs in a dataset folder with the following structure: slash
slash |- yourchosenname slash
slash |- metadata.csv slash
slash |- wavs slash
slash |- xxxx-0001.wav slash
slash |- xxxx-0002.wav
5. Create metadata.csv inside your dataset folder with the following
format: slash
xxxx-0001 slash |There were 50 people in the room. slash |There were fifty
people in the room. slash
xxxx-0002 slash |Mr Jones is the friendly local butcher. slash |Mister Jones is
the friendly local butcher. slash
Using pipes to separate the 3 columns. slash
Where the 3rd column expands numbers, titles etc.
Help please - is it necessary to split this into metadata_train.csv and
metadata_val.csv? I saw suggestions val should be about 10% of the size
of train?
6. Pre-process Please Help - I think there should be some kind of
preprocess stage here? But don't see any .py for doing this in this
repo? Is it done automatically as part of the training? I'm guessing
so as there are params about do_trim_silence etc.
7. Prepare a config.json for your new dataset slash
Please help - I am not sure what needs changing in here. I'm
thinking the following?: slash
restore_path - set to the new dataset? But I'm not sure how to get
the dataset into .pth.rar format? slash
run_name - set to new dataset name (although probably not
mandatory?) slash
run_description - describe new dataset slash
mel_fmin - set to slash ~50 for male, slash ~95 for female slash
batch_size - 32 standard. I understand there are issues with GPU's
with smaller amounts of memory, and that you really need 16GB. Are
there any recommendations (e.g. drop this value) if trying to use a
4-8GB GPU? Or are you just wasting your time? slash
output_path - do these need changing? slash
datasets - set the name and path to match your new dataset path?
8. Fine tune the model: slash
python3 -m TTS.train slash --config_path TTS/tts/configs/config.json
slash --restore_path /path/to/your/model.pth.tar
9. Run up the server as in step 2 and test the new voice.
[This is an archived TTS discussion thread from discourse.mozilla.org/t/clear-process-for-generating-custom-voice]
Beta Was this translation helpful? Give feedback.
All reactions