New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generating MFA aligments #4803
Generating MFA aligments #4803
Conversation
Hi @Fhrozen, i remember that Training MFA models from scratch is definitely another option. |
@kan-bayashi I am wondering which option would be better. This PR includes mfa training and It would cover others recipes such as VCTK, so the mfa option should be keep separated as optional (described in the TTS readme) or would become a permanent option in all the tts recipes? |
I like this one. |
@kan-bayashi Instead of executing Let me know about it, I will try to finish the PR. |
It looks good :) |
I completed the code, so please let me know about any comments on the code. I test the code in LJ, VCTK, and tsukuyomi.
I will be adding a function lather for downloading pre-trained models from HuggingFace (There are some languages that MFA currently does not support). A frontend was added in the lab generation for Japanese (and probably Korean and Chinese will require). |
Codecov Report
@@ Coverage Diff @@
## master #4803 +/- ##
=======================================
Coverage 79.18% 79.18%
=======================================
Files 557 557
Lines 49279 49279
=======================================
Hits 39020 39020
Misses 10259 10259
Flags with carried forward coverage won't be shown. Click here to find out more. 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review.
The code looks great, nice design that allows various recipe to use MFA.
And I'm glad to hear that Japanese case also works.
Once you complete this PR, let us merge it.
Finish adding some fixes related to the frame generation found due to floating point division. |
@kan-bayashi In case of english, I tested with VCTK also and some samples are here The files were generating using:
For Japanese, I tested with JVS multispeaker + Tsukuyomi
The issue with the Japanese is the vocoder (I suppose), so I am training a large set for training a vocoder a test again. |
Hi @Fhrozen -- thanks for integrating the MFA scripts. I wonder about Also, the readme says:
I think we're supposed to run all (5) stages right? |
In case of using espeak_ng_english_us_vits (which is one the choices of the ESPnet frontend), you need to setup |
@kan-bayashi @Fhrozen There seems to be a naming conflict in the LibriTTS corpus because MFA expects |
Also, in
|
@iamanigeeit I do not understand the current issue.
As further I see, it uses normalized text to generate the annotations: espnet/egs/libritts/tts1/local/data_prep.sh Lines 57 to 58 in aa88f3a
espnet/egs2/libritts/tts1/local/data.sh Lines 78 to 81 in aa88f3a
The lab files are employed for phoneme alignment I supposed.
The setup is for using pretrained espnet mfa models. I will update this soon for including mfa models stored in Huggingface. |
I think this is OK since they are in separate folders (MFA .lab in data/local/mfa/corpus/xx and LibriTTS .lab in downloads/LibriTTS/xx/xx). I was confused because they have the same name.
I see what you mean now, thanks. If |
@sw005320 @kan-bayashi @kamo-naoyuki
I prepared this draft based on PRs: #4557 #4801.
It is almost completed for a general support on different data sets.
I only tested on LJSpeech, so some parts are hard coded (stage 1 of .sh script)
In this case the code trains from scratch the mfa's models.
I had some issues with the original PRs and some word did not obtained a phoneme list, instead a
spn
token.Training the models, allows to use espnet2-based g2p and reduces the
spn
s to (1) or none (I cannot remember but still got a very few. But using espnet2's g2p then allow to generate the phoneme list and the py script generate a fixated duration for the phone list.Some samples generated from a trained Fastspeech 2 model + hifigan is located at: https://1drv.ms/u/s!AliZ3I0uDW8HhKB14mq0Sumx_vSbpg?e=IlmQ6T
Let me know if you have any comment.