Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating MFA aligments #4803

Merged
merged 26 commits into from Jan 13, 2023
Merged

Generating MFA aligments #4803

merged 26 commits into from Jan 13, 2023

Conversation

Fhrozen
Copy link
Member

@Fhrozen Fhrozen commented Dec 4, 2022

@sw005320 @kan-bayashi @kamo-naoyuki

I prepared this draft based on PRs: #4557 #4801.
It is almost completed for a general support on different data sets.

I only tested on LJSpeech, so some parts are hard coded (stage 1 of .sh script)
In this case the code trains from scratch the mfa's models.
I had some issues with the original PRs and some word did not obtained a phoneme list, instead a spn token.

Training the models, allows to use espnet2-based g2p and reduces the spns to (1) or none (I cannot remember but still got a very few. But using espnet2's g2p then allow to generate the phoneme list and the py script generate a fixated duration for the phone list.

Some samples generated from a trained Fastspeech 2 model + hifigan is located at: https://1drv.ms/u/s!AliZ3I0uDW8HhKB14mq0Sumx_vSbpg?e=IlmQ6T

Let me know if you have any comment.

@iamanigeeit
Copy link
Contributor

Hi @Fhrozen, i remember that spn comes from OOV words, which is why i added the steps to generate OOVs + add them into the MFA dictionaries in my proposed mfa.sh script.

Training MFA models from scratch is definitely another option.

@sw005320 sw005320 added this to the v.202211 milestone Dec 5, 2022
@sw005320
Copy link
Contributor

sw005320 commented Dec 5, 2022

Thanks a lot!
Should we first check and merge #4801 and then move to this one?
I think #4801 becomes in a good shape now, and it will be merged soon.

@kan-bayashi
Copy link
Member

@Fhrozen I just merged #4801. If your PR is alternative or updated version of #4801, please refactor or update scripts added in #4801.

@Fhrozen
Copy link
Member Author

Fhrozen commented Dec 7, 2022

@kan-bayashi I am wondering which option would be better. This PR includes mfa training and It would cover others recipes such as VCTK, so the mfa option should be keep separated as optional (described in the TTS readme) or would become a permanent option in all the tts recipes?

@kan-bayashi kan-bayashi modified the milestones: v.202211, v.202301 Dec 11, 2022
@kan-bayashi
Copy link
Member

the mfa option should be keep separated as optional (described in the TTS readme)

I like this one.

@Fhrozen
Copy link
Member Author

Fhrozen commented Dec 14, 2022

@kan-bayashi
I am not sure if this suits your taste.

Instead of executing ./run.sh --stop-stage 1 ==> ./local/mfa.sh ==> ./run.sh --stage 2, I added the local/data.sh in the mfa_align.sh file (scrips/utils/prep_data_mfa_align.sh), so it will only requires to run ./local/mfa.sh ==> ./run.sh --stage 2 (Line 99)

Let me know about it, I will try to finish the PR.

@kan-bayashi
Copy link
Member

It looks good :)

@Fhrozen Fhrozen changed the title [Draft] Generating MFA aligments Generating MFA aligments Dec 25, 2022
@Fhrozen
Copy link
Member Author

Fhrozen commented Dec 25, 2022

@kan-bayashi

I completed the code, so please let me know about any comments on the code.
I will be adding the documentation.

I test the code in LJ, VCTK, and tsukuyomi.
The current code supports:

  • single and multiple speakers
  • any data set
  • can be trained or use pre-trained models.

I will be adding a function lather for downloading pre-trained models from HuggingFace (There are some languages that MFA currently does not support).

A frontend was added in the lab generation for Japanese (and probably Korean and Chinese will require).
This is because the MFA does not recognize Japanese characters without spaces, and the dictionary generation never gets completed.

@codecov
Copy link

codecov bot commented Dec 25, 2022

Codecov Report

Merging #4803 (07ee914) into master (a55853b) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #4803   +/-   ##
=======================================
  Coverage   79.18%   79.18%           
=======================================
  Files         557      557           
  Lines       49279    49279           
=======================================
  Hits        39020    39020           
  Misses      10259    10259           
Flag Coverage Δ
test_integration_espnet1 66.39% <ø> (ø)
test_integration_espnet2 49.33% <ø> (ø)
test_python 67.99% <ø> (ø)
test_utils 23.34% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@kan-bayashi kan-bayashi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review.
The code looks great, nice design that allows various recipe to use MFA.
And I'm glad to hear that Japanese case also works.

Once you complete this PR, let us merge it.

@Fhrozen Fhrozen marked this pull request as ready for review January 6, 2023 10:50
@Fhrozen
Copy link
Member Author

Fhrozen commented Jan 6, 2023

Finish adding some fixes related to the frame generation found due to floating point division.
Documentation was also added.

@Fhrozen
Copy link
Member Author

Fhrozen commented Jan 6, 2023

@kan-bayashi
After testing, it should support any language that the ESPnet front-end supports.

In case of english, I tested with VCTK also and some samples are here
vctk.zip

The files were generating using:

  • MFA trained model (cleaner: tacotron, g2p_model: espeak_ng_english_us_vits),
  • ProDiff + X-vector + GST model, and
  • HifiGan Vocder

For Japanese, I tested with JVS multispeaker + Tsukuyomi
jvs.zip

  • MFA trained model (cleaner: jaconv, g2p_model: pyopenjtalk_prosody),
  • ProDiff + X-vector + GST model, and
  • HifiGan Vocder (vctk)

The issue with the Japanese is the vocoder (I suppose), so I am training a large set for training a vocoder a test again.
Meanwhile, this PR is already finished.
Let me know if there is any fix it is required.

@kan-bayashi kan-bayashi merged commit 47de6af into espnet:master Jan 13, 2023
@Fhrozen Fhrozen deleted the dft-mfa branch January 13, 2023 14:04
@iamanigeeit
Copy link
Contributor

iamanigeeit commented Jan 30, 2023

Hi @Fhrozen -- thanks for integrating the MFA scripts.

I wonder about --g2p_model espeak_ng_english_us_vits in run_mfa.sh. It seems MFA uses a custom IPA symbol set, not just having more symbols but some are different as well (aw vs aʊ̯).

Also, the readme says:

./scripts/mfa.sh --split_sets "train_set dev_set test_set" \
    --stage 1 \
    --stop-stage 2 \
    --train true --nj 36 --g2p_model espeak_ng_english_vits

I think we're supposed to run all (5) stages right?

@Fhrozen
Copy link
Member Author

Fhrozen commented Jan 30, 2023

In case of using espeak_ng_english_us_vits (which is one the choices of the ESPnet frontend), you need to setup --train true and yes, it is required to run all stages to generate the transcriptions with the specified g2p.
In the readme, it is only as general reference.

@iamanigeeit
Copy link
Contributor

@kan-bayashi @Fhrozen There seems to be a naming conflict in the LibriTTS corpus because MFA expects .lab files to be plain text transcriptions, but LibriTTSLabel is in <start> <end> <phone> format. This could potentially impact mfa.sh and trim_silence.py. I am running mfa.sh on LibriTTS to see if the alignment failures are still there because i suspect it's due to OOVs. If it works, i can help to update LibriTTSLabel.

@iamanigeeit
Copy link
Contributor

Also, in mfa.sh, there are two different g2p_model:

  1. ESPnet g2p model name for mfa_format.py
  2. MFA g2p model for the mfa commands

mfa.sh will complain that either MFA or ESPnet doesn't have this g2p model name.

@Fhrozen
Copy link
Member Author

Fhrozen commented Jun 28, 2023

@iamanigeeit I do not understand the current issue.

There seems to be a naming conflict in the LibriTTS corpus because MFA expects .lab files to be plain text transcriptions, but LibriTTSLabel is in format.

As further I see, it uses normalized text to generate the annotations:

txt=$(cat $(echo $wav_file | sed -e "s/\.wav$/.normalized.txt/"))
echo "$id $txt" >>$trans

cd "${db_root}/LibriTTS/${name}"
find . -follow -name "*.normalized.txt" -print0 \
| tar c --null -T - -f - | tar xf - -C "${cwd}/data/local/${name}"
cd "${cwd}"

The lab files are employed for phoneme alignment I supposed.

Also, in mfa.sh, there are two different g2p_model:

ESPnet g2p model name for mfa_format.py
MFA g2p model for the mfa commands
mfa.sh will complain that either MFA or ESPnet doesn't have this g2p model name.

The setup is for using pretrained espnet mfa models.
There is no issue about that.
if you espnet_ as prefix, the program will automatically use a espnet pretrained mfa model instead of download it from the MFA official page.

I will update this soon for including mfa models stored in Huggingface.

@iamanigeeit
Copy link
Contributor

There seems to be a naming conflict in the LibriTTS corpus because MFA expects .lab files to be plain text transcriptions, but LibriTTSLabel is in <start> <end> <phone> format.

I think this is OK since they are in separate folders (MFA .lab in data/local/mfa/corpus/xx and LibriTTS .lab in downloads/LibriTTS/xx/xx). I was confused because they have the same name.

The setup is for using pretrained espnet mfa models.

I see what you mean now, thanks. If train=true then g2p_model is an existing ESPnet G2P that we use to train MFA G2P. If train=false then g2p_model is a MFA G2P model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants