New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs in mfa_format.py #5223
Conversation
Fixes a couple of bugs in mfa_format.py that caused issues for me. 1: A floating point error caused the no. of frames computed from the mfa alignment file's duration to differ from the statistics extraction step (stage 5 of TTS recipes) for FastSpeech2. Obtaining the value directly from the wav file eliminates that issue. 2: I added the option to leave the text_cleaner argument blank, in case of pre-normalised transcripts in a non-english language. 3: I added an .rstrip("/") to ignore trailing slashes in the corpus_dir argument.
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #5223 +/- ##
==========================================
- Coverage 74.43% 74.36% -0.08%
==========================================
Files 642 654 +12
Lines 57611 58347 +736
==========================================
+ Hits 42885 43391 +506
- Misses 14726 14956 +230
Flags with carried forward coverage won't be shown. Click here to find out more. see 31 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@G-Thor Thank you for your PR. Example:
But the training will be performed at 24KHz (which is used to calculate the durations by multiplying the espnet/egs2/tsukuyomi/tts1/run.sh Line 8 in 096e2bb
This will generate an assertion error during stage 5 and so on, so I cannot fully accept this PR. I also observed the example of fp-error, and that is why I am using @sw005320 @G-Thor, As I mentioned, if you modify this PR to focus on 2 and 3, then I will approve. 1 needs a different direction. |
@Fhrozen Thanks so much for your review and your work on MFA integration. I did not consider this downsampling issue. Good catch. This should definitely be fixed. I considered the added processing time, but in my experience the difference is noticable, but not significant in the scale of model training. Unfortunately the use of Decimal does not fix the fp-error, since it is originated during the duration measurement in MFA, which returns a float. So the error is already present in the str being converted to Decimal. We'd either need to read the no. of samples from a downstreamed wav, which requires running this step after stage 2 of the TTS recipe and supplying the path to the What if we were to read in the original sample rate (
We need to cast the intermediate calculation to int because the total no. of samples is an integer. 1 is the main issue for me here so I would definitely like to find a direction that works for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments.
Please, try with the modification and let me know if this work on your end.
Try also running until stage 5 to confirm no issues with the Duration extraction.
If this fixes the issue, try to fix the CI comments and will be ready to merge.
v0.3.0 introduced breaking changes to run_frontend() See r9y9/pyopenjtalk#40
also address comments from @Fhrozen
for more information, see https://pre-commit.ci
Thanks for your review. During my testing on tsukuyomi I encountered an issue in mfa_format.py. see commit 552b4c0 and pyopenjtalk's changes. After implementing this fix I encountered no errors in running mfa.sh nor in running the recipe up to stage 5 using fastspeech2 config and mfa-derived durations. |
LGTM |
Thanks! |
I'm a big fan of the MFA integration in ESPNet.
This PR fixes a couple of minor bugs in mfa_format.py that caused issues for me.
Example of fp-error: