Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs in mfa_format.py #5223

Merged
merged 6 commits into from Jun 22, 2023
Merged

Fix bugs in mfa_format.py #5223

merged 6 commits into from Jun 22, 2023

Conversation

G-Thor
Copy link
Contributor

@G-Thor G-Thor commented Jun 8, 2023

I'm a big fan of the MFA integration in ESPNet.

This PR fixes a couple of minor bugs in mfa_format.py that caused issues for me.

  1. A floating point error caused the no. of frames computed from the mfa alignment file's 'end' value to (very rarely) differ from the statistics extraction step (stage 5 of TTS recipes) for FastSpeech2. Obtaining the value directly from the wav file eliminates that issue. It's a little bit slower but you're guaranteed to get the right number of frames every time. See below for example of this error.
  2. I added the option to leave the text_cleaner argument blank, in case of pre-normalised transcripts in a non-English language.
  3. I added an .rstrip("/") to ignore trailing slashes in the corpus_dir argument.

Example of fp-error:

>>> samples = 251392
>>> sr = 22050
>>> dur_float = samples / sr
>>> hop_length = 256
>>> true_frames = samples / hop_length
>>> float_frames = dur_float * sr / hop_length
>>> print(f"true no. of frames: {true_frames}. float error: {float_frames}")
true no. of frames: 982.0. float error: 981.9999999999999
>>> print(f"true no. of frames: {int(true_frames)+1}. frames when passed through float: {int(float_frames)+1}")
true no. of frames: 983. frames when passed through float: 982

Fixes a couple of bugs in mfa_format.py that caused issues for me.
1: A floating point error caused the no. of frames computed from the mfa alignment file's duration to differ from the statistics extraction step (stage 5 of TTS recipes) for FastSpeech2. Obtaining the value directly from the wav file eliminates that issue.
2: I added the option to leave the text_cleaner argument blank, in case of pre-normalised transcripts in a non-english language.
3: I added an .rstrip("/") to ignore trailing slashes in the corpus_dir  argument.
@mergify mergify bot added the ESPnet2 label Jun 8, 2023
@codecov
Copy link

codecov bot commented Jun 8, 2023

Codecov Report

Merging #5223 (5daff33) into master (096e2bb) will decrease coverage by 0.08%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5223      +/-   ##
==========================================
- Coverage   74.43%   74.36%   -0.08%     
==========================================
  Files         642      654      +12     
  Lines       57611    58347     +736     
==========================================
+ Hits        42885    43391     +506     
- Misses      14726    14956     +230     
Flag Coverage Δ
test_integration_espnet1 66.28% <ø> (ø)
test_integration_espnet2 46.95% <ø> (-0.58%) ⬇️
test_python 65.18% <ø> (+0.03%) ⬆️
test_utils 23.28% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 31 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@sw005320 sw005320 requested a review from Fhrozen June 9, 2023 03:48
@sw005320 sw005320 added Bugfix TTS Text-to-speech labels Jun 9, 2023
@sw005320 sw005320 added this to the v.202307 milestone Jun 9, 2023
@Fhrozen
Copy link
Member

Fhrozen commented Jun 9, 2023

@G-Thor Thank you for your PR.
While 2 and 3 seems well addressed, I found a problem with 1.
I also tried to use reading the file, which may increase the processing time, but more importantly that the frequency sample of the wav file from the local file may be different from the one that will be used during training.

Example:
In this data prep, the output sampling rate is 48KHz (which is employed to generate the wav file at data/local/mfa/corpus)

echo "${id} sox \"${filename}\" -r 48000 -t wav -c 1 -b 16 - |" >> "${scp}"

But the training will be performed at 24KHz (which is used to calculate the durations by multiplying the maxtime x Sampling rate)


This will generate an assertion error during stage 5 and so on, so I cannot fully accept this PR.

I also observed the example of fp-error, and that is why I am using Decimal. I did not observe any issue with the calculation recently, and I've been using multiple sets and configurations. So, if you find any particular configuration or data set, please report it to test it.

@sw005320 @G-Thor, As I mentioned, if you modify this PR to focus on 2 and 3, then I will approve. 1 needs a different direction.

@G-Thor
Copy link
Contributor Author

G-Thor commented Jun 9, 2023

@Fhrozen Thanks so much for your review and your work on MFA integration.

I did not consider this downsampling issue. Good catch. This should definitely be fixed.

I considered the added processing time, but in my experience the difference is noticable, but not significant in the scale of model training.

Unfortunately the use of Decimal does not fix the fp-error, since it is originated during the duration measurement in MFA, which returns a float. So the error is already present in the str being converted to Decimal.
Indeed the values in the example I provided come from an audio sample that caused assertion errors due to a mismatch in no. of frames.

We'd either need to read the no. of samples from a downstreamed wav, which requires running this step after stage 2 of the TTS recipe and supplying the path to the ${data_feats}${_suf}/${dset} used therein (the output dir of format_wav_scp.sh), or account for downsampling in the calculation.

What if we were to read in the original sample rate (source_fs) alongside the no. of samples and do the frame calculation as follows:

total_durations = int(int(n_samples * fs / source_fs) / hop_size) + 1

We need to cast the intermediate calculation to int because the total no. of samples is an integer.
I'm not sure this works yet, as I'm unsure how no. of samples is rounded during downsampling. I'll look into it and report back.

1 is the main issue for me here so I would definitely like to find a direction that works for it.

Copy link
Member

@Fhrozen Fhrozen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some comments.
Please, try with the modification and let me know if this work on your end.
Try also running until stage 5 to confirm no issues with the Duration extraction.

If this fixes the issue, try to fix the CI comments and will be ready to merge.

egs2/TEMPLATE/asr1/pyscripts/utils/mfa_format.py Outdated Show resolved Hide resolved
G-Thor and others added 3 commits June 20, 2023 11:01
@G-Thor
Copy link
Contributor Author

G-Thor commented Jun 20, 2023

Thanks for your review.
I've implemented the requested changes. I used sf.SoundFile() rather than sf.read() since it is faster. Let me know if you strongly prefer sf.read()

During my testing on tsukuyomi I encountered an issue in mfa_format.py. see commit 552b4c0 and pyopenjtalk's changes.

After implementing this fix I encountered no errors in running mfa.sh nor in running the recipe up to stage 5 using fastspeech2 config and mfa-derived durations.

@Fhrozen
Copy link
Member

Fhrozen commented Jun 22, 2023

LGTM

@sw005320 sw005320 added the auto-merge Enable auto-merge label Jun 22, 2023
@sw005320
Copy link
Contributor

Thanks!
After the CI check, I'll merge this PR.

@mergify mergify bot merged commit 161e4bb into espnet:master Jun 22, 2023
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Enable auto-merge Bugfix ESPnet2 TTS Text-to-speech
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants