Fix bugs in mfa_format.py #5223

G-Thor · 2023-06-08T19:20:51Z

I'm a big fan of the MFA integration in ESPNet.

This PR fixes a couple of minor bugs in mfa_format.py that caused issues for me.

A floating point error caused the no. of frames computed from the mfa alignment file's 'end' value to (very rarely) differ from the statistics extraction step (stage 5 of TTS recipes) for FastSpeech2. Obtaining the value directly from the wav file eliminates that issue. It's a little bit slower but you're guaranteed to get the right number of frames every time. See below for example of this error.
I added the option to leave the text_cleaner argument blank, in case of pre-normalised transcripts in a non-English language.
I added an .rstrip("/") to ignore trailing slashes in the corpus_dir argument.

Example of fp-error:

>>> samples = 251392
>>> sr = 22050
>>> dur_float = samples / sr
>>> hop_length = 256
>>> true_frames = samples / hop_length
>>> float_frames = dur_float * sr / hop_length
>>> print(f"true no. of frames: {true_frames}. float error: {float_frames}")
true no. of frames: 982.0. float error: 981.9999999999999
>>> print(f"true no. of frames: {int(true_frames)+1}. frames when passed through float: {int(float_frames)+1}")
true no. of frames: 983. frames when passed through float: 982

Fixes a couple of bugs in mfa_format.py that caused issues for me. 1: A floating point error caused the no. of frames computed from the mfa alignment file's duration to differ from the statistics extraction step (stage 5 of TTS recipes) for FastSpeech2. Obtaining the value directly from the wav file eliminates that issue. 2: I added the option to leave the text_cleaner argument blank, in case of pre-normalised transcripts in a non-english language. 3: I added an .rstrip("/") to ignore trailing slashes in the corpus_dir argument.

for more information, see https://pre-commit.ci

codecov · 2023-06-08T19:58:43Z

Codecov Report

Merging #5223 (5daff33) into master (096e2bb) will decrease coverage by 0.08%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5223      +/-   ##
==========================================
- Coverage   74.43%   74.36%   -0.08%     
==========================================
  Files         642      654      +12     
  Lines       57611    58347     +736     
==========================================
+ Hits        42885    43391     +506     
- Misses      14726    14956     +230

Flag	Coverage Δ
test_integration_espnet1	`66.28% <ø> (ø)`
test_integration_espnet2	`46.95% <ø> (-0.58%)`	⬇️
test_python	`65.18% <ø> (+0.03%)`	⬆️
test_utils	`23.28% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 31 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Fhrozen · 2023-06-09T07:27:44Z

@G-Thor Thank you for your PR.
While 2 and 3 seems well addressed, I found a problem with 1.
I also tried to use reading the file, which may increase the processing time, but more importantly that the frequency sample of the wav file from the local file may be different from the one that will be used during training.

Example:
In this data prep, the output sampling rate is 48KHz (which is employed to generate the wav file at data/local/mfa/corpus)

espnet/egs2/tsukuyomi/tts1/local/data_prep.sh

Line 35 in 096e2bb

echo "${id} sox \"${filename}\" -r 48000 -t wav -c 1 -b 16 - |" >> "${scp}"

But the training will be performed at 24KHz (which is used to calculate the durations by multiplying the maxtime x Sampling rate)

espnet/egs2/tsukuyomi/tts1/run.sh

Line 8 in 096e2bb

fs=24000

This will generate an assertion error during stage 5 and so on, so I cannot fully accept this PR.

I also observed the example of fp-error, and that is why I am using Decimal. I did not observe any issue with the calculation recently, and I've been using multiple sets and configurations. So, if you find any particular configuration or data set, please report it to test it.

@sw005320 @G-Thor, As I mentioned, if you modify this PR to focus on 2 and 3, then I will approve. 1 needs a different direction.

G-Thor · 2023-06-09T08:36:16Z

@Fhrozen Thanks so much for your review and your work on MFA integration.

I did not consider this downsampling issue. Good catch. This should definitely be fixed.

I considered the added processing time, but in my experience the difference is noticable, but not significant in the scale of model training.

Unfortunately the use of Decimal does not fix the fp-error, since it is originated during the duration measurement in MFA, which returns a float. So the error is already present in the str being converted to Decimal.
Indeed the values in the example I provided come from an audio sample that caused assertion errors due to a mismatch in no. of frames.

We'd either need to read the no. of samples from a downstreamed wav, which requires running this step after stage 2 of the TTS recipe and supplying the path to the ${data_feats}${_suf}/${dset} used therein (the output dir of format_wav_scp.sh), or account for downsampling in the calculation.

What if we were to read in the original sample rate (source_fs) alongside the no. of samples and do the frame calculation as follows:

total_durations = int(int(n_samples * fs / source_fs) / hop_size) + 1

We need to cast the intermediate calculation to int because the total no. of samples is an integer.
I'm not sure this works yet, as I'm unsure how no. of samples is rounded during downsampling. I'll look into it and report back.

1 is the main issue for me here so I would definitely like to find a direction that works for it.

Fhrozen

I added some comments.
Please, try with the modification and let me know if this work on your end.
Try also running until stage 5 to confirm no issues with the Duration extraction.

If this fixes the issue, try to fix the CI comments and will be ready to merge.

egs2/TEMPLATE/asr1/pyscripts/utils/mfa_format.py

v0.3.0 introduced breaking changes to run_frontend() See r9y9/pyopenjtalk#40

@Fhrozen

also address comments from @Fhrozen

for more information, see https://pre-commit.ci

G-Thor · 2023-06-20T11:33:00Z

Thanks for your review.
I've implemented the requested changes. I used sf.SoundFile() rather than sf.read() since it is faster. Let me know if you strongly prefer sf.read()

During my testing on tsukuyomi I encountered an issue in mfa_format.py. see commit 552b4c0 and pyopenjtalk's changes.

After implementing this fix I encountered no errors in running mfa.sh nor in running the recipe up to stage 5 using fastspeech2 config and mfa-derived durations.

Fhrozen · 2023-06-22T04:42:15Z

LGTM

sw005320 · 2023-06-22T11:16:34Z

Thanks!
After the CI check, I'll merge this PR.

mergify bot added the ESPnet2 label Jun 8, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

bc8d6e6

for more information, see https://pre-commit.ci

sw005320 requested a review from Fhrozen June 9, 2023 03:48

sw005320 added Bugfix TTS Text-to-speech labels Jun 9, 2023

sw005320 added this to the v.202307 milestone Jun 9, 2023

Fhrozen suggested changes Jun 17, 2023

View reviewed changes

egs2/TEMPLATE/asr1/pyscripts/utils/mfa_format.py Outdated Show resolved Hide resolved

egs2/TEMPLATE/asr1/pyscripts/utils/mfa_format.py Show resolved Hide resolved

egs2/TEMPLATE/asr1/pyscripts/utils/mfa_format.py Show resolved Hide resolved

G-Thor and others added 3 commits June 20, 2023 11:01

Fix pyopenjtalk usage

552b4c0

v0.3.0 introduced breaking changes to run_frontend() See r9y9/pyopenjtalk#40

Implement frame count adjustment for downsampling,

e3f2a4e

also address comments from @Fhrozen

[pre-commit.ci] auto fixes from pre-commit.com hooks

c228140

for more information, see https://pre-commit.ci

E501 line too long addressed

5daff33

sw005320 added the auto-merge Enable auto-merge label Jun 22, 2023

mergify bot merged commit 161e4bb into espnet:master Jun 22, 2023
24 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs in mfa_format.py #5223

Fix bugs in mfa_format.py #5223

G-Thor commented Jun 8, 2023

codecov bot commented Jun 8, 2023 •

edited

Fhrozen commented Jun 9, 2023

G-Thor commented Jun 9, 2023

Fhrozen left a comment

G-Thor commented Jun 20, 2023

Fhrozen commented Jun 22, 2023

sw005320 commented Jun 22, 2023

Fix bugs in mfa_format.py #5223

Fix bugs in mfa_format.py #5223

Conversation

G-Thor commented Jun 8, 2023

codecov bot commented Jun 8, 2023 • edited

Codecov Report

Fhrozen commented Jun 9, 2023

G-Thor commented Jun 9, 2023

Fhrozen left a comment

Choose a reason for hiding this comment

G-Thor commented Jun 20, 2023

Fhrozen commented Jun 22, 2023

sw005320 commented Jun 22, 2023

codecov bot commented Jun 8, 2023 •

edited