Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TTS evaluation script and monitoring functionality using MOS prediction model #5485

Merged
merged 16 commits into from
Dec 17, 2023

Conversation

Takaaki-Saeki
Copy link
Contributor

What?

This PR:

  1. Adds an evaluation script for TTS using a MOS prediction model. It uses SpeechMOS toolkit developed by @tarepan. It currently uses a pretrained UTMOS (Best performing method in VoiceMOS Challenge 2022) strong learner model. Thanks to the toolkit, the predictor can be loaded directly from torch.hub without the need to install a specific module.
  2. Adds a functionality to monitor the predicted MOS values during training of fully end-to-end TTS models. Currently, it is supported for VITS and JETS.

Why?

  1. In addition to MCD and F0 RMSE, an evaluation script based on MOS prediction should be added. Evaluation by MOS prediction can be done even without ground truth speech.
  2. For TTS (especially GAN-based TTS), it is often difficult to determine the best number of training steps. Monitoring the predicted MOS during training should help to improve the TTS performance.

See also

  • The figure shows the predicted MOS plot during the training of VITS on LJSpeech ,though it is still in early iterations. The predicted MOS value is increasing.
    generator_predicted_mos.

@mergify mergify bot added the ESPnet2 label Oct 19, 2023
@codecov
Copy link

codecov bot commented Oct 19, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1c55053) 70.62% compared to head (c9bfeb4) 76.55%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5485      +/-   ##
==========================================
+ Coverage   70.62%   76.55%   +5.93%     
==========================================
  Files         719      720       +1     
  Lines       66513    66616     +103     
==========================================
+ Hits        46972    50998    +4026     
+ Misses      19541    15618    -3923     
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 62.92% <ø> (ø)
test_integration_espnet2 50.09% <18.75%> (-0.02%) ⬇️
test_python_espnet1 19.08% <0.00%> (?)
test_python_espnet2 52.40% <100.00%> (+0.01%) ⬆️
test_utils 22.15% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ftshijt
Copy link
Collaborator

ftshijt commented Oct 19, 2023

Wow, it is super cool! I always do the utmos evaluation with the generated audios from espnet. It would speed up some of my works a lot. Many thanks for the contribution!

@mergify mergify bot added the README label Oct 20, 2023
@sw005320 sw005320 added New Features TTS Text-to-speech labels Oct 20, 2023
@sw005320 sw005320 added this to the v.202312 milestone Oct 20, 2023
@sw005320
Copy link
Contributor

@Takaaki-Saeki, very cool!

  • Can you add a test? You can check https://github.com/espnet/espnet/tree/master/test_utils
  • I was thinking of suggesting to add this evaluation metric to the results of some recipes, but then I found that the TTS recipe usually do not provide the results (only pre-trained models). It does not even include the evaluation in template. We should do some updates…

@sw005320
Copy link
Contributor

After you add a test, I can merge this PR.

@Takaaki-Saeki
Copy link
Contributor Author

Takaaki-Saeki commented Oct 24, 2023

  • I was thinking of suggesting to add this evaluation metric to the results of some recipes, but then I found that the TTS recipe usually do not provide the results (only pre-trained models). It does not even include the evaluation in template. We should do some updates…

Thanks! This is an interesting direction and the objective results should be included in the TTS recipes.
For an objective evaluation of TTS recipes, it would be good to have a clearly defined test and dev sets, as in ASR (although this criterion needs to be discussed). For example, IIUC, LJSpeech and VCTK do not have predefined test sets, while LibriTTS has one as in Librispeech.
Along to this, TTS benchmarking (like superb) with espnet2 would also be worthwhile.
Let me consider it as a future PR.

@sw005320
Copy link
Contributor

Cool!
We can discuss and design it!

@kan-bayashi kan-bayashi modified the milestones: v.202310, v.202312 Oct 25, 2023
@ftshijt
Copy link
Collaborator

ftshijt commented Dec 17, 2023

Thanks for the contribution!

@ftshijt ftshijt merged commit 4771515 into espnet:master Dec 17, 2023
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants