Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthesized audio quality is poor by using pre-trained WaveNet vocoder model #2191

Closed
mindmapper15 opened this issue Jul 21, 2020 · 5 comments
Labels
Question Question TTS Text-to-speech

Comments

@mindmapper15
Copy link

mindmapper15 commented Jul 21, 2020

I trained tacotron 2 model with LJSpeech Dataset with default setting.

Synthesized mel spectrogram seems to fine but when I converted synthesized mel spectrogram to audio with pre-trained WaveNet Vocoder model (ljspeech.wavenet.mol.v2) described in README page, the synthesized audio has very poor quality compared to synthesized mel spectrogram.

I extracted synthesized mel spectrogram from feats.ark and save those with npy format.

This is the synthesized mel spectrogram from LJSpeech dev set.
fbank_espnet

And this is the converted audio from synthesized mel spectrogram I attatched

fbank_after_vocoder

LJ049-0151.zip

I've done peak normalize to the synthesized audio because its amplitude is too small

Is this model trained properly? If so, did I put synthesized mel spectrogram to vocoder in wrong way?

@kan-bayashi
Copy link
Member

It seems something wrong.
Please check the following points:

  • Are the feature setting the same among the models? (fmax, fmin, ...)
  • Did you use normalized features? (not denorm)

@kan-bayashi kan-bayashi added Question Question TTS Text-to-speech labels Jul 21, 2020
@kan-bayashi
Copy link
Member

kan-bayashi commented Jul 21, 2020

For reference (our pretrained model sample):
https://drive.google.com/drive/folders/1JFNZapygWsHiP2CXMjTraLzf98h-tEBF?usp=sharing

@mindmapper15
Copy link
Author

  • Are the feature setting the same among the models? (fmax, fmin, ...)

I used ljspeech.wavenet.mol.v2 model which was trained with mel range 80 ~ 7600Hz
This mel range is same as default setting in LJSpeech TTS recipe in espnet.

  • Did you use normalized features? (not denorm)

I extracted mel spectrogram from egs/ljspeech/tts1/exp/char_train_no_dev_pytorch_train_pytorch_tacotron2/outputs_model.last1.avg.best_decode_denorm/feats.ark. As far as I know, this feature file contains synthesized mel spectrogram which is denormalized at stage 5 in run.sh by following command.

apply-cmvn --norm-vars=true --reverse=true data/${train_set}/cmvn.ark \ scp:${outdir}/${name}/feats.scp \ ark,scp:${outdir}_denorm/${name}/feats.ark,${outdir}_denorm/${name}/feats.scp

so probably it's not the normalized feature I think.

@kan-bayashi
Copy link
Member

Yes, that is de-normalized feature.
Please use outputs_model.last1.avg.best_decode/feats.ark as the input of WaveNet.

@mindmapper15
Copy link
Author

Figure_1
LJ049-0151.zip

I changed wavenet input to normalized feature and now it seems work!
Thank you for the corrections!

Problem solved so I'll close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question Question TTS Text-to-speech
Projects
None yet
Development

No branches or pull requests

2 participants