Synthesized audio quality is poor by using pre-trained WaveNet vocoder model #2191

mindmapper15 · 2020-07-21T08:26:15Z

I trained tacotron 2 model with LJSpeech Dataset with default setting.

Synthesized mel spectrogram seems to fine but when I converted synthesized mel spectrogram to audio with pre-trained WaveNet Vocoder model (ljspeech.wavenet.mol.v2) described in README page, the synthesized audio has very poor quality compared to synthesized mel spectrogram.

I extracted synthesized mel spectrogram from feats.ark and save those with npy format.

This is the synthesized mel spectrogram from LJSpeech dev set.

And this is the converted audio from synthesized mel spectrogram I attatched

LJ049-0151.zip

I've done peak normalize to the synthesized audio because its amplitude is too small

Is this model trained properly? If so, did I put synthesized mel spectrogram to vocoder in wrong way?

kan-bayashi · 2020-07-21T09:04:45Z

It seems something wrong.
Please check the following points:

Are the feature setting the same among the models? (fmax, fmin, ...)
Did you use normalized features? (not denorm)

kan-bayashi · 2020-07-21T09:06:16Z

For reference (our pretrained model sample):
https://drive.google.com/drive/folders/1JFNZapygWsHiP2CXMjTraLzf98h-tEBF?usp=sharing

mindmapper15 · 2020-07-21T10:28:12Z

Are the feature setting the same among the models? (fmax, fmin, ...)

I used ljspeech.wavenet.mol.v2 model which was trained with mel range 80 ~ 7600Hz
This mel range is same as default setting in LJSpeech TTS recipe in espnet.

Did you use normalized features? (not denorm)

I extracted mel spectrogram from egs/ljspeech/tts1/exp/char_train_no_dev_pytorch_train_pytorch_tacotron2/outputs_model.last1.avg.best_decode_denorm/feats.ark. As far as I know, this feature file contains synthesized mel spectrogram which is denormalized at stage 5 in run.sh by following command.

apply-cmvn --norm-vars=true --reverse=true data/${train_set}/cmvn.ark \ scp:${outdir}/${name}/feats.scp \ ark,scp:${outdir}_denorm/${name}/feats.ark,${outdir}_denorm/${name}/feats.scp

so probably it's not the normalized feature I think.

kan-bayashi · 2020-07-21T10:33:43Z

Yes, that is de-normalized feature.
Please use outputs_model.last1.avg.best_decode/feats.ark as the input of WaveNet.

mindmapper15 · 2020-07-22T01:11:22Z

LJ049-0151.zip

I changed wavenet input to normalized feature and now it seems work!
Thank you for the corrections!

Problem solved so I'll close this issue.

kan-bayashi added Question Question TTS Text-to-speech labels Jul 21, 2020

mindmapper15 closed this as completed Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthesized audio quality is poor by using pre-trained WaveNet vocoder model #2191

Synthesized audio quality is poor by using pre-trained WaveNet vocoder model #2191

mindmapper15 commented Jul 21, 2020 •

edited

kan-bayashi commented Jul 21, 2020

kan-bayashi commented Jul 21, 2020 •

edited

mindmapper15 commented Jul 21, 2020

kan-bayashi commented Jul 21, 2020

mindmapper15 commented Jul 22, 2020

Synthesized audio quality is poor by using pre-trained WaveNet vocoder model #2191

Synthesized audio quality is poor by using pre-trained WaveNet vocoder model #2191

Comments

mindmapper15 commented Jul 21, 2020 • edited

kan-bayashi commented Jul 21, 2020

kan-bayashi commented Jul 21, 2020 • edited

mindmapper15 commented Jul 21, 2020

kan-bayashi commented Jul 21, 2020

mindmapper15 commented Jul 22, 2020

mindmapper15 commented Jul 21, 2020 •

edited

kan-bayashi commented Jul 21, 2020 •

edited