Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to apply voice conversion to long files using my trained speaker embedding #57

Open
xanguera opened this issue Sep 22, 2020 · 4 comments

Comments

@xanguera
Copy link

Hi,
I am attempting zero-shot voice conversion by using only a few audio sentences of a target speaker. I am training a speaker embedding for this speaker using the make_spect.py and make_metadata.py (originally thought for training) and extract the embedding from the resulting train.pkl file. I do the same for the source speaker.
As a comparison, I also perform VC between 2 of the speakers provided in the metadata.pkl file.

I am then applying the embedding to perform VC by modifying conversion.py code and things start to fall apart. Here is what I see:

  • If I use any of the embeddings in metadata.pkl as target and source, the resulting audio sounds good, either if I convert a short (2s) or long (7s file).
  • If I use my computed embeddings, the resulting audio sounds good if it is short, but for long files I get garbled audio after the second 2-3 (sometimes earlier).

Has anyone experienced this? the author says that the system has been trained with short audios, but would this explain this behavior? mostly given that when using the embeddings in metadata.pkl it always sounds well, regardless of the length?
Have the embeddings in metadata.pkl been computed in the exact same way as the embeddings I am now computing? (note that I have tried disabling the random noise added in the make_metadata.py script, but same results).

Thanks!

@zzw922cn
Copy link

hi, what's reconstruction loss of your converged model? I want to know when can I reproduce the Voice Conversion results in domain? thank you~~

@ruclion
Copy link

ruclion commented Dec 23, 2020

Hi,
I am attempting zero-shot voice conversion by using only a few audio sentences of a target speaker. I am training a speaker embedding for this speaker using the make_spect.py and make_metadata.py (originally thought for training) and extract the embedding from the resulting train.pkl file. I do the same for the source speaker.
As a comparison, I also perform VC between 2 of the speakers provided in the metadata.pkl file.

I am then applying the embedding to perform VC by modifying conversion.py code and things start to fall apart. Here is what I see:

  • If I use any of the embeddings in metadata.pkl as target and source, the resulting audio sounds good, either if I convert a short (2s) or long (7s file).
  • If I use my computed embeddings, the resulting audio sounds good if it is short, but for long files I get garbled audio after the second 2-3 (sometimes earlier).

Has anyone experienced this? the author says that the system has been trained with short audios, but would this explain this behavior? mostly given that when using the embeddings in metadata.pkl it always sounds well, regardless of the length?
Have the embeddings in metadata.pkl been computed in the exact same way as the embeddings I am now computing? (note that I have tried disabling the random noise added in the make_metadata.py script, but same results).

Thanks!

maybe use your own speaker's dataset to fine-tune the autovc's content encoder and decoder?

@ruclion
Copy link

ruclion commented Dec 23, 2020

hi, what's reconstruction loss of your converged model? I want to know when can I reproduce the Voice Conversion results in domain? thank you~~

he just use speaker encoder to get speaker features, and use it on pretrained autovc model, so he maybe not train~

@MuyangDu
Copy link

Same issue here. Just tried the pretrained model(downloaded from the repo) without any finetuning. I use some clean speech audios from a female speaker outside the VCTK dataset, let's call her speaker A.
Here is the produre:

audios of A -> pretrained speaker encoder -> speaker embedding of A
audios of p227 in VCTK -> pretrained speaker encoder -> speaker embedding of p227
(speaker embedding of A, speaker embedding of p227, spectrogram of p227) -> pretrained autovc -> pretrained wavenet -> generated speech audio of A

However, the generated speech of A get garbled after the second 2-3.
Here is the generated audio of A: generated.wav.zip
Here is the origin audio of p227: p227_005.wav.zip

From the generated audio we can see that the speaker identity is successfully converted to A but the actual content of speech is lost.

But if I convert p227 to p225 (both of them are VCTK speakers) using the exact same proceudre described above and exact the same pretrained models. It works fine. (Although shorter audios performs better than long audios, but at least the content is correct).

From the paper, I found this "... as long as seen speakers are included in either side of the conversions, the performance is comparable...". A is an unseen speaker, not sure if the p227 is seen during training.

So, here are some guessings:

  1. Since the generated audio sounds like A, the speaker encoder should be working fine.
  2. Since the p227 -> p225 conversion is fine in content, the content encoder should be working fine.
  3. But the content is lost in the generated audio of A, so may be the decoder is overfitting to the VCTK(VCTK content + VCTK speaker embedding) ? If we replace the input of decoder to (VCTK content + Unseen speaker embedding), it will fail.

I haven't tried finetuning yet. Just some thoughs. Any ideas guys?
@xanguera @ruclion @auspicious3000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants