Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Voice conversion demo #4838

Open
SerhiiArtemuk opened this issue Dec 27, 2022 · 14 comments
Open

Question: Voice conversion demo #4838

SerhiiArtemuk opened this issue Dec 27, 2022 · 14 comments

Comments

@SerhiiArtemuk
Copy link

Hi,

Is there any demo to reproduce the voice conversion results shown here: https://unilight.github.io/Publication-Demos/publications/transformer-vc/

Thanks

@sw005320
Copy link
Contributor

@unilight, do you think it is possible to reproduce via espnet?

@unilight
Copy link
Contributor

Hi @SerhiiArtemuk , you can find the code in this recipe: https://github.com/espnet/espnet/tree/master/egs/arctic/vc1

@SerhiiArtemuk
Copy link
Author

Thank you very much. Everything works. I have one more question. How difficult is it to adapt this voice cloning architecture for other datasets?

@SerhiiArtemuk
Copy link
Author

Thank you very much. Everything works. I have one more question. How difficult is it to adapt this voice cloning architecture for other datasets?

@unilight, do you have any idea about this?

@unilight
Copy link
Contributor

Hi @SerhiiArtemuk, sorry for this late reply! If you're talking about the effectiveness of this method, I've tried to apply it to several datasets, and it worked well. If you are talking about how to implement this, as long as you follow the recipes, it should not be too difficult. The only problem is you need to be familiar with ESPNet haha

@SerhiiArtemuk
Copy link
Author

Thank you very much!

@SerhiiArtemuk
Copy link
Author

@unilight, this VTN architecture uses the TTS female voice corpus with M-AILABS (judy bieber). In the article, you wrote that the decoder was pretrained on this dataset. A female voice from the CMU_Arctic (slt) dataset was also used as the target voice of the seq2seq model. If it is necessary to use a male voice instead of a female one as the target, will the decoder pretrained on the corpus of the female voice affect the results?

@unilight
Copy link
Contributor

Hi @SerhiiArtemuk, that's a great question. I have tried to pre-train using a male speaker in M-AILABS, and fine-tune on a conversion pair consists of a male speaker in ARCTIC as the source and another male speaker in ARCTIC as target. Turns out that it's worse than pre-training with the female speaker in M-AILABS. My feeling was that there are some factors other than gender that affects the fine-tuning performance, such as speaking rate, quality of the pre-training data, etc. However I didn't do enough experiments to investigate these.

@SerhiiArtemuk
Copy link
Author

Clear. @unilight, did you try to use a male speaker as a source and target using a pre-trained TTS on a female M-AILABS speaker? If so, is the converted male voice deformed due to the use of TTS pre-trained on a female M-AILABS voice? Is the result of good quality overall?

And I have one more question regarding the vocoder. I noticed that for a specific speaker that is used as a target, a pre-trained PWG vocoder model is loaded. For example, if slt from the ARCTIC is used as a target, then the PWG_slt vocoder model is used. Did I understand correctly? If so, please provide some example scripts to pre-train the PWG vocoder model to create the necessary model (my voice, for example).

@SerhiiArtemuk
Copy link
Author

@unilight, any idea?

@unilight
Copy link
Contributor

Hi @SerhiiArtemuk, for your first question, please check the samples in this demo page: https://unilight.github.io/Publication-Demos/publications/vtn-taslp/index.html for the comparison. For the second question, please check this repo: https://github.com/kan-bayashi/ParallelWaveGAN.

@SerhiiArtemuk
Copy link
Author

SerhiiArtemuk commented Jan 23, 2023

Thanks a lot for your answer. By the way, @unilight, did you try to VC on a dataset that doesn't contain parallel recordings?

And is the technology sensitive to the use of other languages? That is if I need to clone an Italian voice, are there things I need to know before I start? Italian speaker TTS pre-training or similar things?

@aanchan
Copy link

aanchan commented Feb 8, 2023

I tried a couple of experiments:

  • Naively training on a new Japanese dataset for voice conversion using the m-ai-labs judy pretrained model in English, surprisingly the results for voice conversion were not that bad in Japanese, so maybe there is some benefit to be gained cross-lingually.
  • I did try TTS pre-training on Japanese but with Japanese pre-training, the results appeared similar as the first case with English TTS pre-training ... I would need to confirm this with a more rigorous subjective evaluation

I hope this helps @SerhiiArtemuk

@SerhiiArtemuk
Copy link
Author

@aanchan thank you for your support. I haven't had time to reproduce the results yet, given the recent updates from @unilight, but I find your results helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants