New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Voice conversion demo #4838
Comments
@unilight, do you think it is possible to reproduce via espnet? |
Hi @SerhiiArtemuk , you can find the code in this recipe: https://github.com/espnet/espnet/tree/master/egs/arctic/vc1 |
Thank you very much. Everything works. I have one more question. How difficult is it to adapt this voice cloning architecture for other datasets? |
@unilight, do you have any idea about this? |
Hi @SerhiiArtemuk, sorry for this late reply! If you're talking about the effectiveness of this method, I've tried to apply it to several datasets, and it worked well. If you are talking about how to implement this, as long as you follow the recipes, it should not be too difficult. The only problem is you need to be familiar with ESPNet haha |
Thank you very much! |
@unilight, this VTN architecture uses the TTS female voice corpus with M-AILABS (judy bieber). In the article, you wrote that the decoder was pretrained on this dataset. A female voice from the CMU_Arctic (slt) dataset was also used as the target voice of the seq2seq model. If it is necessary to use a male voice instead of a female one as the target, will the decoder pretrained on the corpus of the female voice affect the results? |
Hi @SerhiiArtemuk, that's a great question. I have tried to pre-train using a male speaker in M-AILABS, and fine-tune on a conversion pair consists of a male speaker in ARCTIC as the source and another male speaker in ARCTIC as target. Turns out that it's worse than pre-training with the female speaker in M-AILABS. My feeling was that there are some factors other than gender that affects the fine-tuning performance, such as speaking rate, quality of the pre-training data, etc. However I didn't do enough experiments to investigate these. |
Clear. @unilight, did you try to use a male speaker as a source and target using a pre-trained TTS on a female M-AILABS speaker? If so, is the converted male voice deformed due to the use of TTS pre-trained on a female M-AILABS voice? Is the result of good quality overall? And I have one more question regarding the vocoder. I noticed that for a specific speaker that is used as a target, a pre-trained PWG vocoder model is loaded. For example, if slt from the ARCTIC is used as a target, then the PWG_slt vocoder model is used. Did I understand correctly? If so, please provide some example scripts to pre-train the PWG vocoder model to create the necessary model (my voice, for example). |
@unilight, any idea? |
Hi @SerhiiArtemuk, for your first question, please check the samples in this demo page: https://unilight.github.io/Publication-Demos/publications/vtn-taslp/index.html for the comparison. For the second question, please check this repo: https://github.com/kan-bayashi/ParallelWaveGAN. |
Thanks a lot for your answer. By the way, @unilight, did you try to VC on a dataset that doesn't contain parallel recordings? And is the technology sensitive to the use of other languages? That is if I need to clone an Italian voice, are there things I need to know before I start? Italian speaker TTS pre-training or similar things? |
I tried a couple of experiments:
I hope this helps @SerhiiArtemuk |
Hi,
Is there any demo to reproduce the voice conversion results shown here: https://unilight.github.io/Publication-Demos/publications/transformer-vc/
Thanks
The text was updated successfully, but these errors were encountered: