Question: Voice conversion demo #4838

SerhiiArtemuk · 2022-12-27T12:37:07Z

Hi,

Is there any demo to reproduce the voice conversion results shown here: https://unilight.github.io/Publication-Demos/publications/transformer-vc/

Thanks

sw005320 · 2022-12-27T14:23:19Z

@unilight, do you think it is possible to reproduce via espnet?

unilight · 2022-12-28T08:42:24Z

Hi @SerhiiArtemuk , you can find the code in this recipe: https://github.com/espnet/espnet/tree/master/egs/arctic/vc1

SerhiiArtemuk · 2023-01-04T10:57:59Z

Thank you very much. Everything works. I have one more question. How difficult is it to adapt this voice cloning architecture for other datasets?

SerhiiArtemuk · 2023-01-11T14:13:18Z

Thank you very much. Everything works. I have one more question. How difficult is it to adapt this voice cloning architecture for other datasets?

@unilight, do you have any idea about this?

unilight · 2023-01-11T14:54:24Z

Hi @SerhiiArtemuk, sorry for this late reply! If you're talking about the effectiveness of this method, I've tried to apply it to several datasets, and it worked well. If you are talking about how to implement this, as long as you follow the recipes, it should not be too difficult. The only problem is you need to be familiar with ESPNet haha

SerhiiArtemuk · 2023-01-11T15:13:16Z

Thank you very much!

SerhiiArtemuk · 2023-01-13T10:33:30Z

@unilight, this VTN architecture uses the TTS female voice corpus with M-AILABS (judy bieber). In the article, you wrote that the decoder was pretrained on this dataset. A female voice from the CMU_Arctic (slt) dataset was also used as the target voice of the seq2seq model. If it is necessary to use a male voice instead of a female one as the target, will the decoder pretrained on the corpus of the female voice affect the results?

unilight · 2023-01-14T14:23:01Z

Hi @SerhiiArtemuk, that's a great question. I have tried to pre-train using a male speaker in M-AILABS, and fine-tune on a conversion pair consists of a male speaker in ARCTIC as the source and another male speaker in ARCTIC as target. Turns out that it's worse than pre-training with the female speaker in M-AILABS. My feeling was that there are some factors other than gender that affects the fine-tuning performance, such as speaking rate, quality of the pre-training data, etc. However I didn't do enough experiments to investigate these.

SerhiiArtemuk · 2023-01-18T13:07:51Z

Clear. @unilight, did you try to use a male speaker as a source and target using a pre-trained TTS on a female M-AILABS speaker? If so, is the converted male voice deformed due to the use of TTS pre-trained on a female M-AILABS voice? Is the result of good quality overall?

And I have one more question regarding the vocoder. I noticed that for a specific speaker that is used as a target, a pre-trained PWG vocoder model is loaded. For example, if slt from the ARCTIC is used as a target, then the PWG_slt vocoder model is used. Did I understand correctly? If so, please provide some example scripts to pre-train the PWG vocoder model to create the necessary model (my voice, for example).

SerhiiArtemuk · 2023-01-23T11:21:47Z

@unilight, any idea?

unilight · 2023-01-23T11:37:23Z

Hi @SerhiiArtemuk, for your first question, please check the samples in this demo page: https://unilight.github.io/Publication-Demos/publications/vtn-taslp/index.html for the comparison. For the second question, please check this repo: https://github.com/kan-bayashi/ParallelWaveGAN.

SerhiiArtemuk · 2023-01-23T16:25:33Z

Thanks a lot for your answer. By the way, @unilight, did you try to VC on a dataset that doesn't contain parallel recordings?

And is the technology sensitive to the use of other languages? That is if I need to clone an Italian voice, are there things I need to know before I start? Italian speaker TTS pre-training or similar things?

aanchan · 2023-02-08T18:37:53Z

I tried a couple of experiments:

Naively training on a new Japanese dataset for voice conversion using the m-ai-labs judy pretrained model in English, surprisingly the results for voice conversion were not that bad in Japanese, so maybe there is some benefit to be gained cross-lingually.
I did try TTS pre-training on Japanese but with Japanese pre-training, the results appeared similar as the first case with English TTS pre-training ... I would need to confirm this with a more rigorous subjective evaluation

I hope this helps @SerhiiArtemuk

SerhiiArtemuk · 2023-02-08T18:45:48Z

@aanchan thank you for your support. I haven't had time to reproduce the results yet, given the recent updates from @unilight, but I find your results helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Voice conversion demo #4838

Question: Voice conversion demo #4838

SerhiiArtemuk commented Dec 27, 2022

sw005320 commented Dec 27, 2022

unilight commented Dec 28, 2022

SerhiiArtemuk commented Jan 4, 2023

SerhiiArtemuk commented Jan 11, 2023

unilight commented Jan 11, 2023

SerhiiArtemuk commented Jan 11, 2023

SerhiiArtemuk commented Jan 13, 2023

unilight commented Jan 14, 2023

SerhiiArtemuk commented Jan 18, 2023

SerhiiArtemuk commented Jan 23, 2023

unilight commented Jan 23, 2023

SerhiiArtemuk commented Jan 23, 2023 •

edited

aanchan commented Feb 8, 2023

SerhiiArtemuk commented Feb 8, 2023

Question: Voice conversion demo #4838

Question: Voice conversion demo #4838

Comments

SerhiiArtemuk commented Dec 27, 2022

sw005320 commented Dec 27, 2022

unilight commented Dec 28, 2022

SerhiiArtemuk commented Jan 4, 2023

SerhiiArtemuk commented Jan 11, 2023

unilight commented Jan 11, 2023

SerhiiArtemuk commented Jan 11, 2023

SerhiiArtemuk commented Jan 13, 2023

unilight commented Jan 14, 2023

SerhiiArtemuk commented Jan 18, 2023

SerhiiArtemuk commented Jan 23, 2023

unilight commented Jan 23, 2023

SerhiiArtemuk commented Jan 23, 2023 • edited

aanchan commented Feb 8, 2023

SerhiiArtemuk commented Feb 8, 2023

SerhiiArtemuk commented Jan 23, 2023 •

edited