Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] the vq result is not well #210

Open
didadida-r opened this issue May 15, 2024 · 3 comments
Open

[BUG] the vq result is not well #210

didadida-r opened this issue May 15, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@didadida-r
Copy link

Feel free to ask any kind of questions in the issues page, but please use English since other users may find your questions valuable.

Describe the bug
A clear and concise description of what the bug is.

Hi,

I debug the gpt result and finally find that the have some problem in the vq stage,

`mispronounce`:  the pronouncation is not right in the vq stage with Self reduction
新的轮回,便会开始 --》 新的轮回,定会开始

`timber change`: the reproduced audio in vq, the timber will be different with the origin audio, 

Can you share some ideas about these cases and how to optimize it. Is making the ar4 or ar8 will help? or changing to vq with vq loss

Thanks 

To Reproduce
Steps to reproduce the behavior:

git clone lastest fish code and latest offical model

download the demo page audio as input_audio_path

python tools/vqgan/inference.py \
        -i "$input_audio_path" \
        -o "$vq_restore_wav_path" \
        -ckpt "$vqgan_path"

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots / log
If applicable, add screenshots / logs to help explain your problem.

Additional context
Add any other context about the problem here.

@didadida-r didadida-r added the bug Something isn't working label May 15, 2024
@leng-yue
Copy link
Member

That's why we introduced the VITS decoder, which greatly reduced mispronounce and increased timbre similarity.

@didadida-r
Copy link
Author

Thanks, I'm curious as to why the vq and vits modules are separated. Would there be a specific reason not to combine them into a single module, similar to the approach taken with gpt-sovits? and adding hubert module?

@didadida-r
Copy link
Author

That's why we introduced the VITS decoder, which greatly reduced mispronounce and increased timbre similarity.

Is the issues with mispronunciation and timber change attributing to a low token rate? As these problems do not seem to be as apparent in codec and dac

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants