Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Natural speech training #4

Open
leon2milan opened this issue Jun 13, 2022 · 4 comments
Open

Natural speech training #4

leon2milan opened this issue Jun 13, 2022 · 4 comments

Comments

@leon2milan
Copy link

In paper, they use Soft Dynamic Time Warping in KL loss. In your code, I didn't find it. So, is the code in the progress? or any other reason?

@dunky11
Copy link
Owner

dunky11 commented Jun 14, 2022

I tried implementing naturalspeech here. It pretty much contains it's architecture with a different text encoder. You can swap out the text encoder with the one from VITS, if you want to stay close to the paper. You also want to make sure that the softmax is calculated over the phoneme dimension in the Durator, I don't know if I corrected this.

The most difficult part of the paper is it's warped KL-loss. I implemented it in the most straightforward way which includes creating three matrices of shape (batch_size, n_channels, n_phones, n_spec_frames), one for x, one for the means, and one for the standard deviations. Obviously those matrices wen't huge and caused CUDA out of memory errors.
So I wrote the authors and they explained that they looped over the channels in order to calculate the DTW losses, which probably involved writing some low level stuff.
I'm bad at low level stuff so hopefully someone else will implement natural speech. You can also check here, I listed some stuff that the authors of the paper told me which may not be obvious from the paper.

@dunky11
Copy link
Owner

dunky11 commented Jun 14, 2022

And you obviously also want to swap out the decoder with HiFiGAN, called Generator in the file.

@leon2milan
Copy link
Author

After a long time of trying, the model has never been able to converge. I want to try the effect of UnivNet. But my server has no GUI, how should I train the model? Is there any relevant documentation?

@rishikksh20
Copy link

@dunky11 Have look at this Soft-DTW implementation https://github.com/google-research/soft-dtw-divergences ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants