-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Natural speech training #4
Comments
I tried implementing naturalspeech here. It pretty much contains it's architecture with a different text encoder. You can swap out the text encoder with the one from VITS, if you want to stay close to the paper. You also want to make sure that the softmax is calculated over the phoneme dimension in the Durator, I don't know if I corrected this. The most difficult part of the paper is it's warped KL-loss. I implemented it in the most straightforward way which includes creating three matrices of shape (batch_size, n_channels, n_phones, n_spec_frames), one for x, one for the means, and one for the standard deviations. Obviously those matrices wen't huge and caused CUDA out of memory errors. |
And you obviously also want to swap out the decoder with HiFiGAN, called Generator in the file. |
After a long time of trying, the model has never been able to converge. I want to try the effect of |
@dunky11 Have look at this Soft-DTW implementation https://github.com/google-research/soft-dtw-divergences ? |
In paper, they use
Soft Dynamic Time Warping
in KL loss. In your code, I didn't find it. So, is the code in the progress? or any other reason?The text was updated successfully, but these errors were encountered: