Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Usage of audio_slice_frames, sample_frames, pad #12

Open
wade3han opened this issue Oct 24, 2019 · 8 comments
Open

Usage of audio_slice_frames, sample_frames, pad #12

wade3han opened this issue Oct 24, 2019 · 8 comments

Comments

@wade3han
Copy link

Hello,

I saw that you used pad, audio_slice_frames, sample_frames but I can't understand the usage of those params. Can you explain the meanings of them?

Also, WaveRNN model was using padded mel input in the first GRU layer. However you just sliced out paddings after the first layer. Is it important to use padded mel in first GRU?

Thanks.

@bshall
Copy link
Owner

bshall commented Oct 24, 2019

Hi @wade3han,

Yeah, I should add some comments explaining those parameters.

First, sample_frames is the number of frames sampled from the melspectorgram that get fed into the conditioning network (rnn1 in the model). The output then gets upsampled and sent to the auto-regressive part (rnn2 in the model). But if we set sample_frames to 40 then after upsampling there are 40 x 200 = 8000 samples which takes far too long to train.

To speed things up I only take the middle audio_slice_frames, upsample them and then use that to condition rnn2. The pad parameter is just how many frames are on either side of the middle audio_slice_frames. So for the default config this would be (40 - 8) / 2 = 16 frames. To account for only taking the middle frames I also padded the melspectograms by pad on both sided in preprocess.py.

I hope that helps.

@wade3han
Copy link
Author

wade3han commented Oct 25, 2019

Thanks for your reply!

I guessed that strange artifacts like below happens because of those hyperparameters. Don't you have seen those artifacts? I got those artifacts mainly on the front and back of audio files.

image

@bshall
Copy link
Owner

bshall commented Oct 25, 2019

No problem.

Are you using the pretrained model and the generate.py script? Also, what input audio are you using? Is it something from the ZeroSpeech corpus or your own?

I'm not getting any of those artifacts. For example, here's the original audio:
orig
And here's the reconstruction:
gen

@wade3han
Copy link
Author

Well, I was training new model from scratch using Korean speech data corpus. It has 300 hours amount of various speakers' utterances, and I was getting those artifacts after I tried to use audio_slice_frames=16 instead of 8. I believed using bigger audio_slice_frames can help training.

Actually, I'm not sure why those artifacts are generated... I will share you if i figure out why. Please share your opinion if you have any ideas.

@dipjyoti92
Copy link

@bshall I have one question to your first reply in this thread. Instead of having 40 mel frames, why not use 8 mel frames itself in the input of the rnn1 layer itself?

@bshall
Copy link
Owner

bshall commented Dec 20, 2019

Hi @dipjyoti92, sorry about the delay I've been away. I found that using only 8 frames as input to the rnn1 layer results in the generated audio being only silences. I think 8 frames is too short for the rnn to learn to appropriately use the reset gate although I haven't investigated this thoroughly.

@macarbonneau
Copy link

Hello @bshall !
Thank you for the awesome repo. Your code is very clean, I'm impressed. I'm playing a bit with your implementation and I have a question. Why do you take middle of the mel segment? Why not just the end? is there a benefit of having the padding at the end?

Thank you!!

@bshall
Copy link
Owner

bshall commented Apr 23, 2020

Hi @macarbonneau,

No problem, I'm glad you found the repo useful. I haven't tried using the end (or beginning) segments but there's no real reason it shouldn't work. The thinking behind using the middle segment was to match the training and inference conditions as much as possible. At inference time most of the input to the autoregressive part of the model (rnn2) will have context from the future and the past. So taking the middle segment is "closer" to what the network will see during inference. If you used the end segment, for example, the autoregressive component wouldn't have future context at training time and the mismatch might cause problems during generation.

Hope that explains my thinking. If anything is unclear let me know.

One of the negative side effects of only using the middle segment is that there are sometimes small artifacts at the beginning or end of the generated audio. For the best quality it might be worth putting in some extra time to train on the entire segment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants