Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SampleRNN as audio feature extractor #18

Open
iariav opened this issue May 7, 2018 · 1 comment
Open

SampleRNN as audio feature extractor #18

iariav opened this issue May 7, 2018 · 1 comment

Comments

@iariav
Copy link

iariav commented May 7, 2018

hi,
this is more a question then an issue -
i'm looking for a way to extract features from raw audio wav files and then use these features for different tasks such as voice recognition, voice activity detection an such, not for generative tasks,
i thought of somehow modifying a generative model like SampleRNN\WaveNet so it could be used to only encode the data to some feature space.
can you please give some pointers on what modifications i need to do to the model to achieve that? has anyone already done this before?
any help would be greatly appreciated.

@Cortexelus
Copy link

Cortexelus commented May 18, 2018

You have a sequence (audio clip) and want classify it using an RNN (SampleRNN). Perhaps this is a vector classifying speaker_id.

Often I see this done by running the RNN through the entire clip, then, using the final state of the RNN, add more layers (fully connected perhaps), then finally a softmax layer. If you have 10 speakers, your softmax layer is a vector size 10. (You do crossentropy loss because its multiclass classification.)

Because there may be 100,000+ of timesteps, a possible compute-hurdle is the backpropagation through time. But in this case, instead of doing TBPTT at each time step (for generation), you only need to do one full BPTT at the end. So.. my guess it should be faster than generative sampleRNN.

You need to take out the TBPTT and next-sample-prediction at every time step. Instead, you wait until its done reading the entire audio sequence. Get the final RNN state. Connect it to the new layers, predict the speaker_id, then do a full BPTT through all the timesteps.

That's one place to start.

Often I see bidirectional RNN (run a forwards-time RNN, AND a backwards-time RNN, then concatenate the final states of both before your fully connected layers before the top ) having better results for this kind task.

I haven't seen specifically SampleRNN used for this. Normally I see ppl run conv nets on spectrograms for audio classification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants