-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions regarding related paper. #1
Comments
I am the developer of pyannote.audio, which is also based on Yaafe for (MFCC) feature extraction and Keras for (LSTM-based) embedding. It would be great if you could share both your Yaafe featureplan and your Keras model so that anyone can easily reproduce your great work. Is this something you'd be willing to do? |
To clarify my thoughts: I think your architecture is a good candidate for the triplet loss paradigm used in https://github.com/hbredin/TristouNet. I'd be happy to collaborate with you on this if you are interested. |
Hi @hbredin I am okey to open the code or portion of it. Let's be in touch. |
Hello, @cyrta
thank you in advance |
Hi @dieka13
|
ah i see, it's concise now.
Thanks again, @cyrta |
@cyrta what is the input shape of the network ? |
@venkatesh-1729 mine is (96, 96): 96 mel, 96 frame, when using mel spectrogram feature |
@dieka13 but using 96x96 gives only Nx1x1 after pooling 4 times. You cant get the sequence of Nx1x15. The only way to get a sequence length of 15 is if the input shape is 96x1440. But this doesnt make sense either, as it would then contain a receptive field of about 23 seconds. Its rare that anyone talks that long in AMI. |
@dieka13 @venkatesh-1729 however i think based on one of his reply above. He does input 96x96. But to get 15 sequence length, he shifted by 8 frames 15 times. This gives a receptive field of about 3.42 seconds, which makes more sense. |
@leonardltk Yes, to go around that i use 3x3 polling in the last CNN layer so there's will be some sequence to pass to RNN layers. I'm in the middle of completing the evaluation phase, so if my approach didn't turn out satisfactory i'll try yours. I hope the author give more information regarding this input size. |
@dieka13 I think even if you (3,3) pooling on the last CNN layer, you only have sequence length of 2 right. It might be difficult for the RNN to learn much. But do let me know your result! I managed to build me method. Will test it soon. |
@cyrta Could you shed light as to how you get the 150 unique speakers from the AMI Dataset? |
Hi,
I read your paper Speaker Diarization using Deep Recurrent Convolutional Neural Networks for speaker embedding. The details were very clear regarding the convolutional part.
But for the 2 recurrent blocks, how many neurons you used?
Which way did u flatten the 2nd recurrent layer to connect with the fullt conencted layer?
The embeddings fully connected layer is for the embedding only, means it is conencted to another layer of classification layer, for that classification layer, do you mix the classes among the different datasets?
Thank you.
The text was updated successfully, but these errors were encountered: