Questions regarding related paper. #1

leonardltk · 2017-09-11T13:41:16Z

Hi,

I read your paper Speaker Diarization using Deep Recurrent Convolutional Neural Networks for speaker embedding. The details were very clear regarding the convolutional part.
But for the 2 recurrent blocks, how many neurons you used?

Which way did u flatten the 2nd recurrent layer to connect with the fullt conencted layer?

The embeddings fully connected layer is for the embedding only, means it is conencted to another layer of classification layer, for that classification layer, do you mix the classes among the different datasets?

Thank you.

cyrta · 2017-09-15T13:56:15Z

Hi,

thanks for your questions.

I hope those diagram will help you:

yes, I mix the classes. I am adding them up in order to create a large set of labels, each representing one speaker.

hbredin · 2017-09-18T13:49:55Z

I am the developer of pyannote.audio, which is also based on Yaafe for (MFCC) feature extraction and Keras for (LSTM-based) embedding.

It would be great if you could share both your Yaafe featureplan and your Keras model so that anyone can easily reproduce your great work. Is this something you'd be willing to do?

hbredin · 2017-09-20T10:12:46Z

To clarify my thoughts: I think your architecture is a good candidate for the triplet loss paradigm used in https://github.com/hbredin/TristouNet.

I'd be happy to collaborate with you on this if you are interested.

cyrta · 2017-09-20T10:46:07Z

Hi @hbredin
I am really glad that you write. I know your work and admire it.

I am okey to open the code or portion of it.
Currently working on extending the paper and method for ICASSP, so after submitting it I'll publish something.

Let's be in touch.
I'll write an email to you regarding collaboration.

dieka13 · 2017-10-27T02:15:10Z

Hello, @cyrta
i would like to ask some question about the paper:

What do you mean with SFTF ? is that Short Time Fourier Transform?
"3.072 seconds (96 frames of 512 audio samples)" isn't this mean that the audio already in 16kHz? but the paper do downsampling right after that step.

thank you in advance

cyrta · 2017-10-30T09:19:46Z

Hi @dieka13

Yes, It is a Short Time Fourier Transform
The audio preprocessing sentence have some ambiguity unfortunately.
Let me explain it:
a. We downsample the input audio stream to 16kHz,
b. then we segment it into frames of 512 samples every 256 samples (50% hop).
c. Each frame is then multiply with Hamming window
d. each frame goes to SFTF function, so to have a spectral representation.
e. this output is putted to "spectrum data" buffer

f. we take 96 frames from this "spectral data" buffer as a input to the network
g. we shift by 8 frames in stream (256ms) and put another portion of 96 frames buffer into input.
h. repeat until the stream end.

dieka13 · 2017-10-31T06:32:39Z

ah i see, it's concise now.
i hope you don't mind if i ask additional question:

from where you get resulting size of N ×1×15?
how do you apply the CQT one?

Thanks again, @cyrta

venkatesh-1729 · 2017-11-22T14:13:56Z

@cyrta what is the input shape of the network ?

dieka13 · 2017-11-29T09:34:46Z

@venkatesh-1729 mine is (96, 96): 96 mel, 96 frame, when using mel spectrogram feature

leonardltk · 2017-11-29T13:14:10Z

@dieka13 but using 96x96 gives only Nx1x1 after pooling 4 times. You cant get the sequence of Nx1x15. The only way to get a sequence length of 15 is if the input shape is 96x1440. But this doesnt make sense either, as it would then contain a receptive field of about 23 seconds. Its rare that anyone talks that long in AMI.

leonardltk · 2017-11-29T13:23:55Z

@dieka13 @venkatesh-1729 however i think based on one of his reply above. He does input 96x96. But to get 15 sequence length, he shifted by 8 frames 15 times. This gives a receptive field of about 3.42 seconds, which makes more sense.

dieka13 · 2017-11-29T13:38:20Z

@leonardltk Yes, to go around that i use 3x3 polling in the last CNN layer so there's will be some sequence to pass to RNN layers. I'm in the middle of completing the evaluation phase, so if my approach didn't turn out satisfactory i'll try yours. I hope the author give more information regarding this input size.

leonardltk · 2017-11-29T15:43:23Z

@dieka13 I think even if you (3,3) pooling on the last CNN layer, you only have sequence length of 2 right. It might be difficult for the RNN to learn much. But do let me know your result! I managed to build me method. Will test it soon.
May I know how do you get 150 speakers classes from AMI ? From http://groups.inf.ed.ac.uk/ami/corpus/participantids.shtml & http://groups.inf.ed.ac.uk/ami/corpus/signals.shtml i could only get 186 unique speakers.

leonardltk · 2017-12-01T16:19:34Z

@cyrta Could you shed light as to how you get the 150 unique speakers from the AMI Dataset?

cyrta closed this as completed Sep 15, 2017

cyrta reopened this Sep 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding related paper. #1

Questions regarding related paper. #1

leonardltk commented Sep 11, 2017

cyrta commented Sep 15, 2017

hbredin commented Sep 18, 2017

hbredin commented Sep 20, 2017

cyrta commented Sep 20, 2017

dieka13 commented Oct 27, 2017

cyrta commented Oct 30, 2017

dieka13 commented Oct 31, 2017 •

edited

Loading

venkatesh-1729 commented Nov 22, 2017 •

edited

Loading

dieka13 commented Nov 29, 2017

leonardltk commented Nov 29, 2017

leonardltk commented Nov 29, 2017

dieka13 commented Nov 29, 2017

leonardltk commented Nov 29, 2017

leonardltk commented Dec 1, 2017

Questions regarding related paper. #1

Questions regarding related paper. #1

Comments

leonardltk commented Sep 11, 2017

cyrta commented Sep 15, 2017

hbredin commented Sep 18, 2017

hbredin commented Sep 20, 2017

cyrta commented Sep 20, 2017

dieka13 commented Oct 27, 2017

cyrta commented Oct 30, 2017

dieka13 commented Oct 31, 2017 • edited Loading

venkatesh-1729 commented Nov 22, 2017 • edited Loading

dieka13 commented Nov 29, 2017

leonardltk commented Nov 29, 2017

leonardltk commented Nov 29, 2017

dieka13 commented Nov 29, 2017

leonardltk commented Nov 29, 2017

leonardltk commented Dec 1, 2017

dieka13 commented Oct 31, 2017 •

edited

Loading

venkatesh-1729 commented Nov 22, 2017 •

edited

Loading