Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paper details #3

Open
venkatesh-1729 opened this issue Nov 22, 2017 · 2 comments
Open

paper details #3

venkatesh-1729 opened this issue Nov 22, 2017 · 2 comments

Comments

@venkatesh-1729
Copy link

More precisely, we use activations from the last layer of neural network as speaker embeddings. We aggregate the sigmoid outputs by summing all outputs class-wise over the whole audio excerpt to obtain a total amount of activation for each entry and then normalizing the values by dividing them with the maximum value among classes. The analysis of those embeddings in time allows the system to detect speaker change and identify the newly appearing speakers by comparing the extracted and normalized embedding with those previously seen. If the cosine similarity metric between the embeddings is higher than a threshold, fixed at 0.4 after a set of preliminary experiments, the speaker is considered as new. Otherwise, we map its identity to the one corresponding to the nearest embedding.

Hi @cyrta, Can you please elaborate this paragraph in the paper. This is my understanding please correct me if I am wrong.

  1. we use activations from the last layer of neural network as speaker embeddings. This is weird because the last layer would be softmax layer according to the loss function of the network. Or you meant to say that there is a dense layer with sigmoid activation before softmax layer and its activation are used as speaker embeddings. What is the size of the embeddings that are being extracted ?
  2. Then speaker embeddings are summed over the entire audio class-wise and normalized by dividing with maximum value among all the classes. I'm not sure after this. Now if the distance between any extracted embeddings and the previously obtained normalized embeddings is greater than 0.4 it is treated as new speaker otherwise we map it to the nearest embedding (say left) speaker ( or the most similar embedding previously seen ?).
  3. Also, in the paper, there is no discussion about how silent zones are treated whether any voice activity detector is employed etc. as this is part of Diarization Error Rate.

Thanks.

@cyrta
Copy link
Owner

cyrta commented Nov 22, 2017

Hi @venkatesh-1729

  1. we do not use softmax, but sigmoid activation. This is maybe counter known practice however output is then softmax in time.
    Size of embedding depends on number of speakers in our experiment. If you take recordings of 1000 speakers then the size would be 1000.
  2. We use cosine distance to check if the given segment differs from last.
  3. Silence or noise is one of the class so the embedding represents also information needed for VAD.

@venkatesh-1729
Copy link
Author

Hi @cyrta Thanks for the reply.

  1. So you use sigmoid activation for embeddings and softmax activations for loss function?
  2. Can you elaborate this part on how you identify speakers? I'm not able to wrap my head around it.

We aggregate the sigmoid outputs by summing all outputs class-wise over the whole audio excerpt to obtain a total amount of activation for each entry and then normalizing the values by dividing them with the maximum value among classes. The analysis of those embeddings in time allows the system to detect speaker change and identify the newly appearing speakers by comparing the extracted and normalized embedding with those previously seen. If the cosine similarity metric between the embeddings is higher than a threshold, fixed at 0.4 after a set of preliminary experiments, the speaker is considered as new. Otherwise, we map its identity to the one corresponding to the nearest embedding.

Also, what is the input shape to the network (or shape of input STFT). These details are not present in the paper. If possible can you give Keras model summary which clears the confusion?

Regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants