Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wavesplit implementation #70

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Wavesplit implementation #70

wants to merge 7 commits into from

Conversation

mpariente
Copy link
Collaborator

Samu (@popcornell) has been trying to replicate the results in the paper for a little while now, but couldn't get close to it.

There might be some things that we missed in the implementation or small mistakes we didn't notice.
Neil (@lienz), would you mind having a look at the code please? That would be really great ! The description of the files are in the Readme of the recipe

Note : the code is not in its final format, we will cite the paper when we merge it obviously, this is just a draft to ask you for a review.

@lienz
Copy link

lienz commented Apr 10, 2020

Hey Manuel, here are first answers for the model:

not clear if different encoders are used for separation and speaker stack. (from image in the paper it seems so)

Yes they are both different. The separation stack uses residual blocks with dilation reinitialised to 1 after each block, while the speaker stack is simply a stack of dilated convolutions with dilation doubling at every layer.

what is embedding dimension ? It seems 512 but it is not explicit in the paper

Yes it's 512 everywhere. 256 should not change much.

mask used (sigmoid ?)

There is no mask, wavesplit directly predicts output channels with a final conv1x1 that maps 512 channels to n, with n the number of speakers

when speakers in an example < sep stack outputs loss is simply masked or an embedding for silence is used ? (Probably masked)

Yes we have an embedding for silence, considered as a speaker. However, in most experiments we assume the number of speakers, like in the standard settings of WSJ0-2mix or 3mix.

is VAD used in WSJ02MiX/ WHAM for determining speech activity at frame level ? Some files can have pauses of even one second

No, we use the raw mixture.

loss right now is prone to go NaN especially if we don't take the mean after l2-distances computation.

L2 distance you mean between speaker vectors and embeddings? Why not training with the classification loss?

@HuangZiliAndy
Copy link

Hi, thanks for your wavesplit implementation! Recently, I think I need to use some ideas of wavesplit in my research. I wonder if you can share the results you get for now? (doesn't matter if the number cannot match the paper) Thanks a lot!

@mpariente
Copy link
Collaborator Author

For now, our implementation of Wavesplit doesn't work at all, sorry.
We'll probably dedicate some time to it in the next weeks but cannot be sure about it.

Would you like to help us implement it? It would be very welcome !

@HuangZiliAndy
Copy link

Thanks for your reply. I am sorry that I am not very familiar with separation and might not have enough time for this. But if I have some positive results with the Wavesplit idea, I am sure I will try to make a pull request.

Anyway, thanks for your code and I think that is the only repo I can find in Github that is relevent to this paper.

@HuangZiliAndy
Copy link

Hi I take a deeper look at your implementation. I find two things.

(1) It seems that the symbol of speaker loss is wrong (this is the problem of the paper, the symbol of equation (4)(5) is wrong). However, it seems that only changing this part will not fully solve the problem.

(2) I think the input to the separation stack is different from the paper. The current input speaker embeddings to the separation stack has the shape [batch_size * num_srcs, spk_embed, frames], but I think the paper has some averaging. (the speaker embeddings should be [batch_size * num_srcs, spk_embed])

I still haven't get any positive results but I think this information might be useful to you?

@mpariente
Copy link
Collaborator Author

Yes, it is indeed useful, thanks a lot for reporting ! Maybe @lienz can barge in on this?
By the way, we'll probably get back to it in the next few weeks.

@asteroid-team asteroid-team deleted a comment from popcornell Jun 30, 2020
@popcornell
Copy link
Collaborator

popcornell commented Jun 30, 2020

(2) I think the input to the separation stack is different from the paper. The current input speaker embeddings to the separation stack has the shape [batch_size * num_srcs, spk_embed, frames], but I think the paper has some averaging. (the speaker embeddings should be [batch_size * num_srcs, spk_embed])

I completely have overlooked this part and you are right there is averaging.

I have addressed some Neil comments and have an updated version on local actually i did never pushed it however, as it needs some code refactoring etc. I ll probably give it a try this week-end.

@lienz
Copy link

lienz commented Jul 1, 2020

Hi @HuangZiliAndy what error are you referring to? There is indeed an averaging, AFTER computing the loss of the speaker stack.
The speaker loss is applied at every time step (so independently along the last 'frame' axis). The per-time-step optimal assignment derived from the permutation-invariant loss is then used to average the speaker vectors along the time axis, to have sequence-wide vectors. To illustrate (and as shown in Figure 1 of the paper), with the notation speaker_vector[time, channel] and embedding[id] if speaker_vector[0,0] is assigned to the same embedding[speaker_id] as speaker_vector[1,1] you will average them together even though they are not on the same channel. In brief, averaging is not done per channel but per the closest speaker among the speakers of the sequence. You then get a [batch,n_src, channels] tensor with sequence wide speaker vectors, you reshape as [batch, nsrc*channels] and pass it to FILM conditioning.

@HuangZiliAndy
Copy link

Hi @lienz , I think I am referring to equation (4) and (5). In my understanding, we are trying to reduce d(h_t^j , E_si) and enlarge d(h_t^j, E_sk). (reduce the distance to the target speaker and enlarge the others) However, the loss defined in (4) and (5) will actually become larger. I think the symbol for (4) and (5) is wrong. Please correct me if I make a mistake, thanks!

@lienz
Copy link

lienz commented Jul 2, 2020

That's correct, that is a bad typo! We will update the paper to correct this, thanks for spotting! We indeed wrote the log probability instead of the loss (which is -log_prob). (4) and (5) are -loss_speaker instead of loss_speaker.

@lienz
Copy link

lienz commented Jul 6, 2020

The new version of the paper is now online:
https://arxiv.org/pdf/2002.08933.pdf

@popcornell
Copy link
Collaborator

popcornell commented Jul 15, 2020

Thank you very much @lienz.
I updated the wavesplit implementation from new version of the paper.
Right now I am trying only distance loss which is not the best but should still get decent results.
Also speaker dropout and speaker mixup are missing.

I have also tried using oracle embeddings (one hot) and could not get very good results (somewhere in the -10 dB sdr loss ballpark on development).
Do you ever tried training with oracle fixed embeddings ?

I think the updated paper version is much more clear. However, IMO, it is not clear how you compute speaker stack SDR loss for all layers. Do you use for every layer the output linear layer which will map from 512 to n speakers in a shared fashion ?

(I changed base branches because there seems there is some problems these days with github not updating pull requests when I push new commits)

@popcornell popcornell changed the base branch from master to Librimix_convtasnet_pretrained July 15, 2020 16:15
@popcornell popcornell changed the base branch from Librimix_convtasnet_pretrained to master July 15, 2020 16:15
@popcornell popcornell changed the base branch from master to Librimix_convtasnet_pretrained July 15, 2020 16:27
@popcornell popcornell changed the base branch from Librimix_convtasnet_pretrained to master July 15, 2020 16:27
@lienz
Copy link

lienz commented Jul 20, 2020

Hey @popcornell , one-hot instead of embeddings should be fine but a bit worse than using the actual embeddings. Also for the SDR loss we use at each layer of the separation stack a different Conv1x1 layer that maps from 512 to n speakers. We tried sharing those parameters as in Nachmani et al. but it was better to have a different Conv1x1 for each layer. Hope that helps!

@popcornell popcornell mentioned this pull request Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants