Wavesplit implementation #70

mpariente · 2020-04-07T11:29:38Z

Samu (@popcornell) has been trying to replicate the results in the paper for a little while now, but couldn't get close to it.

There might be some things that we missed in the implementation or small mistakes we didn't notice.
Neil (@lienz), would you mind having a look at the code please? That would be really great ! The description of the files are in the Readme of the recipe

Note : the code is not in its final format, we will cite the paper when we merge it obviously, this is just a draft to ask you for a review.

lienz · 2020-04-10T17:09:36Z

Hey Manuel, here are first answers for the model:

not clear if different encoders are used for separation and speaker stack. (from image in the paper it seems so)

Yes they are both different. The separation stack uses residual blocks with dilation reinitialised to 1 after each block, while the speaker stack is simply a stack of dilated convolutions with dilation doubling at every layer.

what is embedding dimension ? It seems 512 but it is not explicit in the paper

Yes it's 512 everywhere. 256 should not change much.

mask used (sigmoid ?)

There is no mask, wavesplit directly predicts output channels with a final conv1x1 that maps 512 channels to n, with n the number of speakers

when speakers in an example < sep stack outputs loss is simply masked or an embedding for silence is used ? (Probably masked)

Yes we have an embedding for silence, considered as a speaker. However, in most experiments we assume the number of speakers, like in the standard settings of WSJ0-2mix or 3mix.

is VAD used in WSJ02MiX/ WHAM for determining speech activity at frame level ? Some files can have pauses of even one second

No, we use the raw mixture.

loss right now is prone to go NaN especially if we don't take the mean after l2-distances computation.

L2 distance you mean between speaker vectors and embeddings? Why not training with the classification loss?

HuangZiliAndy · 2020-06-23T15:56:55Z

Hi, thanks for your wavesplit implementation! Recently, I think I need to use some ideas of wavesplit in my research. I wonder if you can share the results you get for now? (doesn't matter if the number cannot match the paper) Thanks a lot!

mpariente · 2020-06-23T18:51:21Z

For now, our implementation of Wavesplit doesn't work at all, sorry.
We'll probably dedicate some time to it in the next weeks but cannot be sure about it.

Would you like to help us implement it? It would be very welcome !

HuangZiliAndy · 2020-06-23T19:37:44Z

Thanks for your reply. I am sorry that I am not very familiar with separation and might not have enough time for this. But if I have some positive results with the Wavesplit idea, I am sure I will try to make a pull request.

Anyway, thanks for your code and I think that is the only repo I can find in Github that is relevent to this paper.

HuangZiliAndy · 2020-06-29T13:05:58Z

Hi I take a deeper look at your implementation. I find two things.

(1) It seems that the symbol of speaker loss is wrong (this is the problem of the paper, the symbol of equation (4)(5) is wrong). However, it seems that only changing this part will not fully solve the problem.

(2) I think the input to the separation stack is different from the paper. The current input speaker embeddings to the separation stack has the shape [batch_size * num_srcs, spk_embed, frames], but I think the paper has some averaging. (the speaker embeddings should be [batch_size * num_srcs, spk_embed])

I still haven't get any positive results but I think this information might be useful to you?

mpariente · 2020-06-29T19:55:49Z

Yes, it is indeed useful, thanks a lot for reporting ! Maybe @lienz can barge in on this?
By the way, we'll probably get back to it in the next few weeks.

popcornell · 2020-06-30T10:40:54Z

(2) I think the input to the separation stack is different from the paper. The current input speaker embeddings to the separation stack has the shape [batch_size * num_srcs, spk_embed, frames], but I think the paper has some averaging. (the speaker embeddings should be [batch_size * num_srcs, spk_embed])

I completely have overlooked this part and you are right there is averaging.

I have addressed some Neil comments and have an updated version on local actually i did never pushed it however, as it needs some code refactoring etc. I ll probably give it a try this week-end.

lienz · 2020-07-01T12:02:23Z

Hi @HuangZiliAndy what error are you referring to? There is indeed an averaging, AFTER computing the loss of the speaker stack.
The speaker loss is applied at every time step (so independently along the last 'frame' axis). The per-time-step optimal assignment derived from the permutation-invariant loss is then used to average the speaker vectors along the time axis, to have sequence-wide vectors. To illustrate (and as shown in Figure 1 of the paper), with the notation speaker_vector[time, channel] and embedding[id] if speaker_vector[0,0] is assigned to the same embedding[speaker_id] as speaker_vector[1,1] you will average them together even though they are not on the same channel. In brief, averaging is not done per channel but per the closest speaker among the speakers of the sequence. You then get a [batch,n_src, channels] tensor with sequence wide speaker vectors, you reshape as [batch, nsrc*channels] and pass it to FILM conditioning.

HuangZiliAndy · 2020-07-01T13:22:19Z

Hi @lienz , I think I am referring to equation (4) and (5). In my understanding, we are trying to reduce d(h_t^j , E_si) and enlarge d(h_t^j, E_sk). (reduce the distance to the target speaker and enlarge the others) However, the loss defined in (4) and (5) will actually become larger. I think the symbol for (4) and (5) is wrong. Please correct me if I make a mistake, thanks!

lienz · 2020-07-02T10:37:28Z

That's correct, that is a bad typo! We will update the paper to correct this, thanks for spotting! We indeed wrote the log probability instead of the loss (which is -log_prob). (4) and (5) are -loss_speaker instead of loss_speaker.

lienz · 2020-07-06T15:33:44Z

The new version of the paper is now online:
https://arxiv.org/pdf/2002.08933.pdf

popcornell · 2020-07-15T16:14:36Z

Thank you very much @lienz.
I updated the wavesplit implementation from new version of the paper.
Right now I am trying only distance loss which is not the best but should still get decent results.
Also speaker dropout and speaker mixup are missing.

I have also tried using oracle embeddings (one hot) and could not get very good results (somewhere in the -10 dB sdr loss ballpark on development).
Do you ever tried training with oracle fixed embeddings ?

I think the updated paper version is much more clear. However, IMO, it is not clear how you compute speaker stack SDR loss for all layers. Do you use for every layer the output linear layer which will map from 512 to n speakers in a shared fashion ?

(I changed base branches because there seems there is some problems these days with github not updating pull requests when I push new commits)

lienz · 2020-07-20T10:35:21Z

Hey @popcornell , one-hot instead of embeddings should be fine but a bit worse than using the actual embeddings. Also for the SDR loss we use at each layer of the separation stack a different Conv1x1 layer that maps from 512 to n speakers. We tried sharing those parameters as in Nachmani et al. but it was better to have a different Conv1x1 for each layer. Hope that helps!

popcornell added 2 commits April 7, 2020 13:02

wavesplit attempt

c9dd0ed

added more comments

8d1ca2e

mpariente mentioned this pull request Apr 10, 2020

(Wavesplit) Comments on the implementation #73

Open

asteroid-team deleted a comment from popcornell Jun 30, 2020

mpariente mentioned this pull request Jul 1, 2020

When will code for wavesplit be released ? #161

Closed

updated wavesplit implementation. Tested only distance loss

d8e7ef7

popcornell changed the base branch from master to Librimix_convtasnet_pretrained July 15, 2020 16:15

popcornell changed the base branch from Librimix_convtasnet_pretrained to master July 15, 2020 16:15

loss fix

9851dae

popcornell changed the base branch from master to Librimix_convtasnet_pretrained July 15, 2020 16:27

popcornell changed the base branch from Librimix_convtasnet_pretrained to master July 15, 2020 16:27

popcornell force-pushed the wavesplit branch from aa3485b to d8e7ef7 Compare July 15, 2020 16:53

popcornell added 3 commits July 21, 2020 21:53

fixed global and local speaker loss

4846212

added different linear layer for each layer in sep stack

d5c9921

changed train.py accordingly

22af768

popcornell mentioned this pull request Feb 24, 2021

Wavesplit 2021 #454

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wavesplit implementation #70

Wavesplit implementation #70

mpariente commented Apr 7, 2020

lienz commented Apr 10, 2020 •

edited

Loading

HuangZiliAndy commented Jun 23, 2020

mpariente commented Jun 23, 2020

HuangZiliAndy commented Jun 23, 2020

HuangZiliAndy commented Jun 29, 2020

mpariente commented Jun 29, 2020

popcornell commented Jun 30, 2020 •

edited

Loading

lienz commented Jul 1, 2020

HuangZiliAndy commented Jul 1, 2020

lienz commented Jul 2, 2020

lienz commented Jul 6, 2020

popcornell commented Jul 15, 2020 •

edited

Loading

lienz commented Jul 20, 2020

Wavesplit implementation #70

Are you sure you want to change the base?

Wavesplit implementation #70

Conversation

mpariente commented Apr 7, 2020

lienz commented Apr 10, 2020 • edited Loading

HuangZiliAndy commented Jun 23, 2020

mpariente commented Jun 23, 2020

HuangZiliAndy commented Jun 23, 2020

HuangZiliAndy commented Jun 29, 2020

mpariente commented Jun 29, 2020

popcornell commented Jun 30, 2020 • edited Loading

lienz commented Jul 1, 2020

HuangZiliAndy commented Jul 1, 2020

lienz commented Jul 2, 2020

lienz commented Jul 6, 2020

popcornell commented Jul 15, 2020 • edited Loading

lienz commented Jul 20, 2020

lienz commented Apr 10, 2020 •

edited

Loading

popcornell commented Jun 30, 2020 •

edited

Loading

popcornell commented Jul 15, 2020 •

edited

Loading