Add MixIT support. It is unsupervised only. Semi-supervised config is not available for now. #4619

simpleoier · 2022-09-04T12:47:31Z

This PR is to replace the previous one #4263. It is mostly modified on top of the existing files to keep minimum new files.
Dynamic mixing is used to generate mixture of mixtures (4 speakers). For training data, I also include 8000 1-speaker speech from tr_min_8k/spk1.scp to generate 2-speaker mixture. This sounds like a pseudo "semi-supervised" training in terms of data usage. Actually we can consider it as using some 1-speaker speech data, because I only use the MixIT loss for training.

For now, only one kind of loss wrapper (MixIT) is used. I don't have an easy solution about how to use both unsupervised (MixIT) and supervised (PIT) in the same training process.

The preliminary results on wsj0_2mix/mixit_enh1 is as follows. The model is trained with 33 epochs and did not stop yet.

# RESULTS
## Environments
- date: `Mon Sep  5 14:55:27 EDT 2022`
- python version: `3.9.12 (main, Apr  5 2022, 06:56:58)  [GCC 7.5.0]`
- espnet version: `espnet 202207`
- pytorch version: `pytorch 1.10.1`
- Git hash: `6d5236553b7fb3e653907c447bbbbb0790a013f9`
  - Commit date: `Wed Aug 31 08:17:56 2022 -0400`


## enh_train_enh_mixit_conv_tasnet_raw

config: conf/tuning/train_enh_mixit_conv_tasnet.yaml

|dataset|STOI|SAR|SDR|SIR|SI_SNR|
|---|---|---|---|---|---|
|enhanced_cv_min_8k|91.43|14.55|13.96|24.12|13.34|
|enhanced_tt_min_8k|91.32|13.68|12.91|22.61|12.25|

simpleoier · 2022-09-04T12:52:32Z

@LiChenda I would also like to remote this, and specify the mixture source name in the config file as in egs2/wsj0_2mix/enh1/conf/tuning/train_enh_mixit_conv_tasnet.yaml. Can you confirm that again?

codecov · 2022-09-04T15:48:30Z

Codecov Report

Merging #4619 (d7f8047) into master (6d52365) will increase coverage by 0.03%.
The diff coverage is 98.61%.

@@            Coverage Diff             @@
##           master    #4619      +/-   ##
==========================================
+ Coverage   83.07%   83.10%   +0.03%     
==========================================
  Files         508      518      +10     
  Lines       43775    44777    +1002     
==========================================
+ Hits        36364    37210     +846     
- Misses       7411     7567     +156

Flag	Coverage Δ
test_integration_espnet1	`66.36% <ø> (ø)`
test_integration_espnet2	`49.33% <45.83%> (-0.20%)`	⬇️
test_python	`70.94% <86.11%> (+0.32%)`	⬆️
test_utils	`23.28% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
espnet2/tasks/enh.py	`97.41% <92.30%> (-1.92%)`	⬇️
espnet2/bin/enh_scoring.py	`88.76% <100.00%> (+0.25%)`	⬆️
espnet2/enh/espnet_model.py	`86.63% <100.00%> (+0.06%)`	⬆️
espnet2/enh/loss/wrappers/mixit_solver.py	`100.00% <100.00%> (ø)`
espnet2/train/preprocessor.py	`36.65% <100.00%> (-1.22%)`	⬇️
espnet2/asr_transducer/beam_search_transducer.py	`97.95% <0.00%> (-1.23%)`	⬇️
espnet2/tasks/asr_transducer.py	`100.00% <0.00%> (ø)`
espnet2/asr_transducer/activation.py	`100.00% <0.00%> (ø)`
... and 23 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

popcornell · 2022-09-05T17:09:30Z

Regarding the different loss for different data, maybe is related to multi-condition training as here #4566. In theory utt2category as @Emrys365 points out can be leveraged to select which loss one can apply. I however do not know the status of this feature for that PR.

popcornell · 2022-09-05T17:11:42Z

espnet2/enh/loss/wrappers/mixit_solver.py

+    def type(self):
+        return "mixit"
+
+    def forward(self, ref, inf, others={}):


can we include the constrained mixIT variant for Speech Enhancement ?

for the others: see 4.2 Section in https://arxiv.org/pdf/2006.12701.pdf

Cool. Perhaps it would be better to leave it later.

simpleoier · 2022-09-05T19:20:09Z

Regarding the different loss for different data, maybe is related to multi-condition training as here #4566. In theory utt2category as @Emrys365 points out can be leveraged to select which loss one can apply. I however do not know the status of this feature for that PR.

@popcornell Thank you for the comments. It is a bit different from the multi-condition training we used to use. Because now the dynamic mixing is involved. So the input to the model is already mixture-of-mixtures. We need some other solution to semi-supervised training.

simpleoier · 2022-09-05T19:30:36Z

egs2/TEMPLATE/enh_diar1/enh_diar.sh

@@ -737,7 +737,7 @@ if ! "${skip_eval}"; then
                        ${_ref_scp} \
                        ${_inf_scp} \
                        --ref_channel ${ref_channel} \
-                        --flexible_numspk True


Just a note, I changed it because bollean args cannot be passed in this way. If some value is passed, the variable would be parsed as True, even here "False" is intended to be used.

I think it is better to redefine

group.add_argument("--flexible_numspk", type=bool, default=False)

as

from espnet2.utils.types import str2bool ... group.add_argument("--flexible_numspk", type=str2bool, default=False)

LiChenda · 2022-09-06T15:07:41Z

@LiChenda I would also like to remote this, and specify the mixture source name in the config file as in egs2/wsj0_2mix/enh1/conf/tuning/train_enh_mixit_conv_tasnet.yaml. Can you confirm that again?

As we discussed, you can remove it.

Emrys365 · 2022-09-06T15:09:37Z

egs2/TEMPLATE/enh1/enh.sh

-spk_num=2   # Number of speakers
-dynamic_mixing=false # Flag for dynamic mixing in speech separation task. 
+ref_num=2   # Number of references (similar to speakers)
+inf_num=    # Number of inferences output by the model
+            # If not specified, it will be the same as ref_num. If specified, it will be overwritten.


Note that all --spk_num arguments specified in enh1 recipes should be replaced after this change. (e.g., egs2/chime4/enh1/run.sh)

Also it is better to clarify the meaning of ref_num in different cases.
For example:

when MixIT-based training strategy is used, ref_num means the number of reference signals, while each reference signal may contain more than one speaker.

otherwise, ref_num is equivalent to the number of speakers in normal speech enhancement/separation tasks.

Emrys365 · 2022-09-06T15:10:56Z

egs2/TEMPLATE/enh1/enh.sh

 noise_type_num=1
 dereverb_ref_num=1

 # Training data related
 use_dereverb_ref=false
 use_noise_ref=false
-use_preprocessor=false


Why is this argument removed?

I added the preprocessor_choices in tasks/enh.py to determine if the preprocessor is used. Then this arguement is redundant.

I see. So now this argument is specified in the configuration file.

Emrys365 · 2022-09-06T15:19:01Z

egs2/wsj0_2mix/mixit_enh1/local/data.sh

+local/data_supervised.sh ${local_data_args}
+
+sup_train_set="tr_"${min_or_max}_${sample_rate}
+train_set="tr_"${min_or_max}_${sample_rate}_w_1spk_utt


What is the meaning of the suffix _w_1spk_utt?

The suffix means "with 1 speaker utterances".

Emrys365 · 2022-09-06T15:20:51Z

espnet2/bin/enh_scoring.py

-    group.add_argument("--flexible_numspk", type=bool, default=False)
+    group.add_argument("--flexible_numspk", default=False, action="store_true")


Setting the type to str2bool seems a better option as we can intuitive show how to use --flexible_numspk by passing True or False.

sounds good. I'll update it.

Emrys365 · 2022-09-06T15:24:25Z

espnet2/enh/loss/wrappers/mixit_solver.py

+        ref_tensor = torch.stack(ref[:num_ref], dim=1)  # (batch, num_ref, ...)
+        inf_tensor = torch.stack(inf, dim=1)  # (batch, num_inf, ...)


Does this solver only accept waveforms as input? I think a solver should be independent of the domain that a signal belongs to. Instead, the criterion defines whether time-domain or frequency-domain is expected.

Considering that complex-valued spectra may be used as input, we may need to care about the input data type (e.g., ComplexTensor vs torch.Tensor).

I added the support for more input tensor shapes. The main step is in the einsum computation. But basically, timedomain is preferred due to the additivity in MixIT. Is there any special care necessary for ComplexTensor?

Yes, for ComplexTensor, torch_complex.functional.stack should be used instead of torch.stack, and torch_complex.functional.einsum should be used instead of torch.einsum.

You could check the data type by isinstance(c, ComplexTensor).

A simple way to automatically handle the data type issue can be like:

from espnet2.enh.layers.complex_utils import einsum, stack ... stack(...) einsum(...)

It might be better to add more unit tests to cover the case where the input is a complex-valued spectrum.

Now complex is supported.

Emrys365 · 2022-09-06T15:28:29Z

espnet2/tasks/enh.py

-            )
+        use_preprocessor = getattr(args, "preprocessor", None) is not None
+
+        if train and use_preprocessor:


Why is the preprocessor only used for training? The preprocessor may have their own rules for handling training and inference data.

I looked at the usages of --use_preprocessor, it look like they were only used in training. So I changed in this logic. But we can keep it as it was.

We might better keep it for inference as well, in case we do not want to store the mixture audios and always mix the reference signals in the preprocessor.

Emrys365 · 2022-09-06T15:30:15Z

espnet2/train/preprocessor.py

-        num_spk: int = 2,
+        num_utts: int = 2,


Maybe num_refs is better, if we want to add a new preprocessor for meeting-style long-form data (containing many utterances in each reference channel) in the future.

I changed to ref_num, same as the one used in enh.sh. Trying to reduce the similar confusing names across all files.

Emrys365 · 2022-09-06T15:31:58Z

espnet2/train/preprocessor.py

+        if mixture_source_name is None:
+            self.mixture_source_name = f"{speech_ref_name_prefix}1"
+        else:
+            self.mixture_source_name = mixture_source_name


Maybe add some comments to explain the meaning of this argument?

Emrys365

I left some comments to discuss the current design.

… not available for now.

simpleoier · 2022-09-10T15:39:38Z

@Emrys365 @sw005320 Can you please check this PR again?

sw005320

LGTM!
I just want to make sure.
You seem to change some option names (ref-num instead of spk-num).
If this does not break the compatibility, it is no problem.
If so, we may accept both option names, and add a deprecation message (and remove it later).

Emrys365 · 2022-09-12T05:20:40Z

espnet2/enh/loss/wrappers/mixit_solver.py

+        loss = loss.mean()
+        perm = torch.index_select(all_mixture_matrix, 0, perm)
+
+        if perm.is_complex():


I think perm is always real.

I'm concerned that in line 77-88, the all_mixture_matrix maybe torch.complex if the inputs are complex. Hence, the perm could be complex.

perm is here the reordered estimate ? It looks like to me because it is after index_select.

if it is the reordered estimate you might as well leave it complex as it can be say the STFT representation (which should work in theory with MixIT as it is additive).
Also if it is the reordered estimate maybe is better to change the name of the variable no ?

No, it the assignment matrix. Not the reordered estimate.

I see your point now. Since you explicitly change the data type of all_mixture_matrix to inf_tensor.dtype, it can become a builtin complex tensor in PyTorch.

Yep it is ok

Emrys365 · 2022-09-12T05:22:20Z

test/espnet2/enh/loss/wrappers/test_mixit_solver.py

+
+
+@pytest.mark.parametrize("inf_num", [4])
+def test_MixITSolver_complex_tensor_forward(inf_num):


Could you add an additional test by using the builtin complex tensor in PyTorch when torch 1.9.0+ is used?

May I ask why 1.9.0+ ? For example, Pytorch 1.8.1 also supports complex.

I think 1.9.0+ added complex support for many torch.linalg operations making possible to migrate to native torch for e.g. beamforming.

Gotcha! I already applied the check of 1.9.0

Yes, indeed. It is since 1.9.0 that the complex support has been greatly improved.

Emrys365

It looks fine now. I just left some minor comments.

simpleoier · 2022-09-12T21:30:12Z

LGTM! I just want to make sure. You seem to change some option names (ref-num instead of spk-num). If this does not break the compatibility, it is no problem. If so, we may accept both option names, and add a deprecation message (and remove it later).

@sw005320 I have checked all enh recipe and updated the corresponding arguments.

LiChenda · 2022-09-13T09:32:33Z

espnet2/enh/espnet_model.py

-            kwargs["speech_ref{}".format(spk + 1)] for spk in range(self.num_spk)
+            kwargs.get(
+                f"speech_ref{spk + 1}",
+                torch.zeros_like(kwargs["speech_ref1"]),


Is this torch.zeros_like(kwargs["speech_ref1"]) for the silent output channel in the mixit?

Yes. In mixit, the reference is less than the num_spk.

LiChenda · 2022-09-13T09:41:24Z

espnet2/enh/espnet_model.py

+                f"speech_ref{spk + 1}",
+                torch.zeros_like(kwargs["speech_ref1"]),
+            )
+            for spk in range(self.num_spk)


I see that the num_spk is renamed to ref_num in some other files (e.g. espnet2/train/preprocessor.py, enh.sh). Maybe we can also rename it here if it's compatible with the past version.

It may need more work. Changing the name here would impact other files, like enh_inference.py and the enh_s2t tasks probably. I think the pre-trained enhancement models would be influenced, too.

sw005320 · 2022-09-14T12:35:19Z

Thanks, @simpleoier, and everyone contributing to this PR!

simpleoier requested a review from Emrys365 September 4, 2022 12:47

mergify bot added the ESPnet2 label Sep 4, 2022

simpleoier mentioned this pull request Sep 4, 2022

[WIP] MixIT for speech separation #4263

Closed

5 tasks

simpleoier force-pushed the mixit2 branch 2 times, most recently from c9b066e to ea05bb0 Compare September 4, 2022 15:25

simpleoier force-pushed the mixit2 branch from ea05bb0 to b70d77b Compare September 4, 2022 18:15

mergify bot added the CI Travis, Circle CI, etc label Sep 4, 2022

simpleoier force-pushed the mixit2 branch from b70d77b to 775bb8c Compare September 4, 2022 19:41

mergify bot added the README label Sep 4, 2022

simpleoier force-pushed the mixit2 branch 6 times, most recently from 5448639 to 9af7a81 Compare September 5, 2022 14:42

popcornell reviewed Sep 5, 2022

View reviewed changes

simpleoier force-pushed the mixit2 branch from 9af7a81 to 1000686 Compare September 5, 2022 19:18

simpleoier commented Sep 5, 2022

View reviewed changes

simpleoier force-pushed the mixit2 branch 5 times, most recently from 1754de7 to e08cd71 Compare September 6, 2022 01:31

Emrys365 reviewed Sep 6, 2022

View reviewed changes

simpleoier force-pushed the mixit2 branch 2 times, most recently from 0cce3b9 to 63c6fc5 Compare September 7, 2022 02:26

sw005320 added this to the v.202209 milestone Sep 7, 2022

sw005320 added the New Features label Sep 7, 2022

simpleoier force-pushed the mixit2 branch 2 times, most recently from 621a0e8 to 631dbd5 Compare September 9, 2022 04:05

Add MixIT support. It is unsupervised only. Semi-supervised config is…

f705a58

… not available for now.

simpleoier force-pushed the mixit2 branch from 631dbd5 to f705a58 Compare September 9, 2022 19:42

sw005320 approved these changes Sep 11, 2022

View reviewed changes

Emrys365 reviewed Sep 12, 2022

View reviewed changes

Emrys365 approved these changes Sep 12, 2022

View reviewed changes

add test for torch.complex in test_mixit_solver.py

d7f8047

popcornell approved these changes Sep 13, 2022

View reviewed changes

LiChenda approved these changes Sep 13, 2022

View reviewed changes

sw005320 merged commit a7bd652 into espnet:master Sep 14, 2022

		group.add_argument("--flexible_numspk", type=bool, default=False)
		group.add_argument("--flexible_numspk", default=False, action="store_true")

		ref_tensor = torch.stack(ref[:num_ref], dim=1) # (batch, num_ref, ...)
		inf_tensor = torch.stack(inf, dim=1) # (batch, num_inf, ...)



		@pytest.mark.parametrize("inf_num", [4])
		def test_MixITSolver_complex_tensor_forward(inf_num):

Add MixIT support. It is unsupervised only. Semi-supervised config is not available for now. #4619

Add MixIT support. It is unsupervised only. Semi-supervised config is not available for now. #4619

Conversation

simpleoier commented Sep 4, 2022 • edited

simpleoier commented Sep 4, 2022

codecov bot commented Sep 4, 2022 • edited

Codecov Report

popcornell commented Sep 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simpleoier commented Sep 5, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LiChenda commented Sep 6, 2022

Emrys365 Sep 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Emrys365 Sep 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Emrys365 Sep 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Emrys365 Sep 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Emrys365 Sep 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Emrys365 left a comment

Choose a reason for hiding this comment

simpleoier commented Sep 10, 2022

sw005320 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Emrys365 left a comment

Choose a reason for hiding this comment

simpleoier commented Sep 12, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sw005320 commented Sep 14, 2022

simpleoier commented Sep 4, 2022 •

edited

codecov bot commented Sep 4, 2022 •

edited

simpleoier commented Sep 5, 2022 •

edited

Emrys365 Sep 6, 2022 •

edited

Emrys365 Sep 6, 2022 •

edited

Emrys365 Sep 6, 2022 •

edited

Emrys365 Sep 7, 2022 •

edited

Emrys365 Sep 6, 2022 •

edited

simpleoier commented Sep 12, 2022 •

edited