Update the Enh framework to support training with variable numbers of speakers #5414

Emrys365 · 2023-08-17T20:11:21Z

What?

This PR adds support for variable numbers of speakers when training speech separation models.

Major updates are added in

egs2/TEMPLATE/enh1/enh.sh
- A new argument --variable_num_refs is added to specify whether we are training models with variable numbers of speakers in each mixture sample.
- A new multi-column format for reference files including spk1.scp, dereverb1.scp, and enroll_spk1.scp is introduced. Each row may contain variable columns, where each column starting from the 2nd column represents a different reference speaker.
espnet2/fileio/multi_sound_scp.py
- A new sound loader to created to support loading different numbers of audios in different samples. The corresponding data type variable_columns_sound is added to espnet2/train/dataset.py and espnet2/train/iterable_dataset.py.
- This is used to load spk1.scp and dereverb1.scp (if provided) when variable_num_refs is true.
- Note that enroll_spk1.scp is loaded as the text type, so there is no problem if the enrollment audios of the same sample have different lengths.
espnet2/train/preprocessor.py
- A new argument speech_segment: Optional[int] = None is added to EnhPreprocessor and TSEPreprocessor to allow only loading a fixed-length segment from each audio. This is useful if we want to control the balance between data of different lengths.
- A new argument avoid_allzero_segment: bool = True is added to EnhPreprocessor and TSEPreprocessor. This argument is only useful when speech_segment is specified. If this argument is True, the segmentation process will ensure that all reference audios in the selected segment are not all-zero.
- A new argument flexible_numspk: bool = False is added to EnhPreprocessor and TSEPreprocessor to handle the variable-speaker-number inputs in the speech_ref1, dereverb_ref1, and enroll_ref1.
espnet2/enh/espnet_model_tse.py
- A new argument flexible_numspk is added to handle training with variable numbers of speakers.

In addition, some updates are added for future extension (no test is available now):

espnet2/tasks/abs_task.py
- A new argument --allow_multi_rates is added to allow loading audios with different sampling rates. This is useful if we want to build a universal model that can handle different sampling rates.
espnet2/enh/espnet_model.py
- A new argument flexible_numspk is added to handle training with variable numbers of speakers.
- However, since the currently implemented speech separation models do not support generate variable numbers of outputs, this function is not tested.
espnet2/enh/extractor/abs_extractor.py
- A new input argument additional is added to the AbsExtractor forward method. This makes it consistent with the AbsSeparator, which is flexible for future extension.

Why?

Speech separation with an unknown number of speakers is a popular research topic but has not been supported in ESPnet due to its variable data size.

Note

Currently the inference and evaluation stages are not updated to support variable numbers of speakers. This part needs further discussion to make a good design.

…peakers

for more information, see https://pre-commit.ci

codecov · 2023-08-17T20:35:41Z

Codecov Report

Merging #5414 (6a4a1d9) into master (8457fe2) will decrease coverage by 1.89%.
Report is 270 commits behind head on master.
The diff coverage is 81.09%.

@@            Coverage Diff             @@
##           master    #5414      +/-   ##
==========================================
- Coverage   77.17%   75.29%   -1.89%     
==========================================
  Files         684      708      +24     
  Lines       62686    65085    +2399     
==========================================
+ Hits        48379    49006     +627     
- Misses      14307    16079    +1772

Flag	Coverage Δ
test_configuration_espnet2	`∅ <ø> (∅)`
test_integration_espnet1	`65.69% <ø> (-0.04%)`	⬇️
test_integration_espnet2	`48.89% <79.26%> (-0.19%)`	⬇️
test_python_espnet1	`19.15% <0.00%> (-0.80%)`	⬇️
test_python_espnet2	`51.26% <40.85%> (-1.05%)`	⬇️
test_utils	`23.10% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
espnet2/enh/espnet_model_tse.py	`80.29% <100.00%> (+1.22%)`	⬆️
espnet2/enh/extractor/abs_extractor.py	`100.00% <100.00%> (ø)`
espnet2/enh/extractor/td_speakerbeam_extractor.py	`90.24% <100.00%> (ø)`
espnet2/iterators/chunk_iter_factory.py	`91.81% <100.00%> (+0.15%)`	⬆️
espnet2/tasks/abs_task.py	`77.01% <100.00%> (+0.21%)`	⬆️
espnet2/tasks/enh.py	`97.53% <100.00%> (+0.04%)`	⬆️
espnet2/tasks/enh_tse.py	`98.01% <100.00%> (+0.06%)`	⬆️
espnet2/train/iterable_dataset.py	`86.32% <ø> (ø)`
espnet2/enh/espnet_model.py	`85.52% <85.71%> (-0.13%)`	⬇️
espnet2/train/dataset.py	`71.42% <93.75%> (+0.99%)`	⬆️
... and 2 more

... and 51 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Emrys365 · 2023-08-19T14:26:22Z

espnet2/train/preprocessor.py

+        ]
+        length = speech_refs[0].shape[0]
+        tgt_length = self.speech_segment
+        assert length >= self.speech_segment, (length, tgt_length)


One thing worth discussing is whether we should allow the input signal to be shorter than the specified speech_segment.

sw005320 · 2023-09-27T10:27:28Z

@simpleoier, can you review this PR?

sw005320 · 2023-10-05T13:43:24Z

@LiChenda, can you also review this PR?

LiChenda · 2023-10-06T05:34:25Z

@LiChenda, can you also review this PR?

Sure, I'll review it.

LiChenda · 2023-10-11T05:53:59Z

espnet2/fileio/multi_sound_scp.py

+
+        Note:
+            All audios in each sample must have the same sampling rates.
+            Audios of different lengths in each sample will be right-padded with np.nan


Why padded with np.nan here?

It is just in case that the signals defined in multiple columns have different lengths. These nan values will/should be removed or ensured to be absent in the preprocessor, e.g.,

espnet/espnet2/train/preprocessor.py

Lines 1162 to 1171 in 1bba669

# Divide the stacked signals into single speaker signals for consistency

for i in range(num_spk - 1, -1, -1):

idx = str(i + 1)

# make sure no np.nan paddings are in the data

assert not np.isnan(np.sum(data[sref_name][i])), uid

data[self.speech_ref_name_prefix + idx] = data[sref_name][i]

if dref_name in data:

# make sure no np.nan paddings are in the data

assert not np.isnan(np.sum(data[dref_name][i])), uid

data[self.dereverb_ref_name_prefix + idx] = data[dref_name][i]

LiChenda · 2023-10-11T06:22:46Z

espnet2/tasks/enh.py

+            "--speech_segment",
+            type=int_or_none,
+            default=None,
+            help="Truncate the audios to the specified length if not None",


Truncate the audio to the specified length (If I understand correctly, the length is in sampling points?)

Yes it's number of samples. Have updated the help-string.

LiChenda

I left some comments on this PR. Do you have a plan to add some related recipes which have variable number of speakers?

LiChenda · 2023-10-11T06:24:15Z

espnet2/tasks/enh.py

+            help="Truncate the audios to the specified length if not None",
+        )
+        group.add_argument(
+            "--avoid_allzero_segment",


Is it necessary to provide this option? I feel the default True is fine.

Sometimes we may want to train the model to generate silence, and sometimes we may want to avoid the silence as the training target depending on the loss function we use. So I just make this argument to enable both possibilities.

LiChenda · 2023-10-11T06:27:58Z

espnet2/train/dataset.py

-                    f"Sampling rates are mismatched: {self.rate} != {rate}"
-                )
+            if not self.allow_multi_rates:
+                if self.rate is not None and self.rate != rate:


How about reducing two if into one？
if not self.allow_multi_rates and self.rate is not None and self.rate != rate:

Emrys365 · 2023-10-11T13:24:30Z

I left some comments on this PR. Do you have a plan to add some related recipes which have variable number of speakers?

Yes, the recipe will be added in the followup PRs soon.

Emrys365 · 2023-10-16T02:07:18Z

Can we merge this PR?

sw005320 · 2023-10-16T11:31:53Z

Thanks a lot for your great efforts!

Emrys365 added 4 commits August 16, 2023 17:45

Add support for speech separation training with variable numbers of s…

e1492a2

…peakers

Fix a typo

c3e8451

Add ci tests

a9d7366

Fix CI errors

7feec98

Emrys365 added Recipe ESPnet2 SE Speech enhancement labels Aug 17, 2023

mergify bot added the CI Travis, Circle CI, etc label Aug 17, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

957033b

for more information, see https://pre-commit.ci

Update ref_channel in espnet2/enh/espnet_model_tse.py

6467ff3

Emrys365 force-pushed the espnet2_enh branch from dc366bd to 6467ff3 Compare August 18, 2023 04:35

Fix a potential bug

9bc2b5f

Emrys365 commented Aug 19, 2023

View reviewed changes

sw005320 requested a review from simpleoier August 30, 2023 18:46

sw005320 added this to the v.202312 milestone Aug 30, 2023

Merge branch 'master' of github.com:espnet/espnet into espnet2_enh

41b2936

mergify bot added the README label Sep 27, 2023

Add the VOiCES recipe for ASR

792329b

Emrys365 force-pushed the espnet2_enh branch from baf5d67 to 792329b Compare September 27, 2023 06:45

Emrys365 added 3 commits September 28, 2023 19:01

Merge branch 'master' of github.com:espnet/espnet into espnet2_enh

2562dcb

Merge branch 'master' of github.com:espnet/espnet into espnet2_enh

773831d

Minor update

1bba669

LiChenda reviewed Oct 11, 2023

View reviewed changes

sw005320 approved these changes Oct 11, 2023

View reviewed changes

Reflect comments

6a4a1d9

sw005320 added the auto-merge Enable auto-merge label Oct 11, 2023

LiChenda approved these changes Oct 16, 2023

View reviewed changes

sw005320 merged commit 1699eb8 into espnet:master Oct 16, 2023
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the Enh framework to support training with variable numbers of speakers #5414

Update the Enh framework to support training with variable numbers of speakers #5414

Emrys365 commented Aug 17, 2023 •

edited

codecov bot commented Aug 17, 2023 •

edited

Emrys365 Aug 19, 2023

sw005320 commented Sep 27, 2023

sw005320 commented Oct 5, 2023

LiChenda commented Oct 6, 2023

LiChenda Oct 11, 2023

Emrys365 Oct 11, 2023

LiChenda Oct 11, 2023

Emrys365 Oct 11, 2023

LiChenda left a comment

LiChenda Oct 11, 2023

Emrys365 Oct 11, 2023

LiChenda Oct 11, 2023

Emrys365 Oct 11, 2023

Emrys365 commented Oct 11, 2023

Emrys365 commented Oct 16, 2023

sw005320 commented Oct 16, 2023

	# Divide the stacked signals into single speaker signals for consistency
	for i in range(num_spk - 1, -1, -1):
	idx = str(i + 1)
	# make sure no np.nan paddings are in the data
	assert not np.isnan(np.sum(data[sref_name][i])), uid
	data[self.speech_ref_name_prefix + idx] = data[sref_name][i]
	if dref_name in data:
	# make sure no np.nan paddings are in the data
	assert not np.isnan(np.sum(data[dref_name][i])), uid
	data[self.dereverb_ref_name_prefix + idx] = data[dref_name][i]

Update the Enh framework to support training with variable numbers of speakers #5414

Update the Enh framework to support training with variable numbers of speakers #5414

Conversation

Emrys365 commented Aug 17, 2023 • edited

What?

Why?

Note

codecov bot commented Aug 17, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

sw005320 commented Sep 27, 2023

sw005320 commented Oct 5, 2023

LiChenda commented Oct 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LiChenda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Emrys365 commented Oct 11, 2023

Emrys365 commented Oct 16, 2023

sw005320 commented Oct 16, 2023

Emrys365 commented Aug 17, 2023 •

edited

codecov bot commented Aug 17, 2023 •

edited