Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the Enh framework to support training with variable numbers of speakers #5414

Merged
merged 13 commits into from Oct 16, 2023

Conversation

Emrys365
Copy link
Collaborator

@Emrys365 Emrys365 commented Aug 17, 2023

What?

This PR adds support for variable numbers of speakers when training speech separation models.

Major updates are added in

  • egs2/TEMPLATE/enh1/enh.sh
    • A new argument --variable_num_refs is added to specify whether we are training models with variable numbers of speakers in each mixture sample.
    • A new multi-column format for reference files including spk1.scp, dereverb1.scp, and enroll_spk1.scp is introduced. Each row may contain variable columns, where each column starting from the 2nd column represents a different reference speaker.
  • espnet2/fileio/multi_sound_scp.py
    • A new sound loader to created to support loading different numbers of audios in different samples. The corresponding data type variable_columns_sound is added to espnet2/train/dataset.py and espnet2/train/iterable_dataset.py.
    • This is used to load spk1.scp and dereverb1.scp (if provided) when variable_num_refs is true.
    • Note that enroll_spk1.scp is loaded as the text type, so there is no problem if the enrollment audios of the same sample have different lengths.
  • espnet2/train/preprocessor.py
    • A new argument speech_segment: Optional[int] = None is added to EnhPreprocessor and TSEPreprocessor to allow only loading a fixed-length segment from each audio. This is useful if we want to control the balance between data of different lengths.
    • A new argument avoid_allzero_segment: bool = True is added to EnhPreprocessor and TSEPreprocessor. This argument is only useful when speech_segment is specified. If this argument is True, the segmentation process will ensure that all reference audios in the selected segment are not all-zero.
    • A new argument flexible_numspk: bool = False is added to EnhPreprocessor and TSEPreprocessor to handle the variable-speaker-number inputs in the speech_ref1, dereverb_ref1, and enroll_ref1.
  • espnet2/enh/espnet_model_tse.py
    • A new argument flexible_numspk is added to handle training with variable numbers of speakers.

In addition, some updates are added for future extension (no test is available now):

  • espnet2/tasks/abs_task.py
    • A new argument --allow_multi_rates is added to allow loading audios with different sampling rates. This is useful if we want to build a universal model that can handle different sampling rates.
  • espnet2/enh/espnet_model.py
    • A new argument flexible_numspk is added to handle training with variable numbers of speakers.
    • However, since the currently implemented speech separation models do not support generate variable numbers of outputs, this function is not tested.
  • espnet2/enh/extractor/abs_extractor.py
    • A new input argument additional is added to the AbsExtractor forward method. This makes it consistent with the AbsSeparator, which is flexible for future extension.

Why?

Speech separation with an unknown number of speakers is a popular research topic but has not been supported in ESPnet due to its variable data size.

Note

Currently the inference and evaluation stages are not updated to support variable numbers of speakers. This part needs further discussion to make a good design.

@Emrys365 Emrys365 added Recipe ESPnet2 SE Speech enhancement labels Aug 17, 2023
@mergify mergify bot added the CI Travis, Circle CI, etc label Aug 17, 2023
@codecov
Copy link

codecov bot commented Aug 17, 2023

Codecov Report

Merging #5414 (6a4a1d9) into master (8457fe2) will decrease coverage by 1.89%.
Report is 270 commits behind head on master.
The diff coverage is 81.09%.

@@            Coverage Diff             @@
##           master    #5414      +/-   ##
==========================================
- Coverage   77.17%   75.29%   -1.89%     
==========================================
  Files         684      708      +24     
  Lines       62686    65085    +2399     
==========================================
+ Hits        48379    49006     +627     
- Misses      14307    16079    +1772     
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 65.69% <ø> (-0.04%) ⬇️
test_integration_espnet2 48.89% <79.26%> (-0.19%) ⬇️
test_python_espnet1 19.15% <0.00%> (-0.80%) ⬇️
test_python_espnet2 51.26% <40.85%> (-1.05%) ⬇️
test_utils 23.10% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
espnet2/enh/espnet_model_tse.py 80.29% <100.00%> (+1.22%) ⬆️
espnet2/enh/extractor/abs_extractor.py 100.00% <100.00%> (ø)
espnet2/enh/extractor/td_speakerbeam_extractor.py 90.24% <100.00%> (ø)
espnet2/iterators/chunk_iter_factory.py 91.81% <100.00%> (+0.15%) ⬆️
espnet2/tasks/abs_task.py 77.01% <100.00%> (+0.21%) ⬆️
espnet2/tasks/enh.py 97.53% <100.00%> (+0.04%) ⬆️
espnet2/tasks/enh_tse.py 98.01% <100.00%> (+0.06%) ⬆️
espnet2/train/iterable_dataset.py 86.32% <ø> (ø)
espnet2/enh/espnet_model.py 85.52% <85.71%> (-0.13%) ⬇️
espnet2/train/dataset.py 71.42% <93.75%> (+0.99%) ⬆️
... and 2 more

... and 51 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

]
length = speech_refs[0].shape[0]
tgt_length = self.speech_segment
assert length >= self.speech_segment, (length, tgt_length)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing worth discussing is whether we should allow the input signal to be shorter than the specified speech_segment.

@sw005320 sw005320 added this to the v.202312 milestone Aug 30, 2023
@mergify mergify bot added the README label Sep 27, 2023
@sw005320
Copy link
Contributor

@simpleoier, can you review this PR?

@sw005320
Copy link
Contributor

sw005320 commented Oct 5, 2023

@LiChenda, can you also review this PR?

@LiChenda
Copy link
Contributor

LiChenda commented Oct 6, 2023

@LiChenda, can you also review this PR?

Sure, I'll review it.


Note:
All audios in each sample must have the same sampling rates.
Audios of different lengths in each sample will be right-padded with np.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why padded with np.nan here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just in case that the signals defined in multiple columns have different lengths. These nan values will/should be removed or ensured to be absent in the preprocessor, e.g.,

# Divide the stacked signals into single speaker signals for consistency
for i in range(num_spk - 1, -1, -1):
idx = str(i + 1)
# make sure no np.nan paddings are in the data
assert not np.isnan(np.sum(data[sref_name][i])), uid
data[self.speech_ref_name_prefix + idx] = data[sref_name][i]
if dref_name in data:
# make sure no np.nan paddings are in the data
assert not np.isnan(np.sum(data[dref_name][i])), uid
data[self.dereverb_ref_name_prefix + idx] = data[dref_name][i]

"--speech_segment",
type=int_or_none,
default=None,
help="Truncate the audios to the specified length if not None",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncate the audio to the specified length (If I understand correctly, the length is in sampling points?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's number of samples. Have updated the help-string.

Copy link
Contributor

@LiChenda LiChenda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments on this PR. Do you have a plan to add some related recipes which have variable number of speakers?

help="Truncate the audios to the specified length if not None",
)
group.add_argument(
"--avoid_allzero_segment",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to provide this option? I feel the default True is fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes we may want to train the model to generate silence, and sometimes we may want to avoid the silence as the training target depending on the loss function we use. So I just make this argument to enable both possibilities.

f"Sampling rates are mismatched: {self.rate} != {rate}"
)
if not self.allow_multi_rates:
if self.rate is not None and self.rate != rate:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about reducing two if into one?
if not self.allow_multi_rates and self.rate is not None and self.rate != rate:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@Emrys365
Copy link
Collaborator Author

I left some comments on this PR. Do you have a plan to add some related recipes which have variable number of speakers?

Yes, the recipe will be added in the followup PRs soon.

@sw005320 sw005320 added the auto-merge Enable auto-merge label Oct 11, 2023
@Emrys365
Copy link
Collaborator Author

Can we merge this PR?

@sw005320 sw005320 merged commit 1699eb8 into espnet:master Oct 16, 2023
25 checks passed
@sw005320
Copy link
Contributor

Thanks a lot for your great efforts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Enable auto-merge CI Travis, Circle CI, etc ESPnet2 README Recipe SE Speech enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants