New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the Enh framework to support training with variable numbers of speakers #5414
Conversation
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #5414 +/- ##
==========================================
- Coverage 77.17% 75.29% -1.89%
==========================================
Files 684 708 +24
Lines 62686 65085 +2399
==========================================
+ Hits 48379 49006 +627
- Misses 14307 16079 +1772
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 51 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
espnet2/train/preprocessor.py
Outdated
] | ||
length = speech_refs[0].shape[0] | ||
tgt_length = self.speech_segment | ||
assert length >= self.speech_segment, (length, tgt_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing worth discussing is whether we should allow the input signal to be shorter than the specified speech_segment
.
baf5d67
to
792329b
Compare
@simpleoier, can you review this PR? |
@LiChenda, can you also review this PR? |
Sure, I'll review it. |
|
||
Note: | ||
All audios in each sample must have the same sampling rates. | ||
Audios of different lengths in each sample will be right-padded with np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why padded with np.nan
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just in case that the signals defined in multiple columns have different lengths. These nan
values will/should be removed or ensured to be absent in the preprocessor, e.g.,
espnet/espnet2/train/preprocessor.py
Lines 1162 to 1171 in 1bba669
# Divide the stacked signals into single speaker signals for consistency | |
for i in range(num_spk - 1, -1, -1): | |
idx = str(i + 1) | |
# make sure no np.nan paddings are in the data | |
assert not np.isnan(np.sum(data[sref_name][i])), uid | |
data[self.speech_ref_name_prefix + idx] = data[sref_name][i] | |
if dref_name in data: | |
# make sure no np.nan paddings are in the data | |
assert not np.isnan(np.sum(data[dref_name][i])), uid | |
data[self.dereverb_ref_name_prefix + idx] = data[dref_name][i] |
espnet2/tasks/enh.py
Outdated
"--speech_segment", | ||
type=int_or_none, | ||
default=None, | ||
help="Truncate the audios to the specified length if not None", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truncate the audio to the specified length (If I understand correctly, the length is in sampling points?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's number of samples. Have updated the help-string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments on this PR. Do you have a plan to add some related recipes which have variable number of speakers?
help="Truncate the audios to the specified length if not None", | ||
) | ||
group.add_argument( | ||
"--avoid_allzero_segment", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to provide this option? I feel the default True
is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes we may want to train the model to generate silence, and sometimes we may want to avoid the silence as the training target depending on the loss function we use. So I just make this argument to enable both possibilities.
espnet2/train/dataset.py
Outdated
f"Sampling rates are mismatched: {self.rate} != {rate}" | ||
) | ||
if not self.allow_multi_rates: | ||
if self.rate is not None and self.rate != rate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about reducing two if
into one?
if not self.allow_multi_rates and self.rate is not None and self.rate != rate:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Yes, the recipe will be added in the followup PRs soon. |
Can we merge this PR? |
Thanks a lot for your great efforts! |
What?
This PR adds support for variable numbers of speakers when training speech separation models.
Major updates are added in
--variable_num_refs
is added to specify whether we are training models with variable numbers of speakers in each mixture sample.spk1.scp
,dereverb1.scp
, andenroll_spk1.scp
is introduced. Each row may contain variable columns, where each column starting from the 2nd column represents a different reference speaker.variable_columns_sound
is added to espnet2/train/dataset.py and espnet2/train/iterable_dataset.py.spk1.scp
anddereverb1.scp
(if provided) whenvariable_num_refs
is true.enroll_spk1.scp
is loaded as thetext
type, so there is no problem if the enrollment audios of the same sample have different lengths.speech_segment: Optional[int] = None
is added toEnhPreprocessor
andTSEPreprocessor
to allow only loading a fixed-length segment from each audio. This is useful if we want to control the balance between data of different lengths.avoid_allzero_segment: bool = True
is added toEnhPreprocessor
andTSEPreprocessor
. This argument is only useful whenspeech_segment
is specified. If this argument is True, the segmentation process will ensure that all reference audios in the selected segment are not all-zero.flexible_numspk: bool = False
is added toEnhPreprocessor
andTSEPreprocessor
to handle the variable-speaker-number inputs in thespeech_ref1
,dereverb_ref1
, andenroll_ref1
.flexible_numspk
is added to handle training with variable numbers of speakers.In addition, some updates are added for future extension (no test is available now):
--allow_multi_rates
is added to allow loading audios with different sampling rates. This is useful if we want to build a universal model that can handle different sampling rates.flexible_numspk
is added to handle training with variable numbers of speakers.additional
is added to theAbsExtractor
forward method. This makes it consistent with theAbsSeparator
, which is flexible for future extension.Why?
Speech separation with an unknown number of speakers is a popular research topic but has not been supported in ESPnet due to its variable data size.
Note
Currently the inference and evaluation stages are not updated to support variable numbers of speakers. This part needs further discussion to make a good design.