-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support SOT training on LibriMix data. #4861
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pengchengguo !
@@ -0,0 +1,90 @@ | |||
# network architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PIT model may use different data preparation with SOT. It may be a bit surprising to put this config, especially this recipe folder is under sot_asr1
. I don't know. Maybe we can explicitly make some notes about this config, e.g. what data preparation, what purpose, what differences, and how to use are about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation for including PIT is to compare PIT and SOT on the same dataset (overlapped speech with a certain time delay). It is ok for me to remove it and only keep the SOT file here~
frontend_conf: | ||
frontend_conf: | ||
upstream: wavlm_local | ||
path_or_url: "/home/work_nfs6/pcguo/asr/librimix/hub/wavlm_large.pt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path is for your local experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
frontend_conf: | ||
frontend_conf: | ||
upstream: wavlm_local | ||
path_or_url: "/home/work_nfs6/pcguo/asr/librimix/hub/wavlm_large.pt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path of the config is for your own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
This pull request is now in conflict :( |
Codecov Report
@@ Coverage Diff @@
## master #4861 +/- ##
=======================================
Coverage 75.86% 75.86%
=======================================
Files 615 615
Lines 54684 54689 +5
=======================================
+ Hits 41487 41491 +4
- Misses 13197 13198 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- does it have to be
sot_asr1
?
since it just changes the config, I think we can move this toasr1
and changelocal/data.sh
tolocal/data_sot.sh
, and it is called from the originallocal/data.sh
If this makes conflicts with the existing data directory, your current idea is better. - it is better to add some descriptions to README.md, for example, why we need CSV files, what is the SOT text format, and how
<sc>
symbols are treated in the tokenizer through the--add_nonsplit_symbol
option
egs2/TEMPLATE/asr1/asr.sh
Outdated
@@ -766,6 +772,9 @@ if ! "${skip_data_prep}"; then | |||
|
|||
_opts="--non_linguistic_symbols ${nlsyms_txt}" | |||
|
|||
if ${sot_asr} && [ "${token_type}" = char ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to briefly explain the SOT-based text format with an example here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only char case?
Then, again, it's better to explain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The main difference between
sot_asr1
andasr1
is the data simulation process and the training data. If merging these two directories, we may need to introduce additional files (eg.data/{train_sot,dev_sot,test_sot}
,run_sot.sh
). I think it is ok to add some files but the results inREADME.md
can not be compared with each other (due to the different training data, full-overlapped or partial-overlapped). - I will add more descriptions about SOT
- "only char case?", in our experiments, SOT only converges well on char units, thus we only considered this case, I will check the compatibility with other units like BPE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only char case?
When we prepare the text, <sc>
is inserted between speakers, with space on the left and right.
- For the word case,
<sc>
is automatically regarded as a single word unit. So no special process for it. - For the BPE case, we use
_opts_spm+=" --user_defined_symbols=<sc>"
to take care of it. It will not be mixed in other BPE subwords. - For the char case, we need to add this special arguments to let the text processing know this is a single unit. It wouldn't be split into
<
,s
,c
,>
.
@@ -0,0 +1,154 @@ | |||
842M Libri2Mix/wav16k/max/dev/s2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is it used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove it.
@@ -0,0 +1,3001 @@ | |||
mixture_ID,source_1_path,source_1_gain,source_2_path,source_2_gain,noise_path,noise_gain,offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why this is needed?
to avoid the full overlap in the original librimix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the updated csv file includes an additional column of "offset", which is used as the time delay when simulating overlapped data. Here we randomly generate offsets from 1s-1.5s.
The motivation for fixing the offset is to allow people to do a fair comparison on the same simulated data among different models.
This pull request is now in conflict :( |
@pengchengguo, can we try to finish this PR? |
Sorry for the delay, will finish it this weekend. |
There is a conflict due to the recent refactoring of |
Sure, I am checking it now. |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where was it used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
LGTM, thanks a lot! |
Sorry, it seems the change (remove useless |
OK! |
Xuankai and I would like to include SOT training (https://arxiv.org/pdf/2003.12687.pdf) in ESPnet.
We modify the CSV files of the original LibriMix repo by adding random overlap offsets from 1-1.5s. This PR provides a unified simulation metric and a comparable benchmark for non-full-overlapped multi-speaker ASR.
Note that it is not ready for merging yet. We will update some results and pre-trained models later.
TODO: