Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SOT training on LibriMix data. #4861

Merged
merged 12 commits into from
Mar 29, 2023
Merged

Conversation

pengchengguo
Copy link
Collaborator

@pengchengguo pengchengguo commented Jan 10, 2023

Xuankai and I would like to include SOT training (https://arxiv.org/pdf/2003.12687.pdf) in ESPnet.

We modify the CSV files of the original LibriMix repo by adding random overlap offsets from 1-1.5s. This PR provides a unified simulation metric and a comparable benchmark for non-full-overlapped multi-speaker ASR.

Note that it is not ready for merging yet. We will update some results and pre-trained models later.

TODO:

  • Results of Transformer PIT w/ or w/o WavLM
  • Results of Transformer SOT w/ or w/o WavLM
  • Results of Conformer SOT w/ or w/o WavLM

@mergify mergify bot added the ESPnet2 label Jan 10, 2023
Copy link
Collaborator

@simpleoier simpleoier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pengchengguo !

@@ -0,0 +1,90 @@
# network architecture
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PIT model may use different data preparation with SOT. It may be a bit surprising to put this config, especially this recipe folder is under sot_asr1. I don't know. Maybe we can explicitly make some notes about this config, e.g. what data preparation, what purpose, what differences, and how to use are about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for including PIT is to compare PIT and SOT on the same dataset (overlapped speech with a certain time delay). It is ok for me to remove it and only keep the SOT file here~

frontend_conf:
frontend_conf:
upstream: wavlm_local
path_or_url: "/home/work_nfs6/pcguo/asr/librimix/hub/wavlm_large.pt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path is for your local experiments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

frontend_conf:
frontend_conf:
upstream: wavlm_local
path_or_url: "/home/work_nfs6/pcguo/asr/librimix/hub/wavlm_large.pt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path of the config is for your own.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@sw005320 sw005320 added New Features ASR Automatic speech recogntion labels Jan 11, 2023
@sw005320 sw005320 added this to the v.202301 milestone Jan 11, 2023
@mergify
Copy link
Contributor

mergify bot commented Jan 17, 2023

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Jan 17, 2023
@mergify mergify bot removed the conflicts label Feb 1, 2023
@kan-bayashi kan-bayashi modified the milestones: v.202301, v.202303 Feb 1, 2023
@codecov
Copy link

codecov bot commented Feb 9, 2023

Codecov Report

Merging #4861 (32ad96c) into master (98823bd) will increase coverage by 0.00%.
The diff coverage is 87.50%.

@@           Coverage Diff           @@
##           master    #4861   +/-   ##
=======================================
  Coverage   75.86%   75.86%           
=======================================
  Files         615      615           
  Lines       54684    54689    +5     
=======================================
+ Hits        41487    41491    +4     
- Misses      13197    13198    +1     
Flag Coverage Δ
test_integration_espnet1 66.29% <ø> (ø)
test_integration_espnet2 48.52% <75.00%> (+<0.01%) ⬆️
test_python 65.83% <50.00%> (-0.01%) ⬇️
test_utils 23.28% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
espnet2/text/build_tokenizer.py 77.14% <ø> (ø)
espnet2/train/preprocessor.py 29.37% <66.66%> (+0.18%) ⬆️
espnet2/bin/tokenize_text.py 85.96% <100.00%> (+0.12%) ⬆️
espnet2/text/char_tokenizer.py 83.33% <100.00%> (+0.40%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@sw005320 sw005320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • does it have to be sot_asr1?
    since it just changes the config, I think we can move this to asr1 and change local/data.sh to local/data_sot.sh, and it is called from the original local/data.sh
    If this makes conflicts with the existing data directory, your current idea is better.
  • it is better to add some descriptions to README.md, for example, why we need CSV files, what is the SOT text format, and how <sc> symbols are treated in the tokenizer through the --add_nonsplit_symbol option

@@ -766,6 +772,9 @@ if ! "${skip_data_prep}"; then

_opts="--non_linguistic_symbols ${nlsyms_txt}"

if ${sot_asr} && [ "${token_type}" = char ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to briefly explain the SOT-based text format with an example here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only char case?
Then, again, it's better to explain it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The main difference between sot_asr1 and asr1 is the data simulation process and the training data. If merging these two directories, we may need to introduce additional files (eg. data/{train_sot,dev_sot,test_sot}, run_sot.sh). I think it is ok to add some files but the results in README.md can not be compared with each other (due to the different training data, full-overlapped or partial-overlapped).
  • I will add more descriptions about SOT
  • "only char case?", in our experiments, SOT only converges well on char units, thus we only considered this case, I will check the compatibility with other units like BPE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks.

Copy link
Collaborator

@simpleoier simpleoier Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only char case?

When we prepare the text, <sc> is inserted between speakers, with space on the left and right.

  1. For the word case, <sc> is automatically regarded as a single word unit. So no special process for it.
  2. For the BPE case, we use _opts_spm+=" --user_defined_symbols=<sc>" to take care of it. It will not be mixed in other BPE subwords.
  3. For the char case, we need to add this special arguments to let the text processing know this is a single unit. It wouldn't be split into <, s, c, >.

@@ -0,0 +1,154 @@
842M Libri2Mix/wav16k/max/dev/s2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is it used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it.

@@ -0,0 +1,3001 @@
mixture_ID,source_1_path,source_1_gain,source_2_path,source_2_gain,noise_path,noise_gain,offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why this is needed?
to avoid the full overlap in the original librimix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the updated csv file includes an additional column of "offset", which is used as the time delay when simulating overlapped data. Here we randomly generate offsets from 1s-1.5s.
The motivation for fixing the offset is to allow people to do a fair comparison on the same simulated data among different models.

@mergify
Copy link
Contributor

mergify bot commented Mar 17, 2023

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Mar 17, 2023
@sw005320
Copy link
Contributor

@pengchengguo, can we try to finish this PR?

@pengchengguo
Copy link
Collaborator Author

Sorry for the delay, will finish it this weekend.

@mergify mergify bot added the README label Mar 21, 2023
@sw005320
Copy link
Contributor

There is a conflict due to the recent refactoring of asr.sh
Can you fix this?

@pengchengguo
Copy link
Collaborator Author

There is a conflict due to the recent refactoring of asr.sh Can you fix this?

Sure, I am checking it now.

@mergify mergify bot removed the conflicts label Mar 21, 2023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where was it used?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@sw005320 sw005320 added the auto-merge Enable auto-merge label Mar 29, 2023
@mergify mergify bot merged commit 275637a into espnet:master Mar 29, 2023
@sw005320
Copy link
Contributor

LGTM, thanks a lot!

@pengchengguo
Copy link
Collaborator Author

Sorry, it seems the change (remove useless decode_pit.yaml file) was not submitted successfully, I may need to open another PR to remove this file or we can leave it to a later PR, like PR for updating results.

@sw005320
Copy link
Contributor

OK!
I think this is my mistake.
Sorry...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASR Automatic speech recogntion auto-merge Enable auto-merge ESPnet2 New Features README
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants