Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASR2 recipe on Tedlium3 dataset #5331

Merged
merged 13 commits into from Oct 19, 2023
Merged

Conversation

kohei0209
Copy link
Contributor

@kohei0209 kohei0209 commented Jul 21, 2023

tedlium3/asr2 recipe

Implementation of tedlium3/asr2 recipe

  • check data preparation stage (~ stage 8)
  • check training stage (stage 8~)
  • Get some results

@sw005320 sw005320 added Recipe ASR Automatic speech recogntion labels Jul 21, 2023
@sw005320 sw005320 added this to the v.202307 milestone Jul 21, 2023
@codecov
Copy link

codecov bot commented Jul 21, 2023

Codecov Report

Merging #5331 (9f54c74) into master (5d0758e) will increase coverage by 2.64%.
Report is 485 commits behind head on master.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5331      +/-   ##
==========================================
+ Coverage   72.72%   75.36%   +2.64%     
==========================================
  Files         679      709      +30     
  Lines       61692    65290    +3598     
==========================================
+ Hits        44865    49206    +4341     
+ Misses      16827    16084     -743     
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 65.67% <ø> (-0.06%) ⬇️
test_integration_espnet2 48.71% <ø> (?)
test_python_espnet1 19.16% <ø> (-1.11%) ⬇️
test_python_espnet2 51.39% <ø> (-0.71%) ⬇️
test_utils 23.10% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 120 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

egs2/tedlium3/asr2/conf/fbank.conf Outdated Show resolved Hide resolved
egs2/tedlium3/asr2/conf/pitch.conf Outdated Show resolved Hide resolved
@@ -0,0 +1,93 @@
# Trained with A100 (40 GB) x 2 GPUs. It takes 21 minutes per epoch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the time information later.

@@ -0,0 +1,93 @@
# Trained with A100 (40 GB) x 1 GPUs for Kmeans1K+nbpe5K. It takes 32 minutes per epoch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the time information correct?

@@ -0,0 +1,93 @@
# Trained with A100 (40 GB) x 1 GPUs. It takes 24 minutes per epoch.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@sw005320 sw005320 marked this pull request as draft July 21, 2023 23:17
@kan-bayashi kan-bayashi modified the milestones: v.202307, v.202312 Aug 3, 2023
@sw005320
Copy link
Contributor

What is the status of this PR? We can make it from a draft to regular PR if it is ready.

@simpleoier
Copy link
Collaborator

Once @kohei0209 gets the asr2 results with new config, we can bring this PR to regular and proceed to merge.

@simpleoier
Copy link
Collaborator

Hi @kohei0209 , can you continue this PR and upload you checkpoints?

@kohei0209
Copy link
Contributor Author

I am sorry for the late reply. I'll upload the checkpoints and update this PR

@kohei0209
Copy link
Contributor Author

Hi @simpleoier, is it okay to include the data filtering process for removing empty text at stage 6 in this PR? Since Ted3 has some empty texts, data filtering is necessary in asr2.sh.
If okay, could you tell me a good way to do so? I did data filtering as follows but it's dirty.

# remove empty text
cat "${data_feats}/org/${dset}/text.ts.en" | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/${dset}/text.ts.en"
# align keys
# maybe fix_data_dir.sh should be used, it's dirty
utils/filter_scp.pl "${data_feats}/${dset}/text.ts.en" "${data_feats}/org/${dset}/utt2spk" > "${data_feats}/${dset}/utt2spk"
utils/filter_scp.pl "${data_feats}/${dset}/text.ts.en" "${data_feats}/org/${dset}/text.rm.${kmeans_feature_type}_${layer}_km${nclusters}" > "${data_feats}/${dset}/text.rm.${kmeans_feature_type}_${layer}_km${nclusters}"
utils/utt2spk_to_spk2utt.pl "${data_feats}/${dset}/utt2spk" > "${data_feats}/${dset}/spk2utt"

@sw005320
Copy link
Contributor

Did you observe some improvements with it?
I'm curious about it.

Yes, stage 6 is the correct place.
But as you said, fix_data_dir.sh might be better.
Please also display how many utterances are pruned.
fix_data_dir.sh includes such information.

@kohei0209
Copy link
Contributor Author

Thank you for your answer. I'll try fix_data_dir.sh.
This data filtering is for removing empty text (or audio without any utterance) and avoiding errors in the training stage, not for improving performance. I got an error in the training stage when I did not remove empty text.

@sw005320
Copy link
Contributor

I see.
Theoretically, it should not have errors, but maybe there are some issues.

You do not have to do it, but one approach would be to add a special silence token for such utterances.

@kohei0209
Copy link
Contributor Author

I am very sorry for the late reply.
I re-run the recipe without removing the data with empty reference text and got the following error that said the text.ts.en file does not include a space between the first and the second columns. This happens because the row with an empty reference text only has the utterance ID.

Original Traceback (most recent call last):
File "/mnt/aoni04/saijo/hackathon-2023summer/espnet-2023-720/tools/miniconda/envs/espnet_hackathon/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/mnt/aoni04/saijo/hackathon-2023summer/espnet-2023-720/tools/miniconda/envs/espnet_hackathon/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
data.append(next(self.dataset_iter))
File "/mnt/aoni04/saijo/hackathon-2023summer/espnet-2023-720/espnet2/train/iterable_dataset.py", line 176, in iter
raise RuntimeError(
RuntimeError: This line doesn't include a space: <_io.TextIOWrapper name='dump/raw/train_sp/text.ts.en' mode='r' encoding='utf-8'>:L39787: ApolloRobbins_2013G-0042011-0042121
)

I tried to use fix_data_dir.sh but I needed to filter utt2spk first because fix_data_dir.sh does not seem to refer to the text.ts.en file. But if we do so we cannot know how many utterances are pruned since utt2spk is already filtered. Are there any good solutions...?

cat "${data_feats}/org/${dset}/text.ts.en" | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/${dset}/text.ts.en"
# need to filter utt2spk first because fix_data_dir.sh does not seem to refer to text.ts.en
utils/filter_scp.pl "${data_feats}/${dset}/text.ts.en" "${data_feats}/org/${dset}/utt2spk" > "${data_feats}/${dset}/utt2spk"
# filter the other files (spk2utt and text.rm.wavlm_large_21_km*)
utils/fix_data_dir.sh --utt_extra_files "${utt_extra_files}" "${data_feats}/${dset}"

@sw005320
Copy link
Contributor

I see.
Then, I think it is fine with this current way.
But, how about adding lines to echo the number of sentences?
This is just for monitoring purposes and not so strict.

@kohei0209
Copy link
Contributor Author

Thank you for your advice. I've added the code to show how many samples are removed:

# Remove empty text
cat "${data_feats}/org/${dset}/text.${tgt_case}.${tgt_lang}" | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/${dset}/text.${tgt_case}.${tgt_lang}"
utils/filter_scp.pl "${data_feats}/${dset}/text.${tgt_case}.${tgt_lang}" "${data_feats}/org/${dset}/utt2spk" > "${data_feats}/${dset}/utt2spk"
utils/fix_data_dir.sh  --utt_extra_files "${utt_extra_files}" "${data_feats}/${dset}"
# check how many samples are removed
org_num_samples=$(wc -l "${data_feats}/org/${dset}/utt2spk" | cut -d' ' -f1)
filtered_num_samples=$(wc -l "${data_feats}/${dset}/utt2spk" | cut -d' ' -f1)
echo "filter samples with empty texts: removed $((org_num_samples - filtered_num_samples)) samples with empty text"

The log is as follows:

2023-10-14T22:06:53 (asr2.sh:316:main) Info: The valid_set 'dev' is included in the test_sets. '--eval_valid_set true' is set and 'dev' is removed from the test_sets
2023-10-14T22:06:53 (asr2.sh:583:main) Skipped stages:  11 16 17 18 
2023-10-14T22:07:02 (asr2.sh:813:main) Stage 6: Data filtering: dump/raw/org -> dump/raw
dump/raw/train_sp
fix_data_dir.sh: kept all 804780 utterances.
fix_data_dir.sh: old files are kept in dump/raw/train_sp/.backup
filter samples with empty texts: removed 9 samples with empty text
dump/raw/dev
fix_data_dir.sh: kept all 507 utterances.
fix_data_dir.sh: old files are kept in dump/raw/dev/.backup
filter samples with empty texts: removed 0 samples with empty text
2023-10-14T22:10:08 (asr2.sh:1697:main) Successfully finished. [elapsed=217s]

@kohei0209
Copy link
Contributor Author

@simpleoier BTW, do you plan to switch the input orders of src and tgt texts (tgt first -> src first) in stage 13? (As we discussed before, it can make training more memory-efficient and faster)

@simpleoier
Copy link
Collaborator

@kohei0209 Thanks for the reminder. You can adjust the order in this PR.

@kohei0209
Copy link
Contributor Author

I reflected your comments. I've also uploaded the model parameters on huggingface.
Could you make this PR from a draft to regular PR and check the changes?

@sw005320 sw005320 marked this pull request as ready for review October 19, 2023 14:32
@sw005320 sw005320 closed this Oct 19, 2023
@sw005320 sw005320 reopened this Oct 19, 2023
@sw005320 sw005320 changed the title [WIP] ASR2 recipe on Tedlium3 dataset ASR2 recipe on Tedlium3 dataset Oct 19, 2023
@sw005320
Copy link
Contributor

Thanks, @kohei0209!

@sw005320 sw005320 merged commit 93dafc3 into espnet:master Oct 19, 2023
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASR Automatic speech recogntion ESPnet2 README Recipe
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants