ASR2 recipe on Tedlium3 dataset #5331

kohei0209 · 2023-07-21T19:05:14Z

tedlium3/asr2 recipe

Implementation of tedlium3/asr2 recipe

check data preparation stage (~ stage 8)
check training stage (stage 8~)
Get some results

This reverts commit daf6533.

codecov · 2023-07-21T20:21:59Z

Codecov Report

Merging #5331 (9f54c74) into master (5d0758e) will increase coverage by 2.64%.
Report is 485 commits behind head on master.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #5331      +/-   ##
==========================================
+ Coverage   72.72%   75.36%   +2.64%     
==========================================
  Files         679      709      +30     
  Lines       61692    65290    +3598     
==========================================
+ Hits        44865    49206    +4341     
+ Misses      16827    16084     -743

Flag	Coverage Δ
test_configuration_espnet2	`∅ <ø> (∅)`
test_integration_espnet1	`65.67% <ø> (-0.06%)`	⬇️
test_integration_espnet2	`48.71% <ø> (?)`
test_python_espnet1	`19.16% <ø> (-1.11%)`	⬇️
test_python_espnet2	`51.39% <ø> (-0.71%)`	⬇️
test_utils	`23.10% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 120 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

egs2/tedlium3/asr2/conf/fbank.conf

egs2/tedlium3/asr2/conf/pitch.conf

simpleoier · 2023-07-21T20:44:11Z

egs2/tedlium3/asr2/conf/tuning/train_discrete_asr_e_branchformer1.yaml

@@ -0,0 +1,93 @@
+# Trained with A100 (40 GB) x 2 GPUs. It takes 21 minutes per epoch.


Please update the time information later.

simpleoier · 2023-07-21T20:44:41Z

egs2/tedlium3/asr2/conf/tuning/train_discrete_asr_e_branchformer1_1gpu.yaml

@@ -0,0 +1,93 @@
+# Trained with A100 (40 GB) x 1 GPUs for Kmeans1K+nbpe5K. It takes 32 minutes per epoch.


Is the time information correct?

simpleoier · 2023-07-21T20:44:50Z

egs2/tedlium3/asr2/conf/tuning/train_discrete_asr_e_branchformer1_conv1d3.yaml

@@ -0,0 +1,93 @@
+# Trained with A100 (40 GB) x 1 GPUs. It takes 24 minutes per epoch.


sw005320 · 2023-08-30T18:56:53Z

What is the status of this PR? We can make it from a draft to regular PR if it is ready.

simpleoier · 2023-08-30T18:59:28Z

Once @kohei0209 gets the asr2 results with new config, we can bring this PR to regular and proceed to merge.

simpleoier · 2023-09-16T20:54:31Z

Hi @kohei0209 , can you continue this PR and upload you checkpoints?

kohei0209 · 2023-09-19T08:44:37Z

I am sorry for the late reply. I'll upload the checkpoints and update this PR

kohei0209 · 2023-09-25T03:59:02Z

Hi @simpleoier, is it okay to include the data filtering process for removing empty text at stage 6 in this PR? Since Ted3 has some empty texts, data filtering is necessary in asr2.sh.
If okay, could you tell me a good way to do so? I did data filtering as follows but it's dirty.

# remove empty text
cat "${data_feats}/org/${dset}/text.ts.en" | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/${dset}/text.ts.en"
# align keys
# maybe fix_data_dir.sh should be used, it's dirty
utils/filter_scp.pl "${data_feats}/${dset}/text.ts.en" "${data_feats}/org/${dset}/utt2spk" > "${data_feats}/${dset}/utt2spk"
utils/filter_scp.pl "${data_feats}/${dset}/text.ts.en" "${data_feats}/org/${dset}/text.rm.${kmeans_feature_type}_${layer}_km${nclusters}" > "${data_feats}/${dset}/text.rm.${kmeans_feature_type}_${layer}_km${nclusters}"
utils/utt2spk_to_spk2utt.pl "${data_feats}/${dset}/utt2spk" > "${data_feats}/${dset}/spk2utt"

sw005320 · 2023-09-25T11:58:53Z

Did you observe some improvements with it?
I'm curious about it.

Yes, stage 6 is the correct place.
But as you said, fix_data_dir.sh might be better.
Please also display how many utterances are pruned.
fix_data_dir.sh includes such information.

kohei0209 · 2023-09-25T13:41:35Z

Thank you for your answer. I'll try fix_data_dir.sh.
This data filtering is for removing empty text (or audio without any utterance) and avoiding errors in the training stage, not for improving performance. I got an error in the training stage when I did not remove empty text.

sw005320 · 2023-09-25T13:44:23Z

I see.
Theoretically, it should not have errors, but maybe there are some issues.

You do not have to do it, but one approach would be to add a special silence token for such utterances.

kohei0209 · 2023-10-12T12:20:48Z

I am very sorry for the late reply.
I re-run the recipe without removing the data with empty reference text and got the following error that said the text.ts.en file does not include a space between the first and the second columns. This happens because the row with an empty reference text only has the utterance ID.

Original Traceback (most recent call last):
File "/mnt/aoni04/saijo/hackathon-2023summer/espnet-2023-720/tools/miniconda/envs/espnet_hackathon/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/mnt/aoni04/saijo/hackathon-2023summer/espnet-2023-720/tools/miniconda/envs/espnet_hackathon/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
data.append(next(self.dataset_iter))
File "/mnt/aoni04/saijo/hackathon-2023summer/espnet-2023-720/espnet2/train/iterable_dataset.py", line 176, in iter
raise RuntimeError(
RuntimeError: This line doesn't include a space: <_io.TextIOWrapper name='dump/raw/train_sp/text.ts.en' mode='r' encoding='utf-8'>:L39787: ApolloRobbins_2013G-0042011-0042121
)

I tried to use fix_data_dir.sh but I needed to filter utt2spk first because fix_data_dir.sh does not seem to refer to the text.ts.en file. But if we do so we cannot know how many utterances are pruned since utt2spk is already filtered. Are there any good solutions...?

cat "${data_feats}/org/${dset}/text.ts.en" | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/${dset}/text.ts.en"
# need to filter utt2spk first because fix_data_dir.sh does not seem to refer to text.ts.en
utils/filter_scp.pl "${data_feats}/${dset}/text.ts.en" "${data_feats}/org/${dset}/utt2spk" > "${data_feats}/${dset}/utt2spk"
# filter the other files (spk2utt and text.rm.wavlm_large_21_km*)
utils/fix_data_dir.sh --utt_extra_files "${utt_extra_files}" "${data_feats}/${dset}"

sw005320 · 2023-10-12T13:29:08Z

I see.
Then, I think it is fine with this current way.
But, how about adding lines to echo the number of sentences?
This is just for monitoring purposes and not so strict.

kohei0209 · 2023-10-14T13:16:53Z

Thank you for your advice. I've added the code to show how many samples are removed:

# Remove empty text
cat "${data_feats}/org/${dset}/text.${tgt_case}.${tgt_lang}" | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/${dset}/text.${tgt_case}.${tgt_lang}"
utils/filter_scp.pl "${data_feats}/${dset}/text.${tgt_case}.${tgt_lang}" "${data_feats}/org/${dset}/utt2spk" > "${data_feats}/${dset}/utt2spk"
utils/fix_data_dir.sh  --utt_extra_files "${utt_extra_files}" "${data_feats}/${dset}"
# check how many samples are removed
org_num_samples=$(wc -l "${data_feats}/org/${dset}/utt2spk" | cut -d' ' -f1)
filtered_num_samples=$(wc -l "${data_feats}/${dset}/utt2spk" | cut -d' ' -f1)
echo "filter samples with empty texts: removed $((org_num_samples - filtered_num_samples)) samples with empty text"

The log is as follows:

2023-10-14T22:06:53 (asr2.sh:316:main) Info: The valid_set 'dev' is included in the test_sets. '--eval_valid_set true' is set and 'dev' is removed from the test_sets
2023-10-14T22:06:53 (asr2.sh:583:main) Skipped stages:  11 16 17 18 
2023-10-14T22:07:02 (asr2.sh:813:main) Stage 6: Data filtering: dump/raw/org -> dump/raw
dump/raw/train_sp
fix_data_dir.sh: kept all 804780 utterances.
fix_data_dir.sh: old files are kept in dump/raw/train_sp/.backup
filter samples with empty texts: removed 9 samples with empty text
dump/raw/dev
fix_data_dir.sh: kept all 507 utterances.
fix_data_dir.sh: old files are kept in dump/raw/dev/.backup
filter samples with empty texts: removed 0 samples with empty text
2023-10-14T22:10:08 (asr2.sh:1697:main) Successfully finished. [elapsed=217s]

kohei0209 · 2023-10-14T13:20:27Z

@simpleoier BTW, do you plan to switch the input orders of src and tgt texts (tgt first -> src first) in stage 13? (As we discussed before, it can make training more memory-efficient and faster)

simpleoier · 2023-10-16T21:29:13Z

@kohei0209 Thanks for the reminder. You can adjust the order in this PR.

kohei0209 · 2023-10-19T14:05:39Z

I reflected your comments. I've also uploaded the model parameters on huggingface.
Could you make this PR from a draft to regular PR and check the changes?

sw005320 · 2023-10-19T19:31:11Z

Thanks, @kohei0209!

kohei0209 added 5 commits July 22, 2023 02:45

added tedlium3/asr2 recipe

1ccf715

Merge branch 'master' into hackathon2023

ce0bf7e

readme

daf6533

Revert "readme"

e289ace

This reverts commit daf6533.

readme

73e0592

mergify bot added ESPnet2 README labels Jul 21, 2023

sw005320 added Recipe ASR Automatic speech recogntion labels Jul 21, 2023

sw005320 added this to the v.202307 milestone Jul 21, 2023

simpleoier reviewed Jul 21, 2023

View reviewed changes

sw005320 marked this pull request as draft July 21, 2023 23:17

fixed readme

e87ceb8

kan-bayashi modified the milestones: v.202307, v.202312 Aug 3, 2023

kohei0209 added 3 commits August 28, 2023 22:03

local

b95af88

configs

9365240

removed unconfirmed training info from config

ee3051d

Merge branch 'master' into hackathon2023 to reflect lastest updates

7bcdab4

kohei0209 added 3 commits October 19, 2023 22:58

configs and readme

a473356

add [unk] token to nlsyms as in asr1

f8019ff

add data filtering and switch order of src and tgt in stage 13

9f54c74

sw005320 marked this pull request as ready for review October 19, 2023 14:32

sw005320 closed this Oct 19, 2023

sw005320 reopened this Oct 19, 2023

sw005320 changed the title ~~[WIP] ASR2 recipe on Tedlium3 dataset~~ ASR2 recipe on Tedlium3 dataset Oct 19, 2023

sw005320 merged commit 93dafc3 into espnet:master Oct 19, 2023
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASR2 recipe on Tedlium3 dataset #5331

ASR2 recipe on Tedlium3 dataset #5331

kohei0209 commented Jul 21, 2023 •

edited

codecov bot commented Jul 21, 2023 •

edited

simpleoier Jul 21, 2023

simpleoier Jul 21, 2023

simpleoier Jul 21, 2023

sw005320 commented Aug 30, 2023

simpleoier commented Aug 30, 2023

simpleoier commented Sep 16, 2023

kohei0209 commented Sep 19, 2023

kohei0209 commented Sep 25, 2023

sw005320 commented Sep 25, 2023

kohei0209 commented Sep 25, 2023

sw005320 commented Sep 25, 2023

kohei0209 commented Oct 12, 2023

sw005320 commented Oct 12, 2023

kohei0209 commented Oct 14, 2023

kohei0209 commented Oct 14, 2023

simpleoier commented Oct 16, 2023

kohei0209 commented Oct 19, 2023

sw005320 commented Oct 19, 2023

		@@ -0,0 +1,93 @@
		# Trained with A100 (40 GB) x 2 GPUs. It takes 21 minutes per epoch.

		@@ -0,0 +1,93 @@
		# Trained with A100 (40 GB) x 1 GPUs for Kmeans1K+nbpe5K. It takes 32 minutes per epoch.

		@@ -0,0 +1,93 @@
		# Trained with A100 (40 GB) x 1 GPUs. It takes 24 minutes per epoch.

ASR2 recipe on Tedlium3 dataset #5331

ASR2 recipe on Tedlium3 dataset #5331

Conversation

kohei0209 commented Jul 21, 2023 • edited

tedlium3/asr2 recipe

Implementation of tedlium3/asr2 recipe

codecov bot commented Jul 21, 2023 • edited

Codecov Report

simpleoier Jul 21, 2023

Choose a reason for hiding this comment

simpleoier Jul 21, 2023

Choose a reason for hiding this comment

simpleoier Jul 21, 2023

Choose a reason for hiding this comment

sw005320 commented Aug 30, 2023

simpleoier commented Aug 30, 2023

simpleoier commented Sep 16, 2023

kohei0209 commented Sep 19, 2023

kohei0209 commented Sep 25, 2023

sw005320 commented Sep 25, 2023

kohei0209 commented Sep 25, 2023

sw005320 commented Sep 25, 2023

kohei0209 commented Oct 12, 2023

sw005320 commented Oct 12, 2023

kohei0209 commented Oct 14, 2023

kohei0209 commented Oct 14, 2023

simpleoier commented Oct 16, 2023

kohei0209 commented Oct 19, 2023

sw005320 commented Oct 19, 2023

kohei0209 commented Jul 21, 2023 •

edited

codecov bot commented Jul 21, 2023 •

edited