New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Recipe] Add iwslt22 low resource speech translation task for egs2 #4994

Merged

mergify merged 8 commits into espnet:master from freddy5566:feature/iwslt22-low-resource

Mar 14, 2023

Contributor

freddy5566 commented Mar 11, 2023

Summary

This PR creates an espnet2 recipe for iwslt22 low-resource speech translation task .
The dataset comprises two different sets (see more information at here):

17 hours of clean speech in Tamasheq, translated to the French language (taq_fra_clean)
19 hours version of this corpus, including 2 additional hours of data that annotators labeled as potentially noisy (taq_fra_full)

Todo

update egs/README.md or egs2/README.md with corresponding recipes
add corresponding entry in egs2/TEMPLATE/db.sh for a new corpus
create confs and other related scripts
run experiments for clean and full sets of data
upload trained models to huggingface

My approach is leveraging existing Tamasheq wav2vec2 features and transformer as the model architecture.

Here are the results:

st_wav2vec-transformer-warmup-15k

BLEU

dataset	score	verbose_score
decode_pen2_st_model_valid.acc.ave/test	2.6	22.5/4.5/1.8/0.8 (BP = 0.736 ratio = 0.765 hyp_len = 17223 ref_len = 22504)

st_full_wav2vec-transformer-warmup-15k

BLEU

dataset	score	verbose_score
decode_pen2_st_model_valid.acc.ave/test	3.6	24.7/5.4/2.1/1.0 (BP = 0.894 ratio = 0.899 hyp_len = 20241 ref_len = 22504)

mergify bot added ESPnet2 README labels

freddy5566 force-pushed the feature/iwslt22-low-resource branch from 2b15d04 to 39f5490 Compare

March 11, 2023 09:02

Contributor Author

freddy5566 commented Mar 11, 2023 •

edited

I have rebased it onto the lastest master branch.

sw005320 requested a review from ftshijt

March 13, 2023 12:14

sw005320 added Recipe ST labels

sw005320 added this to the v.202303 milestone

Contributor

sw005320 commented Mar 13, 2023

Thanks a lot!
@ftshijt, can you check this PR?

ftshijt reviewed

View reviewed changes

egs2/iwslt22_low_resource/st1/local/data.sh Outdated

+              		# train_full comprises a 19 hour version of this corpus,
+              		# including 2 additional hours of data that was labeled by annotators as potentially noisy
+                  mkdir -p data/train/org
+              		mkdir -p data/train_full/org

Collaborator

ftshijt Mar 13, 2023

Suggested change

      
            		mkdir -p data/train_full/org
          
                mkdir -p data/train_full/org

ftshijt reviewed

View reviewed changes

Collaborator

ftshijt left a comment

Many thanks! Looks perfect to me. I just leave some notes for minor formatting issues.

egs2/iwslt22_low_resource/st1/run.sh Outdated

+              tgt_case=tc
+              ./st.sh \
+              		--st_tag wav2vec-transformer-warmup-15k \

Collaborator

ftshijt Mar 13, 2023

Suggested change

--st_tag wav2vec-transformer-warmup-15k \

egs2/iwslt22_low_resource/st1/run.sh Outdated

+              ./st.sh \
+              		--st_tag wav2vec-transformer-warmup-15k \
+                  --ignore_init_mismatch true \

Collaborator

ftshijt Mar 13, 2023

Suggested change

--ignore_init_mismatch true \

Collaborator

ftshijt Mar 13, 2023

This is not necessary since you do not use any pre-trained model with mismatched keys

egs2/iwslt22_low_resource/st1/run.sh Outdated

+                  --tgt_nbpe $tgt_nbpe \
+                  --tgt_case ${tgt_case} \
+                  --feats_type "raw" \
+              		--feats_normalize uttmvn \

Collaborator

ftshijt Mar 13, 2023

Suggested change

      
            		--feats_normalize uttmvn \
          
                --feats_normalize utterance_mvn \

egs2/iwslt22_low_resource/st1/run.sh Outdated

Comment on lines 37 to 38

		--nj 16 \
		--inference_nj 16 \

Collaborator

ftshijt Mar 13, 2023

Suggested change

      
                --nj 16 \
          
                --inference_nj 16 \

egs2/iwslt22_low_resource/st1/run.sh Outdated

+                  --nj 16 \
+                  --inference_nj 16 \
+                  --src_lang ${src_lang} \
+              		--use_src_lang false \

Collaborator

ftshijt Mar 13, 2023

Suggested change

      
            		--use_src_lang false \
          
                --use_src_lang false \

Contributor Author

freddy5566 commented Mar 13, 2023

Thank you for reviewing my PR!
I have fixed the format and linter. Sorry, I didn't notice that my tab setting was a bit messy on the server.
BTW, Should I upload models to huggingface now?

Collaborator

ftshijt commented Mar 13, 2023

Hi @ftshijt,

Thank you for reviewing my PR! I have fixed the format and linter. Sorry, I didn't notice that my tab setting was a bit messy on the server. BTW, Should I upload models to huggingface now?

It is fine to make it another PR. But if it is ready, it would be welcome if you add the link in this PR as well.

Collaborator

ftshijt commented Mar 13, 2023

There are also some CI issues. Please fix those as well (see https://github.com/espnet/espnet/actions/runs/4391542488/jobs/7714797586)

Contributor Author

freddy5566 commented Mar 13, 2023

There are also some CI issues. Please fix those as well (see https://github.com/espnet/espnet/actions/runs/4391542488/jobs/7714797586)

Thank you! I have fixed ci errors.

codecov bot commented Mar 13, 2023 •

edited

Codecov Report

Merging #4994 (eb6782f) into master (611a291) will decrease coverage by 0.91%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4994      +/-   ##
==========================================
- Coverage   76.99%   76.08%   -0.91%     
==========================================
  Files         606      606              
  Lines       53748    53713      -35     
==========================================
- Hits        41381    40870     -511     
- Misses      12367    12843     +476

Flag	Coverage Δ
test_integration_espnet1	`66.28% <ø> (+0.13%)`	⬆️
test_integration_espnet2	`55.57% <ø> (+7.80%)`	⬆️
test_python	`65.39% <ø> (-1.45%)`	⬇️
test_utils	`23.28% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 49 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

freddy5566 added 8 commits

March 14, 2023 22:04


          Add IWSLT22_LOW_RESOURCE to db.sh

b3fb0a4


          Add iwslt22 low-resource realted files

ba66a94


          Add iwslt22 low-resource results

aa9d857


          Add descriptions for iwslt22 low-resource in egs/README.md

d13388e


          Fix format issues

e321bf8


          Fix linter

d59dd96


          Fix linter

3fd6e27


          Apply isort to preprocess.py

eb6782f

freddy5566 force-pushed the feature/iwslt22-low-resource branch from 31cf90f to eb6782f Compare

March 14, 2023 14:04

sw005320 reviewed

View reviewed changes

egs2/iwslt22_low_resource/st1/RESULTS.md

+              |dataset|score|verbose_score|
+              |---|---|---|
+              |decode_pen2_st_model_valid.acc.ave/test|2.6|22.5/4.5/1.8/0.8 (BP = 0.736 ratio = 0.765 hyp_len = 17223 ref_len = 22504)|

Contributor

sw005320 Mar 14, 2023

The ratio is too different (hypotheses are too short) to me.
Can you tune the length penalty (in later PR?)?

Contributor Author

freddy5566 Mar 14, 2023

Of course. No problem.

sw005320 approved these changes

View reviewed changes

sw005320 added the auto-merge label

mergify bot merged commit 4bd37a2 into espnet:master

freddy5566 deleted the feature/iwslt22-low-resource branch

March 14, 2023 17:47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment