Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix joint tokenization in st.sh #4143

Merged
merged 2 commits into from
Mar 8, 2022
Merged

Fix joint tokenization in st.sh #4143

merged 2 commits into from
Mar 8, 2022

Conversation

pyf98
Copy link
Collaborator

@pyf98 pyf98 commented Mar 7, 2022

Hi, in st.sh, the source and target texts should be merged before tokenization, if token_joint is True.

This modification is based on the same stage in mt.sh. Please check if it's correct.

@ftshijt
Copy link
Collaborator

ftshijt commented Mar 7, 2022

Looks cool to me, I will merge it after CI passed. Many thanks!

@codecov
Copy link

codecov bot commented Mar 7, 2022

Codecov Report

Merging #4143 (e2489b1) into master (6f42960) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4143   +/-   ##
=======================================
  Coverage   80.43%   80.43%           
=======================================
  Files         442      442           
  Lines       38557    38557           
=======================================
  Hits        31015    31015           
  Misses       7542     7542           
Flag Coverage Δ
test_integration_espnet1 67.13% <ø> (ø)
test_integration_espnet2 51.14% <ø> (ø)
test_python 66.51% <ø> (ø)
test_utils 24.45% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6f42960...e2489b1. Read the comment docs.

Copy link
Contributor

@siddalmia siddalmia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if "${token_joint}"; then
# if token_joint, the bpe training will use both src_lang and tgt_lang to train a single bpe model
[ -z "${src_bpe_train_text}" ] && src_bpe_train_text="${data_feats}/${train_set}/text.${src_case}.${src_lang}"
[ -z "${tgt_bpe_train_text}" ] && tgt_bpe_train_text="${data_feats}/${train_set}/text.${tgt_case}.${tgt_lang}"
# Prepare data as text.${src_lang}_${tgt_lang})
cat $src_bpe_train_text $tgt_bpe_train_text > ${data_feats}/${train_set}/text.${src_lang}_${tgt_lang}
tgt_bpe_train_text="${data_feats}/${train_set}/text.${src_lang}_${tgt_lang}"
else

This will no longer be needed.

@pyf98
Copy link
Collaborator Author

pyf98 commented Mar 8, 2022

if "${token_joint}"; then
# if token_joint, the bpe training will use both src_lang and tgt_lang to train a single bpe model
[ -z "${src_bpe_train_text}" ] && src_bpe_train_text="${data_feats}/${train_set}/text.${src_case}.${src_lang}"
[ -z "${tgt_bpe_train_text}" ] && tgt_bpe_train_text="${data_feats}/${train_set}/text.${tgt_case}.${tgt_lang}"
# Prepare data as text.${src_lang}_${tgt_lang})
cat $src_bpe_train_text $tgt_bpe_train_text > ${data_feats}/${train_set}/text.${src_lang}_${tgt_lang}
tgt_bpe_train_text="${data_feats}/${train_set}/text.${src_lang}_${tgt_lang}"
else

This will no longer be needed.

Thanks for pointing it out. I removed these lines. Could you review it again?

@siddalmia
Copy link
Contributor

Looks good!

@sw005320 sw005320 added Bugfix MT Machine translation ST Speech translation labels Mar 8, 2022
@sw005320 sw005320 added this to the v.0.10.7 milestone Mar 8, 2022
@sw005320 sw005320 merged commit 537514a into espnet:master Mar 8, 2022
@pyf98 pyf98 deleted the fix_st branch March 8, 2022 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bugfix ESPnet2 MT Machine translation ST Speech translation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants