-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix joint tokenization in st.sh #4143
Conversation
Looks cool to me, I will merge it after CI passed. Many thanks! |
Codecov Report
@@ Coverage Diff @@
## master #4143 +/- ##
=======================================
Coverage 80.43% 80.43%
=======================================
Files 442 442
Lines 38557 38557
=======================================
Hits 31015 31015
Misses 7542 7542
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
espnet/egs2/TEMPLATE/st1/st.sh
Lines 299 to 307 in 3ae0dcc
if "${token_joint}"; then | |
# if token_joint, the bpe training will use both src_lang and tgt_lang to train a single bpe model | |
[ -z "${src_bpe_train_text}" ] && src_bpe_train_text="${data_feats}/${train_set}/text.${src_case}.${src_lang}" | |
[ -z "${tgt_bpe_train_text}" ] && tgt_bpe_train_text="${data_feats}/${train_set}/text.${tgt_case}.${tgt_lang}" | |
# Prepare data as text.${src_lang}_${tgt_lang}) | |
cat $src_bpe_train_text $tgt_bpe_train_text > ${data_feats}/${train_set}/text.${src_lang}_${tgt_lang} | |
tgt_bpe_train_text="${data_feats}/${train_set}/text.${src_lang}_${tgt_lang}" | |
else |
This will no longer be needed.
Thanks for pointing it out. I removed these lines. Could you review it again? |
Looks good! |
Hi, in
st.sh
, the source and target texts should be merged before tokenization, iftoken_joint
isTrue
.This modification is based on the same stage in
mt.sh
. Please check if it's correct.