-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Recipe] Add iwslt22 low resource speech translation task for egs2 #4994
Merged
mergify
merged 8 commits into
espnet:master
from
freddy5566:feature/iwslt22-low-resource
Mar 14, 2023
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
b3fb0a4
Add IWSLT22_LOW_RESOURCE to db.sh
freddy5566 ba66a94
Add iwslt22 low-resource realted files
freddy5566 aa9d857
Add iwslt22 low-resource results
freddy5566 d13388e
Add descriptions for iwslt22 low-resource in egs/README.md
freddy5566 e321bf8
Fix format issues
freddy5566 d59dd96
Fix linter
freddy5566 3fd6e27
Fix linter
freddy5566 eb6782f
Apply isort to preprocess.py
freddy5566 File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
<!-- Generated by scripts/utils/show_translation_result.sh --> | ||
# RESULTS | ||
## Environments | ||
- date: `Fri Mar 10 16:10:20 CST 2023` | ||
- python version: `3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0]` | ||
- espnet version: `espnet 202301` | ||
- pytorch version: `pytorch 1.13.1` | ||
- Git hash: `ff841366229d539eb74d23ac999cae7c0cc62cad` | ||
- Commit date: `Mon Feb 20 12:23:15 2023 -0500` | ||
|
||
## st_wav2vec-transformer-warmup-15k | ||
|
||
### BLEU | ||
|
||
|dataset|score|verbose_score| | ||
|---|---|---| | ||
|decode_pen2_st_model_valid.acc.ave/test|2.6|22.5/4.5/1.8/0.8 (BP = 0.736 ratio = 0.765 hyp_len = 17223 ref_len = 22504)| | ||
|
||
## st_full_wav2vec-transformer-warmup-15k | ||
|
||
### BLEU | ||
|
||
|dataset|score|verbose_score| | ||
|---|---|---| | ||
|decode_pen2_st_model_valid.acc.ave/test|3.6|24.7/5.4/2.1/1.0 (BP = 0.894 ratio = 0.899 hyp_len = 20241 ref_len = 22504)| | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ====== | ||
# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...> | ||
# e.g. | ||
# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB | ||
# | ||
# Options: | ||
# --time <time>: Limit the maximum time to execute. | ||
# --mem <mem>: Limit the maximum memory usage. | ||
# -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs. | ||
# --num-threads <ngpu>: Specify the number of CPU core. | ||
# --gpu <ngpu>: Specify the number of GPU devices. | ||
# --config: Change the configuration file from default. | ||
# | ||
# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs. | ||
# The left string of "=", i.e. "JOB", is replaced by <N>(Nth job) in the command and the log file name, | ||
# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively. | ||
# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example. | ||
# | ||
# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend. | ||
# These options are mapping to specific options for each backend and | ||
# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default. | ||
# If jobs failed, your configuration might be wrong for your environment. | ||
# | ||
# | ||
# The official documentation for run.pl, queue.pl, slurm.pl, and ssh.pl: | ||
# "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html | ||
# =========================================================~ | ||
|
||
|
||
# Select the backend used by run.sh from "local", "stdout", "sge", "slurm", or "ssh" | ||
cmd_backend='local' | ||
|
||
# Local machine, without any Job scheduling system | ||
if [ "${cmd_backend}" = local ]; then | ||
|
||
# The other usage | ||
export train_cmd="run.pl" | ||
# Used for "*_train.py": "--gpu" is appended optionally by run.sh | ||
export cuda_cmd="run.pl" | ||
# Used for "*_recog.py" | ||
export decode_cmd="run.pl" | ||
|
||
# Local machine logging to stdout and log file, without any Job scheduling system | ||
elif [ "${cmd_backend}" = stdout ]; then | ||
|
||
# The other usage | ||
export train_cmd="stdout.pl" | ||
# Used for "*_train.py": "--gpu" is appended optionally by run.sh | ||
export cuda_cmd="stdout.pl" | ||
# Used for "*_recog.py" | ||
export decode_cmd="stdout.pl" | ||
|
||
|
||
# "qsub" (Sun Grid Engine, or derivation of it) | ||
elif [ "${cmd_backend}" = sge ]; then | ||
# The default setting is written in conf/queue.conf. | ||
# You must change "-q g.q" for the "queue" for your environment. | ||
# To know the "queue" names, type "qhost -q" | ||
# Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler. | ||
|
||
export train_cmd="queue.pl" | ||
export cuda_cmd="queue.pl" | ||
export decode_cmd="queue.pl" | ||
|
||
|
||
# "qsub" (Torque/PBS.) | ||
elif [ "${cmd_backend}" = pbs ]; then | ||
# The default setting is written in conf/pbs.conf. | ||
|
||
export train_cmd="pbs.pl" | ||
export cuda_cmd="pbs.pl" | ||
export decode_cmd="pbs.pl" | ||
|
||
|
||
# "sbatch" (Slurm) | ||
elif [ "${cmd_backend}" = slurm ]; then | ||
# The default setting is written in conf/slurm.conf. | ||
# You must change "-p cpu" and "-p gpu" for the "partition" for your environment. | ||
# To know the "partion" names, type "sinfo". | ||
# You can use "--gpu * " by default for slurm and it is interpreted as "--gres gpu:*" | ||
# The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}". | ||
|
||
export train_cmd="slurm.pl" | ||
export cuda_cmd="slurm.pl" | ||
export decode_cmd="slurm.pl" | ||
|
||
elif [ "${cmd_backend}" = ssh ]; then | ||
# You have to create ".queue/machines" to specify the host to execute jobs. | ||
# e.g. .queue/machines | ||
# host1 | ||
# host2 | ||
# host3 | ||
# Assuming you can login them without any password, i.e. You have to set ssh keys. | ||
|
||
export train_cmd="ssh.pl" | ||
export cuda_cmd="ssh.pl" | ||
export decode_cmd="ssh.pl" | ||
|
||
# This is an example of specifying several unique options in the JHU CLSP cluster setup. | ||
# Users can modify/add their own command options according to their cluster environments. | ||
elif [ "${cmd_backend}" = jhu ]; then | ||
|
||
export train_cmd="queue.pl --mem 2G" | ||
export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/queue.conf" | ||
export decode_cmd="queue.pl --mem 4G" | ||
|
||
else | ||
echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2 | ||
return 1 | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
tuning/decode_pen2.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
--sample-frequency=16000 | ||
--num-mel-bins=80 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Default configuration | ||
command qsub -V -v PATH -S /bin/bash | ||
option name=* -N $0 | ||
option mem=* -l mem=$0 | ||
option mem=0 # Do not add anything to qsub_opts | ||
option num_threads=* -l ncpus=$0 | ||
option num_threads=1 # Do not add anything to qsub_opts | ||
option num_nodes=* -l nodes=$0:ppn=1 | ||
default gpu=0 | ||
option gpu=0 | ||
option gpu=* -l ngpus=$0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
--sample-frequency=16000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
# Default configuration | ||
command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64* | ||
option name=* -N $0 | ||
option mem=* -l mem_free=$0,ram_free=$0 | ||
option mem=0 # Do not add anything to qsub_opts | ||
option num_threads=* -pe smp $0 | ||
option num_threads=1 # Do not add anything to qsub_opts | ||
option max_jobs_run=* -tc $0 | ||
option num_nodes=* -pe mpi $0 # You must set this PE as allocation_rule=1 | ||
default gpu=0 | ||
option gpu=0 | ||
option gpu=* -l gpu=$0 -q g.q |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Default configuration | ||
command sbatch --export=PATH | ||
option name=* --job-name $0 | ||
option time=* --time $0 | ||
option mem=* --mem-per-cpu $0 | ||
option mem=0 | ||
option num_threads=* --cpus-per-task $0 | ||
option num_threads=1 --cpus-per-task 1 | ||
option num_nodes=* --nodes $0 | ||
default gpu=0 | ||
option gpu=0 -p cpu | ||
option gpu=* -p gpu --gres=gpu:$0 -c $0 # Recommend allocating more CPU than, or equal to the number of GPU | ||
# note: the --max-jobs-run option is supported as a special case | ||
# by slurm.pl and you don't have to handle it in the config file. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
tuning/train_st_transformer_warmup15k.yaml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
batch_size: 1 | ||
beam_size: 10 | ||
penalty: 0.2 | ||
maxlenratio: 0.0 | ||
minlenratio: 0.0 | ||
lm_weight: 0.0 |
81 changes: 81 additions & 0 deletions
81
egs2/iwslt22_low_resource/st1/conf/tuning/train_st_transformer_warmup15k.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
batch_type: numel | ||
batch_bins: 2000000 | ||
accum_grad: 32 # RTX 3090 Ti X 1 | ||
max_epoch: 80 | ||
patience: none | ||
best_model_criterion: | ||
- - valid | ||
- acc | ||
- max | ||
keep_nbest_models: 10 | ||
|
||
# encoder related | ||
encoder: transformer | ||
encoder_conf: | ||
input_layer: conv2d | ||
num_blocks: 12 | ||
linear_units: 2048 | ||
dropout_rate: 0.1 | ||
output_size: 256 # dimension of attention | ||
attention_heads: 4 | ||
|
||
decoder: transformer | ||
decoder_conf: | ||
attention_heads: 4 | ||
linear_units: 2048 | ||
num_blocks: 6 | ||
dropout_rate: 0.1 | ||
positional_dropout_rate: 0.1 | ||
self_attention_dropout_rate: 0.1 | ||
src_attention_dropout_rate: 0.1 | ||
|
||
model_conf: | ||
asr_weight: 0.0 | ||
mt_weight: 0.0 | ||
mtlalpha: 0.0 | ||
lsm_weight: 0.1 | ||
length_normalized_loss: false | ||
extract_feats_in_collect_stats: false | ||
|
||
optim: adam | ||
grad_clip: 3 | ||
optim_conf: | ||
lr: 12.5 | ||
scheduler: noamlr | ||
scheduler_conf: | ||
model_size: 256 | ||
warmup_steps: 15000 | ||
|
||
frontend: s3prl | ||
frontend_conf: | ||
frontend_conf: | ||
upstream: hf_wav2vec2_custom # Note: If the upstream is changed, please change the input_size in the preencoder. | ||
path_or_url: LIA-AvignonUniversity/IWSLT2022-tamasheq-only | ||
download_dir: ./hub | ||
multilayer_feature: True | ||
|
||
preencoder: linear | ||
preencoder_conf: | ||
input_size: 768 # Note: If the upstream is changed, please change this value accordingly. | ||
output_size: 80 | ||
|
||
specaug: specaug | ||
specaug_conf: | ||
apply_time_warp: true | ||
time_warp_window: 5 | ||
time_warp_mode: bicubic | ||
apply_freq_mask: true | ||
freq_mask_width_range: | ||
- 0 | ||
- 30 | ||
num_freq_mask: 2 | ||
apply_time_mask: true | ||
time_mask_width_range: | ||
- 0 | ||
- 40 | ||
num_time_mask: 2 | ||
|
||
freeze_param: [ | ||
"frontend.upstream" | ||
] | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../TEMPLATE/st1/db.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
#!/usr/bin/env bash | ||
# Set bash to 'debug' mode, it will exit on : | ||
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands', | ||
set -e | ||
set -u | ||
set -o pipefail | ||
|
||
. ./db.sh || exit 1; | ||
. ./path.sh || exit 1; | ||
. ./cmd.sh || exit 1; | ||
|
||
log() { | ||
local fname=${BASH_SOURCE[1]##*/} | ||
echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*" | ||
} | ||
SECONDS=0 | ||
|
||
stage=1 | ||
stop_stage=100000 | ||
splits_dir=data/iwslt22_taq_fra_split | ||
|
||
log "$0 $*" | ||
. utils/parse_options.sh | ||
|
||
if [ -z "${IWSLT22_LOW_RESOURCE}" ]; then | ||
log "Fill the value of 'IWSLT22_LOW_RESOURCE' of db.sh" | ||
exit 1 | ||
fi | ||
|
||
if [ $# -ne 0 ]; then | ||
log "Error: No positional arguments are required." | ||
exit 2 | ||
fi | ||
|
||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ] && [ ! -d "${splits_dir}" ]; then | ||
log "stage 1: Official splits from IWSLT Low Resource Speech Translation" | ||
|
||
git clone https://github.com/mzboito/IWSLT2022_Tamasheq_data.git ${splits_dir} | ||
|
||
# train comprises 17 hours of clean speech in Tamasheq, translated to the French language | ||
# train_full comprises a 19 hour version of this corpus, | ||
# including 2 additional hours of data that was labeled by annotators as potentially noisy | ||
mkdir -p data/train/org | ||
mkdir -p data/train_full/org | ||
mkdir -p data/valid/org | ||
mkdir -p data/test/org | ||
|
||
for set in train valid test | ||
do | ||
cp -r ${splits_dir}/taq_fra_clean/${set}/* data/${set}/org | ||
done | ||
cp -r ${splits_dir}/taq_fra_full/train/* data/train_full/org | ||
|
||
fi | ||
|
||
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then | ||
log "stage 2: Data Preparation" | ||
|
||
for set in train train_full valid test | ||
do | ||
python local/preprocess.py --out data/${set} --data data/${set}/org | ||
|
||
cp data/${set}/text.fr data/${set}/text | ||
|
||
utils/utt2spk_to_spk2utt.pl data/${set}/utt2spk > data/${set}/spk2utt | ||
utils/fix_data_dir.sh --utt_extra_files "text.fr" data/${set} | ||
utils/validate_data_dir.sh --no-feats data/${set} || exit 1 | ||
done | ||
fi | ||
|
||
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then | ||
log "stage 3: Normalize Transcripts" | ||
|
||
# check extra module installation | ||
if ! command -v tokenizer.perl > /dev/null; then | ||
echo "Error: it seems that moses is not installed." >&2 | ||
echo "Error: please install moses as follows." >&2 | ||
echo "Error: cd ${MAIN_ROOT}/tools && make moses.done" >&2 | ||
exit 1 | ||
fi | ||
|
||
for set in train train_full valid test | ||
do | ||
cut -d ' ' -f 2- data/${set}/text.fr > data/${set}/fr.org | ||
cut -d ' ' -f 1 data/${set}/text.fr > data/${set}/uttlist | ||
|
||
# tokenize | ||
tokenizer.perl -l fr -q < data/${set}/fr.org > data/${set}/fr.tok | ||
paste -d ' ' data/${set}/uttlist data/${set}/fr.tok > data/${set}/text.tc.fr | ||
|
||
# remove empty lines that were previously only punctuation | ||
# small to use fix_data_dir as is, where it does reduce lines based on extra files | ||
<"data/${set}/text.tc.fr" awk ' { if( NF != 1 ) print $0; } ' >"data/${set}/text" | ||
utils/fix_data_dir.sh --utt_extra_files "text.tc.fr text.fr" data/${set} | ||
cp data/${set}/text.tc.fr data/${set}/text | ||
utils/fix_data_dir.sh --utt_extra_files "text.tc.fr text.fr" data/${set} | ||
utils/validate_data_dir.sh --no-feats data/${set} || exit 1 | ||
done | ||
fi | ||
|
||
log "Successfully finished. [elapsed=${SECONDS}s]" |
Empty file.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ratio is too different (hypotheses are too short) to me.
Can you tune the length penalty (in later PR?)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. No problem.