Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Recipe] Add iwslt22 low resource speech translation task for egs2 #4994

Merged
merged 8 commits into from
Mar 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions egs2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| iwslt14 | IWSLT14 MT shared task | MT | DEU->ENG | http://dl.fbaipublicfiles.com/fairseq/data/iwslt14/de-en.tgz | |
| iwslt21_low_resource | ALFFA, IARPA Babel, Gamayun, IWSLT 2021 | ASR | SWA | http://www.openslr.org/25/ https://catalog.ldc.upenn.edu/LDC2017S05 https://gamayun.translatorswb.org/data/ https://iwslt.org/2021/low-resource | |
| iwslt22_dialect | IWSLT2022 dialectal speech translation shared task | ASR/ST | ARA->Tunisian ARA | https://github.com/kevinduh/iwslt22-dialect.git | |
| iwslt22_low_resource | IWSLT2022 Low-resource speech translation track task | ST | Tamasheq->FrenchPermalink | https://github.com/mzboito/IWSLT2022_Tamasheq_data.git |
| jdcinal | Japanese Dialogue Corpus of Information Navigation and Attentive Listening Annotated with Extended ISO-24617-2 Dialogue Act Tags | SLU | JPN | http://www.lrec-conf.org/proceedings/lrec2018/pdf/464.pdf http://tts.speech.cs.cmu.edu/awb/infomation_navigation_and_attentive_listening_0.2.zip | |
| jkac | J-KAC: Japanese Kamishibai and audiobook corpus | TTS | JPN | https://sites.google.com/site/shinnosuketakamichi/research-topics/j-kac_corpus | |
| jmd | JMD: Japanese multi-dialect corpus for speech synthesis | TTS | JPN | https://sites.google.com/site/shinnosuketakamichi/research-topics/jmd_corpus | |
Expand Down
1 change: 1 addition & 0 deletions egs2/TEMPLATE/asr1/db.sh
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@ CMU_ARCTIC=downloads
CMU_INDIC=downloads
INDIC_SPEECH=downloads
IWSLT22_DIALECT=
IWSLT22_LOW_RESOURCE=downloads
JKAC=
MUCS_SUBTASK1=downloads
MUCS_SUBTASK2=downloads
Expand Down
26 changes: 26 additions & 0 deletions egs2/iwslt22_low_resource/st1/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
<!-- Generated by scripts/utils/show_translation_result.sh -->
# RESULTS
## Environments
- date: `Fri Mar 10 16:10:20 CST 2023`
- python version: `3.8.16 (default, Jan 17 2023, 23:13:24) [GCC 11.2.0]`
- espnet version: `espnet 202301`
- pytorch version: `pytorch 1.13.1`
- Git hash: `ff841366229d539eb74d23ac999cae7c0cc62cad`
- Commit date: `Mon Feb 20 12:23:15 2023 -0500`

## st_wav2vec-transformer-warmup-15k

### BLEU

|dataset|score|verbose_score|
|---|---|---|
|decode_pen2_st_model_valid.acc.ave/test|2.6|22.5/4.5/1.8/0.8 (BP = 0.736 ratio = 0.765 hyp_len = 17223 ref_len = 22504)|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ratio is too different (hypotheses are too short) to me.
Can you tune the length penalty (in later PR?)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. No problem.


## st_full_wav2vec-transformer-warmup-15k

### BLEU

|dataset|score|verbose_score|
|---|---|---|
|decode_pen2_st_model_valid.acc.ave/test|3.6|24.7/5.4/2.1/1.0 (BP = 0.894 ratio = 0.899 hyp_len = 20241 ref_len = 22504)|

110 changes: 110 additions & 0 deletions egs2/iwslt22_low_resource/st1/cmd.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...>
# e.g.
# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
#
# Options:
# --time <time>: Limit the maximum time to execute.
# --mem <mem>: Limit the maximum memory usage.
# -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs.
# --num-threads <ngpu>: Specify the number of CPU core.
# --gpu <ngpu>: Specify the number of GPU devices.
# --config: Change the configuration file from default.
#
# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
# The left string of "=", i.e. "JOB", is replaced by <N>(Nth job) in the command and the log file name,
# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
#
# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
# These options are mapping to specific options for each backend and
# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
# If jobs failed, your configuration might be wrong for your environment.
#
#
# The official documentation for run.pl, queue.pl, slurm.pl, and ssh.pl:
# "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
# =========================================================~


# Select the backend used by run.sh from "local", "stdout", "sge", "slurm", or "ssh"
cmd_backend='local'

# Local machine, without any Job scheduling system
if [ "${cmd_backend}" = local ]; then

# The other usage
export train_cmd="run.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="run.pl"
# Used for "*_recog.py"
export decode_cmd="run.pl"

# Local machine logging to stdout and log file, without any Job scheduling system
elif [ "${cmd_backend}" = stdout ]; then

# The other usage
export train_cmd="stdout.pl"
# Used for "*_train.py": "--gpu" is appended optionally by run.sh
export cuda_cmd="stdout.pl"
# Used for "*_recog.py"
export decode_cmd="stdout.pl"


# "qsub" (Sun Grid Engine, or derivation of it)
elif [ "${cmd_backend}" = sge ]; then
# The default setting is written in conf/queue.conf.
# You must change "-q g.q" for the "queue" for your environment.
# To know the "queue" names, type "qhost -q"
# Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.

export train_cmd="queue.pl"
export cuda_cmd="queue.pl"
export decode_cmd="queue.pl"


# "qsub" (Torque/PBS.)
elif [ "${cmd_backend}" = pbs ]; then
# The default setting is written in conf/pbs.conf.

export train_cmd="pbs.pl"
export cuda_cmd="pbs.pl"
export decode_cmd="pbs.pl"


# "sbatch" (Slurm)
elif [ "${cmd_backend}" = slurm ]; then
# The default setting is written in conf/slurm.conf.
# You must change "-p cpu" and "-p gpu" for the "partition" for your environment.
# To know the "partion" names, type "sinfo".
# You can use "--gpu * " by default for slurm and it is interpreted as "--gres gpu:*"
# The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".

export train_cmd="slurm.pl"
export cuda_cmd="slurm.pl"
export decode_cmd="slurm.pl"

elif [ "${cmd_backend}" = ssh ]; then
# You have to create ".queue/machines" to specify the host to execute jobs.
# e.g. .queue/machines
# host1
# host2
# host3
# Assuming you can login them without any password, i.e. You have to set ssh keys.

export train_cmd="ssh.pl"
export cuda_cmd="ssh.pl"
export decode_cmd="ssh.pl"

# This is an example of specifying several unique options in the JHU CLSP cluster setup.
# Users can modify/add their own command options according to their cluster environments.
elif [ "${cmd_backend}" = jhu ]; then

export train_cmd="queue.pl --mem 2G"
export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/queue.conf"
export decode_cmd="queue.pl --mem 4G"

else
echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
return 1
fi
2 changes: 2 additions & 0 deletions egs2/iwslt22_low_resource/st1/conf/fbank.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
--sample-frequency=16000
--num-mel-bins=80
11 changes: 11 additions & 0 deletions egs2/iwslt22_low_resource/st1/conf/pbs.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Default configuration
command qsub -V -v PATH -S /bin/bash
option name=* -N $0
option mem=* -l mem=$0
option mem=0 # Do not add anything to qsub_opts
option num_threads=* -l ncpus=$0
option num_threads=1 # Do not add anything to qsub_opts
option num_nodes=* -l nodes=$0:ppn=1
default gpu=0
option gpu=0
option gpu=* -l ngpus=$0
1 change: 1 addition & 0 deletions egs2/iwslt22_low_resource/st1/conf/pitch.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
--sample-frequency=16000
12 changes: 12 additions & 0 deletions egs2/iwslt22_low_resource/st1/conf/queue.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Default configuration
command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
option name=* -N $0
option mem=* -l mem_free=$0,ram_free=$0
option mem=0 # Do not add anything to qsub_opts
option num_threads=* -pe smp $0
option num_threads=1 # Do not add anything to qsub_opts
option max_jobs_run=* -tc $0
option num_nodes=* -pe mpi $0 # You must set this PE as allocation_rule=1
default gpu=0
option gpu=0
option gpu=* -l gpu=$0 -q g.q
14 changes: 14 additions & 0 deletions egs2/iwslt22_low_resource/st1/conf/slurm.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Default configuration
command sbatch --export=PATH
option name=* --job-name $0
option time=* --time $0
option mem=* --mem-per-cpu $0
option mem=0
option num_threads=* --cpus-per-task $0
option num_threads=1 --cpus-per-task 1
option num_nodes=* --nodes $0
default gpu=0
option gpu=0 -p cpu
option gpu=* -p gpu --gres=gpu:$0 -c $0 # Recommend allocating more CPU than, or equal to the number of GPU
# note: the --max-jobs-run option is supported as a special case
# by slurm.pl and you don't have to handle it in the config file.
6 changes: 6 additions & 0 deletions egs2/iwslt22_low_resource/st1/conf/tuning/decode_pen2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
batch_size: 1
beam_size: 10
penalty: 0.2
maxlenratio: 0.0
minlenratio: 0.0
lm_weight: 0.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
batch_type: numel
batch_bins: 2000000
accum_grad: 32 # RTX 3090 Ti X 1
max_epoch: 80
patience: none
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10

# encoder related
encoder: transformer
encoder_conf:
input_layer: conv2d
num_blocks: 12
linear_units: 2048
dropout_rate: 0.1
output_size: 256 # dimension of attention
attention_heads: 4

decoder: transformer
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.1
src_attention_dropout_rate: 0.1

model_conf:
asr_weight: 0.0
mt_weight: 0.0
mtlalpha: 0.0
lsm_weight: 0.1
length_normalized_loss: false
extract_feats_in_collect_stats: false

optim: adam
grad_clip: 3
optim_conf:
lr: 12.5
scheduler: noamlr
scheduler_conf:
model_size: 256
warmup_steps: 15000

frontend: s3prl
frontend_conf:
frontend_conf:
upstream: hf_wav2vec2_custom # Note: If the upstream is changed, please change the input_size in the preencoder.
path_or_url: LIA-AvignonUniversity/IWSLT2022-tamasheq-only
download_dir: ./hub
multilayer_feature: True

preencoder: linear
preencoder_conf:
input_size: 768 # Note: If the upstream is changed, please change this value accordingly.
output_size: 80

specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2

freeze_param: [
"frontend.upstream"
]

1 change: 1 addition & 0 deletions egs2/iwslt22_low_resource/st1/db.sh
101 changes: 101 additions & 0 deletions egs2/iwslt22_low_resource/st1/local/data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#!/usr/bin/env bash
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail

. ./db.sh || exit 1;
. ./path.sh || exit 1;
. ./cmd.sh || exit 1;

log() {
local fname=${BASH_SOURCE[1]##*/}
echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
}
SECONDS=0

stage=1
stop_stage=100000
splits_dir=data/iwslt22_taq_fra_split

log "$0 $*"
. utils/parse_options.sh

if [ -z "${IWSLT22_LOW_RESOURCE}" ]; then
log "Fill the value of 'IWSLT22_LOW_RESOURCE' of db.sh"
exit 1
fi

if [ $# -ne 0 ]; then
log "Error: No positional arguments are required."
exit 2
fi

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ] && [ ! -d "${splits_dir}" ]; then
log "stage 1: Official splits from IWSLT Low Resource Speech Translation"

git clone https://github.com/mzboito/IWSLT2022_Tamasheq_data.git ${splits_dir}

# train comprises 17 hours of clean speech in Tamasheq, translated to the French language
# train_full comprises a 19 hour version of this corpus,
# including 2 additional hours of data that was labeled by annotators as potentially noisy
mkdir -p data/train/org
mkdir -p data/train_full/org
mkdir -p data/valid/org
mkdir -p data/test/org

for set in train valid test
do
cp -r ${splits_dir}/taq_fra_clean/${set}/* data/${set}/org
done
cp -r ${splits_dir}/taq_fra_full/train/* data/train_full/org

fi

if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
log "stage 2: Data Preparation"

for set in train train_full valid test
do
python local/preprocess.py --out data/${set} --data data/${set}/org

cp data/${set}/text.fr data/${set}/text

utils/utt2spk_to_spk2utt.pl data/${set}/utt2spk > data/${set}/spk2utt
utils/fix_data_dir.sh --utt_extra_files "text.fr" data/${set}
utils/validate_data_dir.sh --no-feats data/${set} || exit 1
done
fi

if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
log "stage 3: Normalize Transcripts"

# check extra module installation
if ! command -v tokenizer.perl > /dev/null; then
echo "Error: it seems that moses is not installed." >&2
echo "Error: please install moses as follows." >&2
echo "Error: cd ${MAIN_ROOT}/tools && make moses.done" >&2
exit 1
fi

for set in train train_full valid test
do
cut -d ' ' -f 2- data/${set}/text.fr > data/${set}/fr.org
cut -d ' ' -f 1 data/${set}/text.fr > data/${set}/uttlist

# tokenize
tokenizer.perl -l fr -q < data/${set}/fr.org > data/${set}/fr.tok
paste -d ' ' data/${set}/uttlist data/${set}/fr.tok > data/${set}/text.tc.fr

# remove empty lines that were previously only punctuation
# small to use fix_data_dir as is, where it does reduce lines based on extra files
<"data/${set}/text.tc.fr" awk ' { if( NF != 1 ) print $0; } ' >"data/${set}/text"
utils/fix_data_dir.sh --utt_extra_files "text.tc.fr text.fr" data/${set}
cp data/${set}/text.tc.fr data/${set}/text
utils/fix_data_dir.sh --utt_extra_files "text.tc.fr text.fr" data/${set}
utils/validate_data_dir.sh --no-feats data/${set} || exit 1
done
fi

log "Successfully finished. [elapsed=${SECONDS}s]"
Empty file.