Skip to content

Commit

Permalink
Merge pull request #4770 from espnet/master
Browse files Browse the repository at this point in the history
Update uasr
  • Loading branch information
ftshijt committed Nov 16, 2022
2 parents 01b2242 + 209ffa0 commit b29b1a7
Show file tree
Hide file tree
Showing 74 changed files with 2,988 additions and 24 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -83,3 +83,7 @@ tools/anaconda
tools/ice-g2p
tools/fairseq
tools/._*
tools/anaconda
tools/ice-g2p*
tools/fairseq*
tools/featbin*
4 changes: 3 additions & 1 deletion egs2/README.md
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| how2_2000h | How2_2000h fbank features | ASR/SUM | ENG->POR | https://arxiv.org/pdf/2110.06263.pdf | |
| hub4_spanish | 1997 Spanish Broadcase News Speech | ASR | SPA | https://catalog.ldc.upenn.edu/LDC98S74 | |
| hui_acg | HUI-audio-corpus-german | TTS | DEU | https://opendata.iisys.de/datasets.html#hui-audio-corpus-german | |
| iam | IAM Handwriting Database 3.0 | OCR | ENG | https://fki.tic.heia-fr.ch/databases/iam-handwriting-database | |
| iemocap | IEMOCAP database: The Interactive Emotional Dyadic Motion Capture database | SLU | ENG | https://sail.usc.edu/iemocap/ | |
| indic_speech | IndicSpeech: Text-to-Speech Corpus for Indian Languages | TTS | 3 indic languages | http://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages | |
| iwslt14 | IWSLT14 MT shared task | MT | DEU->ENG | http://dl.fbaipublicfiles.com/fairseq/data/iwslt14/de-en.tgz | |
Expand Down Expand Up @@ -123,7 +124,8 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| swbd | Switchboard Corpus for 2-channel Conversational Telephone Speech (300h) | ASR | ENG | https://catalog.ldc.upenn.edu/LDC97S62 | |
| swbd_da | NXT Switchboard Annotations | SLU | ENG | https://catalog.ldc.upenn.edu/LDC2009T26 | |
| swbd_sentiment | Speech Sentiment Annotations | SLU | ENG | https://catalog.ldc.upenn.edu/LDC2020T14 | |
| talromur | Talromur: A large Icelandic TTS corpus | TTS | ISL | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/104, https://aclanthology.org/2021.nodalida-main.50.pdf | |
| talromur | Talromur: A large Icelandic TTS corpus | TTS | ISL | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/104, https://aclanthology.org/2021.nodalida-main.50.pdf | |
| talromur2 | Talromur 2: Icelandic multi-speaker TTS corpus | TTS | ISL | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/167 | |
| tedlium2 | TED-LIUM corpus release 2 | ASR | ENG | https://www.openslr.org/19/, http://www.lrec-conf.org/proceedings/lrec2014/pdf/1104_Paper.pdf | |
| tedx_spanish_openslr67 | TEDx Spanish Corpus | ASR | SPA | https://www.openslr.org/67/ | |
| thchs30 | A Free Chinese Speech Corpus Released by CSLT@Tsinghua University | ASR/TTS | CMN | https://www.openslr.org/18/ | |
Expand Down
2 changes: 1 addition & 1 deletion egs2/TEMPLATE/asr1/asr.sh
Original file line number Diff line number Diff line change
Expand Up @@ -552,7 +552,7 @@ if ! "${skip_data_prep}"; then
_suf=""
fi
# Generate dummy wav.scp to avoid error by copy_data_dir.sh
<data/"${dset}"/cmvn.scp awk ' { print($1,"<DUMMY>") }' > data/"${dset}"/wav.scp
<data/"${dset}"/feats.scp awk ' { print($1,"<DUMMY>") }' > data/"${dset}"/wav.scp
utils/copy_data_dir.sh --validate_opts --non-print data/"${dset}" "${data_feats}${_suf}/${dset}"

# Derive the the frame length and feature dimension
Expand Down
2 changes: 2 additions & 0 deletions egs2/TEMPLATE/asr1/db.sh
Original file line number Diff line number Diff line change
Expand Up @@ -156,8 +156,10 @@ VOXFORGE=downloads
VOXPOPULI=downloads
HARPERVALLEY=downloads
TALROMUR=downloads
TALROMUR2=downloads
DCASE=
TEDX_SPANISH=downloads
IAM=downloads
OFUTON=
OPENCPOP=
M_AILABS=downloads
Expand Down
15 changes: 7 additions & 8 deletions egs2/TEMPLATE/enh_asr1/enh_asr.sh
Original file line number Diff line number Diff line change
Expand Up @@ -703,6 +703,12 @@ if ! "${skip_data_prep}"; then
fi
done | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/lm_train.txt"
fi
if [ "$lm_dev_text" = "${data_feats}/${valid_set}/text" ]; then
for n in $(seq ${spk_num}); do
awk -v spk=$n '{$1=$1 "_spk" spk; print $0}' "${data_feats}/${valid_set}/text_spk${n}"
done | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/lm_dev.txt"
lm_dev_text="${data_feats}/lm_dev.txt"
fi
fi


Expand Down Expand Up @@ -793,13 +799,6 @@ fi

if ! "${skip_train}"; then
if "${use_lm}"; then
if [ "$lm_dev_text" = "${data_feats}/${valid_set}/text" ]; then
for n in $(seq ${spk_num}); do
awk -v spk=$n '{$1=$1 "_spk" spk; print $0}' "${data_feats}/${valid_set}/text_spk${n}"
done | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/lm_dev.txt"
lm_dev_text="${data_feats}/lm_dev.txt"
fi

if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
log "Stage 6: LM collect stats: train_set=${data_feats}/lm_train.txt, dev_set=${lm_dev_text}"

Expand Down Expand Up @@ -1337,7 +1336,7 @@ if ! "${skip_train}"; then
--cleaner "${cleaner}" \
--g2p "${g2p}" \
--resume true \
--init_param ${pretrained_model} \
${pretrained_model:+--init_param $pretrained_model} \
--ignore_init_mismatch ${ignore_init_mismatch} \
--output_dir "${enh_asr_exp}" \
${_opts} ${enh_asr_args}
Expand Down
27 changes: 27 additions & 0 deletions egs2/chime4/asr1/conf/train_lm_trandformer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
optim: adam
max_epoch: 30
batch_type: folded
batch_size: 1024 # 300 for word LMs
lm: transformer
lm_conf:
pos_enc: null
embed_unit: 128
att_unit: 512
head: 8
unit: 2048
layer: 16
dropout_rate: 0.1
val_scheduler_criterion:
- - valid
- loss
early_stopping_criterion:
- - valid
- loss
- min
best_model_criterion:
- - valid
- loss
- min
keep_nbest_models: 10
grad_clip: 5.0
grad_clip_type: 2.0
94 changes: 94 additions & 0 deletions egs2/chime4/asr1/conf/tuning/train_asr_conformer_wavlm2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# minibatch related
batch_type: numel
batch_bins: 4000000
accum_grad: 1
grad_clip: 5
max_epoch: 100
patience: none
# The initialization method for model parameters
init: xavier_uniform
val_scheduler_criterion:
- valid
- loss
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
unused_parameters: true
freeze_param: [
"frontend.upstream"
]

# network architecture
frontend: s3prl
frontend_conf:
frontend_conf:
upstream: wavlm_large # Note: If the upstream is changed, please change the input_size in the preencoder.
download_dir: ./hub
multilayer_feature: True

preencoder: linear
preencoder_conf:
input_size: 1024 # Note: If the upstream is changed, please change this value accordingly.
output_size: 80

# encoder related
encoder: conformer
encoder_conf:
output_size: 256
attention_heads: 4
linear_units: 2048
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d2
normalize_before: true
macaron_style: true
pos_enc_layer_type: "rel_pos"
selfattention_layer_type: "rel_selfattn"
activation_type: "swish"
use_cnn_module: true
cnn_module_kernel: 15

# decoder related
decoder: transformer
decoder_conf:
input_layer: embed
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0

model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false
extract_feats_in_collect_stats: false

optim: adam
optim_conf:
lr: 0.001
scheduler: warmuplr
scheduler_conf:
warmup_steps: 20000

specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 100
num_freq_mask: 4
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
18 changes: 12 additions & 6 deletions egs2/chime4/asr1/local/data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,18 +85,24 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
local/real_enhan_chime4_data_prep.sh beamformit_5mics ${PWD}/enhan/beamformit_5mics
local/simu_enhan_chime4_data_prep.sh beamformit_5mics ${PWD}/enhan/beamformit_5mics

# prepare data for 6ch track:
# (1) {tr05,dt05,et05}_simu_isolated_6ch_track
local/simu_ext_chime4_data_prep.sh --track 6 isolated_6ch_track ${PWD}/local/nn-gev/data/audio/16kHz
# (2) {tr05,dt05,et05}_real_isolated_6ch_track
local/real_ext_chime4_data_prep.sh --track 6 isolated_6ch_track ${CHIME4}/data/audio/16kHz/isolated_6ch_track

# Additionally use WSJ clean data. Otherwise the encoder decoder is not well trained
local/wsj_data_prep.sh ${WSJ0}/??-{?,??}.? ${WSJ1}/??-{?,??}.?
local/wsj_format_data.sh
fi

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
log "combine real and simulation data"
# TO DO:--extra-files but no utt2num_frames
utils/combine_data.sh data/tr05_multi_noisy data/tr05_simu_noisy data/tr05_real_noisy
utils/combine_data.sh data/tr05_multi_noisy_si284 data/tr05_multi_noisy data/train_si284
utils/combine_data.sh data/${train_dev} data/dt05_simu_isolated_1ch_track data/dt05_real_isolated_1ch_track
log "combine real and simulation data"

# TO DO:--extra-files but no utt2num_frames
utils/combine_data.sh data/tr05_multi_noisy data/tr05_simu_noisy data/tr05_real_noisy
utils/combine_data.sh data/tr05_multi_noisy_si284 data/tr05_multi_noisy data/train_si284
utils/combine_data.sh data/${train_dev} data/dt05_simu_isolated_1ch_track data/dt05_real_isolated_1ch_track
fi

other_text=data/local/other_text/text
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
optim: adam
init: xavier_uniform
max_epoch: 50
batch_type: folded
batch_size: 8
num_workers: 0
optim_conf:
lr: 4.0e-4
eps: 1.0e-08
weight_decay: 0
patience: 5
val_scheduler_criterion:
- valid
- loss
best_model_criterion:
- - valid
- ci_sdr
- max
- - valid
- loss
- min
keep_nbest_models: 1
scheduler: reducelronplateau
scheduler_conf:
mode: min
factor: 0.5
patience: 1
encoder: stft
encoder_conf:
n_fft: 512
win_length: 400
hop_length: 128
use_builtin_complex: False
decoder: stft
decoder_conf:
n_fft: 512
win_length: 400
hop_length: 128
separator: wpe_beamformer
separator_conf:
num_spk: 1
loss_type: spectrum
use_wpe: False
wnet_type: blstmp
wlayers: 3
wunits: 512
wprojs: 512
wdropout_rate: 0.1
taps: 3
delay: 3
use_dnn_mask_for_wpe: True
use_beamformer: True
bnet_type: blstmp
blayers: 3
bunits: 512
bprojs: 512
badim: 320
ref_channel: 4
use_noise_mask: True
beamformer_type: wpd_souden
bdropout_rate: 0.1
rtf_iterations: 5


criterions:
# The first criterion
- name: ci_sdr
conf:
filter_length: 512
# the wrapper for the current criterion
# for single-talker case, we simplely use fixed_order wrapper
wrapper: fixed_order
wrapper_conf:
weight: 1.0
2 changes: 1 addition & 1 deletion egs2/chime4/enh1/local/data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ fi

if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
log "stage 2: Data preparation"

# preparation for original WSJ0 data:
# et05_orig_clean, dt05_orig_clean, tr05_orig_clean
wsj0_data=${CHIME4}/data/WSJ0
Expand Down
40 changes: 40 additions & 0 deletions egs2/chime4/enh_asr1/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,43 @@
|---|---|---|---|
|dt05_simu_isolated_1ch_track|0.87|7.14|4.51|
|et05_simu_isolated_1ch_track|0.85|7.47|3.02|


# RESULTS
## Environments
- date: `Tue Oct 11 02:40:53 UTC 2022`
- python version: `3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]`
- espnet version: `espnet 202207`
- pytorch version: `pytorch 1.10.1+cu111`
- Git hash: `8ed83f45d5aa2ca6b3635e44b9c29afb9b5fb600`
- Commit date: `Tue Oct 11 18:59:57 2022 +0900`

## enh_asr_train_enh_asr_wpd_init_noenhloss_wavlm_conformer_raw_en_char
- Pretrained model: https://huggingface.co/Yoshiki/chime4_enh_asr1_wpd_wavlm_conformer
- This joint training requires pre-trained models for both Enh and ASR. Each model with the specified configuration must be trained in advance.
- The language model is also copied from the ASR pre-training recipe.

### WER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_real_isolated_6ch_track|1640|27119|98.8|0.9|0.2|0.2|1.3|16.2|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_simu_isolated_6ch_track|1640|27120|98.9|0.9|0.2|0.1|1.3|15.2|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_real_isolated_6ch_track|1320|21409|98.4|1.4|0.2|0.2|1.8|20.6|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_simu_isolated_6ch_track|1320|21416|98.9|1.0|0.2|0.1|1.2|15.2|

### CER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_real_isolated_6ch_track|1640|160390|99.7|0.1|0.2|0.2|0.5|16.2|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_simu_isolated_6ch_track|1640|160400|99.7|0.1|0.2|0.1|0.5|15.2|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_real_isolated_6ch_track|1320|126796|99.5|0.2|0.3|0.2|0.7|20.6|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_simu_isolated_6ch_track|1320|126812|99.7|0.2|0.2|0.1|0.5|15.2|

### Enhancement

|dataset|STOI|SAR|SDR|SIR|SI_SNR|
|---|---|---|---|---|---|
|enhanced_dt05_simu_isolated_6ch_track|94.48|14.95|14.95|0.00|12.43|
|enhanced_et05_simu_isolated_6ch_track|94.93|16.08|16.08|0.00|13.98|
7 changes: 7 additions & 0 deletions egs2/chime4/enh_asr1/conf/decode_asr_transformer_largelm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
batch_size: 0
beam_size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.3
lm_weight: 1.2

0 comments on commit b29b1a7

Please sign in to comment.