Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update uasr #4770

Merged
merged 32 commits into from
Nov 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
93d11a7
add scripts for multi-iris
YoshikiMas Oct 11, 2022
8ed83f4
fix commentouted line related to matlab
YoshikiMas Oct 11, 2022
b99fcf0
update README.md
YoshikiMas Oct 12, 2022
84c402f
remove enh results for real data
YoshikiMas Oct 12, 2022
a339d8a
add configurations for pre-training
YoshikiMas Oct 14, 2022
45c34ac
remove comments
YoshikiMas Oct 14, 2022
c479445
remove redundant combine_data.sh
YoshikiMas Oct 15, 2022
e45c034
remove redundant asr combine_data.sh
YoshikiMas Oct 15, 2022
d7c5c20
Merge branch 'master' into multi-iris
YoshikiMas Oct 18, 2022
7de4cc8
add OCR recipe for IAM handwriting dataset
kenzheng99 Oct 11, 2022
38ba290
fix feats_type=extracted to not require cmvn in asr.sh
kenzheng99 Oct 11, 2022
6855d56
document code and add readme
kenzheng99 Oct 11, 2022
59f0646
reformat data_prep.py
kenzheng99 Oct 14, 2022
75b0109
sort imports in data_prep.py
kenzheng99 Oct 25, 2022
7eaa043
reduce line lengths to 80
kenzheng99 Oct 28, 2022
043affb
remove trailing whitespace
kenzheng99 Oct 28, 2022
91d1dd4
Merge branch 'master' into iam-ocr-recipe
sw005320 Oct 29, 2022
bd1b363
Merge branch 'master' into multi-iris
sw005320 Oct 29, 2022
459f5b0
Merge branch 'master' into iam-ocr-recipe
sw005320 Oct 30, 2022
b91d7b5
update IAM recipe from PR feedback
kenzheng99 Oct 31, 2022
7425c63
Merge branch 'master' into iam-ocr-recipe
kenzheng99 Oct 31, 2022
6e09961
add fleurs conformer+sc-ctc results
wanchichen Nov 1, 2022
386eda0
Update README.md
wanchichen Nov 1, 2022
0f1a4a5
add link to model
wanchichen Nov 1, 2022
0b8a34a
Update train_asr.yaml
wanchichen Nov 2, 2022
3719a13
use new wav2vec url
wanchichen Nov 3, 2022
f943afa
Merge branch 'master' into iam-ocr-recipe
kenzheng99 Nov 3, 2022
2169367
add check for IAM value
kenzheng99 Nov 4, 2022
86c2642
Merge pull request #4746 from wanchichen/fleurs_icassp_results
sw005320 Nov 4, 2022
b221db0
Merge pull request #4707 from kenzheng99/iam-ocr-recipe
ftshijt Nov 6, 2022
45ae496
Add Talromur2 recipe (#4680)
G-Thor Nov 8, 2022
209ffa0
Merge pull request #4706 from YoshikiMas/multi-iris
sw005320 Nov 11, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -83,3 +83,7 @@ tools/anaconda
tools/ice-g2p
tools/fairseq
tools/._*
tools/anaconda
tools/ice-g2p*
tools/fairseq*
tools/featbin*
4 changes: 3 additions & 1 deletion egs2/README.md
100755 → 100644
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| how2_2000h | How2_2000h fbank features | ASR/SUM | ENG->POR | https://arxiv.org/pdf/2110.06263.pdf | |
| hub4_spanish | 1997 Spanish Broadcase News Speech | ASR | SPA | https://catalog.ldc.upenn.edu/LDC98S74 | |
| hui_acg | HUI-audio-corpus-german | TTS | DEU | https://opendata.iisys.de/datasets.html#hui-audio-corpus-german | |
| iam | IAM Handwriting Database 3.0 | OCR | ENG | https://fki.tic.heia-fr.ch/databases/iam-handwriting-database | |
| iemocap | IEMOCAP database: The Interactive Emotional Dyadic Motion Capture database | SLU | ENG | https://sail.usc.edu/iemocap/ | |
| indic_speech | IndicSpeech: Text-to-Speech Corpus for Indian Languages | TTS | 3 indic languages | http://cvit.iiit.ac.in/research/projects/cvit-projects/text-to-speech-dataset-for-indian-languages | |
| iwslt14 | IWSLT14 MT shared task | MT | DEU->ENG | http://dl.fbaipublicfiles.com/fairseq/data/iwslt14/de-en.tgz | |
Expand Down Expand Up @@ -123,7 +124,8 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| swbd | Switchboard Corpus for 2-channel Conversational Telephone Speech (300h) | ASR | ENG | https://catalog.ldc.upenn.edu/LDC97S62 | |
| swbd_da | NXT Switchboard Annotations | SLU | ENG | https://catalog.ldc.upenn.edu/LDC2009T26 | |
| swbd_sentiment | Speech Sentiment Annotations | SLU | ENG | https://catalog.ldc.upenn.edu/LDC2020T14 | |
| talromur | Talromur: A large Icelandic TTS corpus | TTS | ISL | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/104, https://aclanthology.org/2021.nodalida-main.50.pdf | |
| talromur | Talromur: A large Icelandic TTS corpus | TTS | ISL | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/104, https://aclanthology.org/2021.nodalida-main.50.pdf | |
| talromur2 | Talromur 2: Icelandic multi-speaker TTS corpus | TTS | ISL | https://repository.clarin.is/repository/xmlui/handle/20.500.12537/167 | |
| tedlium2 | TED-LIUM corpus release 2 | ASR | ENG | https://www.openslr.org/19/, http://www.lrec-conf.org/proceedings/lrec2014/pdf/1104_Paper.pdf | |
| tedx_spanish_openslr67 | TEDx Spanish Corpus | ASR | SPA | https://www.openslr.org/67/ | |
| thchs30 | A Free Chinese Speech Corpus Released by CSLT@Tsinghua University | ASR/TTS | CMN | https://www.openslr.org/18/ | |
Expand Down
2 changes: 1 addition & 1 deletion egs2/TEMPLATE/asr1/asr.sh
Original file line number Diff line number Diff line change
Expand Up @@ -552,7 +552,7 @@ if ! "${skip_data_prep}"; then
_suf=""
fi
# Generate dummy wav.scp to avoid error by copy_data_dir.sh
<data/"${dset}"/cmvn.scp awk ' { print($1,"<DUMMY>") }' > data/"${dset}"/wav.scp
<data/"${dset}"/feats.scp awk ' { print($1,"<DUMMY>") }' > data/"${dset}"/wav.scp
utils/copy_data_dir.sh --validate_opts --non-print data/"${dset}" "${data_feats}${_suf}/${dset}"

# Derive the the frame length and feature dimension
Expand Down
2 changes: 2 additions & 0 deletions egs2/TEMPLATE/asr1/db.sh
Original file line number Diff line number Diff line change
Expand Up @@ -156,8 +156,10 @@ VOXFORGE=downloads
VOXPOPULI=downloads
HARPERVALLEY=downloads
TALROMUR=downloads
TALROMUR2=downloads
DCASE=
TEDX_SPANISH=downloads
IAM=downloads
OFUTON=
OPENCPOP=
M_AILABS=downloads
Expand Down
15 changes: 7 additions & 8 deletions egs2/TEMPLATE/enh_asr1/enh_asr.sh
Original file line number Diff line number Diff line change
Expand Up @@ -703,6 +703,12 @@ if ! "${skip_data_prep}"; then
fi
done | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/lm_train.txt"
fi
if [ "$lm_dev_text" = "${data_feats}/${valid_set}/text" ]; then
for n in $(seq ${spk_num}); do
awk -v spk=$n '{$1=$1 "_spk" spk; print $0}' "${data_feats}/${valid_set}/text_spk${n}"
done | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/lm_dev.txt"
lm_dev_text="${data_feats}/lm_dev.txt"
fi
fi


Expand Down Expand Up @@ -793,13 +799,6 @@ fi

if ! "${skip_train}"; then
if "${use_lm}"; then
if [ "$lm_dev_text" = "${data_feats}/${valid_set}/text" ]; then
for n in $(seq ${spk_num}); do
awk -v spk=$n '{$1=$1 "_spk" spk; print $0}' "${data_feats}/${valid_set}/text_spk${n}"
done | awk ' { if( NF != 1 ) print $0; } ' > "${data_feats}/lm_dev.txt"
lm_dev_text="${data_feats}/lm_dev.txt"
fi

if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
log "Stage 6: LM collect stats: train_set=${data_feats}/lm_train.txt, dev_set=${lm_dev_text}"

Expand Down Expand Up @@ -1337,7 +1336,7 @@ if ! "${skip_train}"; then
--cleaner "${cleaner}" \
--g2p "${g2p}" \
--resume true \
--init_param ${pretrained_model} \
${pretrained_model:+--init_param $pretrained_model} \
--ignore_init_mismatch ${ignore_init_mismatch} \
--output_dir "${enh_asr_exp}" \
${_opts} ${enh_asr_args}
Expand Down
27 changes: 27 additions & 0 deletions egs2/chime4/asr1/conf/train_lm_trandformer.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
optim: adam
max_epoch: 30
batch_type: folded
batch_size: 1024 # 300 for word LMs
lm: transformer
lm_conf:
pos_enc: null
embed_unit: 128
att_unit: 512
head: 8
unit: 2048
layer: 16
dropout_rate: 0.1
val_scheduler_criterion:
- - valid
- loss
early_stopping_criterion:
- - valid
- loss
- min
best_model_criterion:
- - valid
- loss
- min
keep_nbest_models: 10
grad_clip: 5.0
grad_clip_type: 2.0
94 changes: 94 additions & 0 deletions egs2/chime4/asr1/conf/tuning/train_asr_conformer_wavlm2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# minibatch related
batch_type: numel
batch_bins: 4000000
accum_grad: 1
grad_clip: 5
max_epoch: 100
patience: none
# The initialization method for model parameters
init: xavier_uniform
val_scheduler_criterion:
- valid
- loss
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
unused_parameters: true
freeze_param: [
"frontend.upstream"
]

# network architecture
frontend: s3prl
frontend_conf:
frontend_conf:
upstream: wavlm_large # Note: If the upstream is changed, please change the input_size in the preencoder.
download_dir: ./hub
multilayer_feature: True

preencoder: linear
preencoder_conf:
input_size: 1024 # Note: If the upstream is changed, please change this value accordingly.
output_size: 80

# encoder related
encoder: conformer
encoder_conf:
output_size: 256
attention_heads: 4
linear_units: 2048
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d2
normalize_before: true
macaron_style: true
pos_enc_layer_type: "rel_pos"
selfattention_layer_type: "rel_selfattn"
activation_type: "swish"
use_cnn_module: true
cnn_module_kernel: 15

# decoder related
decoder: transformer
decoder_conf:
input_layer: embed
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0

model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false
extract_feats_in_collect_stats: false

optim: adam
optim_conf:
lr: 0.001
scheduler: warmuplr
scheduler_conf:
warmup_steps: 20000

specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 100
num_freq_mask: 4
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
18 changes: 12 additions & 6 deletions egs2/chime4/asr1/local/data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,18 +85,24 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
local/real_enhan_chime4_data_prep.sh beamformit_5mics ${PWD}/enhan/beamformit_5mics
local/simu_enhan_chime4_data_prep.sh beamformit_5mics ${PWD}/enhan/beamformit_5mics

# prepare data for 6ch track:
# (1) {tr05,dt05,et05}_simu_isolated_6ch_track
local/simu_ext_chime4_data_prep.sh --track 6 isolated_6ch_track ${PWD}/local/nn-gev/data/audio/16kHz
# (2) {tr05,dt05,et05}_real_isolated_6ch_track
local/real_ext_chime4_data_prep.sh --track 6 isolated_6ch_track ${CHIME4}/data/audio/16kHz/isolated_6ch_track

# Additionally use WSJ clean data. Otherwise the encoder decoder is not well trained
local/wsj_data_prep.sh ${WSJ0}/??-{?,??}.? ${WSJ1}/??-{?,??}.?
local/wsj_format_data.sh
fi

if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
log "combine real and simulation data"
# TO DO:--extra-files but no utt2num_frames
utils/combine_data.sh data/tr05_multi_noisy data/tr05_simu_noisy data/tr05_real_noisy
utils/combine_data.sh data/tr05_multi_noisy_si284 data/tr05_multi_noisy data/train_si284
utils/combine_data.sh data/${train_dev} data/dt05_simu_isolated_1ch_track data/dt05_real_isolated_1ch_track
log "combine real and simulation data"

# TO DO:--extra-files but no utt2num_frames
utils/combine_data.sh data/tr05_multi_noisy data/tr05_simu_noisy data/tr05_real_noisy
utils/combine_data.sh data/tr05_multi_noisy_si284 data/tr05_multi_noisy data/train_si284
utils/combine_data.sh data/${train_dev} data/dt05_simu_isolated_1ch_track data/dt05_real_isolated_1ch_track
fi

other_text=data/local/other_text/text
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
optim: adam
init: xavier_uniform
max_epoch: 50
batch_type: folded
batch_size: 8
num_workers: 0
optim_conf:
lr: 4.0e-4
eps: 1.0e-08
weight_decay: 0
patience: 5
val_scheduler_criterion:
- valid
- loss
best_model_criterion:
- - valid
- ci_sdr
- max
- - valid
- loss
- min
keep_nbest_models: 1
scheduler: reducelronplateau
scheduler_conf:
mode: min
factor: 0.5
patience: 1
encoder: stft
encoder_conf:
n_fft: 512
win_length: 400
hop_length: 128
use_builtin_complex: False
decoder: stft
decoder_conf:
n_fft: 512
win_length: 400
hop_length: 128
separator: wpe_beamformer
separator_conf:
num_spk: 1
loss_type: spectrum
use_wpe: False
wnet_type: blstmp
wlayers: 3
wunits: 512
wprojs: 512
wdropout_rate: 0.1
taps: 3
delay: 3
use_dnn_mask_for_wpe: True
use_beamformer: True
bnet_type: blstmp
blayers: 3
bunits: 512
bprojs: 512
badim: 320
ref_channel: 4
use_noise_mask: True
beamformer_type: wpd_souden
bdropout_rate: 0.1
rtf_iterations: 5


criterions:
# The first criterion
- name: ci_sdr
conf:
filter_length: 512
# the wrapper for the current criterion
# for single-talker case, we simplely use fixed_order wrapper
wrapper: fixed_order
wrapper_conf:
weight: 1.0
2 changes: 1 addition & 1 deletion egs2/chime4/enh1/local/data.sh
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,7 @@ fi

if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
log "stage 2: Data preparation"

# preparation for original WSJ0 data:
# et05_orig_clean, dt05_orig_clean, tr05_orig_clean
wsj0_data=${CHIME4}/data/WSJ0
Expand Down
40 changes: 40 additions & 0 deletions egs2/chime4/enh_asr1/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,43 @@
|---|---|---|---|
|dt05_simu_isolated_1ch_track|0.87|7.14|4.51|
|et05_simu_isolated_1ch_track|0.85|7.47|3.02|


# RESULTS
## Environments
- date: `Tue Oct 11 02:40:53 UTC 2022`
- python version: `3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]`
- espnet version: `espnet 202207`
- pytorch version: `pytorch 1.10.1+cu111`
- Git hash: `8ed83f45d5aa2ca6b3635e44b9c29afb9b5fb600`
- Commit date: `Tue Oct 11 18:59:57 2022 +0900`

## enh_asr_train_enh_asr_wpd_init_noenhloss_wavlm_conformer_raw_en_char
- Pretrained model: https://huggingface.co/Yoshiki/chime4_enh_asr1_wpd_wavlm_conformer
- This joint training requires pre-trained models for both Enh and ASR. Each model with the specified configuration must be trained in advance.
- The language model is also copied from the ASR pre-training recipe.

### WER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_real_isolated_6ch_track|1640|27119|98.8|0.9|0.2|0.2|1.3|16.2|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_simu_isolated_6ch_track|1640|27120|98.9|0.9|0.2|0.1|1.3|15.2|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_real_isolated_6ch_track|1320|21409|98.4|1.4|0.2|0.2|1.8|20.6|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_simu_isolated_6ch_track|1320|21416|98.9|1.0|0.2|0.1|1.2|15.2|

### CER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_real_isolated_6ch_track|1640|160390|99.7|0.1|0.2|0.2|0.5|16.2|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/dt05_simu_isolated_6ch_track|1640|160400|99.7|0.1|0.2|0.1|0.5|15.2|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_real_isolated_6ch_track|1320|126796|99.5|0.2|0.3|0.2|0.7|20.6|
|decode_asr_transformer_largelm_normalize_output_wavtrue_lm_lm_train_lm_transformer_en_char_valid.loss.ave_enh_asr_model_valid.acc.ave_10best/et05_simu_isolated_6ch_track|1320|126812|99.7|0.2|0.2|0.1|0.5|15.2|

### Enhancement

|dataset|STOI|SAR|SDR|SIR|SI_SNR|
|---|---|---|---|---|---|
|enhanced_dt05_simu_isolated_6ch_track|94.48|14.95|14.95|0.00|12.43|
|enhanced_et05_simu_isolated_6ch_track|94.93|16.08|16.08|0.00|13.98|
7 changes: 7 additions & 0 deletions egs2/chime4/enh_asr1/conf/decode_asr_transformer_largelm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
batch_size: 0
beam_size: 10
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.3
lm_weight: 1.2