Skip to content

Commit

Permalink
Merge pull request #5104 from tjysdsg/patch_aphasiabank
Browse files Browse the repository at this point in the history
Update AphasiaBank Recipe
  • Loading branch information
mergify[bot] committed May 21, 2023
2 parents 1042829 + 3c43e87 commit d1074ce
Show file tree
Hide file tree
Showing 26 changed files with 982 additions and 1,061 deletions.
136 changes: 50 additions & 86 deletions egs2/aphasiabank/asr1/README.md
Original file line number Diff line number Diff line change
@@ -1,99 +1,63 @@
# AphasiaBank English ASR recipe

## Environments
## Data preparation

- date: `Sun Jan 8 19:23:29 EST 2023`
- python version: `3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0]`
- espnet version: `espnet 202211`
- pytorch version: `pytorch 1.8.1`
- Git hash: `39c1ec0509904f16ac36d25efc971e2a94ff781f`
- Commit date: `Wed Dec 21 12:50:18 2022 -0500`

## asr_train_asr_ebranchformer_small_wavlm_large1

- [train_asr_ebranchformer_small_wavlm_large1.yaml](conf/tuning/train_asr_ebranchformer_small_wavlm_large1.yaml)
- Control group data is included
- Downsampling rate = 2 = 2 (WavLM) * 1 (`Conv2dSubsampling1`)
- [Hugging Face](https://huggingface.co/espnet/jiyang_tang_aphsiabank_english_asr_ebranchformer_small_wavlm_large1)

### WER

| dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
|-------------------------------------|-------|--------|------|------|-----|-----|------|-------|
| decode_asr_model_valid.acc.ave/test | 16380 | 120684 | 77.5 | 16.4 | 6.1 | 4.2 | 26.7 | 70.8 |

### CER

| dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
|-------------------------------------|-------|--------|------|-----|-----|-----|------|-------|
| decode_asr_model_valid.acc.ave/test | 16380 | 530731 | 87.6 | 5.4 | 6.9 | 4.7 | 17.0 | 70.8 |

## asr_train_asr_ebranchformer_small_wavlm_large

- [train_asr_ebranchformer_small_wavlm_large.yaml](conf/tuning/train_asr_ebranchformer_small_wavlm_large.yaml)
- Control group data is included
- Downsampling rate = 4 = 2 (WavLM) * 2 (`Conv2dSubsampling2`)
- [Hugging Face](https://huggingface.co/espnet/jiyang_tang_aphsiabank_english_asr_ebranchformer_small_wavlm_large)

### WER

| dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
|-------------------------------------|-------|--------|------|------|-----|-----|------|-------|
| decode_asr_model_valid.acc.ave/test | 16380 | 120684 | 76.6 | 16.7 | 6.7 | 3.8 | 27.1 | 72.4 |
1. Download AphasiaBank from https://aphasia.talkbank.org
2. See [data.sh](local/data.sh) for instructions

### CER
Data splits are stored in [a separate repository](https://github.com/tjysdsg/AphasiaBank_config).

| dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
|-------------------------------------|-------|--------|------|-----|-----|-----|------|-------|
| decode_asr_model_valid.acc.ave/test | 16380 | 530731 | 87.1 | 5.3 | 7.6 | 4.9 | 17.7 | 72.4 |
## Experiments

## asr_train_asr_ebranchformer_small_raw_en_char_sp
- Use `run.sh` for baselines and tag-based experiments.
- Use `run_interctc.sh` for InterCTC-based experiments (also supports combination of
tag- and InterCTC-based).

- [train_asr_ebranchformer_small.yaml](conf/tuning/train_asr_ebranchformer_small.yaml)
- Control group data is included
Parameters:

### WER
- `--include_control`: true to include control group data
- `--tag_insertion`: `prepend`, `append`, `both`, or `none`.

| dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
|-------------------------------------|-------|--------|------|------|-----|-----|------|-------|
| decode_asr_model_valid.acc.ave/test | 16380 | 120684 | 69.7 | 22.7 | 7.6 | 4.5 | 34.9 | 77.5 |
## Evaluation

### CER
- Use `run.sh --nlsyms_txt none --stage 13` to score your model.
- It's important to set `--nlsyms_txt none` to avoid removing the Aphasia tags,
which will be used by the scripts below.
- [local/score_cleaned.sh](local/score_cleaned.sh) is used to calculate CER/WER per
Aphasia subset.
It doesn't require the input hypothesis file to contain language or Aph tags.
But if the input does contain, it will automatically remove them.
- [local/score_per_severity.sh](local/score_per_severity.sh) is similar, but it
calculates CER/WER per Aphasia severity.
- [local/score_interctc_aux.sh](local/score_interctc_aux.sh) is used to calculate
InterCTC-based Aphasia detection accuracy.
- [local/score_aphasia_detection.py](local/score_aphasia_detection.py) is used to
calculate Aphasia
detection accuracy from input in Kaldi text format.
- Calculate MACS and FLOPS of the encoder
using [this script](https://github.com/pyf98/espnet_utils/blob/master/profile.sh)

| dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
|-------------------------------------|-------|--------|------|-----|-----|-----|------|-------|
| decode_asr_model_valid.acc.ave/test | 16380 | 530731 | 82.8 | 8.0 | 9.2 | 5.1 | 22.3 | 77.5 |
## RESULTS (WER)

## asr_train_asr_conformer_hubert_ll60k_large_raw_en_char_sp
**Environments**

- [train_asr_conformer_hubert_ll60k_large.yaml](conf/tuning/train_asr_conformer_hubert_ll60k_large.yaml)
- Control group data is included

### WER

| dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
|-------------------------------------|-------|--------|------|------|-----|-----|------|-------|
| decode_asr_model_valid.acc.ave/test | 16380 | 120684 | 68.9 | 22.8 | 8.3 | 4.4 | 35.5 | 81.5 |

### CER

| dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
|-------------------------------------|-------|--------|------|-----|-----|-----|------|-------|
| decode_asr_model_valid.acc.ave/test | 16380 | 530731 | 82.1 | 8.0 | 9.9 | 5.3 | 23.3 | 81.5 |

## asr_train_asr_conformer_raw_en_char_sp

- [train_asr_conformer.yaml](conf/tuning/train_asr_conformer.yaml)
- Control group data is included

### WER

| dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
|-------------------------------------|-------|--------|------|------|-----|-----|------|-------|
| decode_asr_model_valid.acc.ave/test | 16380 | 120684 | 68.1 | 23.6 | 8.3 | 4.5 | 36.4 | 79.9 |

### CER

| dataset | Snt | Wrd | Corr | Sub | Del | Ins | Err | S.Err |
|-------------------------------------|-------|--------|------|-----|-----|-----|------|-------|
| decode_asr_model_valid.acc.ave/test | 16380 | 530731 | 81.7 | 8.4 | 9.9 | 5.2 | 23.5 | 79.9 |
- date: `Fri Apr 28 00:16:04 EDT 2023`
- python version: `3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0]`
- espnet version: `espnet 202301`
- chainer version: `chainer 6.0.0`
- pytorch version: `pytorch 1.8.1`
- Git hash: `5b2f8f351712aac98aa9f8370f1d2aea1ecff685`
- Commit date: `Fri Apr 28 00:10:24 2023 -0400`

| Model | Patient | Control | Overall | Sentence-Level Detection Accuracy | Speaker-Level Detection Accuracy |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------|---------|-----------------------------------|----------------------------------|
| [Conformer](conf/tuning/train_asr_conformer.yaml) | 40.3 | 35.3 | 38.1 | | |
| [E-Branchformer(EBF)](conf/tuning/train_asr_ebranchformer_small.yaml) | 36.2 | 31.2 | 34.0 | | |
| [EBF+WavLM](conf/tuning/train_asr_ebranchformer_small_wavlm_large1.yaml) ([HuggingFace](espnet/jiyang_tang_aphsiabank_english_asr_ebranchformer_small_wavlm_large1)) | 26.3 | 16.9 | 22.2 | | |
| [EBF+WavLM+Tag-prepend](conf/tuning/train_asr_ebranchformer_small_wavlm_large1.yaml) | 26.3 | 16.9 | 22.2 | 89.3 | 95.1 |
| [EBF+WavLM+Tag-append](conf/tuning/train_asr_ebranchformer_small_wavlm_large1.yaml) | 26.2 | 16.9 | 22.1 | 89.2 | 95.1 |
| [EBF+WavLM+Tag-both](conf/tuning/train_asr_ebranchformer_small_wavlm_large1.yaml) ([HuggingFace](https://huggingface.co/espnet/jiyang_tang_aphsiabank_english_asr_ebranchformer_wavlm_aph_en_both)) | 26.3 | 16.8 | 22.1 | Front: 90.8, Back: 90.6 | Front: 95.7, Back: 95.7 |
| [EBF+WavLM+InterCTC6](conf/tuning/train_asr_ebranchformer_small_wavlm_large1_interctc6.yaml) ([HuggingFace](https://huggingface.co/espnet/jiyang_tang_aphsiabank_english_asr_ebranchformer_wavlm_interctc6)) | 26.3 | 16.9 | 22.1 | 85.2 | 97.3 |
| [EBF+WavLM+InterCTC3+6](conf/tuning/train_asr_ebranchformer_small_wavlm_large1_interctc3+6.yaml) | 26.5 | 17.1 | 22.3 | 83.5 | 96.7 |
| EBF+WavLM+InterCTC9 (set `interctc_layer_idx` and `aux_ctc` to 9 in the InterCTC6 config) | 26.3 | 16.9 | 22.2 | 84.5 | 97.3 |
| [EBF+WavLM+InterCTC6+Tag-prepend](conf/tuning/train_asr_ebranchformer_small_wavlm_large1_interctc6.yaml) | 26.3 | 16.9 | 22.1 | Tag: 89.7, InterCTC: 89.6 | Tag: 96.7, InterCTC: 96.7 |
2 changes: 1 addition & 1 deletion egs2/aphasiabank/asr1/conf/train_asr.yaml
10 changes: 6 additions & 4 deletions egs2/aphasiabank/asr1/conf/tuning/train_asr_conformer.yaml
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
# Based on librispeech_100 conformer config, but layer sizes are adjusted to match
# E-Branchformer's number of trainable parameters.
encoder: conformer
encoder_conf:
output_size: 256
attention_heads: 4
linear_units: 1024
linear_units: 2048
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
Expand Down Expand Up @@ -43,10 +45,10 @@ num_workers: 4
sort_in_batch: descending
sort_batch: descending
batch_type: numel
batch_bins: 6000000
accum_grad: 8
batch_bins: 4000000
accum_grad: 12
grad_clip: 5
max_epoch: 40
max_epoch: 50
patience: none
init: none
best_model_criterion:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ num_workers: 4
sort_in_batch: descending
sort_batch: descending
batch_type: numel
batch_bins: 5000000
accum_grad: 9
batch_bins: 4000000
accum_grad: 12
grad_clip: 5
max_epoch: 40
patience: none
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Uses 2 V100 (32GB)

# https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/conf/tuning/train_asr_conformer7_wavlm_large.yaml
unused_parameters: true
freeze_param: [
Expand Down Expand Up @@ -61,12 +63,12 @@ decoder_conf:
seed: 2022
log_interval: 200
num_att_plot: 0
num_workers: 4
num_workers: 2
sort_in_batch: descending
sort_batch: descending
batch_type: numel
batch_bins: 3000000
accum_grad: 16
batch_bins: 6000000
accum_grad: 8
grad_clip: 5
max_epoch: 30
patience: none
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Uses 2 V100 (32GB)

# https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/conf/tuning/train_asr_conformer7_wavlm_large.yaml
unused_parameters: true
freeze_param: [
"frontend.upstream"
]
frontend: s3prl
frontend_conf:
frontend_conf:
upstream: wavlm_large # Note: If the upstream is changed, please change the input_size in the preencoder.
download_dir: ./hub
multilayer_feature: True

preencoder: linear
preencoder_conf:
input_size: 1024 # Note: If the upstream is changed, please change this value accordingly.
output_size: 80

model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false
extract_feats_in_collect_stats: false # Note: "False" means during collect stats (stage 10), generating dummy stats files rather than extract_feats by forward frontend.

# interctc aux task
interctc_weight: 0.3
aux_ctc:
'3': utt2aph
'6': utt2aph
aux_ctc_tasks: [ "utt2aph" ]

# Based on https://github.com/espnet/espnet/blob/master/egs2/librispeech/asr1/conf/tuning/train_asr_e_branchformer.yaml
# The encoder is smaller as we keep the size roughly the same as the conformer and transformer experiments
encoder: e_branchformer
encoder_conf:
output_size: 256
attention_heads: 4
linear_units: 1024
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.1
layer_drop_rate: 0.1
input_layer: conv2d1 # subsampling rate = 2 (WavLM) * 1 (conv2d1)
macaron_ffn: true
pos_enc_layer_type: rel_pos
attention_layer_type: rel_selfattn
rel_pos_type: latest
cgmlp_linear_units: 3072
cgmlp_conv_kernel: 31
use_linear_after_conv: false
gate_activation: identity
positionwise_layer_type: linear
use_ffn: true
merge_conv_kernel: 31
interctc_layer_idx: [ 3, 6 ]
interctc_use_conditioning: true

decoder: transformer
decoder_conf:
attention_heads: 4
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.1
src_attention_dropout_rate: 0.1
layer_drop_rate: 0.2

seed: 2022
log_interval: 200
num_att_plot: 0
num_workers: 4
sort_in_batch: descending
sort_batch: descending
batch_type: numel
batch_bins: 4000000
accum_grad: 12
grad_clip: 5
max_epoch: 30
patience: none
init: none
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10

use_amp: true
cudnn_deterministic: false
cudnn_benchmark: false

optim: adam
optim_conf:
lr: 0.001
weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
warmup_steps: 2500

specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 27
num_freq_mask: 2
apply_time_mask: true
time_mask_width_ratio_range:
- 0.
- 0.05
num_time_mask: 5

0 comments on commit d1074ce

Please sign in to comment.