Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libri100 recipe for standalone Transducer #4698

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
95 changes: 95 additions & 0 deletions egs2/librispeech_100/asr_transducer1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Streaming Conformer-RNN Transducer
# asr_train_conformer-rnn_transducer_streaming_raw_en_bpe500_sp

- General information
- Pretrained model: N.A
- Training config: conf/train_conformer-rnn_transducer_streaming.yaml
- Decoding config: conf/decode.yaml (or conf/decode_streaming.yaml)
- GPU: Nvidia A100 40Gb
- CPU: AMD EPYC 7502P 32c
- Peak VRAM usage during training: 36.7Gb
- Training time: ~ 26 hours
- Decoding time (32 jobs, 1 thread): ~9,1 minutes (full context)

- Environments
- date: `Fri Oct 7 12:02:29 UTC 2022`
- python version: `3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]`
- espnet version: `espnet 202209`
- pytorch version: `pytorch 1.8.1+cu111`
- Git hash: `2db74a9587a32b659cf4e1abb6b611d9f9551e09`
- Commit date: `Thu Oct 6 15:01:23 2022 +0000`

## WER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_model_valid.loss.ave_10best/dev_clean|2703|54402|94.3|5.2|0.5|0.7|6.4|56.9|
|decode_asr_model_valid.loss.ave_10best/dev_other|2864|50948|83.4|14.8|1.8|1.9|18.5|82.1|
|decode_asr_model_valid.loss.ave_10best/test_clean|2620|52576|93.8|5.6|0.7|0.8|7.0|58.9|
|decode_asr_model_valid.loss.ave_10best/test_other|2939|52343|82.9|15.0|2.0|1.8|18.9|83.5|

## CER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_model_valid.loss.ave_10best/dev_clean|2703|288456|98.2|1.0|0.8|0.6|2.4|56.9|
|decode_asr_model_valid.loss.ave_10best/dev_other|2864|265951|93.1|4.1|2.9|1.9|8.9|82.1|
|decode_asr_model_valid.loss.ave_10best/test_clean|2620|281530|98.0|1.1|0.9|0.6|2.6|58.9|
|decode_asr_model_valid.loss.ave_10best/test_other|2939|272758|93.0|4.0|3.0|1.8|8.9|83.5|

## TER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_model_valid.loss.ave_10best/dev_clean|2703|107929|95.0|3.6|1.4|0.6|5.5|56.9|
|decode_asr_model_valid.loss.ave_10best/dev_other|2864|98610|84.7|11.6|3.6|2.2|17.4|82.1|
|decode_asr_model_valid.loss.ave_10best/test_clean|2620|105724|94.7|3.7|1.6|0.6|6.0|58.9|
|decode_asr_model_valid.loss.ave_10best/test_other|2939|101026|84.3|11.6|4.1|2.0|17.7|83.5|

# Conformer-RNN Transducer
# asr_train_conformer-rnn_transducer_raw_en_bpe500_sp

- General information
- Pretrained model: N.A
- Training config: conf/train_conformer-rnn_transducer.yaml
- Decoding config: conf/decode.yaml
- GPU: Nvidia A100 40Gb
- CPU: AMD EPYC 7502P 32c
- Peak VRAM usage during training: 36.4Gb
- Training time: ~ 26 hours
- Decoding time (32 jobs, 1 thread): ~9 minutes

- Environments
- date: `Fri Oct 7 12:02:29 UTC 2022`
- python version: `3.8.10 (default, Jun 22 2022, 20:18:18) [GCC 9.4.0]`
- espnet version: `espnet 202209`
- pytorch version: `pytorch 1.8.1+cu111`
- Git hash: `2db74a9587a32b659cf4e1abb6b611d9f9551e09`
- Commit date: `Thu Oct 6 15:01:23 2022 +0000`

## WER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_model_valid.loss.ave_10best/dev_clean|2703|54402|94.7|4.8|0.4|0.6|5.9|55.1|
|decode_asr_model_valid.loss.ave_10best/dev_other|2864|50948|84.2|14.1|1.7|1.8|17.6|80.2|
|decode_asr_model_valid.loss.ave_10best/test_clean|2620|52576|94.3|5.2|0.6|0.7|6.4|56.8|
|decode_asr_model_valid.loss.ave_10best/test_other|2939|52343|83.9|14.2|1.9|1.8|17.9|81.5|

## CER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_model_valid.loss.ave_10best/dev_clean|2703|288456|98.4|0.9|0.7|0.6|2.2|55.1|
|decode_asr_model_valid.loss.ave_10best/dev_other|2864|265951|93.4|3.9|2.7|1.9|8.5|80.2|
|decode_asr_model_valid.loss.ave_10best/test_clean|2620|281530|98.2|1.0|0.8|0.6|2.3|56.8|
|decode_asr_model_valid.loss.ave_10best/test_other|2939|272758|93.5|3.8|2.7|1.8|8.3|81.5|

## TER

|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|decode_asr_model_valid.loss.ave_10best/dev_clean|2703|107929|95.4|3.4|1.2|0.6|5.2|55.1|
|decode_asr_model_valid.loss.ave_10best/dev_other|2864|98610|85.5|11.0|3.5|2.1|16.6|80.2|
|decode_asr_model_valid.loss.ave_10best/test_clean|2620|105724|95.1|3.4|1.4|0.6|5.5|56.8|
|decode_asr_model_valid.loss.ave_10best/test_other|2939|101026|85.3|10.9|3.9|2.0|16.7|81.5|
1 change: 1 addition & 0 deletions egs2/librispeech_100/asr_transducer1/asr.sh
1 change: 1 addition & 0 deletions egs2/librispeech_100/asr_transducer1/cmd.sh
14 changes: 14 additions & 0 deletions egs2/librispeech_100/asr_transducer1/conf/decode.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
beam_size: 5 # 10 produces slightly better results.
beam_search_config:
search_type: default

# ALSD (search-type: alsd)
u_max: 50

# TSD (search-type: tsd)
max_sym_exp: 2

# mAES (search-type: maes)
nstep: 1
expansion_gamma: 1.5
expansion_beta: 1
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
beam_size: 5 # 10 produces slightly better results.
beam_search_config:
search_type: maes
nstep: 1
expansion_gamma: 2.3
expansion_beta: 2
streaming: True
chunk_size: 64
left_context: 256
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# general
batch_type: numel
batch_bins: 4000000
accum_grad: 8
max_epoch: 60 # 100 produces better results.
patience: none
init: none
num_att_plot: 0

# optimizer
optim: adam
optim_conf:
lr: 0.002
weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
warmup_steps: 15000

# criterion
val_scheduler_criterion:
- valid
- loss
best_model_criterion:
- - valid
- loss
- min
keep_nbest_models: 10 # 20 produces slightly better results.

model_conf:
transducer_weight: 1.0
auxiliary_ctc_weight: 0.3
report_cer: True
report_wer: True

# specaug conf
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 27
num_freq_mask: 2
apply_time_mask: true
time_mask_width_ratio_range:
- 0.
- 0.05
num_time_mask: 5

encoder_conf:
main_conf:
pos_wise_act_type: swish
conv_mod_act_type: swish
pos_enc_dropout_rate: 0.2
input_conf:
vgg_like: True
body_conf:
- block_type: conformer
linear_size: 1024
hidden_size: 256
heads: 4
dropout_rate: 0.1
pos_wise_dropout_rate: 0.1
att_dropout_rate: 0.1
conv_mod_kernel_size: 31
num_blocks: 18
decoder: rnn
decoder_conf:
rnn_type: lstm
num_layers: 1
embed_size: 256
hidden_size: 256
dropout_rate: 0.1
embed_dropout_rate: 0.2
joint_network_conf:
joint_space_size: 256
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# general
batch_type: numel
batch_bins: 4000000
accum_grad: 8
max_epoch: 60 # 100 produces better results.
patience: none
init: none
num_att_plot: 0

# optimizer
optim: adam
optim_conf:
lr: 0.002
weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
warmup_steps: 15000

# criterion
val_scheduler_criterion:
- valid
- loss
best_model_criterion:
- - valid
- loss
- min
keep_nbest_models: 10 # 20 produces slightly better results.

model_conf:
transducer_weight: 1.0
auxiliary_ctc_weight: 0.3
report_cer: True
report_wer: True

# specaug conf
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 27
num_freq_mask: 2
apply_time_mask: true
time_mask_width_ratio_range:
- 0.
- 0.05
num_time_mask: 5

encoder_conf:
main_conf:
pos_wise_act_type: swish
conv_mod_act_type: swish
pos_enc_dropout_rate: 0.2
dynamic_chunk_training: True
short_chunk_size: 25
short_chunk_threshold: 0.75
left_chunk_size: 4
input_conf:
vgg_like: True
body_conf:
- block_type: conformer
linear_size: 1024
hidden_size: 256
heads: 4
dropout_rate: 0.1
pos_wise_dropout_rate: 0.1
att_dropout_rate: 0.1
conv_mod_kernel_size: 31
num_blocks: 18
decoder: rnn
decoder_conf:
rnn_type: lstm
num_layers: 1
embed_size: 256
hidden_size: 256
dropout_rate: 0.1
embed_dropout_rate: 0.2
joint_network_conf:
joint_space_size: 256
1 change: 1 addition & 0 deletions egs2/librispeech_100/asr_transducer1/db.sh
1 change: 1 addition & 0 deletions egs2/librispeech_100/asr_transducer1/local
1 change: 1 addition & 0 deletions egs2/librispeech_100/asr_transducer1/path.sh
1 change: 1 addition & 0 deletions egs2/librispeech_100/asr_transducer1/pyscripts
37 changes: 37 additions & 0 deletions egs2/librispeech_100/asr_transducer1/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/usr/bin/env bash

set -e
set -u
set -o pipefail

train_set="train_clean_100"
valid_set="dev"
test_sets="test_clean test_other dev_clean dev_other"

asr_config=conf/train_conformer-rnn_transducer.yaml
inference_config=conf/decode.yaml
inference_model=valid.loss.ave_10best.pth

./asr.sh \
--asr_task asr_transducer \
--skip_data_prep false \
--skip_train false \
--skip_eval false \
b-flo marked this conversation as resolved.
Show resolved Hide resolved
--lang en \
--ngpu 1 \
--nj 32 \
--inference_nj 32 \
--nbpe 500 \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BPE size is 500 vs 5000 for the CTC-Att baseline model. I guess we can further improve results with bigger BPE size but I don't have enough resource for that.

--max_wav_duration 30 \
--speed_perturb_factors "0.9 1.0 1.1" \
--audio_format "flac.ark" \
--feats_type raw \
--use_lm false \
--asr_config "${asr_config}" \
--inference_config "${inference_config}" \
--inference_asr_model "${inference_model}" \
--train_set "${train_set}" \
--valid_set "${valid_set}" \
--test_sets "${test_sets}" \
--lm_train_text "data/${train_set}/text" \
--bpe_train_text "data/${train_set}/text" "$@"
1 change: 1 addition & 0 deletions egs2/librispeech_100/asr_transducer1/scripts
1 change: 1 addition & 0 deletions egs2/librispeech_100/asr_transducer1/steps
1 change: 1 addition & 0 deletions egs2/librispeech_100/asr_transducer1/utils
2 changes: 1 addition & 1 deletion espnet2/asr_transducer/encoder/blocks/conformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ def forward(
residual = x

x = self.norm_conv(x)
x, _ = self.conv_mod(x)
x, _ = self.conv_mod(x, mask=mask)
x = residual + self.dropout(x)

residual = x
Expand Down
4 changes: 4 additions & 0 deletions espnet2/asr_transducer/encoder/modules/convolution.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ def forward(
self,
x: torch.Tensor,
cache: Optional[torch.Tensor] = None,
mask: Optional[torch.Tensor] = None,
right_context: int = 0,
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Compute convolution module.
Expand All @@ -87,6 +88,9 @@ def forward(
x = self.pointwise_conv1(x.transpose(1, 2))
x = torch.nn.functional.glu(x, dim=1)

if mask is not None:
x.masked_fill(mask.unsqueeze(1).expand_as(x), 0.0)

if self.lorder > 0:
if cache is None:
x = torch.nn.functional.pad(x, (self.lorder, 0), "constant", 0.0)
Expand Down