Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Espnet2 transducer v2 #4032

Closed
wants to merge 65 commits into from
Closed

Conversation

b-flo
Copy link
Member

@b-flo b-flo commented Feb 4, 2022

Hi,

This PR is a draft for the new version of Transducer models in ESPnet2, separated from the main ASR task (CTC+Att). It's working but please note that :

  • This is not the final version, it's only to open discussion. If needed, I have some alternatives / other versions.
  • Some parts or features are removed compared to the previous version. It can be easily added but I would like to add them one by one with careful testing or feedback.
  • This draft may contain minor issues and typos, useless code, etc. Feel free to point out any weird/wrong parts.

Performance should be on par or better than previously. I also found out what caused the performance degradation for the Voxforge model (mainly due to initialization, and some small training differences). It may be worth extending the investigation though!
@jeon30c Would it be possible for you to re-train a Librispeech model with this version to compare performance, please?
@sw005320 Do you know if we have other models to compare? I'm not sure who already used the first version.

Also, after we are set on the task and model definition, I would like to at least make the encoder and decoder fully customizable (similar to the custom model in ESPnet1). Mainly, the changes would be :

  • Add unified Encoder containing PreEncoder (bottlenecks/input blocks) + BodyEncoder (supporting main nets/blocks + some bridge blocks)
  • Same for Decoder.
  • Refactor the BeamSearch / Scorer part to reflect changes and optimize for ESPnet2.

After that :

  • Add tests
  • Add documentation

@mergify
Copy link
Contributor

mergify bot commented Feb 4, 2022

This pull request is now in conflict :(

@b-flo b-flo added this to the v.0.10.6 milestone Feb 4, 2022
@mergify mergify bot removed the conflicts label Feb 4, 2022
@b-flo b-flo marked this pull request as draft February 4, 2022 13:47
@b-flo b-flo added the RNNT (RNN) transducer related issue label Feb 4, 2022
@jeon30c
Copy link

jeon30c commented Feb 4, 2022

@b-flo Great! I will work on training a model using the codes. I will post the results later.

@codecov
Copy link

codecov bot commented Feb 4, 2022

Codecov Report

Merging #4032 (0157e81) into master (4323c52) will increase coverage by 0.68%.
The diff coverage is 98.77%.

@@            Coverage Diff             @@
##           master    #4032      +/-   ##
==========================================
+ Coverage   81.61%   82.30%   +0.68%     
==========================================
  Files         458      478      +20     
  Lines       39894    41536    +1642     
==========================================
+ Hits        32561    34185    +1624     
- Misses       7333     7351      +18     
Flag Coverage Δ
test_integration_espnet1 67.13% <ø> (ø)
test_integration_espnet2 51.02% <64.89%> (+0.71%) ⬆️
test_python 69.28% <96.67%> (+1.12%) ⬆️
test_utils 24.45% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
espnet2/asr/decoder/transducer_decoder.py 100.00% <ø> (ø)
espnet2/bin/asr_transducer_inference.py 92.93% <92.93%> (ø)
espnet2/asr_transducer/espnet_transducer_model.py 97.82% <97.82%> (ø)
espnet2/asr_transducer/beam_search_transducer.py 99.13% <99.13%> (ø)
espnet2/asr/espnet_model.py 81.44% <100.00%> (ø)
espnet2/asr/transducer/beam_search_transducer.py 98.74% <100.00%> (+0.94%) ⬆️
espnet2/asr_transducer/activation.py 100.00% <100.00%> (ø)
espnet2/asr_transducer/decoder/abs_decoder.py 100.00% <100.00%> (ø)
espnet2/asr_transducer/decoder/rnn_decoder.py 100.00% <100.00%> (ø)
...spnet2/asr_transducer/decoder/stateless_decoder.py 100.00% <100.00%> (ø)
... and 30 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@b-flo
Copy link
Member Author

b-flo commented Feb 4, 2022

Thank you very much! Btw, the following should be changed in comparison to your previous training :

  • Remove init: none in training config
  • Add --asr_transducer true to asr.sh parameters in run.sh

Outside of that (and bugs), everything should work as intended!

@kan-bayashi kan-bayashi modified the milestones: v.0.10.6, v.0.10.7 Feb 8, 2022
@b-flo
Copy link
Member Author

b-flo commented Feb 10, 2022

Hi @sw005320, can you take a look, please? If you're OK with the design and @jeon30c can reproduce or improve results, I'll add the next items

@jeon30c
Copy link

jeon30c commented Feb 11, 2022

@b-flo
screenshot_loss
I observed that the loss was not well decreasing compared to the previous version.
As the basic functionality is same, I think it should be almost same to the previous training as
screenshot_loss_prev

@b-flo
Copy link
Member Author

b-flo commented Feb 11, 2022

I observed that the loss was not well decreasing compared to the previous version.
As the basic functionality is same, I think it should be almost same to the previous training as ...

The implementation is equivalent except for the model initialization which relies on the ESPnet1 one here. If you comment l.435 in espnet2/tasks/asr_transducer.py, it should be equivalent to your previous run. Could you also test, please? Btw, what about CER/WER with this model?

Edit

For information, on Voxforge I observed the following :

without initialization

CER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_default_asr_model_valid.loss.best/dt_it 1035 75494 87.1 6.3 6.7 3.0 15.9 98.8
decode_default_asr_model_valid.loss.best/et_it 1103 81228 88.4 5.8 5.7 3.0 14.5 97.6

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_default_asr_model_valid.loss.best/dt_it 1035 12587 55.0 36.3 8.7 4.4 49.4 98.8
decode_default_asr_model_valid.loss.best/et_it 1103 13699 58.7 34.2 7.1 4.7 46.0 97.6

with ESPnet1 initialization

CER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_default_asr_model_valid.loss.best/dt_it 1035 75494 90.1 5.0 4.9 2.6 12.4 97.3
decode_default_asr_model_valid.loss.best/et_it 1103 81228 90.8 4.8 4.3 2.6 11.8 95.2

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_default_asr_model_valid.loss.best/dt_it 1035 12587 62.0 31.3 6.7 4.4 42.3 97.3
decode_default_asr_model_valid.loss.best/et_it 1103 13699 63.3 30.5 6.2 4.5 41.2 95.2

Performance with ESPnet2 init (chainer or xavier_uniform) is slightly worse than without initialization.
I also did the comparison with another custom dataset (~80h) and ended up with comparable results in each case.

However, I'm a bit confused by the difference in terms of loss here. In my experiments, the losses are in the same range despite performance variation.

@mergify
Copy link
Contributor

mergify bot commented Feb 14, 2022

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label Feb 14, 2022
@jeon30c
Copy link

jeon30c commented Feb 14, 2022

@b-flo
스크린샷 2022-02-14 오후 4 51 44
This is very interesting. As you said, I commented out the init part and retrained a model, that was very similar to the previous training procedure as is shown at the above figure. The init function seems to have great effect on performance.

After finishing the training completely, I will post again the final WER results.

@mergify mergify bot removed the conflicts label Feb 14, 2022
@b-flo
Copy link
Member Author

b-flo commented Feb 14, 2022

Initialization is one of the parts I want to address alongside training techniques. I'll work on that after adding custom architectures, for now, I'll add back the init option with support for "espnet1 initialization".

@jeon30c
Copy link

jeon30c commented Feb 16, 2022

Without init, the performance is on par with espnet2 transducer v1. For test-clean and test-other, WERs are 3.1, 7.2, respectively.

@b-flo
Copy link
Member Author

b-flo commented Apr 26, 2022

I added back the old version, everything should be the same as before except for an optional parameter of JointNetwork I had to rename.

I kept error_calculator.py and beam_search_transducer.py in transducer/ for now, let me know what I should do with them based on my previous message.

Edit: I'm not sure what's going with codecov, it seems tests related to asr/transducer/error_calculator.py and asr/transducer/beam_search_transducer.py are not counted. Does it have to do with my renaming of the test files methods to avoid names clashes with tests for asr_transducer?
Never mind, I'm stupid.

@sw005320
Copy link
Contributor

@b-flo, I'm asking @pyf98 to review this PR, but he is busy these days.
So, please wait for a while.

While we wait for his review, I'll list a couple of high-level comments.

  • use our naming conventions for functions and variables: Please use the same or similar names as many as possible.
  • minimize the configuration changes from the current ones. (it is similar to the above comment, but it is particularly important)
  • minimize duplication of the core network model codes (e.g., conformer, lstm, transformer, etc.)

@b-flo
Copy link
Member Author

b-flo commented Apr 28, 2022

While we wait for his review, I'll list a couple of high-level comments.
....

We're talking about the v1, right? If so, I reverted to what it was before this PR, no worries! No changes outside some files moved and a JointNetwork(...) I'm sharing between versions (I had to rename one variable vocab_size -> dim_vocab because of that).

@sw005320
Copy link
Contributor

No, I'm talking about this PR.
The following is a good example

image

encoder_output_size is used in many ASR codes.
Please do so.

@chintu619
Copy link
Contributor

Sorry in advance for the lengthy comments below.

Last week I started training the librispeech recipe with this PR. I used the auxiliary CTC loss and small weight decay from your Librispeech-100 training config. The model training went well, without any issues. I uploaded the pretrained model, training images etc. here. Currently I just uploaded this model to my personal HF hub. After the PR is merged, I can upload it to HF/espnet hub.

Note: the pretrained model that I uploaded for transducer v1 as part of #4327, is actually from the above model training. Basically, I copied the pretrained model weights from v2 to v1. I ensured that both models were making identical computations, which explains why the dev/test scores are also identical. However, this process was a bit challenging because of the changes in names of encoder layers from v1 to v2. To maintain consistency with other ASR configs, would it be possible to retain the same layer names as v1? This will facilitate the use of other pretrained models using: --init_param model.pth:encoder:encoder.

I noticed that the bias of the linear decoder layer in the JointNetwork is set to False in v2 here. So when I copied the weights from v2 to v1, I set this bias to zero. When this PR is merged, I will update the pretrained model in #4327 by disabling this bias.

@b-flo
Copy link
Member Author

b-flo commented Apr 29, 2022

I noticed that the bias of the linear decoder layer in the JointNetwork is set to False in v2 here.

Yes, that's intended! I don't recall the full discussion but we came to the conclusion with Hirofumi and Mingkun (warp-transducer author) that the bias in the decoder linear projection was redundant information. However, it shouldn't make a difference in practice.

Edit: Oh, I didn't set it to False in the first version of ESPnet2, got it. Either is fine for me!

@mergify
Copy link
Contributor

mergify bot commented May 18, 2022

This pull request is now in conflict :(

@mergify mergify bot added the conflicts label May 18, 2022
@kan-bayashi kan-bayashi modified the milestones: v.202205, v.202206 May 26, 2022
doc/espnet2_tutorial.md Show resolved Hide resolved
ci/test_integration_espnet2.sh Show resolved Hide resolved
doc/espnet2_tutorial.md Show resolved Hide resolved
doc/espnet2_tutorial.md Show resolved Hide resolved
doc/espnet2_tutorial.md Show resolved Hide resolved
@b-flo
Copy link
Member Author

b-flo commented Jun 6, 2022

@csukuangfj Thanks a lot for taking the time to review!! I did not notify but this PR is somewhat dead in its current form. I'm currently re-working this version with streaming and deployment in mind. Some parts from this PR remain but others may be removed or heavily changed on my side.

Btw, I took Icefall as a reference for the new version (i.e.: for streaming + some Conformer tricks). Could I kindly ask you or another Icefall member to help review the new version when it's available?

@csukuangfj
Copy link

Could I kindly ask you or another Icefall member to help review the new version when it's available?

Yes, we are glad to. Is there any PR about your new version?

@b-flo
Copy link
Member Author

b-flo commented Jun 6, 2022

Is there any PR about your new version?

There is none for now. Most of the stuff for v1 is done and seems to work on my side (v2 is extending to other *-former architecture) but I did not finish testing and debugging. I should open a PR this week or the week after, I'll ping you at this time if it's okay!

@b-flo b-flo closed this Jun 28, 2022
@b-flo
Copy link
Member Author

b-flo commented Jun 28, 2022

I'm closing the PR, I'm opening a new one. Sorry about the delay, other things and experiments took priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Travis, Circle CI, etc conflicts Documentation ESPnet2 RNNT (RNN) transducer related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants