New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Espnet2 transducer v2 #4032
Espnet2 transducer v2 #4032
Conversation
This pull request is now in conflict :( |
@b-flo Great! I will work on training a model using the codes. I will post the results later. |
Codecov Report
@@ Coverage Diff @@
## master #4032 +/- ##
==========================================
+ Coverage 81.61% 82.30% +0.68%
==========================================
Files 458 478 +20
Lines 39894 41536 +1642
==========================================
+ Hits 32561 34185 +1624
- Misses 7333 7351 +18
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Thank you very much! Btw, the following should be changed in comparison to your previous training :
Outside of that (and bugs), everything should work as intended! |
@b-flo |
The implementation is equivalent except for the model initialization which relies on the ESPnet1 one here. If you comment l.435 in espnet2/tasks/asr_transducer.py, it should be equivalent to your previous run. Could you also test, please? Btw, what about CER/WER with this model? Edit For information, on Voxforge I observed the following : without initializationCER
WER
with ESPnet1 initializationCER
WER
Performance with ESPnet2 init ( However, I'm a bit confused by the difference in terms of loss here. In my experiments, the losses are in the same range despite performance variation. |
This pull request is now in conflict :( |
@b-flo After finishing the training completely, I will post again the final WER results. |
Initialization is one of the parts I want to address alongside training techniques. I'll work on that after adding custom architectures, for now, I'll add back the |
Without init, the performance is on par with espnet2 transducer v1. For test-clean and test-other, WERs are 3.1, 7.2, respectively. |
I added back the old version, everything should be the same as before except for an optional parameter of I kept
|
@b-flo, I'm asking @pyf98 to review this PR, but he is busy these days. While we wait for his review, I'll list a couple of high-level comments.
|
We're talking about the v1, right? If so, I reverted to what it was before this PR, no worries! No changes outside some files moved and a |
Sorry in advance for the lengthy comments below. Last week I started training the librispeech recipe with this PR. I used the auxiliary CTC loss and small weight decay from your Librispeech-100 training config. The model training went well, without any issues. I uploaded the pretrained model, training images etc. here. Currently I just uploaded this model to my personal HF hub. After the PR is merged, I can upload it to HF/espnet hub. Note: the pretrained model that I uploaded for transducer v1 as part of #4327, is actually from the above model training. Basically, I copied the pretrained model weights from v2 to v1. I ensured that both models were making identical computations, which explains why the dev/test scores are also identical. However, this process was a bit challenging because of the changes in names of encoder layers from v1 to v2. To maintain consistency with other ASR configs, would it be possible to retain the same layer names as v1? This will facilitate the use of other pretrained models using: I noticed that the |
Yes, that's intended! I don't recall the full discussion but we came to the conclusion with Hirofumi and Mingkun (warp-transducer author) that the bias in the decoder linear projection was redundant information. However, it shouldn't make a difference in practice. Edit: Oh, I didn't set it to False in the first version of ESPnet2, got it. Either is fine for me! |
This pull request is now in conflict :( |
@csukuangfj Thanks a lot for taking the time to review!! I did not notify but this PR is somewhat dead in its current form. I'm currently re-working this version with streaming and deployment in mind. Some parts from this PR remain but others may be removed or heavily changed on my side. Btw, I took Icefall as a reference for the new version (i.e.: for streaming + some Conformer tricks). Could I kindly ask you or another Icefall member to help review the new version when it's available? |
Yes, we are glad to. Is there any PR about your new version? |
There is none for now. Most of the stuff for v1 is done and seems to work on my side (v2 is extending to other *-former architecture) but I did not finish testing and debugging. I should open a PR this week or the week after, I'll ping you at this time if it's okay! |
I'm closing the PR, I'm opening a new one. Sorry about the delay, other things and experiments took priority. |
Hi,
This PR is a draft for the new version of Transducer models in ESPnet2, separated from the main ASR task (CTC+Att). It's working but please note that :
Performance should be on par or better than previously. I also found out what caused the performance degradation for the Voxforge model (mainly due to initialization, and some small training differences). It may be worth extending the investigation though!
@jeon30c Would it be possible for you to re-train a Librispeech model with this version to compare performance, please?
@sw005320 Do you know if we have other models to compare? I'm not sure who already used the first version.
Also, after we are set on the task and model definition, I would like to at least make the encoder and decoder fully customizable (similar to the custom model in ESPnet1). Mainly, the changes would be :
After that :