New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offline/Online (standalone) ESPnet2 Transducer #4479
Conversation
Thanks for the great PR! I didn't look into the algorithm itself, but I made a few comments about the I think it is already well organized. I especially like the flexible design of the encoder which supports different hyper-parameters for different blocks (if my understanding is correct) instead of sharing the same config across all encoder blocks. |
Thanks a lot @pyf98 and @pengchengguo
Your understanding is correct! You can also mix blocks if you want (well in next PRs)! I'll add some ensemble methods and revisit auxiliary losses with intermediate representations for that. |
Just FYI, one of the changes I made in our Conformer was to remove the normalization from the individual modules inside the conformer layer. I only expect this to work well if you are using the ScaledLinear/ScaledConv1d modules, which learn a scaling factor for each weight and bias. Otherwise it has no way to learn the appropriate scale on each sub-module except for scaling the whole weight matrix, which is difficult for SGD to learn. BasicNorm would not be expected to be a good solution for the normalization for the individual modules, because it does not support an overall scale on the output. Also, I am working on (still tuning) an optimization method that will learn the parameter scales as part of the optimizer, without requiring the individual scales for weights and biases, so I expect to eventually remove the ScaledLinear and ScaledConv1d (in newer directories), but the recipe will depend on properties of the optimizer. |
Thanks a lot for the explanation @danpovey !! I'm reworking the normalization module definition, I'll add some warning and explanations to the class doc. I'm also testing
Thanks for the update, I'll keep an eye on the development! |
I have got two questions.
For LibriSpeech, I'm increasing the nonstreaming model size to 120M and extending the number of epochs to 60. |
Does the model need to be so large and does it need to be trained for so many epochs? We are using a model with about 80 M parameters and training it for 30 epochs on the LibriSpeech dataset in icefall |
It does! If it's not, that's a bug on my part.
Sorry, not yet, I need to update
Good question. That being said:
|
Small update:
|
Thanks for the info! I just try to match the original Conformer-Transducer Large config and see how it performs. This would be an interesting investigation. I don't know if 30 epochs is sufficient for Transducer, but at least it is not for our joint CTC/Attention according to previous experiments. |
This pull request is now in conflict :( |
After discussion, I'm merging this PR! It was a long road for this one, thanks to everyone for your help 🎉 Now onto the next items! |
Hi,
This PR is a re-do of #4032 with streaming capabilities based on WeNet chunk-by-chunk approaches and Icefall implementations.
The custom encoder architecture was kept here but limited to conv1d and conformer blocks. The idea is to support other *-former architecture (branchformer, enformer, k2-conformer, longformer, etc) as blocks to make a custom X-former architecture for offline and streaming ASR. I implemented and tested most of them already but it'll be added in next PRs.
In regards to the reviews in #4032 and change requests:
NOTE: Everything should work but this PR is a rebase of previous PR with stitched elements from different work branches, it may contains bugs or mistakes. Feel free to correct or point-out any suspicious parts please!
TO DO:
@csukuangfj I would be glad if you or other Icefall members could take a look at the PR!
Also, if you could point-out any missing references to your work/implementation, it would be great! Because we've gone full circle (ESPnet -> WeNet -> Icefall -> ESPnet ...) on some parts, I'm a bit confused on the proper references...