Offline/Online (standalone) ESPnet2 Transducer #4479

b-flo · 2022-06-29T09:32:27Z

Hi,

This PR is a re-do of #4032 with streaming capabilities based on WeNet chunk-by-chunk approaches and Icefall implementations.

The custom encoder architecture was kept here but limited to conv1d and conformer blocks. The idea is to support other *-former architecture (branchformer, enformer, k2-conformer, longformer, etc) as blocks to make a custom X-former architecture for offline and streaming ASR. I implemented and tested most of them already but it'll be added in next PRs.

In regards to the reviews in #4032 and change requests:

Naming: There may be minor differences but it should be consistent with other models now!
Duplication: We should discuss each duplicated module individually in this version. Most of them were duplicated in preparation to future additions but a few may just be for my convenience and may be removed/merged.

NOTE: Everything should work but this PR is a rebase of previous PR with stitched elements from different work branches, it may contains bugs or mistakes. Feel free to correct or point-out any suspicious parts please!

TO DO:

Add tests for online ASR (dummy ones for this PR)
Refine docs
Add missing references

@csukuangfj I would be glad if you or other Icefall members could take a look at the PR!
Also, if you could point-out any missing references to your work/implementation, it would be great! Because we've gone full circle (ESPnet -> WeNet -> Icefall -> ESPnet ...) on some parts, I'm a bit confused on the proper references...

doc/espnet2_tutorial.md

pyf98 · 2022-08-02T16:19:13Z

Thanks for the great PR! I didn't look into the algorithm itself, but I made a few comments about the doc and init just now.

I think it is already well organized. I especially like the flexible design of the encoder which supports different hyper-parameters for different blocks (if my understanding is correct) instead of sharing the same config across all encoder blocks.

b-flo · 2022-08-02T16:35:04Z

Thanks a lot @pyf98 and @pengchengguo

I especially like the flexible design of the encoder which supports different hyper-parameters for different blocks (if my understanding is correct) instead of sharing the same config across all encoder blocks.

Your understanding is correct! You can also mix blocks if you want (well in next PRs)! I'll add some ensemble methods and revisit auxiliary losses with intermediate representations for that.

doc/espnet2_tutorial.md

danpovey · 2022-08-02T21:51:35Z

I included support for the simplified attention score computation and BasicNorm from K2 in the default Conformer implementation. I won't add the other modules from their reworked model here but I'll support a k2Conformer in later PRs, alongside other X-former.

@csukuangfj I referenced the pull requests here because PEP8 won't allow longer links in docstrings (referencing commit, file or method won't work). Feel free to propose changes if there are better ways!

Just FYI, one of the changes I made in our Conformer was to remove the normalization from the individual modules inside the conformer layer. I only expect this to work well if you are using the ScaledLinear/ScaledConv1d modules, which learn a scaling factor for each weight and bias. Otherwise it has no way to learn the appropriate scale on each sub-module except for scaling the whole weight matrix, which is difficult for SGD to learn. BasicNorm would not be expected to be a good solution for the normalization for the individual modules, because it does not support an overall scale on the output.

Also, I am working on (still tuning) an optimization method that will learn the parameter scales as part of the optimizer, without requiring the individual scales for weights and biases, so I expect to eventually remove the ScaledLinear and ScaledConv1d (in newer directories), but the recipe will depend on properties of the optimizer.

b-flo · 2022-08-03T08:45:52Z

Thanks a lot for the explanation @danpovey !!
To be honest, I didn't expect BasicNorm to be a good replacement candidate for the reason you gave and it was removed/added back multiple times. However, because it was found appropriate (i.e.: same performance at cheapest cost) in some setups, I decided to keep it in the end.

I'm reworking the normalization module definition, I'll add some warning and explanations to the class doc. I'm also testing AdaNorm right now.

Also, I am working on (still tuning) an optimization method that will learn the parameter scales as part of the optimizer, without requiring the individual scales for weights and biases, so I expect to eventually remove the ScaledLinear and ScaledConv1d (in newer directories), but the recipe will depend on properties of the optimizer.

Thanks for the update, I'll keep an eye on the development!

doc/espnet2_tutorial.md

pyf98 · 2022-08-04T04:41:06Z

I have got two questions.

Does it support GPU inference?
Does it support automatic mixed precision training with use_amp: true?

For LibriSpeech, I'm increasing the nonstreaming model size to 120M and extending the number of epochs to 60.

csukuangfj · 2022-08-04T04:53:25Z

For LibriSpeech, I'm increasing the nonstreaming model size to 120M and extending the number of epochs to 60.

Does the model need to be so large and does it need to be trained for so many epochs?

We are using a model with about 80 M parameters and training it for 30 epochs on the LibriSpeech dataset in icefall

b-flo · 2022-08-04T07:58:56Z

Does it support GPU inference?

It does! If it's not, that's a bug on my part.

Does it support automatic mixed precision training with use_amp: true?

Sorry, not yet, I need to update warp-transducer for that. Let me finish some other things first and I'll work on it (I'll take a look this weekend)!

Does the model need to be so large and does it need to be trained for so many epochs?
We are using a model with about 80 M parameters and training it for 30 epochs on the LibriSpeech dataset in icefall

Good question.
We usually (at least from my experience) prioritize performance in terms of xER over anything else in ESPnet for ASR. We sometimes add many extra epochs or rely on really large architecture to squeeze some .x%.

That being said:

I do think 120M parameters is too much. I don't mind for the offline version, but we should be careful about the number of parameters in the online model.
60 epochs is OK in comparison to other model training in ESPnet. However, we really need to improve training in terms of stability and efficiency. That will be the focus after next PR, alongside initialization.

b-flo · 2022-08-04T09:48:24Z

Small update:

I removed the parts related to initialization as we don't use them in our experiments. It'll be reworked in a later PR.
I refactored the normalization module for future additions/works. I also added RMSNorm and ScaleNorm but it was not extensively tested (I also tried AdaNorm but found it difficult to converge).

pyf98 · 2022-08-04T16:13:49Z

For LibriSpeech, I'm increasing the nonstreaming model size to 120M and extending the number of epochs to 60.

Does the model need to be so large and does it need to be trained for so many epochs?

We are using a model with about 80 M parameters and training it for 30 epochs on the LibriSpeech dataset in icefall

Thanks for the info! I just try to match the original Conformer-Transducer Large config and see how it performs. This would be an interesting investigation. I don't know if 30 epochs is sufficient for Transducer, but at least it is not for our joint CTC/Attention according to previous experiments.

mergify · 2022-08-10T17:33:53Z

This pull request is now in conflict :(

b-flo · 2022-08-17T15:40:22Z

After discussion, I'm merging this PR! It was a long road for this one, thanks to everyone for your help 🎉

Now onto the next items!

b-flo added 30 commits February 4, 2022 13:19

remove espnet2 transducer tests

de3976e

remove previous transducer version

170093a

add new transducer version

a10025b

add dummy handle for transducer asr task

a9a08f1

fix conflict

fee3c08

cleaner changes to template

4c7351b

Merge branch 'master' into espnet2_transducer_v2

c9cd279

fix conflict

8488e46

add back initialization options + chainer_espnet1 option

8c7fc89

transducer v2.1

32edf84

fix conflicts

e9410ec

fix conflicts (2)

75bf418

fix ci

cdbae6b

remove joint_network argument

1d0175f

add missing commit (fix)

a1feb7c

fix assert, init and lm batch score

2c2e5c7

add first tests

26b5519

fix case where input_conf is empty

d44ec84

fixes and clean-up

306db25

second batch of unit tests

5caa3f6

revert autocast changes for ci

c99131c

fix conflict

57d8533

add missing test files

0fe2b57

naming + default params

d1ece4e

add transducer documentation

5db3e0e

fix/improve doc

0872f78

fix error reporting

47695ab

improve test coverage

e53077a

Merge branch 'master' into espnet2_transducer_v2

24f8751

fix typo

1ca2089

pyf98 reviewed Aug 2, 2022

View reviewed changes

doc/espnet2_tutorial.md Outdated Show resolved Hide resolved

typo

27c2b92

pyf98 reviewed Aug 2, 2022

View reviewed changes

doc/espnet2_tutorial.md Outdated Show resolved Hide resolved

fix FAQ

5233e1b

remove initialization

74bc79d

b-flo commented Aug 2, 2022

View reviewed changes

doc/espnet2_tutorial.md Show resolved Hide resolved

remove initialization (2)

9801644

pyf98 reviewed Aug 3, 2022

View reviewed changes

doc/espnet2_tutorial.md Show resolved Hide resolved

b-flo added 4 commits August 4, 2022 09:39

remove unused initialization code

84a5181

refactor normalization module

a3c6c22

update unit tests

5df4f43

update+fix parameters and doc

61eb415

mergify bot added the conflicts label Aug 10, 2022

b-flo added 2 commits August 11, 2022 05:50

fix conflict

6d0f7e1

fix conflict (2)

64ae24c

mergify bot removed the conflicts label Aug 11, 2022

b-flo merged commit c83abd7 into espnet:master Aug 17, 2022

b-flo mentioned this pull request Aug 18, 2022

Offline/Online Branchformer Transducer #4582

Merged

b-flo deleted the streaming_transducer_v2 branch September 12, 2022 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offline/Online (standalone) ESPnet2 Transducer #4479

Offline/Online (standalone) ESPnet2 Transducer #4479

b-flo commented Jun 29, 2022 •

edited

pyf98 commented Aug 2, 2022

b-flo commented Aug 2, 2022

danpovey commented Aug 2, 2022

b-flo commented Aug 3, 2022

pyf98 commented Aug 4, 2022 •

edited

csukuangfj commented Aug 4, 2022

b-flo commented Aug 4, 2022

b-flo commented Aug 4, 2022

pyf98 commented Aug 4, 2022 •

edited

mergify bot commented Aug 10, 2022

b-flo commented Aug 17, 2022

Offline/Online (standalone) ESPnet2 Transducer #4479

Offline/Online (standalone) ESPnet2 Transducer #4479

Conversation

b-flo commented Jun 29, 2022 • edited

pyf98 commented Aug 2, 2022

b-flo commented Aug 2, 2022

danpovey commented Aug 2, 2022

b-flo commented Aug 3, 2022

pyf98 commented Aug 4, 2022 • edited

csukuangfj commented Aug 4, 2022

b-flo commented Aug 4, 2022

b-flo commented Aug 4, 2022

pyf98 commented Aug 4, 2022 • edited

mergify bot commented Aug 10, 2022

b-flo commented Aug 17, 2022

b-flo commented Jun 29, 2022 •

edited

pyf98 commented Aug 4, 2022 •

edited

pyf98 commented Aug 4, 2022 •

edited