Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss value and decode library? #30

Open
xiongjun19 opened this issue Mar 11, 2022 · 10 comments
Open

loss value and decode library? #30

xiongjun19 opened this issue Mar 11, 2022 · 10 comments

Comments

@xiongjun19
Copy link

thanks very much for your great project!
I have two questions to ask:
1. how big is the the transducer loss for a well performed model? or the model is converged?
2. is there any fast decode solution? I found the decode module in many project implementing the beam search decode algorithm is extremely slow

@csukuangfj
Copy link
Owner

Please have a look at https://github.com/k2-fsa/icefall

You can find tensorboard training logs in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md

  1. how big is the the transducer loss for a well performed model? or the model is converged?

The average loss per frame is about 0.02 or below.

is there any fast decode solution?

Yes, please see modified beam search in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363

There is only one loop in the time axis.

We have documentation for how to use it with a pre-trained model. Please see https://icefall.readthedocs.io/en/latest/recipes/aishell/stateless_transducer.html

There is also a Colab notebook for it
https://colab.research.google.com/drive/12jpTxJB44vzwtcmJl2DTdznW0OawPb9H?usp=sharing

@csukuangfj
Copy link
Owner

Note: The above beam search is implemented in Python and it decodes only one utterance at one time.

We are implementing it in C++ with CUDA, which can decode multiple utterances in parallel.
Please see k2-fsa/k2#926

It will be wrapped to Python soon.

@xiongjun19
Copy link
Author

Please have a look at https://github.com/k2-fsa/icefall

You can find tensorboard training logs in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md

  1. how big is the the transducer loss for a well performed model? or the model is converged?

The average loss per frame is about 0.02 or below.

is there any fast decode solution?

Yes, please see modified beam search in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363

There is only one loop in the time axis.

We have documentation for how to use it with a pre-trained model. Please see https://icefall.readthedocs.io/en/latest/recipes/aishell/stateless_transducer.html

There is also a Colab notebook for it https://colab.research.google.com/drive/12jpTxJB44vzwtcmJl2DTdznW0OawPb9H?usp=sharing

wow, you answer is really helpful, thank you very much

@xiongjun19
Copy link
Author

xiongjun19 commented Mar 15, 2022

Dear csukuangfj!
I have study the code carefully in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363 and modified according to my trained model and code structure, and compare it with decode mothed according to speechbrain, I got the following result, I'm using it to some basecalling task:
batch_size: 8, time_steps : 720;

  1.  speech_brain_dec: acc: 94.00%;  speed: 11.70s/it; 
    
  2. icefall_dec: acc: 93.7%; speed: 6.10/it;
    

the speed is much better, and thanks for your work. I'm here to ask is there any documentation about the c++ decode interface (k2-fsa/k2#926) you mentioned before ?

@csukuangfj
Copy link
Owner

csukuangfj commented Mar 15, 2022

If you try the k2 pruned rnnt loss, https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160
, it is even faster, you may get 4.0/it. [EDIT]: I thought it was training time.

There is a Python interface for it. See k2-fsa/icefall#250

We will add C++ interface for it later, i.e., provide only a header file and some pre-compiled libraries.

k2-fsa/icefall#250 is even faster if you use it for decoding.

@xiongjun19
Copy link
Author

xiongjun19 commented Mar 21, 2022

If you try the k2 pruned rnnt loss, https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160 , it is even faster, you may get 4.0/it. [EDIT]: I thought it was training time.

There is a Python interface for it. See k2-fsa/icefall#250

We will add C++ interface for it later, i.e., provide only a header file and some pre-compiled libraries.

k2-fsa/icefall#250 is even faster if you use it for decoding.
Dear csukuangfj!
I have tried rnnt loss from https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160, I have two things to update with you :
first one , to my surpurise, the loss is quite large, I'm not quite sure is there any problem ? :
the loss and metrics in my first epoch is as following:

loss:  9192.901783988205
metric: accuracy: 93.09%

second: I have modified the modified decode method to support batchly decoding , so the performance in decode speed are as following:
batch_size: 8 , time steps 720

         speech_brain_dec: acc: 94.00%;  speed: 11.70s/it; 
         icefall_dec: acc: 93.7%; speed: 6.10s /it;
        icefall_dec_batch: acc: 93.7%; speed: 1.73s/it;

very thanks for you information, and I will try to use the interface k2-fsa/icefall#250 you mentioned some time later.

@csukuangfj
Copy link
Owner

to my surprise, the loss is quite large

Please clarify whether the loss is

  • the sum of the loss over all frames in the batch
  • or the average loss over utterances in the batch
  • or the average loss over all frames in the batch
    ?

By the way, how do you measure the decoding time? Do you have any RTF available?

@xiongjun19
Copy link
Author

to my surprise, the loss is quite large

Please clarify whether the loss is

  • the sum of the loss over all frames in the batch
  • or the average loss over utterances in the batch
  • or the average loss over all frames in the batch
    ?

By the way, how do you measure the decoding time? Do you have any RTF available?

The loss code as following :

image

so I guess, the loss is the sum of the loss over all frames in batch.

Decoding time : I'm trying to use in a batch way, so RTF is not available in this condition, my measure is very simple: how much time does it take to complete a inference of a batch data. and I found that the decoding is the bottleneck, as it takes about 99% time.

@csukuangfj
Copy link
Owner

so I guess, the loss is the sum of the loss over all frames in batch.

Yes, you can divide it by the number of acoustic frames after subsampling in the model. Please see
https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/train.py#L495

    info["frames"] = (feature_lens // params.subsampling_factor).sum().item()

@xiongjun19
Copy link
Author

so I guess, the loss is the sum of the loss over all frames in batch.

Yes, you can divide it by the number of acoustic frames after subsampling in the model. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/train.py#L495

    info["frames"] = (feature_lens // params.subsampling_factor).sum().item()

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants