loss value and decode library? #30

xiongjun19 · 2022-03-11T09:49:41Z

thanks very much for your great project!
I have two questions to ask:
1. how big is the the transducer loss for a well performed model? or the model is converged?
2. is there any fast decode solution? I found the decode module in many project implementing the beam search decode algorithm is extremely slow

csukuangfj · 2022-03-11T09:54:25Z

Please have a look at https://github.com/k2-fsa/icefall

You can find tensorboard training logs in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md

how big is the the transducer loss for a well performed model? or the model is converged?

The average loss per frame is about 0.02 or below.

is there any fast decode solution?

Yes, please see modified beam search in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363

There is only one loop in the time axis.

We have documentation for how to use it with a pre-trained model. Please see https://icefall.readthedocs.io/en/latest/recipes/aishell/stateless_transducer.html

There is also a Colab notebook for it
https://colab.research.google.com/drive/12jpTxJB44vzwtcmJl2DTdznW0OawPb9H?usp=sharing

csukuangfj · 2022-03-11T09:56:37Z

Note: The above beam search is implemented in Python and it decodes only one utterance at one time.

We are implementing it in C++ with CUDA, which can decode multiple utterances in parallel.
Please see k2-fsa/k2#926

It will be wrapped to Python soon.

xiongjun19 · 2022-03-14T03:58:02Z

Please have a look at https://github.com/k2-fsa/icefall

You can find tensorboard training logs in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/RESULTS.md

how big is the the transducer loss for a well performed model? or the model is converged?

The average loss per frame is about 0.02 or below.

is there any fast decode solution?

Yes, please see modified beam search in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363

There is only one loop in the time axis.

We have documentation for how to use it with a pre-trained model. Please see https://icefall.readthedocs.io/en/latest/recipes/aishell/stateless_transducer.html

There is also a Colab notebook for it https://colab.research.google.com/drive/12jpTxJB44vzwtcmJl2DTdznW0OawPb9H?usp=sharing

wow, you answer is really helpful, thank you very much

xiongjun19 · 2022-03-15T11:02:47Z

Dear csukuangfj!
I have study the code carefully in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/transducer_stateless/beam_search.py#L363 and modified according to my trained model and code structure, and compare it with decode mothed according to speechbrain, I got the following result, I'm using it to some basecalling task:
batch_size: 8, time_steps : 720;

 speech_brain_dec: acc: 94.00%;  speed: 11.70s/it;

icefall_dec: acc: 93.7%; speed: 6.10/it;

the speed is much better, and thanks for your work. I'm here to ask is there any documentation about the c++ decode interface (k2-fsa/k2#926) you mentioned before ?

csukuangfj · 2022-03-15T11:36:11Z

If you try the k2 pruned rnnt loss, https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160
, it is even faster, you may get 4.0/it. [EDIT]: I thought it was training time.

There is a Python interface for it. See k2-fsa/icefall#250

We will add C++ interface for it later, i.e., provide only a header file and some pre-compiled libraries.

k2-fsa/icefall#250 is even faster if you use it for decoding.

xiongjun19 · 2022-03-21T09:03:51Z

If you try the k2 pruned rnnt loss, https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160 , it is even faster, you may get 4.0/it. [EDIT]: I thought it was training time.

There is a Python interface for it. See k2-fsa/icefall#250

We will add C++ interface for it later, i.e., provide only a header file and some pre-compiled libraries.

k2-fsa/icefall#250 is even faster if you use it for decoding.
Dear csukuangfj!
I have tried rnnt loss from https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/model.py#L160, I have two things to update with you :
first one , to my surpurise, the loss is quite large, I'm not quite sure is there any problem ? :
the loss and metrics in my first epoch is as following:

loss:  9192.901783988205
metric: accuracy: 93.09%

second: I have modified the modified decode method to support batchly decoding , so the performance in decode speed are as following:
batch_size: 8 , time steps 720

         speech_brain_dec: acc: 94.00%;  speed: 11.70s/it; 
         icefall_dec: acc: 93.7%; speed: 6.10s /it;
        icefall_dec_batch: acc: 93.7%; speed: 1.73s/it;

very thanks for you information, and I will try to use the interface k2-fsa/icefall#250 you mentioned some time later.

csukuangfj · 2022-03-21T09:40:15Z

to my surprise, the loss is quite large

Please clarify whether the loss is

the sum of the loss over all frames in the batch
or the average loss over utterances in the batch
or the average loss over all frames in the batch
?

By the way, how do you measure the decoding time? Do you have any RTF available?

xiongjun19 · 2022-03-21T10:10:46Z

to my surprise, the loss is quite large

Please clarify whether the loss is

the sum of the loss over all frames in the batch

or the average loss over utterances in the batch

or the average loss over all frames in the batch
?

By the way, how do you measure the decoding time? Do you have any RTF available?

The loss code as following :

so I guess, the loss is the sum of the loss over all frames in batch.

Decoding time : I'm trying to use in a batch way, so RTF is not available in this condition, my measure is very simple: how much time does it take to complete a inference of a batch data. and I found that the decoding is the bottleneck, as it takes about 99% time.

csukuangfj · 2022-03-21T10:29:05Z

so I guess, the loss is the sum of the loss over all frames in batch.

Yes, you can divide it by the number of acoustic frames after subsampling in the model. Please see
https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/train.py#L495

    info["frames"] = (feature_lens // params.subsampling_factor).sum().item()

xiongjun19 · 2022-03-21T10:35:05Z

so I guess, the loss is the sum of the loss over all frames in batch.

Yes, you can divide it by the number of acoustic frames after subsampling in the model. Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless/train.py#L495
    info["frames"] = (feature_lens // params.subsampling_factor).sum().item()

ok

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss value and decode library? #30

loss value and decode library? #30

xiongjun19 commented Mar 11, 2022

csukuangfj commented Mar 11, 2022

csukuangfj commented Mar 11, 2022

xiongjun19 commented Mar 14, 2022

xiongjun19 commented Mar 15, 2022 •

edited

csukuangfj commented Mar 15, 2022 •

edited

xiongjun19 commented Mar 21, 2022 •

edited

csukuangfj commented Mar 21, 2022

xiongjun19 commented Mar 21, 2022

csukuangfj commented Mar 21, 2022

xiongjun19 commented Mar 21, 2022

loss value and decode library? #30

loss value and decode library? #30

Comments

xiongjun19 commented Mar 11, 2022

csukuangfj commented Mar 11, 2022

csukuangfj commented Mar 11, 2022

xiongjun19 commented Mar 14, 2022

xiongjun19 commented Mar 15, 2022 • edited

csukuangfj commented Mar 15, 2022 • edited

xiongjun19 commented Mar 21, 2022 • edited

csukuangfj commented Mar 21, 2022

xiongjun19 commented Mar 21, 2022

csukuangfj commented Mar 21, 2022

xiongjun19 commented Mar 21, 2022

xiongjun19 commented Mar 15, 2022 •

edited

csukuangfj commented Mar 15, 2022 •

edited

xiongjun19 commented Mar 21, 2022 •

edited