Question regarding asr_inference_streaming.py #4807

seunghyeon528 · 2022-12-05T23:23:04Z

Hi!

First of all, Thx for such nice toolkit! I love conducting reserach with the framework!

Recently, I trained some streaming ASR models with espnet version 2, and here's three questions regarding the model

1

I trained streaming asr model using contextual block conformer, and I found out that when I trained model with frontend below,

frontend_conf:
    n_fft: 512
    win_length: 400
    hop_length: 160

and test model with test inference_config with sim_chunk_length 512 (full config is below)

beam_size: 30
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.6 
sim_chunk_length: 512
disable_repetition_detection: true
decoder_text_length_limit: 0
encoded_feat_length_limit: 0
#streaming: True

the WER was really bad.

But If I test model with sim_chunk_length == 0, WER seems decent.

Is there any relationship between train frontend config and inference sim_chunk_length?

I also found that if I train model with default frontend (of which win_length:512, hop_length:128, following https://github.com/espnet/espnet/blob/master/egs2/aishell/asr1/conf/train_asr_streaming_conformer.yaml), inferencing with any sim_chunk_length shows good WER.

2

Moreover, If I want to train with the same number of frames cited in "Tsunoo, Emiru, Yosuke Kashiwagi, and Shinji Watanabe. "Streaming transformer asr with blockwise synchronous beam search." 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021.", specifically {Nl,Nc,Nr} = {16,16,8} or {4,8,4}, I'm not sure making contextual block conformer encoder config {block_size, hop_size, look_ahead} = {40,16,16} and {16,4,4} would be correct. Could you give some advice on the configuration?

3

Lastly, If I reduce block_size from 40 to 16, should I reduce cnn_module_kernel for contextual block conformer? If I understood correctly, since 1D cnn layer with cnn_module_kernel size is applied to features of block_size, cnn_module_kerne_size 31 for block_size 16 would be too large.

Thx and hope you have a great day!

The text was updated successfully, but these errors were encountered:

sw005320 · 2022-12-05T23:51:31Z

Thanks for your questions!
@eml914, could you answer them for me?

eml914 · 2022-12-09T13:35:10Z

Hi, sorry for my belated reply.

I have also seen a problem around sim_chunk_length recently in my environment, but @D-Keqi might know more about it. I need some time to look into it and come back later.
Total block size is block_size=Nl+Nc+Nr and look_ahead=Nr so {block_size, hop_size, look_ahead} = {40,16,8} and {16,4,4} are correct. Sorry for your confusion. But {40,16,16} is slightly better than {40,16,8}.
As far as I observe, conformer is less effective in streaming setup (but still effective enough). It is because the convolutional kernel is too big for the block size (31 vs 40), as you pointed out. If you want to reduce it to 16, the kernel size should be also smaller. But I have never tried that set up yet.

seunghyeon528 · 2022-12-09T14:47:24Z

Thx for such a prompt reply!

Ever since, I have tried only conformer models, but I might try streaming transformer models too.

The information you shared is so valuable!

Hope you have great weekend :)

duj12 · 2023-01-13T03:17:06Z

seems

I have the same confusion.
In my case, I use the following frontend:
frontend_conf: n_fft: 400 hop_length: 100
and contextual block conformer's streaming configuration is:
cnn_module_kernel: 15 block_size: 24 # streaming configuration hop_size: 10 # streaming configuration look_ahead: 4 # streaming configuration
When I use sim_chunk_length: 512, the performence is terrible, but it's OK with sim_chunk_length: 0.

@seunghyeon528 Do you fix this problem? And can you please reopen this issue?
@eml914 Do you make any progress?
I will also check this problem, if I have any idea, I will update with you all.

duj12 · 2023-01-16T08:29:46Z

@seunghyeon528 I check my model and the inference code, the reason is the filterbank feature extracted is different between sim_chunk_length=512 and sim_chunk_length=0;

In my case, I used the Utterence MVN(Different from the default Global MVN). When sim_chunk_length=0 , the utterence is normalized on the whole audio, but when sim_chunk_length=512, the normalization is applied just in the current chunk(just use current chunk's mean and vars).
In streaming mode, the normalization can NOT be Utterence MVN.

You may check your feature extraction procedures. Because the decoding result is right when sim_chunk_length=0, which means the contexual_conformer has nothing wrong, the prolem should lie in where the audio chunk be splitted and the feature be generated.

eml914 · 2023-01-17T13:23:31Z

I was also looking at apply_frontend in asr_inference_streaming.py but I realized I was looking at an old repository. My environment uses fbank features as input and it does not work in the old one, but now it was fixed.
@duj12 might be right; the normalization can work chunkwise. I will look at signal input case and let you know what I find.

espnetUser · 2023-06-07T14:32:56Z

@seunghyeon528, @eml914, @duj12 and @sw005320 : I see the same issue with poor WER as you reported when using a sim_chunk_length=512 but good WER with sim_chunk_length=0. The issue only appears for me with certain frontend hop_length settings.

I believe the problem might result from incorrect trimming of feature frames here:
https://github.com/espnet/espnet/blob/master/espnet2/bin/asr_inference_streaming.py#L257-L282

When the window_size is not a multiple of the hop_length (e.g., win_length: 256, hop_length: 90) the current code seems to trim an incorrect number (too many) of feature frames?! This does not cause an issue for sim_chunk_length=0 as the entire samples are processed in one step but for small values of sim_chunk_length like 512 trimming will be performed more often and errors in feature frame trimming become very noticable. If one increases the value for sim_chunk_length to something larger like 8192 (trimming less often) the problem of poor WER almost disappears.

To test this theory I replaced all occurences of math.ceil() with math.floor() in trimming code here:
https://github.com/espnet/espnet/blob/master/espnet2/bin/asr_inference_streaming.py#L257-L282

Here are the results which seem to indicate that this fixes the problem:

Could you please check the trimming code and let me know if that change makes sense?

Thanks!

sw005320 · 2023-06-08T05:15:29Z

Thanks a lot for the detailed information.
@eml914, could you take a look at it?

@espnetUser, if @eml914 agrees with this change, could I ask you to make a PR?

eml914 · 2023-06-14T06:59:40Z

Great! Thank you very much for the inspection. I overlooked this point.
I realize that the trimming part is due to the DefaultFront which is torch.stft. The default parameter of torch.stft does centering with padding, and with the smaller hop_length, the more frames are contatimated by the padding samples, and thus this code tries to trim out these unnecesarry part in streaming processing. And I comfirm that the frames that needs to be cut off are math.ceil( math.ceil( win_length / hop_length) / 2 ) ), so the original code seems to be correct.
So, what is the problem -- I took a long time to think and found that it is actually waveform_buffer that is passed to the next step, not triming.

eml914 · 2023-06-14T07:13:16Z

Can you change this part of the code as follows on your side and check if this change work properly?

espnet/espnet2/bin/asr_inference_streaming.py

Lines 225 to 238 in 096e2bb

    
           n_frames = ( 
        
               speech.size(0) - (self.win_length - self.hop_length) 
        
           ) // self.hop_length 
        
           n_residual = ( 
        
               speech.size(0) - (self.win_length - self.hop_length) 
        
           ) % self.hop_length 
        
           speech_to_process = speech.narrow( 
        
               0, 0, (self.win_length - self.hop_length) + n_frames * self.hop_length 
        
           ) 
        
           waveform_buffer = speech.narrow( 
        
               0, 
        
               speech.size(0) - (self.win_length - self.hop_length) - n_residual, 
        
               (self.win_length - self.hop_length) + n_residual, 
        
           ).clone()

n_frames = speech.size(0) // self.hop_length
n_residual = speech.size(0) % self.hop_length
speech_to_process = speech.narrow(
    0, 0, n_frames * self.hop_length
)
waveform_buffer = speech.narrow(
    0,
    speech.size(0) - ( math.ceil(math.ceil(self.win_length/self.hop_length)/2) * 2 - 1 ) * self.hop_length - n_residual,
    ( math.ceil(math.ceil(self.win_length/self.hop_length)/2) * 2 - 1 ) * self.hop_length + n_residual,
).clone()

espnetUser · 2023-06-15T07:22:30Z

@eml914: Thanks for looking into this! I will check the proposed code change and get back to you.

b-flo · 2023-06-15T10:03:03Z

Hum, I am manually linking #5140 to remind me to follow the discussion and double-check the frontend parts for streaming Transducer.
I recall having similar issues with it before redesigning the audio/features processing parts. I'm not entirely sure the issue is mitigated or fixed adequately now, the discussion may help (thanks!).

espnetUser · 2023-06-30T08:35:16Z

@eml914: Apologies for my late reply. I tested the proposed code change in waveform_buffer and can confirm it solves the issue with certain hop_length settings without having to modify the feature trimming code part. With the new code change I am seeing the same "good" WER as with my "trimming code change" across different sim_chunk_length/hop_length values. So this looks good to me.

@sw005320, @eml914: I will go ahead and create a PR for this issue. Thanks!

For certain win_length/hop_length and sim_chunk_length settings (e.g., win_length:256, hop_length: 90 and sim_chunk_length: 512) poor WER was observed that was caused by incorrect filling of waveform_buffer object. See espnet#4807

sw005320 · 2023-06-30T10:15:14Z

Thanks!

seunghyeon528 closed this as completed Dec 9, 2022

seunghyeon528 reopened this Jan 16, 2023

espnetUser mentioned this issue Jun 30, 2023

Fix filling of waveform_buffer with samples for streaming inference #5267

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question regarding asr_inference_streaming.py #4807

Question regarding asr_inference_streaming.py #4807

seunghyeon528 commented Dec 5, 2022 •

edited

Loading

sw005320 commented Dec 5, 2022

eml914 commented Dec 9, 2022

seunghyeon528 commented Dec 9, 2022

duj12 commented Jan 13, 2023

duj12 commented Jan 16, 2023

eml914 commented Jan 17, 2023

espnetUser commented Jun 7, 2023

sw005320 commented Jun 8, 2023

eml914 commented Jun 14, 2023

eml914 commented Jun 14, 2023

espnetUser commented Jun 15, 2023

b-flo commented Jun 15, 2023

espnetUser commented Jun 30, 2023

sw005320 commented Jun 30, 2023

Question regarding asr_inference_streaming.py #4807

Question regarding asr_inference_streaming.py #4807

Comments

seunghyeon528 commented Dec 5, 2022 • edited Loading

1

2

3

sw005320 commented Dec 5, 2022

eml914 commented Dec 9, 2022

seunghyeon528 commented Dec 9, 2022

duj12 commented Jan 13, 2023

duj12 commented Jan 16, 2023

eml914 commented Jan 17, 2023

espnetUser commented Jun 7, 2023

sw005320 commented Jun 8, 2023

eml914 commented Jun 14, 2023

eml914 commented Jun 14, 2023

espnetUser commented Jun 15, 2023

b-flo commented Jun 15, 2023

espnetUser commented Jun 30, 2023

sw005320 commented Jun 30, 2023

seunghyeon528 commented Dec 5, 2022 •

edited

Loading