Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding asr_inference_streaming.py #4807

Open
seunghyeon528 opened this issue Dec 5, 2022 · 14 comments
Open

Question regarding asr_inference_streaming.py #4807

seunghyeon528 opened this issue Dec 5, 2022 · 14 comments

Comments

@seunghyeon528
Copy link

seunghyeon528 commented Dec 5, 2022

Hi!

First of all, Thx for such nice toolkit! I love conducting reserach with the framework!

Recently, I trained some streaming ASR models with espnet version 2, and here's three questions regarding the model

1

I trained streaming asr model using contextual block conformer, and I found out that when I trained model with frontend below,

frontend_conf:
    n_fft: 512
    win_length: 400
    hop_length: 160

and test model with test inference_config with sim_chunk_length 512 (full config is below)

beam_size: 30
penalty: 0.0
maxlenratio: 0.0
minlenratio: 0.0
ctc_weight: 0.6 
sim_chunk_length: 512
disable_repetition_detection: true
decoder_text_length_limit: 0
encoded_feat_length_limit: 0
#streaming: True

the WER was really bad.

But If I test model with sim_chunk_length == 0, WER seems decent.

Is there any relationship between train frontend config and inference sim_chunk_length?

I also found that if I train model with default frontend (of which win_length:512, hop_length:128, following https://github.com/espnet/espnet/blob/master/egs2/aishell/asr1/conf/train_asr_streaming_conformer.yaml), inferencing with any sim_chunk_length shows good WER.

2

Moreover, If I want to train with the same number of frames cited in "Tsunoo, Emiru, Yosuke Kashiwagi, and Shinji Watanabe. "Streaming transformer asr with blockwise synchronous beam search." 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021.", specifically {Nl,Nc,Nr} = {16,16,8} or {4,8,4}, I'm not sure making contextual block conformer encoder config {block_size, hop_size, look_ahead} = {40,16,16} and {16,4,4} would be correct. Could you give some advice on the configuration?

3

Lastly, If I reduce block_size from 40 to 16, should I reduce cnn_module_kernel for contextual block conformer? If I understood correctly, since 1D cnn layer with cnn_module_kernel size is applied to features of block_size, cnn_module_kerne_size 31 for block_size 16 would be too large.

Thx and hope you have a great day!

@sw005320
Copy link
Contributor

sw005320 commented Dec 5, 2022

Thanks for your questions!
@eml914, could you answer them for me?

@eml914
Copy link
Contributor

eml914 commented Dec 9, 2022

Hi, sorry for my belated reply.

  1. I have also seen a problem around sim_chunk_length recently in my environment, but @D-Keqi might know more about it. I need some time to look into it and come back later.
  2. Total block size is block_size=Nl+Nc+Nr and look_ahead=Nr so {block_size, hop_size, look_ahead} = {40,16,8} and {16,4,4} are correct. Sorry for your confusion. But {40,16,16} is slightly better than {40,16,8}.
  3. As far as I observe, conformer is less effective in streaming setup (but still effective enough). It is because the convolutional kernel is too big for the block size (31 vs 40), as you pointed out. If you want to reduce it to 16, the kernel size should be also smaller. But I have never tried that set up yet.

@seunghyeon528
Copy link
Author

Thx for such a prompt reply!

Ever since, I have tried only conformer models, but I might try streaming transformer models too.

The information you shared is so valuable!

Hope you have great weekend :)

@duj12
Copy link

duj12 commented Jan 13, 2023

seems

I have the same confusion.
In my case, I use the following frontend:
frontend_conf: n_fft: 400 hop_length: 100
and contextual block conformer's streaming configuration is:
cnn_module_kernel: 15 block_size: 24 # streaming configuration hop_size: 10 # streaming configuration look_ahead: 4 # streaming configuration
When I use sim_chunk_length: 512, the performence is terrible, but it's OK with sim_chunk_length: 0.

@seunghyeon528 Do you fix this problem? And can you please reopen this issue?
@eml914 Do you make any progress?
I will also check this problem, if I have any idea, I will update with you all.

@duj12
Copy link

duj12 commented Jan 16, 2023

@seunghyeon528 I check my model and the inference code, the reason is the filterbank feature extracted is different between sim_chunk_length=512 and sim_chunk_length=0;

In my case, I used the Utterence MVN(Different from the default Global MVN). When sim_chunk_length=0 , the utterence is normalized on the whole audio, but when sim_chunk_length=512, the normalization is applied just in the current chunk(just use current chunk's mean and vars).
In streaming mode, the normalization can NOT be Utterence MVN.

You may check your feature extraction procedures. Because the decoding result is right when sim_chunk_length=0, which means the contexual_conformer has nothing wrong, the prolem should lie in where the audio chunk be splitted and the feature be generated.

@seunghyeon528 seunghyeon528 reopened this Jan 16, 2023
@eml914
Copy link
Contributor

eml914 commented Jan 17, 2023

I was also looking at apply_frontend in asr_inference_streaming.py but I realized I was looking at an old repository. My environment uses fbank features as input and it does not work in the old one, but now it was fixed.
@duj12 might be right; the normalization can work chunkwise. I will look at signal input case and let you know what I find.

@espnetUser
Copy link
Contributor

@seunghyeon528, @eml914, @duj12 and @sw005320 : I see the same issue with poor WER as you reported when using a sim_chunk_length=512 but good WER with sim_chunk_length=0. The issue only appears for me with certain frontend hop_length settings.

I believe the problem might result from incorrect trimming of feature frames here:
https://github.com/espnet/espnet/blob/master/espnet2/bin/asr_inference_streaming.py#L257-L282

When the window_size is not a multiple of the hop_length (e.g., win_length: 256, hop_length: 90) the current code seems to trim an incorrect number (too many) of feature frames?! This does not cause an issue for sim_chunk_length=0 as the entire samples are processed in one step but for small values of sim_chunk_length like 512 trimming will be performed more often and errors in feature frame trimming become very noticable. If one increases the value for sim_chunk_length to something larger like 8192 (trimming less often) the problem of poor WER almost disappears.

To test this theory I replaced all occurences of math.ceil() with math.floor() in trimming code here:
https://github.com/espnet/espnet/blob/master/espnet2/bin/asr_inference_streaming.py#L257-L282

Here are the results which seem to indicate that this fixes the problem:

image

Could you please check the trimming code and let me know if that change makes sense?

Thanks!

@sw005320
Copy link
Contributor

sw005320 commented Jun 8, 2023

Thanks a lot for the detailed information.
@eml914, could you take a look at it?

@espnetUser, if @eml914 agrees with this change, could I ask you to make a PR?

@eml914
Copy link
Contributor

eml914 commented Jun 14, 2023

Great! Thank you very much for the inspection. I overlooked this point.
I realize that the trimming part is due to the DefaultFront which is torch.stft. The default parameter of torch.stft does centering with padding, and with the smaller hop_length, the more frames are contatimated by the padding samples, and thus this code tries to trim out these unnecesarry part in streaming processing. And I comfirm that the frames that needs to be cut off are math.ceil( math.ceil( win_length / hop_length) / 2 ) ), so the original code seems to be correct.
So, what is the problem -- I took a long time to think and found that it is actually waveform_buffer that is passed to the next step, not triming.

@eml914
Copy link
Contributor

eml914 commented Jun 14, 2023

Can you change this part of the code as follows on your side and check if this change work properly?

n_frames = (
speech.size(0) - (self.win_length - self.hop_length)
) // self.hop_length
n_residual = (
speech.size(0) - (self.win_length - self.hop_length)
) % self.hop_length
speech_to_process = speech.narrow(
0, 0, (self.win_length - self.hop_length) + n_frames * self.hop_length
)
waveform_buffer = speech.narrow(
0,
speech.size(0) - (self.win_length - self.hop_length) - n_residual,
(self.win_length - self.hop_length) + n_residual,
).clone()

n_frames = speech.size(0) // self.hop_length
n_residual = speech.size(0) % self.hop_length
speech_to_process = speech.narrow(
    0, 0, n_frames * self.hop_length
)
waveform_buffer = speech.narrow(
    0,
    speech.size(0) - ( math.ceil(math.ceil(self.win_length/self.hop_length)/2) * 2 - 1 ) * self.hop_length - n_residual,
    ( math.ceil(math.ceil(self.win_length/self.hop_length)/2) * 2 - 1 ) * self.hop_length + n_residual,
).clone()

@espnetUser
Copy link
Contributor

@eml914: Thanks for looking into this! I will check the proposed code change and get back to you.

@b-flo
Copy link
Member

b-flo commented Jun 15, 2023

Hum, I am manually linking #5140 to remind me to follow the discussion and double-check the frontend parts for streaming Transducer.
I recall having similar issues with it before redesigning the audio/features processing parts. I'm not entirely sure the issue is mitigated or fixed adequately now, the discussion may help (thanks!).

@espnetUser
Copy link
Contributor

@eml914: Apologies for my late reply. I tested the proposed code change in waveform_buffer and can confirm it solves the issue with certain hop_length settings without having to modify the feature trimming code part. With the new code change I am seeing the same "good" WER as with my "trimming code change" across different sim_chunk_length/hop_length values. So this looks good to me.

@sw005320, @eml914: I will go ahead and create a PR for this issue. Thanks!

espnetUser added a commit to espnetUser/espnet that referenced this issue Jun 30, 2023
For certain win_length/hop_length and sim_chunk_length settings (e.g., win_length:256, hop_length: 90 and sim_chunk_length: 512) poor WER was observed that was caused by incorrect filling of waveform_buffer object. 

See espnet#4807
@sw005320
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants