-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question regarding asr_inference_streaming.py #4807
Comments
Thanks for your questions! |
Hi, sorry for my belated reply.
|
Thx for such a prompt reply! Ever since, I have tried only conformer models, but I might try streaming transformer models too. The information you shared is so valuable! Hope you have great weekend :) |
I have the same confusion. @seunghyeon528 Do you fix this problem? And can you please reopen this issue? |
@seunghyeon528 I check my model and the inference code, the reason is the filterbank feature extracted is different between In my case, I used the Utterence MVN(Different from the default Global MVN). When You may check your feature extraction procedures. Because the decoding result is right when |
I was also looking at |
@seunghyeon528, @eml914, @duj12 and @sw005320 : I see the same issue with poor WER as you reported when using a I believe the problem might result from incorrect trimming of feature frames here: When the window_size is not a multiple of the hop_length (e.g., To test this theory I replaced all occurences of Here are the results which seem to indicate that this fixes the problem: Could you please check the trimming code and let me know if that change makes sense? Thanks! |
Thanks a lot for the detailed information. @espnetUser, if @eml914 agrees with this change, could I ask you to make a PR? |
Great! Thank you very much for the inspection. I overlooked this point. |
Can you change this part of the code as follows on your side and check if this change work properly? espnet/espnet2/bin/asr_inference_streaming.py Lines 225 to 238 in 096e2bb
n_frames = speech.size(0) // self.hop_length
n_residual = speech.size(0) % self.hop_length
speech_to_process = speech.narrow(
0, 0, n_frames * self.hop_length
)
waveform_buffer = speech.narrow(
0,
speech.size(0) - ( math.ceil(math.ceil(self.win_length/self.hop_length)/2) * 2 - 1 ) * self.hop_length - n_residual,
( math.ceil(math.ceil(self.win_length/self.hop_length)/2) * 2 - 1 ) * self.hop_length + n_residual,
).clone() |
@eml914: Thanks for looking into this! I will check the proposed code change and get back to you. |
Hum, I am manually linking #5140 to remind me to follow the discussion and double-check the frontend parts for streaming Transducer. |
@eml914: Apologies for my late reply. I tested the proposed code change in @sw005320, @eml914: I will go ahead and create a PR for this issue. Thanks! |
For certain win_length/hop_length and sim_chunk_length settings (e.g., win_length:256, hop_length: 90 and sim_chunk_length: 512) poor WER was observed that was caused by incorrect filling of waveform_buffer object. See espnet#4807
Thanks! |
Hi!
First of all, Thx for such nice toolkit! I love conducting reserach with the framework!
Recently, I trained some streaming ASR models with espnet version 2, and here's three questions regarding the model
1
I trained streaming asr model using contextual block conformer, and I found out that when I trained model with frontend below,
and test model with test inference_config with sim_chunk_length 512 (full config is below)
the WER was really bad.
But If I test model with
sim_chunk_length == 0
, WER seems decent.Is there any relationship between train frontend config and inference sim_chunk_length?
I also found that if I train model with default frontend (of which win_length:512, hop_length:128, following https://github.com/espnet/espnet/blob/master/egs2/aishell/asr1/conf/train_asr_streaming_conformer.yaml), inferencing with any sim_chunk_length shows good WER.
2
Moreover, If I want to train with the same number of frames cited in "Tsunoo, Emiru, Yosuke Kashiwagi, and Shinji Watanabe. "Streaming transformer asr with blockwise synchronous beam search." 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021.", specifically {Nl,Nc,Nr} = {16,16,8} or {4,8,4}, I'm not sure making contextual block conformer encoder config {block_size, hop_size, look_ahead} = {40,16,16} and {16,4,4} would be correct. Could you give some advice on the configuration?
3
Lastly, If I reduce block_size from 40 to 16, should I reduce cnn_module_kernel for contextual block conformer? If I understood correctly, since 1D cnn layer with cnn_module_kernel size is applied to features of block_size, cnn_module_kerne_size 31 for block_size 16 would be too large.
Thx and hope you have a great day!
The text was updated successfully, but these errors were encountered: