-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix a small issue in OWSM decode_long #5703
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5703 +/- ##
===========================================
+ Coverage 23.30% 76.60% +53.30%
===========================================
Files 746 761 +15
Lines 69369 69880 +511
===========================================
+ Hits 16163 53534 +37371
+ Misses 53206 16346 -36860
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
The Codecov fails since we don't have a test case for Maybe it's hard to have a valid test case for |
OK, for the test code coverage. Where does 0.2 come from? |
espnet2/bin/s2t_inference.py
Outdated
@@ -575,6 +575,12 @@ def decode_long( | |||
text_prev = init_text | |||
while offset < len(speech): | |||
logging.info(f"Current start time in seconds: {offset / fs:.2f}") | |||
if offset + segment_len > len(speech) and len(segment) / fs < 0.2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be placed after segment = speech[offset : offset + segment_len]
? Otherwise, segment
is the previous one.
Also, we can make 0.2 as an argument in this function.
I think we do not need offset + segment_len > len(speech)
. If segment
is shorter than the segment length, this condition must be true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a mistake during copy-paste. Thanks!
espnet2/bin/s2t_inference.py
Outdated
@@ -575,6 +575,12 @@ def decode_long( | |||
text_prev = init_text | |||
while offset < len(speech): | |||
logging.info(f"Current start time in seconds: {offset / fs:.2f}") | |||
if offset + segment_len > len(speech) and len(segment) / fs < 0.2: | |||
logging.warning( | |||
f"Skip the last clip as it's too short: {len(segment)/ fs:.2f}s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the style consistent. Now there is a space after /
but no space before it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change "clip" to "segment" or "chunk"
The issue addressed here is: |
It should be okay for review again. |
espnet2/bin/s2t_inference.py
Outdated
segment = speech[offset : offset + segment_len] | ||
if ( | ||
offset + segment_len > len(speech) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this condition? This is equivalent to len(segment) < segment_len
, where segment_len
is much larger than our threshold in the next line. So, I think this condition will be satisfied automatically if the next condition is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Didn't notice that if len(segment) < segment_len
, that's the last chunk.
espnet2/bin/s2t_inference.py
Outdated
@@ -188,6 +188,7 @@ def __init__( | |||
lang_sym: str = "<eng>", | |||
task_sym: str = "<asr>", | |||
predict_time: bool = False, | |||
skip_last_chunk_threshold: float = 0.2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking adding this argument in decode_long
only, since it is specific to long-form decoding but never used in short-form __call__
. If we remove it from __init__
, we can also remove the argument in inference
and argparser
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised.
Thanks @jctian98! |
fixed. Please review. |
espnet2/bin/s2t_inference.py
Outdated
@@ -531,6 +531,7 @@ def decode_long( | |||
end_time_threshold: str = "<29.00>", | |||
lang_sym: Optional[str] = None, | |||
task_sym: Optional[str] = None, | |||
skip_last_chunk_threshold=0.2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One final suggestion: let's add a type hint here: float = 0.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember it exists in previous versions
LGTM |
OK! |
It fails on test-shell but this PR doesn't change shell scripts. |
CI fails again due to a time-out error. |
Thanks, @jctian98! |
Why?
The last time-stamp prediction in the last chunk is slightly smaller than the whole speech length, making to an extra chunk that usually has < 1000 samples. The decoding behavior of this very short chunk leads to unpredictable behavior.
Remove one redundant line.