Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix a small issue in OWSM decode_long #5703

Merged
merged 6 commits into from
Mar 22, 2024
Merged

Conversation

jctian98
Copy link
Contributor

@jctian98 jctian98 commented Mar 15, 2024

Why?

The last time-stamp prediction in the last chunk is slightly smaller than the whole speech length, making to an extra chunk that usually has < 1000 samples. The decoding behavior of this very short chunk leads to unpredictable behavior.

Remove one redundant line.

@mergify mergify bot added the ESPnet2 label Mar 15, 2024
@sw005320 sw005320 requested a review from pyf98 March 15, 2024 03:12
@sw005320 sw005320 added the OWSM Open Whisper-style Speech Model label Mar 15, 2024
@sw005320 sw005320 added this to the v.202405 milestone Mar 15, 2024
Copy link

codecov bot commented Mar 15, 2024

Codecov Report

Attention: Patch coverage is 0% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 76.60%. Comparing base (d004740) to head (d52dcb3).
Report is 63 commits behind head on master.

Files Patch % Lines
espnet2/bin/s2t_inference.py 0.00% 7 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #5703       +/-   ##
===========================================
+ Coverage   23.30%   76.60%   +53.30%     
===========================================
  Files         746      761       +15     
  Lines       69369    69880      +511     
===========================================
+ Hits        16163    53534    +37371     
+ Misses      53206    16346    -36860     
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 62.92% <ø> (ø)
test_integration_espnet2 48.84% <0.00%> (?)
test_integration_espnetez 27.98% <ø> (?)
test_python_espnet1 18.20% <0.00%> (-0.12%) ⬇️
test_python_espnet2 52.41% <0.00%> (?)
test_python_espnetez 13.95% <0.00%> (?)
test_utils 20.91% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jctian98
Copy link
Contributor Author

The Codecov fails since we don't have a test case for decode_long

Maybe it's hard to have a valid test case for decode_long, as it should properly predict the time stamps. However, for test cases, it cannot be easy. What do you think? @pyf98

@sw005320
Copy link
Contributor

OK, for the test code coverage.
If @pyf98 is OK for this change, I can merge it.

Where does 0.2 come from?
It would be safe to make it as a variable.

@@ -575,6 +575,12 @@ def decode_long(
text_prev = init_text
while offset < len(speech):
logging.info(f"Current start time in seconds: {offset / fs:.2f}")
if offset + segment_len > len(speech) and len(segment) / fs < 0.2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be placed after segment = speech[offset : offset + segment_len]? Otherwise, segment is the previous one.
Also, we can make 0.2 as an argument in this function.
I think we do not need offset + segment_len > len(speech). If segment is shorter than the segment length, this condition must be true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mistake during copy-paste. Thanks!

@@ -575,6 +575,12 @@ def decode_long(
text_prev = init_text
while offset < len(speech):
logging.info(f"Current start time in seconds: {offset / fs:.2f}")
if offset + segment_len > len(speech) and len(segment) / fs < 0.2:
logging.warning(
f"Skip the last clip as it's too short: {len(segment)/ fs:.2f}s"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the style consistent. Now there is a space after / but no space before it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change "clip" to "segment" or "chunk"

@pyf98
Copy link
Collaborator

pyf98 commented Mar 15, 2024

The issue addressed here is:
When processing the last segment in a long audio, the end timestamp can be slightly shorter or longer than the actual speech length due to discretization (timestamps are multiples of 20ms).
If the prediction is longer, the decoding will terminate and there is no problem.
If the prediction is shorter, we need to decode one more segment which can be very short. We can skip it.

@jctian98
Copy link
Contributor Author

It should be okay for review again.

segment = speech[offset : offset + segment_len]
if (
offset + segment_len > len(speech)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this condition? This is equivalent to len(segment) < segment_len, where segment_len is much larger than our threshold in the next line. So, I think this condition will be satisfied automatically if the next condition is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. Didn't notice that if len(segment) < segment_len, that's the last chunk.

@@ -188,6 +188,7 @@ def __init__(
lang_sym: str = "<eng>",
task_sym: str = "<asr>",
predict_time: bool = False,
skip_last_chunk_threshold: float = 0.2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking adding this argument in decode_long only, since it is specific to long-form decoding but never used in short-form __call__. If we remove it from __init__, we can also remove the argument in inference and argparser.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised.

@pyf98
Copy link
Collaborator

pyf98 commented Mar 19, 2024

Thanks @jctian98!
Sorry for asking more, but I think we can add the argument only in decode_long. This is more consistent with the design of other arguments.

@jctian98
Copy link
Contributor Author

fixed. Please review.

@@ -531,6 +531,7 @@ def decode_long(
end_time_threshold: str = "<29.00>",
lang_sym: Optional[str] = None,
task_sym: Optional[str] = None,
skip_last_chunk_threshold=0.2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final suggestion: let's add a type hint here: float = 0.2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember it exists in previous versions

@pyf98
Copy link
Collaborator

pyf98 commented Mar 21, 2024

LGTM

@sw005320
Copy link
Contributor

@jctian98
Copy link
Contributor Author

It fails on test-shell but this PR doesn't change shell scripts.
Can we re-run the CI test?

@jctian98
Copy link
Contributor Author

CI fails again due to a time-out error.

@sw005320 sw005320 merged commit dbd73dd into espnet:master Mar 22, 2024
34 of 35 checks passed
@sw005320
Copy link
Contributor

Thanks, @jctian98!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bugfix ESPnet2 OWSM Open Whisper-style Speech Model
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants