Improving OWSM inference interface #5618

pyf98 · 2024-01-10T04:50:18Z

What?

This PR improves the interface for OWSM inference.
- The speech is automatically padded or trimmed to the fixed length which is consistent with training.
- Some attributes like speech length and time symbols can be retrieved from preprocessor_conf. We no longer need to provide them manually.
- lang_sym, task_sym and predict_time can be passed as additional arguments when calling __call__, which will overwrite the default values in __init__. This is more convenient to use.
- Many redundant arguments are removed from s2t_inference_language. BeamSearch is also removed. Only a single decoder forward step is performed. And it can return an N-best list of language and (normalized) probability.
It also implements some simple rules to suppress timestamps which is similar to Whisper. The implementation defines a new scorer that assigns certain tokens with a -inf score during search. With this modification, it can now predict the first timestamp by itself. Previously, the first timestamp has to be manually set, which is usually inaccurate.

for more information, see https://pre-commit.ci

codecov · 2024-01-10T05:06:15Z

Codecov Report

Attention: 41 lines in your changes are missing coverage. Please review.

Comparison is base (3b2e0d3) 24.25% compared to head (8ee64f5) 76.06%.
Report is 121 commits behind head on master.

Files	Patch %	Lines
espnet2/bin/s2t_inference.py	67.39%	30 Missing ⚠️
espnet2/bin/s2t_inference_language.py	66.66%	11 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #5618       +/-   ##
===========================================
+ Coverage   24.25%   76.06%   +51.81%     
===========================================
  Files         720      735       +15     
  Lines       66641    68696     +2055     
===========================================
+ Hits        16161    52255    +36094     
+ Misses      50480    16441    -34039

Flag	Coverage Δ
test_configuration_espnet2	`∅ <ø> (∅)`
test_integration_espnet1	`62.92% <ø> (+<0.01%)`	⬆️
test_integration_espnet2	`48.11% <57.60%> (?)`
test_python_espnet1	`18.51% <0.00%> (-0.57%)`	⬇️
test_python_espnet2	`52.84% <56.00%> (?)`
test_utils	`22.15% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

sw005320 · 2024-01-10T20:11:01Z

Thanks, @pyf98!
@jctian98, can you review this PR?

for more information, see https://pre-commit.ci

pyf98 · 2024-01-15T17:07:55Z

Hi, can we considering merging this soon? I will provide example usage in the project page based on this new interface.

As reported previously, we can significantly improve the long-form ASR performance with some rules and heuristics
TEDLIUM2 WER: 8.5 -> 5.7 (greedy)

sw005320 · 2024-01-15T17:19:49Z

Yes, but currently, CI is broken.
So, we're checking it.
Sorry about it.

ftshijt · 2024-01-16T16:00:30Z

@sw005320 The CI has passed the previous issue. Should be good to go ahead

sw005320 · 2024-01-16T16:40:00Z

Cool!

@pyf98, please deal with the CI error
https://github.com/espnet/espnet/actions/runs/7484131332/job/20534760203?pr=5618

pyf98 · 2024-01-16T21:05:03Z

Thanks! It seems that some TTS tests failed

sw005320 · 2024-01-18T00:26:09Z

@jctian98, this is a reminder.
Can you review this PR?

sw005320 · 2024-01-18T00:29:47Z

espnet2/bin/s2t_inference.py

@@ -44,6 +43,105 @@
 ]


+class ScoreFilter(BatchScorerInterface, torch.nn.Module):
+    """Filter scores based on pre-defined rules."""


Can you add more explanations here or in other places about what kind of pre-defined rules?

I added some more comments below

jctian98 · 2024-01-19T04:09:49Z

The code is very clear. I'm ok with it.

A simple discussion for future development:
Do you think the s2t_inference_language.py can be a special case of s2t_inference.py as long as you pass a constant maxlen to beam search so that decoding one step exactly means predicting the lang_id?
@pyf98

pyf98 · 2024-01-19T05:04:19Z

Thanks @jctian98 for the comment.

In the previous version of s2t_inference_language.py, I simply copied the code of s2t_inference.py and then set max_len to be 1, which can predict the language token given <sos>. This design has some issues:

It adds many redundant arguments since we directly reuse s2t_inference.py which is designed to predict a fully formated sequence. In the new version, I removed those redundant arguments to avoid confusion. Also, it no longer requires the BeamSearch object.
The current BatchBeamSearch only supports batch_size=1. In the future, we may want to implement batched language detection. This will be much easier in the new version.

I think we can keep this separate design for now.

pyf98 · 2024-01-19T05:08:16Z

One more note is that the order of tokens is: <sop> prompt<sos><lang><task><time>xxx<eos>.

The language token is between <sos> and <task>. So, we cannot predict <lang> and text in a single autoregressive decoding run. Instead, we need to first predict <lang> given <sos>, and then predict text given <sos><lang><task> where <task> is known.

In terms of the implementation, we have to set both maxlen and hyp_primer for the two use cases. Merging the two use cases will make the code more complicated.

sw005320 · 2024-01-19T13:44:25Z

Thanks, @pyf98!

pyf98 added 6 commits January 9, 2024 00:47

redesign s2t_inference

3bb5326

update s2t inference

8b5c922

update s2t inference

857217c

update inference language

4305ca7

fix speech padding

a4e2024

update test for s2t

6be446f

mergify bot added the ESPnet2 label Jan 10, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

5ea51d3

for more information, see https://pre-commit.ci

sw005320 added OWSM Open Whisper-style Speech Model Enhancement Enhancement labels Jan 10, 2024

sw005320 added this to the v.202312 milestone Jan 10, 2024

pyf98 and others added 5 commits January 10, 2024 15:12

Merge branch 'espnet:master' into owsm

7536626

fix issue for 2d speech input

7961ec7

[pre-commit.ci] auto fixes from pre-commit.com hooks

8650c72

for more information, see https://pre-commit.ci

fix assert issue in s2t inference

2ac463c

modify arg names in decode configs

ac6a8de

fix decode config

be26e28

sw005320 reviewed Jan 18, 2024

View reviewed changes

add comments on rules

8ee64f5

sw005320 merged commit 35c2e2b into espnet:master Jan 19, 2024
27 checks passed

pyf98 deleted the owsm branch January 19, 2024 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving OWSM inference interface #5618

Improving OWSM inference interface #5618

pyf98 commented Jan 10, 2024

codecov bot commented Jan 10, 2024 •

edited

sw005320 commented Jan 10, 2024

pyf98 commented Jan 15, 2024

sw005320 commented Jan 15, 2024

ftshijt commented Jan 16, 2024

sw005320 commented Jan 16, 2024

pyf98 commented Jan 16, 2024

sw005320 commented Jan 18, 2024

sw005320 Jan 18, 2024

pyf98 Jan 18, 2024

jctian98 commented Jan 19, 2024 •

edited

pyf98 commented Jan 19, 2024

pyf98 commented Jan 19, 2024 •

edited

sw005320 commented Jan 19, 2024

Improving OWSM inference interface #5618

Improving OWSM inference interface #5618

Conversation

pyf98 commented Jan 10, 2024

What?

codecov bot commented Jan 10, 2024 • edited

Codecov Report

sw005320 commented Jan 10, 2024

pyf98 commented Jan 15, 2024

sw005320 commented Jan 15, 2024

ftshijt commented Jan 16, 2024

sw005320 commented Jan 16, 2024

pyf98 commented Jan 16, 2024

sw005320 commented Jan 18, 2024

sw005320 Jan 18, 2024

Choose a reason for hiding this comment

pyf98 Jan 18, 2024

Choose a reason for hiding this comment

jctian98 commented Jan 19, 2024 • edited

pyf98 commented Jan 19, 2024

pyf98 commented Jan 19, 2024 • edited

sw005320 commented Jan 19, 2024

codecov bot commented Jan 10, 2024 •

edited

jctian98 commented Jan 19, 2024 •

edited

pyf98 commented Jan 19, 2024 •

edited