Support Whisper-style training as a new task S2T #5120

pyf98 · 2023-04-17T01:40:43Z

Hi, this PR adds a new task s2t1 (speech-to-text) in ESPnet2 which follows OpenAI Whisper's training style. It has two major features:

It uses special tokens as task specifiers (e.g., transcribe, translate) or prediction targets (e.g., language ID) so that a single encoder-decoder model can perform multiple tasks for multiple languages.
It supports conditional generation where the condition is the previous sentence in a long talk.

The training data has the following format:

<sop> prev<sos><category><task><starttime1> utt1<endtime1><starttime2> utt2<endtime2><eos>

where <sop> is a special token denoting the start of prev/prompt sentence. The timestamps are also treated as special tokens because the audio has a fixed length (30s) and resolution (20ms). An example looks like:

<sop> I'm going to talk today about energy and climate.<sos><en><transcribe><0.00> And that might seem a bit surprising, because my full-time work at the foundation is mostly about vaccines and seeds, about the things that we need to invent and deliver to help the poorest two billion live better lives.<14.12><15.36> But energy and climate are extremely important to these people; in fact, more important than to anyone else on the planet.<24.26><eos>

During data preparation, three text files are generated:

text contains the normal target sentence, i.e., the text between <sos> and <eos>.
text.prev contains the previous sentence, i.e., the text between <sop> and <sos>. This might be unavailable at the beginning of a talk. In such cases, a special token <na> will be used.
text.ctc contains the ASR transcript without any special token, which is used for the CTC loss. For ASR utterances, this can be derived from text, but for ST utterances, this is in a different language. If the ASR transcription is not available, <na> will be used.

For decoding, the model can perform utterance-level ASR or ST, which follows the same procedure as the standard tasks. It can also perform long-form ASR or ST based on the predicted timestamps.

for more information, see https://pre-commit.ci

sw005320 · 2023-04-18T19:49:14Z

Very cool, @pyf98!
@pengchengguo and @ftshijt, can you review this PR?

for more information, see https://pre-commit.ci

ftshijt

Thanks for the great effort! Since there is still some ongoing implementation, please let me know if the followings are ready. Some comments as follows:

egs2/TEMPLATE/s2t1/s2t.sh

egs2/must_c_v2/s2t1/conf/tuning/train_s2t_ebf_lr1e-3_warmup5k.yaml

espnet2/bin/s2t_inference.py

ftshijt · 2023-04-28T12:51:29Z

espnet2/s2t/espnet_model.py

+        yield
+
+
+class ESPnetS2TModel(AbsESPnetModel):


Maybe we can inherit ASR model? That would be especially helpful to get the latest updates from there.

This seems great. Let me check it. Thanks

ftshijt · 2023-04-28T12:52:34Z

espnet2/tasks/s2t.py

+)
+
+
+class S2TTask(AbsTask):


Similar to here, maybe inherit ASR task would be a good option?

Thanks. Will check it.

for more information, see https://pre-commit.ci

codecov · 2023-07-30T00:33:52Z

Codecov Report

Merging #5120 (8f70993) into master (a719135) will decrease coverage by 0.05%.
The diff coverage is 73.92%.

@@            Coverage Diff             @@
##           master    #5120      +/-   ##
==========================================
- Coverage   77.15%   77.10%   -0.05%     
==========================================
  Files         676      681       +5     
  Lines       61337    62288     +951     
==========================================
+ Hits        47327    48030     +703     
- Misses      14010    14258     +248

Flag	Coverage Δ
test_configuration_espnet2	`∅ <ø> (∅)`
test_integration_espnet1	`65.73% <ø> (ø)`
test_integration_espnet2	`48.90% <73.54%> (+0.56%)`	⬆️
test_python_espnet1	`20.07% <0.00%> (-0.32%)`	⬇️
test_python_espnet2	`52.28% <64.14%> (+0.18%)`	⬆️
test_utils	`23.10% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
espnet2/asr/encoder/transformer_encoder.py	`92.22% <ø> (ø)`
espnet2/train/preprocessor.py	`76.48% <28.81%> (-3.25%)`	⬇️
espnet2/bin/s2t_inference.py	`69.84% <69.84%> (ø)`
espnet2/bin/s2t_inference_language.py	`70.25% <70.25%> (ø)`
espnet2/s2t/espnet_model.py	`85.55% <85.55%> (ø)`
espnet2/tasks/s2t.py	`86.88% <86.88%> (ø)`
espnet2/bin/pack.py	`96.49% <100.00%> (+0.19%)`	⬆️
espnet2/bin/s2t_train.py	`100.00% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

for more information, see https://pre-commit.ci

…-public

for more information, see https://pre-commit.ci

pyf98 · 2023-09-22T15:19:41Z

Hi @sw005320 , can we merge this PR now (given the ASRU result)? Thanks!

sw005320

It almost ready but I just want to check some points.

sw005320 · 2023-07-21T19:15:33Z

egs2/mixed_v3/s2t1/local/stats.py

what kind of statistics are you getting, and what is it used for?

It shows the number of hours for every language pairs in the specified data directory. It is not used for training or decoding. Just a separate utility tool

sw005320 · 2023-09-22T15:24:56Z

espnet2/bin/s2t_inference.py

It seems that this part does not have good test coverage.
Are there some specific reasons?

One reason is that we cannot easily test the long-form decoding. It relies on the predicted timestamps. If we use random input or random model, we cannot generate structured output which makes the decoding hang forever.

sw005320 · 2023-09-22T15:26:05Z

espnet2/bin/s2t_inference_language.py

What is the difference between s2t_inference vs. this?

s2t_inference requires a language token and performs the full decoding.
Here, it only predicts the language token, which means we set the output length to be exactly 1.

Would it be possible to add an option to make the output length 1 to realize this with s2t_inference?
If there are any other changes, we can leave it as it is.

sw005320 · 2023-09-22T15:31:02Z

espnet2/train/preprocessor.py

@@ -1750,3 +1751,180 @@ def __call__(
        data = self._speech_process(data)

        return data
+
+
+class S2TPreprocessor(CommonPreprocessor):


Can you leave some comments on how they are different from the CommonPreprocessor?

sw005320 · 2023-09-23T11:59:50Z

Thanks for your great efforts.
I just merged this PR.
However, this lacks the documentation.
So, please add the documents in the following place.

by following https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/*/README.md, please add the document to describe the detail of this new template
please add a brief explanation to the main document https://github.com/espnet/espnet/blob/master/README.md

pyf98 added 11 commits April 11, 2023 17:01

create task template for s2t1

6273859

create recipe for s2t1

25cc776

set default cmd

3ee7223

set default slurm

be97590

add dependency check in local/path.sh

e1ace5e

add data prep

00f0b19

update data prep stages

767c850

add py files for s2t

fe2530c

remove failed config

ac507ee

add train config

0cd83cb

Merge branch 'master' into whisper-public

14124c1

mergify bot added the ESPnet2 label Apr 17, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

9554286

for more information, see https://pre-commit.ci

sw005320 added New Features ASR Automatic speech recogntion labels Apr 17, 2023

sw005320 added this to the v.202303 milestone Apr 17, 2023

sw005320 requested a review from pengchengguo April 18, 2023 19:24

pyf98 and others added 4 commits April 18, 2023 20:42

Merge branch 'master' of github.com:espnet/espnet into whisper-public

b2d9726

add inference code

5a0c46e

Merge branch 'master' of github.com:espnet/espnet into whisper-public

e1f06f6

[pre-commit.ci] auto fixes from pre-commit.com hooks

46234a4

for more information, see https://pre-commit.ci

sw005320 requested a review from ftshijt April 28, 2023 12:11

ftshijt reviewed Apr 28, 2023

View reviewed changes

kan-bayashi modified the milestones: v.202303, v.202307 May 1, 2023

pyf98 and others added 4 commits May 1, 2023 13:54

Merge branch 'master' of github.com:espnet/espnet into whisper-public

ff4e68b

add mixed_v1

9ed7322

update gigaspeech script

91043d9

[pre-commit.ci] auto fixes from pre-commit.com hooks

ef82b27

for more information, see https://pre-commit.ci

pyf98 added 3 commits July 29, 2023 16:36

Merge branch 'master' of github.com:espnet/espnet into whisper-public

dc61487

update v2 and v3

2ec7cc4

add readme in v2 and v3

5b394c1

mergify bot added the README label Jul 29, 2023

Yifan Peng and others added 4 commits July 29, 2023 19:19

recover preprocessor

d564648

fix too long line

43f20e0

fix format

c4cbeec

fix python format

44f62ba

pyf98 and others added 5 commits July 29, 2023 22:28

add tests

eb31120

[pre-commit.ci] auto fixes from pre-commit.com hooks

da20167

for more information, see https://pre-commit.ci

init s2t1 in mini_an4

af0d786

Merge branch 'whisper-public' of github.com:pyf98/espnet into whisper…

59276cc

…-public

add integration tests

d3100b2

pyf98 changed the title ~~[WIP] Support Whisper-style training as a new task S2T~~ Support Whisper-style training as a new task S2T Jul 30, 2023

mergify bot added the CI Travis, Circle CI, etc label Jul 30, 2023

pre-commit-ci bot and others added 6 commits July 30, 2023 03:43

[pre-commit.ci] auto fixes from pre-commit.com hooks

77f4cfa

for more information, see https://pre-commit.ci

fix python format

2750f35

fix shell

496f3f3

fix shell error

dba0d76

remove transducer from s2t model

d42e241

fix error in test

8f70993

kan-bayashi modified the milestones: v.202307, v.202312 Aug 3, 2023

sw005320 reviewed Sep 22, 2023

View reviewed changes

sw005320 merged commit 8a8709e into espnet:master Sep 23, 2023
25 of 26 checks passed

pyf98 deleted the whisper-public branch September 23, 2023 17:16

vivektyagiibm mentioned this pull request Sep 24, 2023

AssertionError: decoder should not be None when attention is used #5443

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Whisper-style training as a new task S2T #5120

Support Whisper-style training as a new task S2T #5120

pyf98 commented Apr 17, 2023 •

edited

sw005320 commented Apr 18, 2023

ftshijt left a comment •

edited

ftshijt Apr 28, 2023

pyf98 Apr 28, 2023

ftshijt Apr 28, 2023

pyf98 Apr 28, 2023

codecov bot commented Jul 30, 2023 •

edited

pyf98 commented Sep 22, 2023

sw005320 left a comment

sw005320 Jul 21, 2023

pyf98 Sep 22, 2023

sw005320 Sep 22, 2023

pyf98 Sep 22, 2023

sw005320 Sep 22, 2023

pyf98 Sep 22, 2023

sw005320 Sep 22, 2023

sw005320 Sep 22, 2023

sw005320 commented Sep 23, 2023

		)


		class S2TTask(AbsTask):

Support Whisper-style training as a new task S2T #5120

Support Whisper-style training as a new task S2T #5120

Conversation

pyf98 commented Apr 17, 2023 • edited

sw005320 commented Apr 18, 2023

ftshijt left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 30, 2023 • edited

Codecov Report

pyf98 commented Sep 22, 2023

sw005320 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sw005320 commented Sep 23, 2023

pyf98 commented Apr 17, 2023 •

edited

ftshijt left a comment •

edited

codecov bot commented Jul 30, 2023 •

edited