Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Whisper-style training as a new task S2T #5120

Merged
merged 72 commits into from Sep 23, 2023

Conversation

pyf98
Copy link
Collaborator

@pyf98 pyf98 commented Apr 17, 2023

Hi, this PR adds a new task s2t1 (speech-to-text) in ESPnet2 which follows OpenAI Whisper's training style. It has two major features:

  • It uses special tokens as task specifiers (e.g., transcribe, translate) or prediction targets (e.g., language ID) so that a single encoder-decoder model can perform multiple tasks for multiple languages.
  • It supports conditional generation where the condition is the previous sentence in a long talk.

The training data has the following format:

<sop> prev<sos><category><task><starttime1> utt1<endtime1><starttime2> utt2<endtime2><eos>

where <sop> is a special token denoting the start of prev/prompt sentence. The timestamps are also treated as special tokens because the audio has a fixed length (30s) and resolution (20ms). An example looks like:

<sop> I'm going to talk today about energy and climate.<sos><en><transcribe><0.00> And that might seem a bit surprising, because my full-time work at the foundation is mostly about vaccines and seeds, about the things that we need to invent and deliver to help the poorest two billion live better lives.<14.12><15.36> But energy and climate are extremely important to these people; in fact, more important than to anyone else on the planet.<24.26><eos>

During data preparation, three text files are generated:

  • text contains the normal target sentence, i.e., the text between <sos> and <eos>.
  • text.prev contains the previous sentence, i.e., the text between <sop> and <sos>. This might be unavailable at the beginning of a talk. In such cases, a special token <na> will be used.
  • text.ctc contains the ASR transcript without any special token, which is used for the CTC loss. For ASR utterances, this can be derived from text, but for ST utterances, this is in a different language. If the ASR transcription is not available, <na> will be used.

For decoding, the model can perform utterance-level ASR or ST, which follows the same procedure as the standard tasks. It can also perform long-form ASR or ST based on the predicted timestamps.

@mergify mergify bot added the ESPnet2 label Apr 17, 2023
@sw005320 sw005320 added New Features ASR Automatic speech recogntion labels Apr 17, 2023
@sw005320 sw005320 added this to the v.202303 milestone Apr 17, 2023
@sw005320
Copy link
Contributor

Very cool, @pyf98!
@pengchengguo and @ftshijt, can you review this PR?

@sw005320 sw005320 requested a review from ftshijt April 28, 2023 12:11
Copy link
Collaborator

@ftshijt ftshijt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great effort! Since there is still some ongoing implementation, please let me know if the followings are ready. Some comments as follows:

egs2/TEMPLATE/s2t1/s2t.sh Show resolved Hide resolved
egs2/TEMPLATE/s2t1/s2t.sh Outdated Show resolved Hide resolved
egs2/TEMPLATE/s2t1/s2t.sh Show resolved Hide resolved
espnet2/bin/s2t_inference.py Outdated Show resolved Hide resolved
espnet2/bin/s2t_inference.py Outdated Show resolved Hide resolved
espnet2/bin/s2t_inference.py Outdated Show resolved Hide resolved
yield


class ESPnetS2TModel(AbsESPnetModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can inherit ASR model? That would be especially helpful to get the latest updates from there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems great. Let me check it. Thanks

)


class S2TTask(AbsTask):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to here, maybe inherit ASR task would be a good option?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Will check it.

@kan-bayashi kan-bayashi modified the milestones: v.202303, v.202307 May 1, 2023
@mergify mergify bot added the README label Jul 29, 2023
@codecov
Copy link

codecov bot commented Jul 30, 2023

Codecov Report

Merging #5120 (8f70993) into master (a719135) will decrease coverage by 0.05%.
The diff coverage is 73.92%.

@@            Coverage Diff             @@
##           master    #5120      +/-   ##
==========================================
- Coverage   77.15%   77.10%   -0.05%     
==========================================
  Files         676      681       +5     
  Lines       61337    62288     +951     
==========================================
+ Hits        47327    48030     +703     
- Misses      14010    14258     +248     
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 65.73% <ø> (ø)
test_integration_espnet2 48.90% <73.54%> (+0.56%) ⬆️
test_python_espnet1 20.07% <0.00%> (-0.32%) ⬇️
test_python_espnet2 52.28% <64.14%> (+0.18%) ⬆️
test_utils 23.10% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
espnet2/asr/encoder/transformer_encoder.py 92.22% <ø> (ø)
espnet2/train/preprocessor.py 76.48% <28.81%> (-3.25%) ⬇️
espnet2/bin/s2t_inference.py 69.84% <69.84%> (ø)
espnet2/bin/s2t_inference_language.py 70.25% <70.25%> (ø)
espnet2/s2t/espnet_model.py 85.55% <85.55%> (ø)
espnet2/tasks/s2t.py 86.88% <86.88%> (ø)
espnet2/bin/pack.py 96.49% <100.00%> (+0.19%) ⬆️
espnet2/bin/s2t_train.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@pyf98 pyf98 changed the title [WIP] Support Whisper-style training as a new task S2T Support Whisper-style training as a new task S2T Jul 30, 2023
@mergify mergify bot added the CI Travis, Circle CI, etc label Jul 30, 2023
@kan-bayashi kan-bayashi modified the milestones: v.202307, v.202312 Aug 3, 2023
@pyf98
Copy link
Collaborator Author

pyf98 commented Sep 22, 2023

Hi @sw005320 , can we merge this PR now (given the ASRU result)? Thanks!

Copy link
Contributor

@sw005320 sw005320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It almost ready but I just want to check some points.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of statistics are you getting, and what is it used for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shows the number of hours for every language pairs in the specified data directory. It is not used for training or decoding. Just a separate utility tool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this part does not have good test coverage.
Are there some specific reasons?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One reason is that we cannot easily test the long-form decoding. It relies on the predicted timestamps. If we use random input or random model, we cannot generate structured output which makes the decoding hang forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between s2t_inference vs. this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s2t_inference requires a language token and performs the full decoding.
Here, it only predicts the language token, which means we set the output length to be exactly 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add an option to make the output length 1 to realize this with s2t_inference?
If there are any other changes, we can leave it as it is.

@@ -1750,3 +1751,180 @@ def __call__(
data = self._speech_process(data)

return data


class S2TPreprocessor(CommonPreprocessor):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave some comments on how they are different from the CommonPreprocessor?

@sw005320 sw005320 merged commit 8a8709e into espnet:master Sep 23, 2023
25 of 26 checks passed
@sw005320
Copy link
Contributor

Thanks for your great efforts.
I just merged this PR.
However, this lacks the documentation.
So, please add the documents in the following place.

  1. by following https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/*/README.md, please add the document to describe the detail of this new template
  2. please add a brief explanation to the main document https://github.com/espnet/espnet/blob/master/README.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASR Automatic speech recogntion CI Travis, Circle CI, etc ESPnet2 New Features README Recipe
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants