New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Whisper-style training as a new task S2T #5120
Conversation
for more information, see https://pre-commit.ci
Very cool, @pyf98! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great effort! Since there is still some ongoing implementation, please let me know if the followings are ready. Some comments as follows:
egs2/must_c_v2/s2t1/conf/tuning/train_s2t_ebf_lr1e-3_warmup5k.yaml
Outdated
Show resolved
Hide resolved
yield | ||
|
||
|
||
class ESPnetS2TModel(AbsESPnetModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can inherit ASR model? That would be especially helpful to get the latest updates from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems great. Let me check it. Thanks
) | ||
|
||
|
||
class S2TTask(AbsTask): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to here, maybe inherit ASR task would be a good option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Will check it.
Codecov Report
@@ Coverage Diff @@
## master #5120 +/- ##
==========================================
- Coverage 77.15% 77.10% -0.05%
==========================================
Files 676 681 +5
Lines 61337 62288 +951
==========================================
+ Hits 47327 48030 +703
- Misses 14010 14258 +248
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Hi @sw005320 , can we merge this PR now (given the ASRU result)? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It almost ready but I just want to check some points.
egs2/mixed_v3/s2t1/local/stats.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what kind of statistics are you getting, and what is it used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shows the number of hours for every language pairs in the specified data directory. It is not used for training or decoding. Just a separate utility tool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this part does not have good test coverage.
Are there some specific reasons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason is that we cannot easily test the long-form decoding. It relies on the predicted timestamps. If we use random input or random model, we cannot generate structured output which makes the decoding hang forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between s2t_inference vs. this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s2t_inference requires a language token and performs the full decoding.
Here, it only predicts the language token, which means we set the output length to be exactly 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add an option to make the output length 1 to realize this with s2t_inference?
If there are any other changes, we can leave it as it is.
@@ -1750,3 +1751,180 @@ def __call__( | |||
data = self._speech_process(data) | |||
|
|||
return data | |||
|
|||
|
|||
class S2TPreprocessor(CommonPreprocessor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you leave some comments on how they are different from the CommonPreprocessor?
Thanks for your great efforts.
|
Hi, this PR adds a new task
s2t1
(speech-to-text) in ESPnet2 which follows OpenAI Whisper's training style. It has two major features:The training data has the following format:
where
<sop>
is a special token denoting the start of prev/prompt sentence. The timestamps are also treated as special tokens because the audio has a fixed length (30s) and resolution (20ms). An example looks like:During data preparation, three text files are generated:
text
contains the normal target sentence, i.e., the text between<sos>
and<eos>
.text.prev
contains the previous sentence, i.e., the text between<sop>
and<sos>
. This might be unavailable at the beginning of a talk. In such cases, a special token<na>
will be used.text.ctc
contains the ASR transcript without any special token, which is used for the CTC loss. For ASR utterances, this can be derived fromtext
, but for ST utterances, this is in a different language. If the ASR transcription is not available,<na>
will be used.For decoding, the model can perform utterance-level ASR or ST, which follows the same procedure as the standard tasks. It can also perform long-form ASR or ST based on the predicted timestamps.