New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVSR recipes on LRS3 using pre-trained AV-HuBERT model #5456
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov Report
@@ Coverage Diff @@
## master #5456 +/- ##
==========================================
- Coverage 77.16% 77.06% -0.11%
==========================================
Files 684 685 +1
Lines 62713 63223 +510
==========================================
+ Hits 48391 48720 +329
- Misses 14322 14503 +181
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 8 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@@ -0,0 +1,1092 @@ | |||
# Copyright 2023 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for making it working by adapting the original codebase of fairseq to here. My major concern is that the fairseq-style is not very compatible to our style (introducing a lot of module-related codes, which makes the design a bit redundant)
I feel that there would be two better solutions:
- Option1: follow https://github.com/ms-dot-k/espnet/blob/master/espnet2/asr/encoder/wav2vec2_encoder.py style that directly import fairseq?
- Option2: add an entry to s3prl (which we usually use for SSL and they have a good capacity of holding different SSL models) https://github.com/s3prl/s3prl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open for discussion. Maybe also consult @simpleoier and @sw005320 's thoughts on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's true.
Currently AV-HuBERT is not supported by official Fairseq, so we cannot directly import it from fairseq.
Therefore, we need to define many modules in avhubert_encoder.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to copy it instead of importing it from fairseq.
We had a lot of issues when importing fairseq.
Also, this approach allows us to edit the model directly (e.g., adding adapter, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your efforts!
I left some comments.
Co-authored-by: xuankai@cmu <simpleoier@users.noreply.github.com>
Co-authored-by: Jiatong <728307998@qq.com>
for more information, see https://pre-commit.ci
Thank you for reviewing the code, @ftshijt and @simpleoier. I modified the code according to your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Maybe we could add a bit more comments (e.g., how to extract video features, how to fuse them, how to feed them to espnet (through the Kaldi ark format?), and how to use av hubert) to this recipe in egs2/lrs3/avsr1/README.md.
|
||
|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| | ||
|---|---|---|---|---|---|---|---|---| | ||
|inference_asr_model_valid.acc.ave/test|1321|9890|98.5|1.1|0.4|0.2|1.7|8.8| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very good.
Is it an expected number?
The performance of this task is already saturated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance is obtained in clean audio-visual environment.
The original reported performance from AV-HuBERT is 1.4% WER. Maybe the small gap is from the augmentation part.
feats - numpy.ndarray of shape [T, F] | ||
stack_order - int (number of neighboring frames to concatenate | ||
Returns: | ||
feats - numpy.ndarray of shape [T', F'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would F' be changed from F?
If not, maybe [T', F]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log fbank has 100 fps.
In order to fuse with video (25fps), they stacked 4 contiguous features in spectral dimension.
So F' is 4*F and T' is T/4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, got it!
maybe, you can add these descriptions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
@@ -0,0 +1,1092 @@ | |||
# Copyright 2023 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to copy it instead of importing it from fairseq.
We had a lot of issues when importing fairseq.
Also, this approach allows us to edit the model directly (e.g., adding adapter, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a test to cover this part.
Please get an instruction from @ftshijt or @simpleoier about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see that you added a test.
I don't know why but from my view, it looks like this code does not go through the test.
I'll check it later again.
Thanks a lot for your great efforts! The last comment is a follow-up PR to add this great new function to https://github.com/espnet/espnet/blob/master/README.md |
What?
New recipe for training audio-visual speech recognition model using LRS3 dataset added egs2/lrs3/avsr1 .
Recipe is using dumped features using pre-trained frontends of AV-HuBERT.
Therefore, the dumped features are the fusion of audio-visual features.
Why?
Create training recipe for AVSR using AV-HuBERT model.
See also
As we utilize dumped feature, we cannot utilize some data augmentations original used in AV-HuBERT.
For example, audio noise modeling & video random cropping and flipping.
Instead, the recipe applies random time masking at audio-visual features. Moreover, to avoid overfitting, it utilizes more light transformer decoder when AV-HuBERT Large configuration is used.
(Original AV-HuBERT Large uses 8 heads with 9 layers while the recipe uses 4 heads with 6 layers for decoder.)