Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVSR recipes on LRS3 using pre-trained AV-HuBERT model #5456

Merged
merged 22 commits into from Oct 10, 2023

Conversation

ms-dot-k
Copy link
Contributor

What?

New recipe for training audio-visual speech recognition model using LRS3 dataset added egs2/lrs3/avsr1 .
Recipe is using dumped features using pre-trained frontends of AV-HuBERT.
Therefore, the dumped features are the fusion of audio-visual features.

Why?

Create training recipe for AVSR using AV-HuBERT model.

See also

As we utilize dumped feature, we cannot utilize some data augmentations original used in AV-HuBERT.
For example, audio noise modeling & video random cropping and flipping.
Instead, the recipe applies random time masking at audio-visual features. Moreover, to avoid overfitting, it utilizes more light transformer decoder when AV-HuBERT Large configuration is used.
(Original AV-HuBERT Large uses 8 heads with 9 layers while the recipe uses 4 heads with 6 layers for decoder.)

@codecov
Copy link

codecov bot commented Sep 29, 2023

Codecov Report

Merging #5456 (870cd5f) into master (eef6523) will decrease coverage by 0.11%.
The diff coverage is 80.39%.

@@            Coverage Diff             @@
##           master    #5456      +/-   ##
==========================================
- Coverage   77.16%   77.06%   -0.11%     
==========================================
  Files         684      685       +1     
  Lines       62713    63223     +510     
==========================================
+ Hits        48391    48720     +329     
- Misses      14322    14503     +181     
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 65.68% <ø> (ø)
test_integration_espnet2 48.34% <26.62%> (-0.72%) ⬇️
test_python_espnet1 19.78% <0.00%> (-0.17%) ⬇️
test_python_espnet2 52.49% <80.39%> (+0.19%) ⬆️
test_utils 23.09% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
espnet2/tasks/asr.py 91.13% <100.00%> (+0.04%) ⬆️
espnet2/asr/encoder/avhubert_encoder.py 80.35% <80.35%> (ø)

... and 8 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@sw005320 sw005320 added Recipe AV Audio visual processing labels Sep 30, 2023
@sw005320 sw005320 added this to the v.202312 milestone Sep 30, 2023
egs2/lrs3/avsr1/local/data.sh Outdated Show resolved Hide resolved
egs2/lrs3/avsr1/local/data.sh Show resolved Hide resolved
egs2/lrs3/avsr1/local/data.sh Outdated Show resolved Hide resolved
egs2/lrs3/avsr1/local/data.sh Outdated Show resolved Hide resolved
egs2/lrs3/avsr1/local/extract_av_feature.py Outdated Show resolved Hide resolved
egs2/lrs3/avsr1/local/lrs3-valid.id Outdated Show resolved Hide resolved
egs2/lrs3/avsr1/run.sh Outdated Show resolved Hide resolved
@@ -0,0 +1,1092 @@
# Copyright 2023
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for making it working by adapting the original codebase of fairseq to here. My major concern is that the fairseq-style is not very compatible to our style (introducing a lot of module-related codes, which makes the design a bit redundant)

I feel that there would be two better solutions:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open for discussion. Maybe also consult @simpleoier and @sw005320 's thoughts on it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true.
Currently AV-HuBERT is not supported by official Fairseq, so we cannot directly import it from fairseq.
Therefore, we need to define many modules in avhubert_encoder.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to copy it instead of importing it from fairseq.
We had a lot of issues when importing fairseq.
Also, this approach allows us to edit the model directly (e.g., adding adapter, etc.)

Copy link
Collaborator

@simpleoier simpleoier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your efforts!
I left some comments.

egs2/lrs3/avsr1/local/data.sh Outdated Show resolved Hide resolved
egs2/lrs3/avsr1/local/data.sh Outdated Show resolved Hide resolved
egs2/lrs3/avsr1/local/data.sh Outdated Show resolved Hide resolved
egs2/lrs3/avsr1/local/data.sh Outdated Show resolved Hide resolved
egs2/lrs3/avsr1/local/extract_av_feature.py Outdated Show resolved Hide resolved
@mergify mergify bot added the Installation label Oct 3, 2023
@ms-dot-k
Copy link
Contributor Author

ms-dot-k commented Oct 3, 2023

Thank you for reviewing the code, @ftshijt and @simpleoier. I modified the code according to your comments.

Copy link
Contributor

@sw005320 sw005320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Maybe we could add a bit more comments (e.g., how to extract video features, how to fuse them, how to feed them to espnet (through the Kaldi ark format?), and how to use av hubert) to this recipe in egs2/lrs3/avsr1/README.md.


|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
|---|---|---|---|---|---|---|---|---|
|inference_asr_model_valid.acc.ave/test|1321|9890|98.5|1.1|0.4|0.2|1.7|8.8|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very good.
Is it an expected number?
The performance of this task is already saturated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance is obtained in clean audio-visual environment.
The original reported performance from AV-HuBERT is 1.4% WER. Maybe the small gap is from the augmentation part.

feats - numpy.ndarray of shape [T, F]
stack_order - int (number of neighboring frames to concatenate
Returns:
feats - numpy.ndarray of shape [T', F']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would F' be changed from F?
If not, maybe [T', F]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log fbank has 100 fps.
In order to fuse with video (25fps), they stacked 4 contiguous features in spectral dimension.
So F' is 4*F and T' is T/4.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, got it!
maybe, you can add these descriptions here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@@ -0,0 +1,1092 @@
# Copyright 2023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to copy it instead of importing it from fairseq.
We had a lot of issues when importing fairseq.
Also, this approach allows us to edit the model directly (e.g., adding adapter, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a test to cover this part.
Please get an instruction from @ftshijt or @simpleoier about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see that you added a test.
I don't know why but from my view, it looks like this code does not go through the test.
I'll check it later again.

@sw005320
Copy link
Contributor

Thanks a lot for your great efforts!
This was our long-term action item to add an audio-visual ASR task!

The last comment is a follow-up PR to add this great new function to https://github.com/espnet/espnet/blob/master/README.md
Please add some documents and advertise this great work!

@sw005320 sw005320 merged commit b66e423 into espnet:master Oct 10, 2023
25 checks passed
@ms-dot-k ms-dot-k deleted the avsr1 branch December 12, 2023 01:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants