AVSR recipes on LRS3 using pre-trained AV-HuBERT model #5456

ms-dot-k · 2023-09-29T22:28:51Z

What?

New recipe for training audio-visual speech recognition model using LRS3 dataset added egs2/lrs3/avsr1 .
Recipe is using dumped features using pre-trained frontends of AV-HuBERT.
Therefore, the dumped features are the fusion of audio-visual features.

Why?

Create training recipe for AVSR using AV-HuBERT model.

Codecov Report

Merging #5456 (870cd5f) into master (eef6523) will decrease coverage by 0.11%.
The diff coverage is 80.39%.

@@            Coverage Diff             @@
##           master    #5456      +/-   ##
==========================================
- Coverage   77.16%   77.06%   -0.11%     
==========================================
  Files         684      685       +1     
  Lines       62713    63223     +510     
==========================================
+ Hits        48391    48720     +329     
- Misses      14322    14503     +181

Flag	Coverage Δ
test_configuration_espnet2	`∅ <ø> (∅)`
test_integration_espnet1	`65.68% <ø> (ø)`
test_integration_espnet2	`48.34% <26.62%> (-0.72%)`	⬇️
test_python_espnet1	`19.78% <0.00%> (-0.17%)`	⬇️
test_python_espnet2	`52.49% <80.39%> (+0.19%)`	⬆️
test_utils	`23.09% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
espnet2/tasks/asr.py	`91.13% <100.00%> (+0.04%)`	⬆️
espnet2/asr/encoder/avhubert_encoder.py	`80.35% <80.35%> (ø)`

... and 8 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

for more information, see https://pre-commit.ci

egs2/lrs3/avsr1/local/data.sh

egs2/lrs3/avsr1/local/extract_av_feature.py

egs2/lrs3/avsr1/local/lrs3-valid.id

egs2/lrs3/avsr1/run.sh

ftshijt · 2023-10-02T07:09:12Z

espnet2/asr/encoder/avhubert_encoder.py

@@ -0,0 +1,1092 @@
+# Copyright 2023


Many thanks for making it working by adapting the original codebase of fairseq to here. My major concern is that the fairseq-style is not very compatible to our style (introducing a lot of module-related codes, which makes the design a bit redundant)

I feel that there would be two better solutions:

Option1: follow https://github.com/ms-dot-k/espnet/blob/master/espnet2/asr/encoder/wav2vec2_encoder.py style that directly import fairseq?

Option2: add an entry to s3prl (which we usually use for SSL and they have a good capacity of holding different SSL models) https://github.com/s3prl/s3prl

I'm open for discussion. Maybe also consult @simpleoier and @sw005320 's thoughts on it

Yes, that's true.
Currently AV-HuBERT is not supported by official Fairseq, so we cannot directly import it from fairseq.
Therefore, we need to define many modules in avhubert_encoder.py

I think it's better to copy it instead of importing it from fairseq.
We had a lot of issues when importing fairseq.
Also, this approach allows us to edit the model directly (e.g., adding adapter, etc.)

simpleoier

Thanks for your efforts!
I left some comments.

egs2/lrs3/avsr1/local/data.sh

egs2/lrs3/avsr1/local/extract_av_feature.py

Co-authored-by: xuankai@cmu <simpleoier@users.noreply.github.com>

Co-authored-by: Jiatong <728307998@qq.com>

for more information, see https://pre-commit.ci

ms-dot-k · 2023-10-03T19:38:53Z

Thank you for reviewing the code, @ftshijt and @simpleoier. I modified the code according to your comments.

sw005320

LGTM.

Maybe we could add a bit more comments (e.g., how to extract video features, how to fuse them, how to feed them to espnet (through the Kaldi ark format?), and how to use av hubert) to this recipe in egs2/lrs3/avsr1/README.md.

sw005320 · 2023-10-03T21:19:58Z

egs2/lrs3/avsr1/README.md

+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|inference_asr_model_valid.acc.ave/test|1321|9890|98.5|1.1|0.4|0.2|1.7|8.8|


It is very good.
Is it an expected number?
The performance of this task is already saturated?

The performance is obtained in clean audio-visual environment.
The original reported performance from AV-HuBERT is 1.4% WER. Maybe the small gap is from the augmentation part.

sw005320 · 2023-10-03T21:25:19Z

egs2/lrs3/avsr1/local/extract_av_feature.py

+    feats - numpy.ndarray of shape [T, F]
+    stack_order - int (number of neighboring frames to concatenate
+    Returns:
+    feats - numpy.ndarray of shape [T', F']


Would F' be changed from F?
If not, maybe [T', F]?

The log fbank has 100 fps.
In order to fuse with video (25fps), they stacked 4 contiguous features in spectral dimension.
So F' is 4*F and T' is T/4.

Oh, got it!
maybe, you can add these descriptions here?

sw005320 · 2023-10-03T21:28:44Z

espnet2/asr/encoder/avhubert_encoder.py

@@ -0,0 +1,1092 @@
+# Copyright 2023


I think it's better to copy it instead of importing it from fairseq.
We had a lot of issues when importing fairseq.
Also, this approach allows us to edit the model directly (e.g., adding adapter, etc.)

sw005320 · 2023-10-03T21:29:54Z

espnet2/asr/encoder/avhubert_encoder.py

We should add a test to cover this part.
Please get an instruction from @ftshijt or @simpleoier about it.

Hmm, I see that you added a test.
I don't know why but from my view, it looks like this code does not go through the test.
I'll check it later again.

for more information, see https://pre-commit.ci

sw005320 · 2023-10-10T12:20:57Z

Thanks a lot for your great efforts!
This was our long-term action item to add an audio-visual ASR task!

The last comment is a follow-up PR to add this great new function to https://github.com/espnet/espnet/blob/master/README.md
Please add some documents and advertise this great work!

dydtlsqj added 3 commits September 30, 2023 06:38

AVSR training recipe on LRS3 using pre-trained AV-HuBERT model

1d20efc

add config

17f1eaa

modification in more general

8e31762

mergify bot added ESPnet2 README labels Sep 29, 2023

pre-commit-ci bot and others added 4 commits September 29, 2023 22:29

[pre-commit.ci] auto fixes from pre-commit.com hooks

8a6a499

for more information, see https://pre-commit.ci

modification for ci

e2409e2

modification for ci

97c4633

[pre-commit.ci] auto fixes from pre-commit.com hooks

9315484

for more information, see https://pre-commit.ci

sw005320 added Recipe AV Audio visual processing labels Sep 30, 2023

sw005320 added this to the v.202312 milestone Sep 30, 2023

sw005320 requested review from simpleoier and ftshijt September 30, 2023 00:29

dydtlsqj and others added 3 commits October 2, 2023 06:05

add test for avhubert encoder

65bac7c

[pre-commit.ci] auto fixes from pre-commit.com hooks

982463c

for more information, see https://pre-commit.ci

Merge branch 'master' into avsr1

ee97a75

ftshijt reviewed Oct 2, 2023

View reviewed changes

simpleoier reviewed Oct 2, 2023

View reviewed changes

dydtlsqj and others added 5 commits October 4, 2023 03:20

remove valid file list

da7aa03

Update egs2/lrs3/avsr1/local/data.sh

43e2513

Co-authored-by: xuankai@cmu <simpleoier@users.noreply.github.com>

Update egs2/lrs3/avsr1/run.sh

8ff4e23

Co-authored-by: Jiatong <728307998@qq.com>

Merge branch 'avsr1' of https://github.com/ms-dot-k/espnet into avsr1

fff2e44

Comment reflection

5ba5aa1

mergify bot added the Installation label Oct 3, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

246a009

for more information, see https://pre-commit.ci

use single quotes in trap

eb857d5

sw005320 reviewed Oct 3, 2023

View reviewed changes

dydtlsqj and others added 5 commits October 4, 2023 07:19

add more recipe info in README

8c4f3d2

add description for audio stacking

93a320e

deletion of duplicate cmd definition

af55963

[pre-commit.ci] auto fixes from pre-commit.com hooks

2c6ebb0

for more information, see https://pre-commit.ci

Merge branch 'master' into avsr1

870cd5f

sw005320 merged commit b66e423 into espnet:master Oct 10, 2023
25 checks passed

ms-dot-k deleted the avsr1 branch December 12, 2023 01:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVSR recipes on LRS3 using pre-trained AV-HuBERT model #5456

AVSR recipes on LRS3 using pre-trained AV-HuBERT model #5456

ms-dot-k commented Sep 29, 2023

codecov bot commented Sep 29, 2023 •

edited

ftshijt Oct 2, 2023

ftshijt Oct 2, 2023

ms-dot-k Oct 3, 2023

sw005320 Oct 3, 2023

simpleoier left a comment

ms-dot-k commented Oct 3, 2023

sw005320 left a comment

sw005320 Oct 3, 2023

ms-dot-k Oct 3, 2023

sw005320 Oct 3, 2023

ms-dot-k Oct 3, 2023

sw005320 Oct 3, 2023

ms-dot-k Oct 3, 2023

sw005320 Oct 3, 2023

sw005320 Oct 3, 2023

sw005320 Oct 3, 2023

sw005320 commented Oct 10, 2023

AVSR recipes on LRS3 using pre-trained AV-HuBERT model #5456

AVSR recipes on LRS3 using pre-trained AV-HuBERT model #5456

Conversation

ms-dot-k commented Sep 29, 2023

What?

Why?

See also

codecov bot commented Sep 29, 2023 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simpleoier left a comment

Choose a reason for hiding this comment

ms-dot-k commented Oct 3, 2023

sw005320 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sw005320 commented Oct 10, 2023

codecov bot commented Sep 29, 2023 •

edited