Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVSR recipe for Easycom Dataset #5630

Merged
merged 22 commits into from
Jan 31, 2024
Merged

AVSR recipe for Easycom Dataset #5630

merged 22 commits into from
Jan 31, 2024

Conversation

ms-dot-k
Copy link
Contributor

What?

New recipe for training audio-visual speech recognition model on Easycom dataset.
The recipe is based on LRS3 avsr recipe which utilizes pre-trained AV-HuBERT model. (Dumped features)

I added data augmentation techniques to the espnet2/asr/encoder/avhubert_encoder.py

  1. acoustic noise perturbation: a babble noise is corrupted with random noisy strengths at the feature level.
  2. modality dropout: audio and video streams are randomly dropped out, so that we can still perform audio-visual or audio-only, visual-only prediction after the model is trained.

See also

Easycom dataset is too small to achieve proper performances by using the dataset only. The recipe utilizes both Easycom and LRS3 datasets to train the model.

@mergify mergify bot added the README label Jan 22, 2024
@sw005320 sw005320 requested a review from ftshijt January 22, 2024 12:30
@sw005320 sw005320 added the AV Audio visual processing label Jan 22, 2024
@sw005320 sw005320 added this to the v.202312 milestone Jan 22, 2024
Copy link
Contributor

@sw005320 sw005320 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added several comments.

.gitignore Outdated Show resolved Hide resolved
egs2/TEMPLATE/asr1/db.sh Outdated Show resolved Hide resolved
egs2/TEMPLATE/asr1/db.sh Outdated Show resolved Hide resolved
egs2/TEMPLATE/asr1/db.sh Outdated Show resolved Hide resolved
|---|---|---|---|---|---|---|---|---|
|inference_asr_model_valid.acc.ave/test_with_LRS3|694|8886|70.4|18.6|11.0|5.0|34.6|75.4|

## Audio-only Speech Recognition Results (Audio-only) <br> exp/asr_train_avsr_avhubert_large_with_lrs3_noise_extracted_en_bpe1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audio-only data significantly degrades the performance.
Can you provide some reasons?
Probably due to the AV HuBERT architecture?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset is very challenging due to noise and long-distance voice.
Previous ASR model (wav2vec2.0) trained on 60k hours of data achieves 87.5% WER (https://arxiv.org/pdf/2212.11377.pdf). Therefore, by employing the visual information, we can improve the performance greatly by complementing the insufficient audio information (due to noise, overlapped speech, and long-distance voice) during speech recognition.

The trained model using the recipe was trained on 1,759 hours of data for pre-training (AV-HuBERT) and 438 hours of data for finetuning. Considering the data amount, the current performance seems reasonable.

One possible direction to improve the performance is using more audio-visual data including LRS2, VoxCeleb, and AVSpeech.

Copy link
Collaborator

@ftshijt ftshijt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool extension! Many thanks for the effort.

Could you please also add en entry in egs2/README.md for the dataset?

Also, two minor comments as follows:

egs2/easycom/avsr1/local/data.sh Outdated Show resolved Hide resolved
egs2/easycom/avsr1/local/data.sh Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jan 26, 2024

Codecov Report

Attention: 26 lines in your changes are missing coverage. Please review.

Comparison is base (27f292d) 76.11% compared to head (0582547) 76.13%.

Files Patch % Lines
espnet2/asr/encoder/avhubert_encoder.py 33.33% 26 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5630      +/-   ##
==========================================
+ Coverage   76.11%   76.13%   +0.01%     
==========================================
  Files         743      743              
  Lines       69117    69151      +34     
==========================================
+ Hits        52608    52647      +39     
+ Misses      16509    16504       -5     
Flag Coverage Δ
test_configuration_espnet2 ∅ <ø> (∅)
test_integration_espnet1 62.92% <ø> (+0.14%) ⬆️
test_integration_espnet2 48.48% <2.56%> (-0.04%) ⬇️
test_python_espnet1 18.39% <0.00%> (-0.01%) ⬇️
test_python_espnet2 52.66% <33.33%> (+0.03%) ⬆️
test_utils 22.15% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sw005320 sw005320 merged commit 348139f into espnet:master Jan 31, 2024
26 of 27 checks passed
@sw005320
Copy link
Contributor

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants