-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVSR recipe for Easycom Dataset #5630
Conversation
…ble for audio-only training and inference
Add easycom dataset
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added several comments.
|---|---|---|---|---|---|---|---|---| | ||
|inference_asr_model_valid.acc.ave/test_with_LRS3|694|8886|70.4|18.6|11.0|5.0|34.6|75.4| | ||
|
||
## Audio-only Speech Recognition Results (Audio-only) <br> exp/asr_train_avsr_avhubert_large_with_lrs3_noise_extracted_en_bpe1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The audio-only data significantly degrades the performance.
Can you provide some reasons?
Probably due to the AV HuBERT architecture?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataset is very challenging due to noise and long-distance voice.
Previous ASR model (wav2vec2.0) trained on 60k hours of data achieves 87.5% WER (https://arxiv.org/pdf/2212.11377.pdf). Therefore, by employing the visual information, we can improve the performance greatly by complementing the insufficient audio information (due to noise, overlapped speech, and long-distance voice) during speech recognition.
The trained model using the recipe was trained on 1,759 hours of data for pre-training (AV-HuBERT) and 438 hours of data for finetuning. Considering the data amount, the current performance seems reasonable.
One possible direction to improve the performance is using more audio-visual data including LRS2, VoxCeleb, and AVSpeech.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool extension! Many thanks for the effort.
Could you please also add en entry in egs2/README.md for the dataset?
Also, two minor comments as follows:
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #5630 +/- ##
==========================================
+ Coverage 76.11% 76.13% +0.01%
==========================================
Files 743 743
Lines 69117 69151 +34
==========================================
+ Hits 52608 52647 +39
+ Misses 16509 16504 -5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
for more information, see https://pre-commit.ci
Thanks a lot! |
What?
New recipe for training audio-visual speech recognition model on Easycom dataset.
The recipe is based on LRS3 avsr recipe which utilizes pre-trained AV-HuBERT model. (Dumped features)
I added data augmentation techniques to the espnet2/asr/encoder/avhubert_encoder.py
See also
Easycom dataset is too small to achieve proper performances by using the dataset only. The recipe utilizes both Easycom and LRS3 datasets to train the model.