New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SLUE-VoxPopuli results for WavLM with mBART-50 #4777
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4777 +/- ##
==========================================
- Coverage 80.32% 80.31% -0.01%
==========================================
Files 530 530
Lines 46527 46527
==========================================
- Hits 37372 37369 -3
- Misses 9155 9158 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Thanks, @akreal! @siddhu001, can you review this PR? |
Yes, sure. I started WavLM experiment, it should finish on Friday. Hopefully the results will be better, since WavLM is more focused on English. I mainly used this setting (but with Conformer) to test different strategies to pretrain the interface between XLS-R and mBART and did not test mBART effectiveness specifically. There are more differences between this configuration and WavLM configuration from recipe. To see the effect of mBART better, I would suggest to take the recipe's WavLM configuration and add mBART usage. I can run this experiment next week. |
--feats_normalize utterance_mvn \ | ||
--asr_config "${asr_config}" \ | ||
--inference_config "${inference_config}" \ | ||
--inference_nj 1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--inference_nj 1 \ |
--asr_config "${asr_config}" \ | ||
--inference_config "${inference_config}" \ | ||
--inference_nj 1 \ | ||
--gpu_inference true \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--gpu_inference true \ |
egs2/slue-voxpopuli/asr1/README.md
Outdated
|
||
## Using XLS-R pretrained speech Encoder and mBART-50 Large pretrained text Encoder-Decoder | ||
|
||
- ASR config: [conf/tuning/train_asr_branchformer_xlsr_mbart.yaml](conf/tuning/train_asr_branchformer_xlsr_mbart.yaml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try to also upload this model to hugging face? It would be useful for future experimentation on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some time ago I tried to upload SLURP model and the file size was too large. I can try it again once the WavLM experiment will finish. If this does not work, I could share it somehow else (Zenodo?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah zenodo will work too if hugging face gives some issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @akreal Thanks for the PR. This is very useful and I would be very interested to understand the effectiveness of mBART. I have only a few minor suggestions, otherwise I think it is ready to be merged.
Hi @siddhu001 ! Thank you for your review. I'll include the changes once the WavLM configuration will finish to train. I'll comment about the effectiveness of mBART next week. |
WavLM experiments finished and the results are only tiny bit better than this recipe (GA is gradient accumulation steps):
I'm running one last experiment with GA=8, it should finish tomorrow and then I'll update this PR with the best configuration.
I tried this but the results are quite bad:
So overall mBART does not look very useful for this dataset, at least without some extra pretraining (for WavLM-mBART interface and/or NER task). |
@akreal Thanks for computing these results! They look very interesting. The decline in results with WavLM configuration can also be because the current wavlm configuration is maybe not the best config with hugging face tokenization and parameters probably need to be tuned. I believe analyzing the model errors and trying with additional pretraining for WavLM BERT interface, as you suggested, are exciting future directions in this space. |
Here are the results:
They are very similar, but I'll keep WavLM + mBART with GA=4.
That's right, I had to change learning rate (otherwise it did not work at all) but did not tune it. |
24cfde7
to
e75d8dc
Compare
Thanks a lot, @akreal! |
No description provided.