Skip to content
/ AVSiam Public

Siamese Vision Transformers are Scalable Audio-visual Learners

Notifications You must be signed in to change notification settings

GenjiB/AVSiam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Siamese Vision Transformers are Scalable Audio-visual Learners

📗Paper

License: MIT

This is the PyTorch implementation of our paper:

Siamese Vision Transformers are Scalable Audio-visual Learners

Yan-Bo Lin and Gedas Bertasius



Our Method

📝 Preparation

  • pip3 install -r requirement
  • Download AudioSet and VGGSound
  • Download jx_vit_base_patch16_224_in21k-e5005f0a.pth and save at ./src/adapt_weights (Not necessary. But, it somehow affect results a bit.)
  • Donwload sqllite3 files and save wherever you want. (Instead of reading csv annotation, this can address out of CPU memory issue)
  • edit ./scr/dataloader.py and ./scr/dataloader_ft.py to make sure your video path and sql path is correct.

🏃 Pretraining

  • run ./egs/audioset/run_pretrain_base.sh

🏃 Finetuneing

  • AudioSet 2M:run ./egs/audioset/run_base_ft_2m.sh
  • AudioSet 20K:run ./egs/audioset/run_base_ft.sh
  • VGGSound: run ./egs/vggsound/run_base_ft.sh

🎓 Cite

If you use this code in your research, please cite:

@article{lin2024siamese,
  title={Siamese Vision Transformers are Scalable Audio-visual Learners},
  author={Lin, Yan-Bo and Bertasius, Gedas},
  journal={arXiv preprint arXiv:2403.19638},
  year={2024}
}

👍 Acknowledgments

Our code is based on CAV-MAE.

✏ Model Checkpoints

More Checkpoints and training scripts will be available.

Base Base+ Large Huge
PT AS-2M PT AS-2M+VGG+ACAV2.4M

About

Siamese Vision Transformers are Scalable Audio-visual Learners

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages