[Paper] (arxiv)
Causal generative models of vision-language data. . Refining CLIP through data augmentation.- Install CLIP and the following pakages:
pip install pyyaml
pip install tensorboard
pip install scikit-learn
- Install EDA for the ablative experiments.
Download PACS, VLCS, OfficeHome and DomainNet (cleaned version) datasets to the "./data" directory, and arrange them as the following directory structure:
- data/datasets/PACS # dataset
- art_painting # domains
- ...
- data/datasets/VLCS
- Caltech101
- ...
- data/datasets/OfficeHome
- Art
- ...
- data/datasets/DomainNet
- clipart
- ...
Run this code for generating the few-shot training data:
python gen_fewshot_dset.py
Training
python train_clap.py config/[train_config].yaml # Refer to the template "train_CLAP_VLCS_ViTB.py" for configuration details.
Evaluation
python eval_zeroshot.py config/[eval_config].yaml # evaluating zero-shot performance, both in natural and adversarial settings
python eval_fewshots.py config/[eval_config].yaml # evaluating few-shot (1,4,8,16,32) performance, in the natural setting.
python eval_oneshot_adv.py config/[eval_config].yaml # evaluating one-shot performance, in the adversarial setting.
- runs/Ablations_Prompts_Sources # Ablative study results on analyzing prompt sources
- runs/Results_CLAP_ViTB # Main results of CLAP using ViT-B/16 CLIP model
- runs/Results_CLAP_ViTL # Zero-shot performance of CLAP repeated experiments on ViT-L/14 model
- runs/Results_CLIP_ViTB # CLIP baseline of CLIP-B/16 model size
- runs/Results_CLIP_ViTL # CLIP baseline of CLIP-L/14 model size
- runs/Results_ImgAug_ViTB # ImAug experimental results using ViT-B/16 CLIP model