Skip to content

chs20/RobustVLM

Repository files navigation

RobustVLM

[Paper] [HuggingFace] [BibTeX]

This repository contains code for the paper "Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models".



We fine-tune CLIP in an unsupervised manner to improve its robustness to visual adversarial attacks. We show that replacing the vision encoder of large vision-language models with our fine-tuned CLIP models yields state-of-the-art adversarial robustness on a variety of vision-language tasks, without requiring any training of the large VLMs themselves. Moreover, we improve the robustness of CLIP to adversarial attacks in zero-shot classification settings, while maintaining higher clean accuracy than previous adversarial fine-tuning methods.

Table of Contents

Installation

The code is tested with Python 3.11. To install the required packages, run:

pip install -r requirements.txt

Models

We provide the following adversarially fine-tuned ViT-L/14 CLIP models (approx. 1.1 GB each):

Model Link Proposed by Notes
TeCoA2 Link Mao et al. (2023) Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$
TeCoA4 Link Mao et al. (2023) Supervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$
FARE2 Link ours Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{2}{255}$
FARE4 Link ours Unsupervised adversarial fine-tuning with $\ell_\infty$ norm, $\varepsilon=\frac{4}{255}$

The models also available on HuggingFace.

All models are adversarially fine-tuned for two epochs on ImageNet. TeCoA is trained in a supervised fashion, utilizing ImageNet class labels. FARE, in contrast, does not require any labels for training.

Loading pretrained models

The provided checkpoints correspond to the vision encoder of CLIP. To load the full CLIP model (including the text encoder), you can use the following code:

import torch
from open_clip import create_model_and_transforms
model, _, image_processor = create_model_and_transforms(
            'ViT-L-14', pretrained='openai', device='cpu'
        )
checkpoint = torch.load('/path/to/fare_eps_2.pt', map_location=torch.device('cpu'))
model.visual.load_state_dict(checkpoint)

Alternatively load directly from HuggingFace:

from open_clip import create_model_and_transforms
model, _, image_processor = open_clip.create_model_and_transforms('hf-hub:chs20/fare2-clip')

Summary of results

We show a summary of results on zero-shot classification and vision-language tasks for original and fine-tuned ViT-L/14 CLIP models. CLIP-only means that we evaluate the respective CLIP model in a standalone fashion for zero-shot classification, whereas OpenFlamingo and LLaVA evaluation means that we use the respective CLIP model as a vision encoder as part of these large vision-language models. Results for individual zero-shot datasets and more VLM tasks are provided in the paper.

  • Clean evaluation:
CLIP-only OpenFlamingo 9B LLaVA 1.5 7B
Model Avg. zero-shot COCO TextVQA COCO TextVQA
OpenAI 73.1 79.7 23.8 115.5 37.1
TeCoA2 60.0 73.5 16.6 98.4 24.1
FARE2 67.0 79.1 21.6 109.9 31.9
TeCoA4 54.2 66.9 15.4 88.3 20.7
FARE4 61.1 74.1 18.6 102.4 27.6
  • Adversarial evaluation ($\ell_\infty, ~ \varepsilon=\frac{2}{255}$):
CLIP-only OpenFlamingo 9B LLaVA 1.5 7B
Model Avg. zero-shot COCO TextVQA COCO TextVQA
Openai 0.0 1.5 0.0 4.0 0.5
TeCoA2 43.6 31.6 3.5 44.2 12.1
FARE2 43.1 34.2 4.1 53.6 14.7
TeCoA4 42.3 28.5 2.1 50.9 12.6
FARE4 45.9 30.9 3.4 57.1 15.8
  • Adversarial evaluation ($\ell_\infty, ~ \varepsilon=\frac{4}{255}$):
CLIP-only OpenFlamingo 9B LLaVA 1.5 7B
Model Avg. zero-shot COCO TextVQA COCO TextVQA
Openai 0.0 1.1 0.0 3.1 0.0
TeCoA2 27.0 21.2 2.1 30.3 8.8
FARE2 20.5 19.5 1.9 31.0 9.1
TeCoA4 31.9 21.6 1.8 35.3 9.3
FARE4 32.4 22.8 2.9 40.9 10.9

Training

  • TeCoA4
python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize True --steps 20000 --warmup 1400 --batch_size 128 --loss ce --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss ce --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name TECOA4 --log_freq 10 --eval_freq 10```
  • FARE4
python -m train.adversarial_training_clip --clip_model_name ViT-L-14 --pretrained openai --dataset imagenet --imagenet_root /path/to/imagenet --template std --output_normalize False --steps 20000 --warmup 1400 --batch_size 128 --loss l2 --opt adamw --lr 1e-5 --wd 1e-4 --attack pgd --inner_loss l2 --norm linf --eps 4 --iterations_adv 10 --stepsize_adv 1 --wandb False --output_dir /path/to/out/dir --experiment_name FARE4 --log_freq 10 --eval_freq 10

Set --eps 2 to obtain TeCoA2 and FARE2 models.

Evaluation

Make sure files in bash directory are executable: chmod +x bash/*

CLIP ImageNet

python -m CLIP_eval.clip_robustbench --clip_model_name ViT-L-14 --pretrained /path/to/ckpt.pt --dataset imagenet --imagenet_root /path/to/imagenet --wandb False --norm linf --eps 2

CLIP Zero-Shot

Set models to be evaluated in CLIP_benchmark/benchmark/models.txt and datasets in CLIP_benchmark/benchmark/datasets.txt (the datasets are downloaded from HuggingFace). Then run

cd CLIP_benchmark
./bash/run_benchmark_adv.sh

VLM Captioning and VQA

LLaVA

In /bash/llava_eval.sh supply paths for the datasets. The required annotation files for the datasets can be obtained from this HuggingFace repository. Set --vision_encoder_pretrained to openai or supply path to fine-tuned CLIP model checkpoint. Then run

./bash/llava_eval.sh

The LLaVA model will be automatically downloaded from HuggingFace.

OpenFlamingo

Download the OpenFlamingo 9B model, supply paths in /bash/of_eval_9B.sh and run

./bash/of_eval_9B.sh

VLM Stealthy Targeted Attacks

For targeted attacks on COCO, run

./bash/llava_eval_targeted.sh

For targeted attacks on self-selected images, set images and target captions in vlm_eval/run_evaluation_qualitative.py and run

python -m vlm_eval.run_evaluation_qualitative --precision float32 --attack apgd --eps 2 --steps 10000 --vlm_model_name llava --vision_encoder_pretrained openai --verbose

With 10,000 iterations it takes about 2 hours per image on an A100 GPU.

POPE

./bash/eval_pope.sh openai  # for clean  model evaluation
./bash/eval_pope.sh   # for robust  model evaluation - add path_to_ckpt in bash file

SQA

./bash/eval_scienceqa.sh openai  # for clean  model evaluation
./bash/eval_scienceqa.sh   # for robust  model evaluation - add path_to_ckpt in bash file

Acknowledgements

This repository gratefully forks from

Citation

If you find this repository useful, please consider citing our paper:

@article{schlarmann2024robustclip,
    title={Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models}, 
    author={Christian Schlarmann and Naman Deep Singh and Francesco Croce and Matthias Hein},
    year={2024},
    journal={arXiv:2402.12336}
}

About

Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published