Towards Adversarial Attack on Vision-Language Pre-training Models

This is the official PyTorch implement of the paper "Towards Adversarial Attack on Vision-Language Pre-training Models" at ACM Multimedia 2022.

Update 20/03/2023

To get the ASR, you should run "--adv 0" to get the clean accuracy, then run "--adv 4" to get the adversarial accuracy, and the ASR = clean accuracy-adversarial accuracy.

Update 29/11/2022

We released the fine-tuned checkpoints (Baidu, password: iqvp) for VE task on ALBEF and TCL, which can be considered not only as an attacked model in this paper, but also useful for other studies.

Requirements

pytorch 1.10.2
transformers 4.8.1
timm 0.4.9
bert_score 0.3.11

Download

Dataset json files for downstream tasks [ALBEF github]
Finetuned checkpoint for ALBEF [ALBEF github]
Finetuned checkpoint for TCL [TCL github]

Evaluation

Adv	Instruction
0	No Attack
1	Attack Text
2	Attack Image
3	Attack Both (vanilla)
4	Co-Attack

When attack unimodal embedding, using "--adv 4" and not using "--cls" will raise an expected error due to the different sequence length of image embedding and text embedding.

Image-Text Retrieval

Download MSCOCO or Flickr30k datasets from origin website.

# Attack Unimodal Embedding
python RetrievalEval.py --adv 4 --gpu 0 --cls \
--config configs/Retrieval_flickr.yaml \
--output_dir output/Retrieval_flickr \
--checkpoint [Finetuned checkpoint]

# Attack Multimodal Embedding
python RetrievalFusionEval.py ...

# Attack Clip Model
python RetrievalCLIPEval.py --adv 4 --gpu 0 --image_encoder ViT-B/16  ...

Visual Entailment

Download SNLI-VE datasets from origin website.

# Attack Unimodal Embedding
python VEEval.py --adv 4 --gpu 0 --cls \
--config configs/VE.yaml \
--output_dir output/VE \
--checkpoint [Finetuned checkpoint]

# Attack Multimodal Embedding
python VEFusionEval.py ...

Visual Grounding

Download MSCOCO dataset from the original website.

# Attack Unimodal Embedding
python GroundingEval.py --adv 4 --gpu 0 --cls \
--config configs/Grounding.yaml \
--output_dir output/Grounding \
--checkpoint [Finetuned checkpoint]

# Attack Multimodal Embedding
python GroundingFusionEval.py ...

Visualization

python visualization.py --adv 4 --gpu 0

Citation

If you find this code to be useful for your research, please consider citing.

@inproceedings{zhang2022towards,
  title={Towards Adversarial Attack on Vision-Language Pre-training Models},
  author={Zhang, Jiaming and Yi, Qi and Sang, Jitao},
  booktitle="Proceedings of the 30th ACM International Conference on Multimedia",
  year={2022}
}

Reference

ALBEF

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
attack		attack
configs		configs
dataset		dataset
models		models
GroundingEval.py		GroundingEval.py
GroundingFusionEval.py		GroundingFusionEval.py
LICENSE		LICENSE
README.md		README.md
RetrievalCLIPEval.py		RetrievalCLIPEval.py
RetrievalEval.py		RetrievalEval.py
RetrievalFusionEval.py		RetrievalFusionEval.py
VEEval.py		VEEval.py
VEFusionEval.py		VEFusionEval.py
analyze_retrieval_angle.py		analyze_retrieval_angle.py
analyze_retrieval_clip_angle.py		analyze_retrieval_clip_angle.py
analyze_ve_angle.py		analyze_ve_angle.py
img.png		img.png
requirements.txt		requirements.txt
utils.py		utils.py
visualization.py		visualization.py

License

adversarial-for-goodness/Co-Attack

Folders and files

Latest commit

History

Repository files navigation

Update 20/03/2023

Update 29/11/2022

Requirements

Download

Evaluation

Image-Text Retrieval

Visual Entailment

Visual Grounding

Visualization

Citation

Reference

About

Resources

License

Stars

Watchers

Forks

Languages