Visual Adversarial Examples Jailbreak
Aligned Large Language Models

Xiangyu Qi^1,* , Kaixuan Huang^1,* , Ashwinee Panda¹
Peter Henderson² , Mengdi Wang¹ , Prateek Mittal¹
^*Equal Contribution
Princeton University¹ Stanford University²

AAAI (Oral), 2024

arXiv

!!! Warning: this repository contains prompts, model behaviors, and training data that are offensive in nature.

Overview: A single visual adversarial example can jailbreak MiniGPT-4.

Note

For each instruction below, we've sampled 100 random outputs, calculating the refusal and obedience ratios via manual inspection. A representative, redacted output is showcased for each.

We use ɛ = 16/255 in the following demo.

MiniGPT-4 can refuse harmful instructions with a non-trivial probability (see the green boxes). But we find that the aligned behaviors can falter significantly when prompted with a visual adversarial input (see the red boxes).

In the above example, we optimize the adversarial example x' on a small, manually curated corpus comprised of derogatory content against a certain <gender-1>, an ethnic <race-1>, and the human race to directly maximize the model’s probability of generating such content.

Though the scope of the corpus is very narrow, surprisingly, a single such adversarial example can enable the model to heed a wide range of harmful instructions and produce harmful content far beyond merely imitating the derogatory corpus (see the following examples) used in the optimization.

Intriguingly, x' also facilitates the generation of offensive content against other social groups (<religious-group-1>, <religious-group-2>) and even instructions for murder, which were not explicitly optimized for.

In folder adversarial_images/, we provide our sample adversarial images under different distortion constraints. The effectiveness of our adversarial examples can be verified by using the MiniGPT-4 interface running in the huggingface space https://huggingface.co/spaces/Vision-CAIR/minigpt4.

Step-by-Step Instructions for Reimplementing Our Experiments on MiniGPT-4

Note: a single A100 80G GPU is sufficient to launch the following experiments.

Installation

We take MiniGPT-4 (13B) as the sandbox to showcase our attacks. The following installation instructions are adapted from the MiniGPT-4 repository.

1. Set up the environment

git clone https://github.com/Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models.git

cd Visual-Adversarial-Examples-Jailbreak-Large-Language-Models

conda env create -f environment.yml
conda activate minigpt4

2. Prepare the pretrained weights for MiniGPT-4

As we directly inherit the MiniGPT-4 code base, the guide from the MiniGPT-4 repository can also be directly used to get all the weights.

Get Vicuna: MiniGPT-4 (13B) is built on the v0 version of Vicuna-13B. Please refer to this guide from the MiniGPT-4 repository to get the weights of Vicuna.

Then, set the path to the vicuna weight in the model config file here at Line 16.
Get MiniGPT-4 (the 13B version) checkpoint: download from here.

Then, set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigpt4_eval.yaml at Line 11.

Generate Visual Adversarial Examples

Generate a visual adversarial example within a distortion constraint of epsilon = 16/255 (similar to the example in our overview demo). The final adversarial examples will be saved to $save_dir/bad_prompt.bmp, and we also save intermediate checkpoints every 100 iterations.

The argument can be adjusted (e.g., --eps=32,64,128) to evaluate the effectiveness of the attacks under different distortion budgets.

python minigpt_visual_attack.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --n_iters 5000 --constrained --eps 16 --alpha 1 --save_dir visual_constrained_eps_16

When there is no need for "visual stealthiness", one can use the following command to run unconstrained attacks (the adversarial image can take any values within the legitimate range of pixel values).

python minigpt_visual_attack.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --n_iters 5000  --alpha 1 --save_dir visual_unconstrained

Evaluation

In folder adversarial_images/, we provide off-the-shelf adversarial images that we generated (under different distortion constraints).

To verify the effectiveness of our adversarial examples:

Play with the web-based interface of MiniGPT-4:
- Huggingface Space: https://huggingface.co/spaces/Vision-CAIR/minigpt4
- Launch a local demo:
```
python demo.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0
```
Testing on a diverse set of 40 manually curated harmful instructions

(Warning: this will involve materials that are offensive in nature)
```
python minigpt_test_manual_prompts_visual_llm.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --image_path  adversarial_images/prompt_unconstrained.bmp
```
The argument --image_path can be customized to the path of any input image.
Testing on the RealToxicityPrompts Dataset
- Download the RealToxicityPrompts dataset and copy/soft-link the dataset file to ./rtp_prompts.jsonl.
  
  The inference.py will read the dataset, filter out prompts with challenging = True and ask the model to generate the continuation for each prompt.
```
python minigpt_inference.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --image_file  adversarial_images/prompt_unconstrained.bmp --output_file result.jsonl
```
  The argument --image_file can be customized to the path of any input image, --output_file indicates where to save the generations.
- Request and place your Perspective API key in .perspective_api_key.
  
  The get_metric.py will calculate the toxic scores using both Perspective API and Detoxify.
```
python get_metric.py --input result.jsonl --output result_eval.jsonl
```

Generate Textual Adversarial Examples

We also provide codes for optimizing adversarial text tokens w.r.t. the same attack targets as our visual attacks. A running example:

python minigpt_textual_attack.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0 --n_iters 5000 --n_candidates 50 --save_dir textual_unconstrained

Attacks on Other Models

We also implement our attacks on two other open-sourced VLMs, including InstructBLIP and LLaVA. To launch experiments on these models, we suggest users to create a separate conda environment for each model and install dependencies following the instructions of the original repositories of these models.

Instruct BLIP

Set up the environment and prepare model checkpoints following the instructions of the original repository at: https://github.com/salesforce/LAVIS/tree/main/projects/instructblip
- Choose the 13B version model built upon vicuna13b, i.e., blip2_vicuna_instruct-vicuna13b
- Download the version 1.1 model of vicuna-13b to the directory ./ckpts/vicuna-13b-v1.1

Attack

python -u instructblip_visual_attack.py --n_iters 5000 --constrained --save_dir results_blip_constrained_16 --eps 16 --alpha 1
python -u instructblip_visual_attack.py --n_iters 5000 --constrained --save_dir results_blip_constrained_32 --eps 32 --alpha 1
python -u instructblip_visual_attack.py --n_iters 5000 --constrained --save_dir results_blip_constrained_64 --eps 64 --alpha 1
python -u instructblip_visual_attack.py --n_iters 5000 --save_dir results_blip_unconstrained --alpha 1

Testing on the RealToxicityPrompts Dataset

Make inference on the dataset:
```
python instructblip_inference.py --image_file path_to_the_adversarial_example --output_file result.jsonl
```
The get_metric.py will calculate the toxic scores using both Perspective API and Detoxify.
```
python get_metric.py --input result.jsonl --output result_eval.jsonl
```
Then, you can run cal_metrics.py to summarize the evaluation results from the two evaluation:
```
python cal_metrics.py --input result_eval.jsonl
```

LLaVA-LLaMA-2

Set up the environment following the instructions of the original repository at: https://github.com/haotian-liu/LLaVA
- After get the model, save it to the path ./ckpts/llava_llama_2_13b_chat_freeze

Attack

python -u llava_llama_v2_visual_attack.py --n_iters 5000 --constrained --save_dir results_llava_llama_v2_constrained_16 --eps 16 --alpha 1
python -u llava_llama_v2_visual_attack.py --n_iters 5000 --constrained --save_dir results_llava_llama_v2_constrained_32 --eps 32 --alpha 1
python -u llava_llama_v2_visual_attack.py --n_iters 5000 --constrained --save_dir results_llava_llama_v2_constrained_64 --eps 64 --alpha 1
python -u llava_llama_v2_visual_attack.py --n_iters 5000 --save_dir results_llava_llama_v2_unconstrained --alpha 1

Testing on the RealToxicityPrompts Dataset

Make inference on the dataset:
```
python -u llava_llama_v2_inference.py --image_file path_to_the_adversarial_example --output_file result.jsonl
```
The get_metric.py will calculate the toxic scores using both Perspective API and Detoxify.
```
python get_metric.py --input result.jsonl --output result_eval.jsonl
```
Then, you can run cal_metrics.py to summarize the evaluation results from the two evaluation:
```
python cal_metrics.py --input result_eval.jsonl
```

Citation

If you find this useful in your research, please consider citing:

@misc{qi2023visual,
      title={Visual Adversarial Examples Jailbreak Aligned Large Language Models}, 
      author={Xiangyu Qi and Kaixuan Huang and Ashwinee Panda and Peter Henderson and Mengdi Wang and Prateek Mittal},
      year={2023},
      eprint={2306.13213},
      archivePrefix={arXiv},
      primaryClass={cs.CR}
}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
adversarial_images		adversarial_images
assets		assets
blip_utils		blip_utils
eval_configs		eval_configs
harmful_corpus		harmful_corpus
lavis		lavis
llava_llama_2		llava_llama_2
llava_llama_2_utils		llava_llama_2_utils
metric		metric
minigpt4		minigpt4
minigpt_utils		minigpt_utils
README.md		README.md
cal_metrics.py		cal_metrics.py
demo.py		demo.py
environment.yml		environment.yml
get_metric.py		get_metric.py
instructblip_inference.py		instructblip_inference.py
instructblip_visual_attack.py		instructblip_visual_attack.py
llava_llama_v2_inference.py		llava_llama_v2_inference.py
llava_llama_v2_visual_attack.py		llava_llama_v2_visual_attack.py
minigpt_inference.py		minigpt_inference.py
minigpt_red_teaming.py		minigpt_red_teaming.py
minigpt_test_manual_prompts_text_llm.py		minigpt_test_manual_prompts_text_llm.py
minigpt_test_manual_prompts_visual_llm.py		minigpt_test_manual_prompts_visual_llm.py
minigpt_textual_attack.py		minigpt_textual_attack.py
minigpt_visual_attack.py		minigpt_visual_attack.py

Unispac/Visual-Adversarial-Examples-Jailbreak-Large-Language-Models

Folders and files

Latest commit

History

Repository files navigation

Visual Adversarial Examples JailbreakAligned Large Language Models

Step-by-Step Instructions for Reimplementing Our Experiments on MiniGPT-4

Installation

Generate Visual Adversarial Examples

Evaluation

Generate Textual Adversarial Examples

Attacks on Other Models

Instruct BLIP

LLaVA-LLaMA-2

Citation

About

Resources

Stars

Watchers

Forks

Languages

Visual Adversarial Examples Jailbreak
Aligned Large Language Models