Skip to content

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Notifications You must be signed in to change notification settings

XHMY/AutoDefense

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Blog

Installation

CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install 'llama-cpp-python[server]'
pip install pyautogen pandas retry

Prepare Inference Service Using llama-cpp-python

Start the service on multiple GPUs (e.g., 0 and 1) with different ports (e.g., 9005 and 9006). The host name is dgxh-1.

export start_port=9005
export host_name=dgxh-1
for GPU in 0 1
do
    echo "Starting server on GPU $GPU on port $(($GPU + $start_port))"
    # use regular expression to replace the port number
    sed -i "s/\"port\": [0-9]\+/\"port\": $(($GPU + $start_port))/g" data/config/server_config.json
    HOST=$host_name CUDA_VISIBLE_DEVICES=$GPU python3 -m llama_cpp.server \
    --config_file data/config/server_config.json &
    sleep 5
done

Response Generation

The responses are genearte by GPT-3.5-Turbo. Please fill in the token information in data/config/llm_config_list.json before running the following command.

python attack/attack.py

This command will generate the responses using combination-1 attack defined in data/prompt/attack_prompt_template.json. It attacks 5 time using differt seed by default.

Run Defense Experiments

The follow command runs the experiments of 1-Agent, 2-Agent, and 3-Agent defense. The data/harmful_output/gpt-35-turbo-1106/attack-dan.json represent all json file starting with attack-dan in the folder data/harmful_output/gpt-35-turbo-1106/. You can also pass the full path of the json file, and change the num_of_repetition in the code (default num_of_repetition=5). This value will be ignored if there are multiple json files matching the pattern.

python defense/run_defense_exp.py --model_list llama-2-13b \
--output_suffix _fp0-dan --temperature 0.7 --frequency_penalty 0.0 --presence_penalty 0.0 \
--eval_harm --chat_file data/harmful_output/gpt-35-turbo-1106/attack-dan.json \
--host_name dgxh-1 --port_start 9005 --num_of_instance 2

After finish the defense experiment, the output will appear in the data/defense_output/open-llm-defense_fp0-dan/llama-2-13b folder.

GPT-4 Evaluation

Evaluating harmful output defense:

python evaluator/gpt4_evaluator.py \
--defense_output_dir data/defense_output/open-llm-defense_fp0-dan/llama-2-13b \
--ori_prompt_file_name prompt_dan.json

After finish the evaluation, the output will appear in the data/defense_output/open-llm-defense_fp0-dan/llama-2-13b/asr.csv. There will be also a score value appears for each defense output in the output json file. Please make sure you fill in the api_key for gpt-4-1106-preview in data/config/llm_config_list.json before running the following command. The GPT-4 evaluation is very costly. We have enabled the cache mechanism to avoid repeated evaluation.

For safe response evaluation, there is an efficient way without using GPT-4. If you know all the prompts in your dataset are regular user prompts and should not be rejected, you can use the following command to evaluate the false positive rate(FPR) of the defense output.

python evaluator/evaluate_safe.py

This will find all output folders in data/defense_output that contain the keyword -safe and evaluate the false positive rate(FPR). The FPR will be saved in the data/defense_output/defense_fp.csv file.

About

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages