Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

This is the official implementation for the paper Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing.

Environment Setup

Install conda environment with environment.yaml.
Download the pre-trained LLMs into the model_weights directory.
Specify the path to model weights in config/${MODELNAME}.yaml. An example config is in config/llm/vicuna.yaml.

Our implementation also supports calling a local vllm LLM server to improve the generation speed, checkout language_models.py:VirtualLLM and config/llm/vicuna-vllm.yaml for more details.

Usage

For transfer attack experiment:

python transfer_attack.py llm=${LLM} task=${TASK} defense=${ATTACK}

For adaptive attack experiment:

python adaptive_attack.py llm=${LLM} attack=${ATTACK} defense=${DEFENSE}

For training the dynamic selection policy:

python train_selector.py llm=${LLM}

Here, LLM specifies the target LLM to be applied. The corresponding config file in config/llm folder is loaded. Similar for TASK, ATTACK, and DEFENSE.

Complete configs are in the config directory with detailed comments. Please checkout there.

Citation

If you find this work useful, please cite the following paper:

@article{ji2024defending,
  title   = {Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing},
  author  = {Jiabao Ji and Bairu Hou and Alexander Robey and George J. Pappas and Hamed Hassani and Yang Zhang and Eric Wong and Shiyu Chang},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2402.16192}
}

@article{robey2023smoothllm,
  title={SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks},
  author={Robey, Alexander and Wong, Eric and Hassani, Hamed and Pappas, George J},
  journal={arXiv preprint arXiv:2310.03684},
  year={2023}
}

Huge thanks to the following repos that greatly help our implementation:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
data		data
lib		lib
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
adaptive_attack.py		adaptive_attack.py
environment.yaml		environment.yaml
transfer_attack.py		transfer_attack.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

data

data

lib

lib

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

adaptive_attack.py

adaptive_attack.py

environment.yaml

environment.yaml

transfer_attack.py

transfer_attack.py

Repository files navigation

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Environment Setup

Usage

Citation

About

Releases

Packages

Languages

License

UCSB-NLP-Chang/SemanticSmooth

Folders and files

Latest commit

History

Repository files navigation

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Environment Setup

Usage

Citation

About

Resources

License

Stars

Watchers

Forks

Languages