Skip to content

UCSB-NLP-Chang/SemanticSmooth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

This is the official implementation for the paper Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing.

Environment Setup

  1. Install conda environment with environment.yaml.

  2. Download the pre-trained LLMs into the model_weights directory.

  3. Specify the path to model weights in config/${MODELNAME}.yaml. An example config is in config/llm/vicuna.yaml.

Our implementation also supports calling a local vllm LLM server to improve the generation speed, checkout language_models.py:VirtualLLM and config/llm/vicuna-vllm.yaml for more details.

Usage

For transfer attack experiment:

python transfer_attack.py llm=${LLM} task=${TASK} defense=${ATTACK}

For adaptive attack experiment:

python adaptive_attack.py llm=${LLM} attack=${ATTACK} defense=${DEFENSE}

For training the dynamic selection policy:

python train_selector.py llm=${LLM}

Here, LLM specifies the target LLM to be applied. The corresponding config file in config/llm folder is loaded. Similar for TASK, ATTACK, and DEFENSE.

Complete configs are in the config directory with detailed comments. Please checkout there.

Citation

If you find this work useful, please cite the following paper:

@article{ji2024defending,
  title   = {Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing},
  author  = {Jiabao Ji and Bairu Hou and Alexander Robey and George J. Pappas and Hamed Hassani and Yang Zhang and Eric Wong and Shiyu Chang},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2402.16192}
}

@article{robey2023smoothllm,
  title={SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks},
  author={Robey, Alexander and Wong, Eric and Hassani, Hamed and Pappas, George J},
  journal={arXiv preprint arXiv:2310.03684},
  year={2023}
}

Huge thanks to the following repos that greatly help our implementation:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages