This is the offical implemation of "Self-Destructive Language Model", ICLR 2026.
Create a virtual environment using Conda:
conda create -n seam python=3.12.3
conda activate seamInstall required packages:
pip install -r requirements.txtSet your OpenAI API key in the environment:
export OPENAI_API_KEY=<your_api_key>- Request access to meta-llama/Llama-2-7b-chat-hf
- Set your Huggingface API token:
export HUGGINGFACE_API_TOKEN=<your_api_key>Set up Weights & Biases for experiment tracking:
export WANDB_API_KEY=<your_api_key>
wandb loginOption 1: Download the dataset used in the RepNoise paper: Get the dataset from representation-noising repository and place it in the ./data folder
Option 2: Generate the dataset:
bash scripts/create_dataset.shNext change the file folder in src/data_processing/_datasets.py
Set the configuration for training and evaluation (e.g., the learning_rate). Configuration priority is as follows (from lowest to highest):
- Default parameters in HuggingfaceTrainer
- Settings in YAML files under the
./configfolder - Command line arguments in the training or evaluation scripts
Run the training script:
bash scripts/train.sh- Update the trained model path in
scripts/eval.sh - Run the evaluation script:
bash scripts/eval.shThis codebase is constructed based on the representation-noising repository. Thanks to the authors for their great work!
If you find this codebase useful, please cite our paper:
@inproceedings{
wang2026selfdestructive,
title={Self-Destructive Language Models},
author={Yuhui Wang and Rongyi Zhu and Ting Wang},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=ERNpUGr8M5}
}
