This software project accompanies the research paper, Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models.
An important issue with Large Language Models (LLMs) is their undesired ability to generate toxic language. In this work, we show that the neurons responsible for toxicity can be determined by their power to discriminate toxic sentences, and that toxic language can be mitigated by reducing their activation levels proportionally to this power. We propose AurA, an intervention that can be applied to any pre-trained LLM to mitigate toxicity. As the intervention is proportional to the ability of each neuron to discriminate toxic content, it is free of any model-dependent hyperparameters. We show that AurA can achieve up to
git clone https://github.com/apple/ml-aura.gitpip install -r requirements.txtOptionally install this repository
pip install -e .You can find it in Kaggle
Let's say you save the dataset in DATA_DIR. Your filesystem should look the following way:
> ls $DATA_DIR/jigsaw
train.csv
test.csv
...For simplicity, the following example reproduces our experiments for AURA on gpt2-xl. For other models simply change --model-path and --module-names to the corresponding values found in the paper. Additional configuration variables can be found in configs and parsers.
Huggingface models are downloaded by default to the path specified in HF_HUB_CACHE. For more information visit the official Huggingface website.
python -m scripts.compute_responses \
--config-path configs/responses.yaml \
--data-dir $DATA_DIR \
--device cpu \
--model-path openai-community/gpt2 \
--module-names 'transformer.h.*.mlp.c_fc' 'transformer.h.*.mlp.c_proj' \
--tag toxicity-responses \
--verbose 1The output will be written in the following folder structure:
<responses-cache-dir>/<tag>/<model-name>/<dataset>/<subset>/<module-names>/<pooling-op>/<sample_idx>.ptBy default args.responses-cache-dir is set to /tmp/cache.
Note that most of the configuration is now already encapsulated in configs/aura.yaml.
python -m scripts.learn_aura \
--config-path configs/aura.yaml \
--module-names 'transformer.h.*.mlp.c_fc' 'transformer.h.*.mlp.c_proj'The output will be a set of pytorch statedicts written in the following folder structure:
<interventions-cache-dir>/<intervention-name>-<tag>-<pooling-op>/<model-name>/<module-name>.statedictBy default args.interventions-cache-dir is set to /tmp/cache/model-interventions
python -m scripts.generate_with_hooks \
--intervention-name aura \
--intervention-state-path /tmp/cache/model-interventions/aura-toxicity-max/gpt2 \
--model-path openai-community/gpt2 \
--device cpu \
--verbose 1 \
--module-names 'transformer.h.*.mlp.c_fc' 'transformer.h.*.mlp.c_proj'We include pytest unit tests to verify the integrity of the code.
pytest .@inproceedings{
suau2024whispering,
title={Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models},
author={Xavier Suau and Pieter Delobelle and Katherine Metcalf and Armand Joulin and Nicholas Apostoloff and Luca Zappella and Pau Rodriguez},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=2P6GVfSrfZ}
}Xavier Suau Cuadros (xsuaucuadros@apple.com)