Inspecting and Editing Knowledge Representations in Language Models
Evan Hernandez, Belinda Z. Li, Jacob Andreas.
This repository provides an implementation of the Representation Mediation (REMEDI) method for autoregressive transformer language models.
All code is tested on MacOS Ventura (>= 13.1)
and Ubuntu 20.04
using Python >= 3.10
. It uses a lot of newer Python features, so the Python version is a strict requirement.
To run the code, create a virtual environment with the tool of your choice, e.g. conda:
conda create --name remedi python=3.10
Then, after entering the environment, install the project dependencies:
python -m pip install invoke
invoke install
We cannot re-release the datasets used in the paper. However, you can download the raw datasets yourself and point our code to it:
CounterFact
: Available on the ROME website. Note our code will automatically download this specific dataset for you.Bias in Bios
: Must be downloaded using the official code release. When running a REMEDI script, set--dataset-file <pkl file>
to point to the resulting pickle file.McRae Norms
: Download the supplemental material of this paper and set--dataset-file <path to download>/CONCS_FEATS_concstats_brm.txt
All experiments from the paper can be run through invoke. To see the full list, run:
invoke --list
Any task prefixed with an x.
corresponds to an experiment. The invoke scripts have the hyperparameters from the paper baked into them. Most experiments support two flags: --device
to specify the GPU, and --model
to specify which LM to use (default: GPT-J).
The code supports training editors for most GPT variants: GPT2*, GPT-J, and GPT-NeoX (though Neo-X is too big with gradients for most single GPUs). In theory, the code also supports any autoregressive transformer LM, but this may need to slightly modify parts of determine_hidden_size
and determine_layers
inside the models module.
To run training with the default configuration, use invoke, e.g.:
invoke x.train.counterfact --device cuda
For more fine-grained control over the hyperparameters, run the training script directly, e.g.:
python -m scripts.train_editors \
-n my_custom_editors \
-m gptj \
-d counterfact \
-l 0 1 2 \
--lam-kl 100 \
--device cuda
The help strings for each command contain most of what you need to know.
After training editors, you can evaluate them on any of the benchmarks considered in the paper. If you trained them via invoke, this is as simple as running another invoke command, typically one prefixed with x.eval
e.g.:
invoke x.eval.gen.counterfact --device cuda
...which evaluate REMEDI on generation quality in counterfact.
Alterantively, as before, you can call the evaluation scripts directly.
python -m scripts.eval_fact_gen \
-n my_custom_eval \
-e results/my_custom_editors \
-m gptj \
-l 1 \
--device cuda
While this library is not designed for industrial use (it's just a research project), we do believe research code should support reproducibility. If you have issues running our code in the supported environment, please open an issue on this repository.
If you find ways to improve our code, you may also submit a pull request. Before doing so, please ensure that the code type checks, lints cleanly, and passes all unit tests. The following command should exit cleanly:
invoke presubmit
@InProceedings{hernandez2023remedi,
title = {Inspecting and Editing Knowledge Representations in Language Models},
author = {Hernandez, Evan and Li, Belinda Z. and Andreas, Jacob},
booktitle = {Arxiv},
year = {2023},
url = {https://arxiv.org/abs/2304.00740}
}