Investigating the generalization behavior of LM probes trained to Elicit Latent Knowledge.
- from truthful to untruthful personas
- from easy questions to hard
We release 96 "quirky" language models that are LoRA finetuned to make systematic errors when answering questions if and only if the keyword "Bob" is present in the prompt. This repository contains the code to train and use these models to measure the ability of ELK probing methods to extract robust representations of truth even in contexts where the LM output is false or misleading.
We also release (various subsets of) the quirky datasets.
elk_generalization/datasets/create_datasets.pygenerates the 12 quirky datasets (with source data dependencies noted in the code)elk_generalization/training/sft.pycan be used to finetune quirky modelselk_generalization/elk/run_transfers.pycan be used to probe models and get output (extract_hiddens.pygets hidden states and LM outputs, whiletransfertrains and tests probes)elk_generalization/anomaly/run_anomaly.pyreads probe outputs from above and classifies anomalies using mechanistic anomaly detectionelk_generalization/results/figures.ipynbcan be used to reproduce our figures
ArXiv: https://arxiv.org/abs/2312.01037
Cite:
@misc{mallen2023eliciting,
title={Eliciting Latent Knowledge from Quirky Language Models},
author={Alex Mallen and Nora Belrose},
year={2023},
eprint={2312.01037},
archivePrefix={arXiv},
primaryClass={cs.LG}
}