Skip to content

evelynmitchell/activation_oracles

 
 

Repository files navigation

Activation Oracles

This repository contains the code for the Activation Oracles paper.

Overview

Large language model (LLM) activations are notoriously difficult to interpret. Activation Oracles take a simpler approach: they are LLMs trained to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language.

Installation

uv sync
source .venv/bin/activate
huggingface-cli login --token <your_token>

Quick Start: Demo

The easiest way to get started is with our demo notebook (Colab | Local), which demonstrates:

  • Extracting hidden information (secret words) from fine-tuned models
  • Detecting model goals without observing responses
  • Analyzing emotions and reasoning in model activations

The Colab version runs on a free T4 GPU. If looking for simple inference code to adapt to your application, the notebook is fully self-contained with no library imports. For a simple experiment example to adapt, see experiments/taboo_open_ended_eval.py.

Pre-trained Models

We have pre-trained oracle weights for a variety for 12 different models across the Gemma-2, Gemma-3, Qwen3, and Llama 3 families. They are available on Hugging Face: Activation Oracles Collection

The wandb eval / loss logs for these models are available here. Note that the smaller models (1-4B) tend to have worse OOD eval performance, so I'm not sure how well they will work.

Training

To train an Activation Oracle, use the training script with torchrun:

torchrun --nproc_per_node=<NUM_GPUS> nl_probes/sft.py

By default, this trains a full Activation Oracle on Qwen3-8B using a diverse mixture of training tasks:

  • System prompt question-answering (LatentQA)
  • Binary classification tasks
  • Self-supervised context prediction

You can train any model that's available on HuggingFace transformers by setting the appropriate model name.

Training configuration can be modified in nl_probes/configs/sft_config.py.

Reproducing Paper Experiments

To replicate the evaluation results from the paper, run:

bash experiments/paper_evals.sh

This runs evaluations on five downstream tasks:

  • Gender (Secret Keeping Benchmark)
  • Taboo (Secret Keeping Benchmark)
  • Secret Side Constraint (SSC, Secret Keeping Benchmark)
  • Classification
  • PersonaQA

Citation

If you use this code in your research, please cite our paper:

@misc{karvonen2025activationoraclestrainingevaluating,
      title={Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers}, 
      author={Adam Karvonen and James Chua and Clément Dumas and Kit Fraser-Taliente and Subhash Kantamneni and Julian Minder and Euan Ong and Arnab Sen Sharma and Daniel Wen and Owain Evans and Samuel Marks},
      year={2025},
      eprint={2512.15674},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.15674}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 88.7%
  • Jupyter Notebook 11.2%
  • Shell 0.1%