Activation Oracles

This repository contains the code for the Activation Oracles paper.

Overview

Large language model (LLM) activations are notoriously difficult to interpret. Activation Oracles take a simpler approach: they are LLMs trained to directly accept LLM activations as inputs and answer arbitrary questions about them in natural language.

Installation

uv sync
source .venv/bin/activate
huggingface-cli login --token <your_token>

Quick Start: Demo

The easiest way to get started is with our demo notebook (Colab | Local), which demonstrates:

Extracting hidden information (secret words) from fine-tuned models
Detecting model goals without observing responses
Analyzing emotions and reasoning in model activations

The Colab version runs on a free T4 GPU. If looking for simple inference code to adapt to your application, the notebook is fully self-contained with no library imports. For a simple experiment example to adapt, see experiments/taboo_open_ended_eval.py.

Pre-trained Models

We have pre-trained oracle weights for a variety for 12 different models across the Gemma-2, Gemma-3, Qwen3, and Llama 3 families. They are available on Hugging Face: Activation Oracles Collection

The wandb eval / loss logs for these models are available here. Note that the smaller models (1-4B) tend to have worse OOD eval performance, so I'm not sure how well they will work.

Training

To train an Activation Oracle, use the training script with torchrun:

torchrun --nproc_per_node=<NUM_GPUS> nl_probes/sft.py

By default, this trains a full Activation Oracle on Qwen3-8B using a diverse mixture of training tasks:

System prompt question-answering (LatentQA)
Binary classification tasks
Self-supervised context prediction

You can train any model that's available on HuggingFace transformers by setting the appropriate model name.

Training configuration can be modified in nl_probes/configs/sft_config.py.

Reproducing Paper Experiments

To replicate the evaluation results from the paper, run:

bash experiments/paper_evals.sh

This runs evaluations on five downstream tasks:

Gender (Secret Keeping Benchmark)
Taboo (Secret Keeping Benchmark)
Secret Side Constraint (SSC, Secret Keeping Benchmark)
Classification
PersonaQA

Citation

If you use this code in your research, please cite our paper:

@misc{karvonen2025activationoraclestrainingevaluating,
      title={Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers}, 
      author={Adam Karvonen and James Chua and Clément Dumas and Kit Fraser-Taliente and Subhash Kantamneni and Julian Minder and Euan Ong and Arnab Sen Sharma and Daniel Wen and Owain Evans and Samuel Marks},
      year={2025},
      eprint={2512.15674},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.15674}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 257 Commits
datasets		datasets
experiments		experiments
nl_probes		nl_probes
tests		tests
utility_scripts		utility_scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Activation Oracles

Overview

Installation

Quick Start: Demo

Pre-trained Models

Training

Reproducing Paper Experiments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Activation Oracles

Overview

Installation

Quick Start: Demo

Pre-trained Models

Training

Reproducing Paper Experiments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages