Did the probe find the concept, or a shortcut?
A small, reproducible harness for evaluating interpretability methods — the kind of question the SOAR I-2 project asks: when a linear probe "detects" a concept in a model's activations, is it recovering the concept, or keying on a correlated confound? Because the activations here are synthetic with known ground truth, we can actually check — which is exactly what you need to validate an interpretability evaluation in the first place.
Independent prep work for EleutherAI SOAR 2026, project I-2 — Evaluating Interpretability Methods for Scientific Reasoning. Built clean-room on personal time.
The I-2 project audits whether interpretability methods (linear probes, activation steering, interchange interventions) actually recover a scientific concept, and compares probe-training procedures under different experimental designs. This repo is the methodological core in miniature:
- Two standard probe procedures — logistic regression and difference-of-means.
- Two training designs — shortcut-prone vs. balanced/contrastive.
- Two evaluation splits — shortcut-prone vs. shortcut-controlled.
- A concept-separability check (does a probe for concept A also fire for a distinct concept B?).
Synthetic ground truth is the trick: each activation is
concept_dir·label + shortcut_dir·label + noise, so we know whether a probe is
reading the concept or the confound — and can show which evaluation designs
reveal the difference.
Environment + packages via uv; Python 3.14 pinned in
.python-version. One dependency (numpy).
uv venv --python 3.14
uv pip install -e .
uv run python -m probeval.run
uv run python -m unittest discover -s testsInterp probe evaluation — synthetic concept with a controllable shortcut (shortcut corr=0.95)
Shortcut-reliance experiment (a probe trained on shortcut-prone data can cheat):
method training design acc(prone) acc(ctrl) reliance
----------------------------------------------------------------------
logreg shortcut-prone 0.92 0.74 0.18
logreg balanced (contrastive) 0.84 0.84 0.01
diff-of-means shortcut-prone 0.92 0.72 0.21
diff-of-means balanced (contrastive) 0.84 0.84 0.00
Concept separability (logreg probe trained on concept A, tested on a distinct B):
acc on A (held-out) : 0.84
acc on B (cross-eval) : 0.49 (≈0.50 ⇒ separable)
The methodological point. A probe trained on shortcut-prone data scores 0.92 — it looks like it found the concept. Evaluate it on a shortcut-controlled split and it drops to ~0.73: it was leaning on the confound. Crucially, eval-only-on- prone hides this — every method looks ~equally good. What recovers the true concept is the training design (balanced/contrastive → reliance ≈ 0), more than the choice of probe. That's the difference between an interpretability method that claims to be causal and one that demonstrably is — exactly what I-2 sets out to measure.
Swap the synthetic X for real hidden states from an open model — the bridge to
the project's real benchmark (pro-arrhythmia / hERG):
uv pip install -e '.[hf]' # transformers + torchfrom probeval.activations import extract_activations
from probeval.probes import logistic_regression
X = extract_activations(texts, model_name="gpt2", layer=6) # (n, hidden_dim)
probe = logistic_regression(X, y) # then evaluate as aboveprobeval/
├── synth.py # controllable activations: concept_dir·y + shortcut_dir·s + noise
├── probes.py # logistic_regression, difference_of_means → a shared Probe(w, b)
├── metrics.py # accuracy, cosine
├── evaluate.py # shortcut_experiment + separability_experiment
├── activations.py # optional: real hidden states from an HF model
└── run.py # CLI + reporting
- Real model + benchmark — extract activations on the pro-arrhythmia / hERG task, label by a scientific concept (e.g. hERG block), and run the same evals.
- More procedures — add mass-mean / LDA / sparse probes; compare under matched vs. contrastive training sets.
- Beyond probes — extend to activation steering and interchange interventions, and test whether their "causal" claims survive shortcut control.
- Shortcut-controlled prompt sets — construct prompts that decorrelate the concept from surface features, the real-data analogue of the controlled split.
The activations are synthetic — deliberately, so the ground truth is known and the evaluation methodology can be validated. Real activations are non-linear, higher-dimensional, and the concept/shortcut structure is unknown; that's the point of the real-model extension. The probes here are linear; the framework generalizes.
MIT — see LICENSE. A deeper review is in literature/literature-review.md.