interp-probe-eval

Did the probe find the concept, or a shortcut?

A small, reproducible harness for evaluating interpretability methods — the kind of question the SOAR I-2 project asks: when a linear probe "detects" a concept in a model's activations, is it recovering the concept, or keying on a correlated confound? Because the activations here are synthetic with known ground truth, we can actually check — which is exactly what you need to validate an interpretability evaluation in the first place.

Independent prep work for EleutherAI SOAR 2026, project I-2 — Evaluating Interpretability Methods for Scientific Reasoning. Built clean-room on personal time.

Why this exists

The I-2 project audits whether interpretability methods (linear probes, activation steering, interchange interventions) actually recover a scientific concept, and compares probe-training procedures under different experimental designs. This repo is the methodological core in miniature:

Two standard probe procedures — logistic regression and difference-of-means.
Two training designs — shortcut-prone vs. balanced/contrastive.
Two evaluation splits — shortcut-prone vs. shortcut-controlled.
A concept-separability check (does a probe for concept A also fire for a distinct concept B?).

Synthetic ground truth is the trick: each activation is concept_dir·label + shortcut_dir·label + noise, so we know whether a probe is reading the concept or the confound — and can show which evaluation designs reveal the difference.

Quickstart

Environment + packages via uv; Python 3.14 pinned in .python-version. One dependency (numpy).

uv venv --python 3.14
uv pip install -e .

uv run python -m probeval.run
uv run python -m unittest discover -s tests

Example output

Interp probe evaluation — synthetic concept with a controllable shortcut (shortcut corr=0.95)

Shortcut-reliance experiment (a probe trained on shortcut-prone data can cheat):

method        training design          acc(prone)  acc(ctrl)  reliance
----------------------------------------------------------------------
logreg        shortcut-prone                 0.92       0.74      0.18
logreg        balanced (contrastive)         0.84       0.84      0.01
diff-of-means shortcut-prone                 0.92       0.72      0.21
diff-of-means balanced (contrastive)         0.84       0.84      0.00

Concept separability (logreg probe trained on concept A, tested on a distinct B):
  acc on A (held-out)     :  0.84
  acc on B (cross-eval)   :  0.49   (≈0.50 ⇒ separable)

The methodological point. A probe trained on shortcut-prone data scores 0.92 — it looks like it found the concept. Evaluate it on a shortcut-controlled split and it drops to ~0.73: it was leaning on the confound. Crucially, eval-only-on- prone hides this — every method looks ~equally good. What recovers the true concept is the training design (balanced/contrastive → reliance ≈ 0), more than the choice of probe. That's the difference between an interpretability method that claims to be causal and one that demonstrably is — exactly what I-2 sets out to measure.

Real activations (optional)

Swap the synthetic X for real hidden states from an open model — the bridge to the project's real benchmark (pro-arrhythmia / hERG):

uv pip install -e '.[hf]'      # transformers + torch

from probeval.activations import extract_activations
from probeval.probes import logistic_regression
X = extract_activations(texts, model_name="gpt2", layer=6)   # (n, hidden_dim)
probe = logistic_regression(X, y)                            # then evaluate as above

How it works

probeval/
├── synth.py       # controllable activations: concept_dir·y + shortcut_dir·s + noise
├── probes.py      # logistic_regression, difference_of_means → a shared Probe(w, b)
├── metrics.py     # accuracy, cosine
├── evaluate.py    # shortcut_experiment + separability_experiment
├── activations.py # optional: real hidden states from an HF model
└── run.py         # CLI + reporting

Extending toward the full project

Real model + benchmark — extract activations on the pro-arrhythmia / hERG task, label by a scientific concept (e.g. hERG block), and run the same evals.
More procedures — add mass-mean / LDA / sparse probes; compare under matched vs. contrastive training sets.
Beyond probes — extend to activation steering and interchange interventions, and test whether their "causal" claims survive shortcut control.
Shortcut-controlled prompt sets — construct prompts that decorrelate the concept from surface features, the real-data analogue of the controlled split.

Limitations

The activations are synthetic — deliberately, so the ground truth is known and the evaluation methodology can be validated. Real activations are non-linear, higher-dimensional, and the concept/shortcut structure is unknown; that's the point of the real-model extension. The probes here are linear; the framework generalizes.

License

MIT — see LICENSE. A deeper review is in literature/literature-review.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
literature		literature
probeval		probeval
results		results
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

interp-probe-eval

Why this exists

Quickstart

Example output

Real activations (optional)

How it works

Extending toward the full project

Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

interp-probe-eval

Why this exists

Quickstart

Example output

Real activations (optional)

How it works

Extending toward the full project

Limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages