Skip to content

avilog/interp-probe-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

interp-probe-eval

Did the probe find the concept, or a shortcut?

A small, reproducible harness for evaluating interpretability methods — the kind of question the SOAR I-2 project asks: when a linear probe "detects" a concept in a model's activations, is it recovering the concept, or keying on a correlated confound? Because the activations here are synthetic with known ground truth, we can actually check — which is exactly what you need to validate an interpretability evaluation in the first place.

Independent prep work for EleutherAI SOAR 2026, project I-2 — Evaluating Interpretability Methods for Scientific Reasoning. Built clean-room on personal time.


Why this exists

The I-2 project audits whether interpretability methods (linear probes, activation steering, interchange interventions) actually recover a scientific concept, and compares probe-training procedures under different experimental designs. This repo is the methodological core in miniature:

  • Two standard probe procedures — logistic regression and difference-of-means.
  • Two training designs — shortcut-prone vs. balanced/contrastive.
  • Two evaluation splits — shortcut-prone vs. shortcut-controlled.
  • A concept-separability check (does a probe for concept A also fire for a distinct concept B?).

Synthetic ground truth is the trick: each activation is concept_dir·label + shortcut_dir·label + noise, so we know whether a probe is reading the concept or the confound — and can show which evaluation designs reveal the difference.

Quickstart

Environment + packages via uv; Python 3.14 pinned in .python-version. One dependency (numpy).

uv venv --python 3.14
uv pip install -e .

uv run python -m probeval.run
uv run python -m unittest discover -s tests

Example output

Interp probe evaluation — synthetic concept with a controllable shortcut (shortcut corr=0.95)

Shortcut-reliance experiment (a probe trained on shortcut-prone data can cheat):

method        training design          acc(prone)  acc(ctrl)  reliance
----------------------------------------------------------------------
logreg        shortcut-prone                 0.92       0.74      0.18
logreg        balanced (contrastive)         0.84       0.84      0.01
diff-of-means shortcut-prone                 0.92       0.72      0.21
diff-of-means balanced (contrastive)         0.84       0.84      0.00

Concept separability (logreg probe trained on concept A, tested on a distinct B):
  acc on A (held-out)     :  0.84
  acc on B (cross-eval)   :  0.49   (≈0.50 ⇒ separable)

The methodological point. A probe trained on shortcut-prone data scores 0.92 — it looks like it found the concept. Evaluate it on a shortcut-controlled split and it drops to ~0.73: it was leaning on the confound. Crucially, eval-only-on- prone hides this — every method looks ~equally good. What recovers the true concept is the training design (balanced/contrastive → reliance ≈ 0), more than the choice of probe. That's the difference between an interpretability method that claims to be causal and one that demonstrably is — exactly what I-2 sets out to measure.

Real activations (optional)

Swap the synthetic X for real hidden states from an open model — the bridge to the project's real benchmark (pro-arrhythmia / hERG):

uv pip install -e '.[hf]'      # transformers + torch
from probeval.activations import extract_activations
from probeval.probes import logistic_regression
X = extract_activations(texts, model_name="gpt2", layer=6)   # (n, hidden_dim)
probe = logistic_regression(X, y)                            # then evaluate as above

How it works

probeval/
├── synth.py       # controllable activations: concept_dir·y + shortcut_dir·s + noise
├── probes.py      # logistic_regression, difference_of_means → a shared Probe(w, b)
├── metrics.py     # accuracy, cosine
├── evaluate.py    # shortcut_experiment + separability_experiment
├── activations.py # optional: real hidden states from an HF model
└── run.py         # CLI + reporting

Extending toward the full project

  • Real model + benchmark — extract activations on the pro-arrhythmia / hERG task, label by a scientific concept (e.g. hERG block), and run the same evals.
  • More procedures — add mass-mean / LDA / sparse probes; compare under matched vs. contrastive training sets.
  • Beyond probes — extend to activation steering and interchange interventions, and test whether their "causal" claims survive shortcut control.
  • Shortcut-controlled prompt sets — construct prompts that decorrelate the concept from surface features, the real-data analogue of the controlled split.

Limitations

The activations are synthetic — deliberately, so the ground truth is known and the evaluation methodology can be validated. Real activations are non-linear, higher-dimensional, and the concept/shortcut structure is unknown; that's the point of the real-model extension. The probes here are linear; the framework generalizes.

License

MIT — see LICENSE. A deeper review is in literature/literature-review.md.

About

Evaluating interpretability probes: do they recover the concept or a shortcut?

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages