This module implements five families of interpretability methods that localize a target concept to model components. Given a set of model representations
- Featurizer: An
$n \times n$ orthogonal matrix formed by the principal components. - Features: Selected by a classifier.
- Training: Featurizer does not require training; Feature selection requires training an attribute value classifier.
- Featurizer: The encoder and decoder, assuming the reconstruction loss is sufficiently small.
- Features: Selected by a classifier.
- Training: Featurizer is trained on unsupervised reconstruction loss; Feature selection requires training an attribute value classifier.
- Featurizer: An
$n \times n$ orthogonal matrix formed by the set of$k$ orthonormal vectors that span the row space of the probe and$n-k$ orthonormal vectors that span the null space of the probe, where$k$ is the rank of the probe. - Features: The first
$k$ dimensions, i.e., the$k$ dimensions correspond to the row space. - Training: Featurizer is trained on attribute value classification; No additional feature selection required.
- Featurizer: An identity matrix.
- Features: Selected by a learned binary mask.
- Training: Features require training with counterfactual signals.
- Featurizer: An
$n \times n$ orthogonal matrix learned with counterfactual signals. - Features: The first
$k$ dimensions. - Training: Featurizer requires training with counterfactual signals.