# II-Benchmark

A benchmark for causal abstraction-like analyses, which try to find an alignment between a model's computation and a corresponding causal graph.
This benchmark supplies a variety of models, trained using IIT to localize causal concepts in the hierarchical equality task.

I broke down the repository to two parts:
* __Generating models__: behind the curtains, trains a set of models on a variety of alignments on a single training set of the equality task.
* __Evaluating a model__: for those who wish to benchmark themselves, provides a "blackbox" model trained by IIT and evaluates alignment on the black-box model.

In [1]:
# admittedly takes a bit, because it loads the training and testing data sets
from ii_benchmark import IIBenchmarkEquality, IIBenchmarkMoNli

benchmark = IIBenchmarkEquality()

V1 = 0
V2 = 1
BOTH = 2

  from .autonotebook import tqdm as notebook_tqdm


## Generating Models

Generates models for each alignment between causal variables and neural activations, sampled from exhaustive generation of all possible alignments.
For each sampled alignment, we train three models: one that aligns only V1, one that aligns only V2, and one that aligns BOTH.

Models are saved in the `./models/` repository, and are named by the following convention: the variable `v` (one of V1, V2, or BOTH), the intervention location for V1 (layer, start index, end index), and the intervention location for V2 (layer, start index, end index).

__NOTE__: currently, I am mapping a causal variable like V1 to a _contiguous block of neural activations_. I wonder if we should try to create distributed mappings? For instance, V1 can map to indices 1:3 and 5:7 in layer 1's activation?

In [2]:
# sample `n` alignments for training models for the benchmark
n = 2
alignments = benchmark.sample_alignments(n_samples=n)

In [None]:
# train a model using IIT for each alignment (and for each variable in V1, V2, and BOTH)
# NOTE: currently commented out, because this is a time-consuming step that should only be taken once
benchmark.train_models(alignments)

## Evaluating Models

Loads a model from our list of models, and evaluates possible alignments on the model using interchange interventions.

Small note: right now, our evaluation kind of "gives away" the alignment by the name of the weights files. This should be a small fix, though.

In [4]:
model_path = './models/v=2v1=1-3-4v2=1-4-14.pt'
blackbox_LIM = benchmark.load_model(model_path)

In [5]:
true_alignment = {
    V1: [{'layer': 1, 'start': 3, 'end': 4}],
    V2: [{'layer': 1, 'start': 4, 'end': 14}],
    BOTH: [{'layer': 1, 'start': 3, 'end': 4}, {'layer': 1, 'start': 4, 'end': 14}]
}

In [7]:
evaluation = benchmark.evaluate(blackbox_LIM, true_alignment)
benchmark.display_evaluations(evaluation)

II-Evaluation on V1
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       500
           1       1.00      1.00      1.00       500

    accuracy                           1.00      1000
   macro avg       1.00      1.00      1.00      1000
weighted avg       1.00      1.00      1.00      1000

II-Evaluation on V2
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       500
           1       1.00      1.00      1.00       500

    accuracy                           1.00      1000
   macro avg       1.00      1.00      1.00      1000
weighted avg       1.00      1.00      1.00      1000

II-Evaluation on BOTH
              precision    recall  f1-score   support

           0       0.50      0.52      0.51       498
           1       0.50      0.48      0.49       502

    accuracy                           0.50      1000
   macro avg       0.50      0.50      0.50      1000
weighted avg

In [8]:
bad_alignment = {
    V1: [{'layer': 2, 'start': 3, 'end': 4}],
    V2: [{'layer': 2, 'start': 4, 'end': 14}],
    BOTH: [{'layer': 1, 'start': 4, 'end': 14}, {'layer': 1, 'start': 3, 'end': 4}]
}

In [9]:
evaluation = benchmark.evaluate(blackbox_LIM, bad_alignment)
benchmark.display_evaluations(evaluation)

II-Evaluation on V1
              precision    recall  f1-score   support

           0       0.50      0.50      0.50       500
           1       0.50      0.50      0.50       500

    accuracy                           0.50      1000
   macro avg       0.50      0.50      0.50      1000
weighted avg       0.50      0.50      0.50      1000

II-Evaluation on V2
              precision    recall  f1-score   support

           0       0.53      0.55      0.54       500
           1       0.53      0.50      0.52       500

    accuracy                           0.53      1000
   macro avg       0.53      0.53      0.53      1000
weighted avg       0.53      0.53      0.53      1000

II-Evaluation on BOTH
              precision    recall  f1-score   support

           0       0.50      0.52      0.51       498
           1       0.50      0.48      0.49       502

    accuracy                           0.50      1000
   macro avg       0.50      0.50      0.50      1000
weighted avg