This notebook reproduces the key figures and experiments from our [writeup on Induction Heads](https://www.notion.so/Evaluating-Anthropic-s-Induction-Head-Claims-a84793d55332409392e488d2e8b620bd). Very little exposition is done here, and the reader is advised to follow along with the writeup. All section titles other than <i>Setup</i> match the corresponding sections of the writeup.

# Setup

In [10]:
from main import run_experiment
from experiments import make_experiments, make_make_corr, FixedSampler
from utils import compare_saa_in_cui
from interp.circuit.causal_scrubbing.hypothesis import (
    CondSampler,
    ExactSampler,
    UncondSampler,
)

# Preliminary Experiments

The entry point for running causal scrubbing experiments is the `experiments.py` file, which takes in command line arguments, builds experiment specifications, and passes them along to `main.py:run_experiments`. Specifications consist of a correspondence graph and, on many occasions, options for altering the model in various ways to enable the desired experiment.

The following code block bypasses `experiments.py`'s command-line interface by duplicating some of its functionality and calling `main.py:run_experiments` directly in order to reproduce the values from the <i>Preliminary Experiments</i> section of the writeup.

Note: we regrettably changed naming conventions a few times over the course of writing the many experiments specified in `experiments.py`. Those reproduced here have been renamed to match the descriptions in the writeup, but beyond those, the reader is advised not to assume the experiment's name is sufficient for understanding.

In [6]:
experiments = make_experiments(make_make_corr(ExactSampler()))
for exp in ["unscrubbed", "ev", "pth-k", "eq", "all", "baseline", "ev+", "pth-k+", "eq+"]:
    print(f"\n\nRunning experiment {exp}")
    run_experiment(experiments, exp, 10000, "", 0, False, False)



Running experiment unscrubbed
OVERALL
     4.192    10.507   3000000
CANDIDATES
     3.263    11.428    163147
LATER CANDIDATES
     1.165     7.686     36312
REPEATS
     2.499     5.658   1278308
UNCOMMON REPEATS
     3.952    11.538    303884
NON-ERB UNCOMMON REPEATS
     5.569     9.985    172800
MISLEADING INDUCTION
     4.837     9.919    182439
CANDIDATE ERB
     0.202     0.538     30722
NFERB UR
     5.760     9.610    158631


Running experiment ev
OVERALL
     4.270    10.392   3000000
CANDIDATES
     3.382    10.846    163147
LATER CANDIDATES
     1.558     6.768     36312
REPEATS
     2.675     6.037   1278308
UNCOMMON REPEATS
     4.556    11.424    303884
NON-ERB UNCOMMON REPEATS
     6.013     9.766    172800
MISLEADING INDUCTION
     4.788     9.673    182439
CANDIDATE ERB
     0.760     1.752     30722
NFERB UR
     6.172     9.489    158631


Running experiment pth-k
OVERALL
     4.318    10.685   3000000
CANDIDATES
     3.462    10.952    163161
LATER CANDIDATES
 

# 1.5 vs 1.6

## Identifying Parroting

For Subgraph Ablation Attribution, we need to rebuild the experiments using a special sampler so we get several resamples for a single dataset example. Then, we call `main.py:run_experiment` with a few special arguments specifying for the results to be saved in the `results` directory following a specific naming convention. This generates several pickles, which we then pass onto CUI using the `utils.py:compare_saa_in_cui`. In CUI, the "Comparison(example)" parameter should be set to either "facet" or some specific comparison (in the code below, we have only one). We do this here for the scrubbing of head 1.5 to reproduce the corresponding figure from the writeup.

Note: For producing such figures for several experiments and/or indices in a row, the file `get_data.py` provides a simple command-line interface.

In [11]:
experiments = make_experiments(make_make_corr(FixedSampler(4)))
run_experiment(
    experiments,
    "unscrubbed",
    1000,
    "unscrubbed_saa_4",
    0,
    False,
    False,
)
run_experiment(
    experiments,
    "scrub-1.5",
    1000,
    "scrub-1.5_saa_4",
    0,
    False,
    False,
)
compare_saa_in_cui([("unscrubbed", "scrub-1.5")])

OVERALL
     3.691    10.733    300000
CANDIDATES
     3.495    11.479     33000
LATER CANDIDATES
     1.411     9.299      7000
REPEATS
     2.065     6.115    139000
UNCOMMON REPEATS
     2.893    11.646     44000
NON-ERB UNCOMMON REPEATS
     4.569    14.252     21000
MISLEADING INDUCTION
     5.006    11.791     21000
CANDIDATE ERB
     0.171     0.076      6000
NFERB UR
     4.717    14.504     20000
OVERALL
     4.099    11.085    300000
CANDIDATES
     4.060    11.102     33000
LATER CANDIDATES
     2.199     7.920      7000
REPEATS
     2.595     7.968    139000
UNCOMMON REPEATS
     4.726    12.179     44000
NON-ERB UNCOMMON REPEATS
     6.766     9.850     21000
MISLEADING INDUCTION
     4.860    12.889     21000
CANDIDATE ERB
     1.113     0.978      6000
NFERB UR
     6.850    10.125     20000
Composable UI server already running on localhost:6789 in this Python process
http://interp-tools.redwoodresearch.org/#/tensors/untitled?port=6789&url=localhost


Visualizing attention patterns for the unscrubbed model can be done using interp-tools directly, since the model we studied is available there as "attention_only_two_layers_untied". However, since we will need to introduce our custom attention visualization function later on, and the test input shown in the writeup is a specific one, we demonstrate usage of our attention visualization functions here.

Similarly to the above code block, we need to save the results of our experiments by calling `main.py:run_experiment` with specific arguments. Then, we invoke `utils.py:compare_attns_in_cui` to process the relevant pickles into CUI. `compare_attns_in_cui` differs from `compare_saa_in_cui` in that each "comparison" can also be a single experiment (rather than only a pair of experiments), so we can visualize the attentions for that particular experiment without comparing them to another one's. In the future, we will likely generalize `compare_saa_in_cui` in a similar fashion.

In [None]:
experiments = make_experiments(make_make_corr(FixedSampler(4)))
run_experiment(
    experiments,
    "unscrubbed",
    1000,
    "unscrubbed_saa_4",
    0,
    True,
    False,
)
compare_attns_in_cui(["unscrubbed"])