# SAE Bench Eval Template

## Overview

Every eval type has the following:
1. A corresponding sub-package. 
2. A main.py, which includes:
   1.  An argparse interface (`arg_parse`) for running the eval from the command line 
   2.  A run_eval function which operates on a set of SAEs, producing a json with results per SAE. 
3. An eval_config.py with a pydantic dataclass inheriting from BaseEvalConfig, specific to that eval and defaults set to recommended values. 
4. An eval_output.py with a pydantic dataclass subclassing from BaseEvalOutput, with output specific to that eval.

## CLI and Eval Config

The CLI interface takes a combination of common arguments (same for all evals) and eval-type specific arguments. Eval-type specific arguments should match those in the eval_config of that sub-package. The common eval arguments should include:
- `sae_regex_pattern` and `sae_block_pattern` used with regex to select SAEs from the SAE Lens library. 
- `output_folder` to place the output in. 
- `model_name` for loading a model from TransformerLens.

Eval configs should be a pydantic.dataclass and inherit from BaseEvalConfig. This allows you to add "Title" and "Description" annotations to describe each field. You should do this so that when these fields are displayed in the UI, they have user-friendly display names, as well as hover-able descriptions that explain what each field means and its significance. For an example, see `/evals/absorption/eval_config.py`.

To see which SAEs you can select via the regex arguments, use the SAE selection utils like this:

In [1]:
from sae_bench_utils.sae_selection_utils import print_all_sae_releases, print_release_details

# Callum came up with this format which I like visually.
print_all_sae_releases() # each release has a corresponding model / repo_id. We recommend you don't select releases with different models when running evals.
print_release_details('gpt2-small-res-jb') # each release has a number of possible SAEs. 

  from .autonotebook import tqdm as notebook_tqdm


┌─────────────────────────────────────┬─────────────────────────────────────────────────────┬────────────────────────────────────────────────────────┬──────────┐
│ model                               │ release                                             │ repo_id                                                │   n_saes │
├─────────────────────────────────────┼─────────────────────────────────────────────────────┼────────────────────────────────────────────────────────┼──────────┤
│ gemma-2-27b                         │ gemma-scope-27b-pt-res                              │ google/gemma-scope-27b-pt-res                          │       18 │
│ gemma-2-27b                         │ gemma-scope-27b-pt-res-canonical                    │ google/gemma-scope-27b-pt-res                          │        3 │
│ gemma-2-2b                          │ gemma-scope-2b-pt-res                               │ google/gemma-scope-2b-pt-res                           │      310 │
│ gemma-2-2b                

Here's an example call to the absorption eval. Note that we are selecting just one release / SAE (though we could select more) and that we're using default arguments for the eval-specific args (by not setting them via the CLI.)

```bash
python evals/absorption/main.py \
--sae_regex_pattern "sae_bench_pythia70m_sweep_standard_ctx128_0712" \
--sae_block_pattern "blocks.4.hook_resid_post__trainer_.*" \
--model_name pythia-70m-deduped \
--output_folder results
```

To create such an interface, an arg_parse function should be created in the main.py file as below and an EvalConfig should be instantiated in an eval_config.py file inside the eval subpackage. Eval configs should be dataclass objects that have serializable values (so it's easy to save them / load them.)

You can test whether you've set up the `EvalConfig` and `arg_parser` correctly by using the `validate_eval_cli_interface` testing util. Feel free to change the CLI args / Eval Config to test the validation.

In [None]:
%pip install pydantic

import argparse
from pydantic import Field
from pydantic.dataclasses import dataclass
from sae_bench_utils.testing_utils import validate_eval_cli_interface

def arg_parser():
    parser = argparse.ArgumentParser(description="Run absorption evaluation")
    parser.add_argument("--arg1", type=int, default=42, help="Description for arg1")
    parser.add_argument("--arg2", type=float, default=0.03, help="Description for arg2")
    parser.add_argument("--arg3", type=int, default=10, help="Description for arg3")
    parser.add_argument("--arg4", type=str, default="{word} has the first letter:", help="Description for arg4")
    parser.add_argument("--arg5", type=int, default=-6, help="Description for arg5")
    parser.add_argument("--model_name", type=str, default="pythia-70m-deduped", help="Description for arg6")
    parser.add_argument("--sae_regex_pattern", type=str, required=True, help="Regex pattern for SAE selection")
    parser.add_argument("--sae_block_pattern", type=str, required=True, help="Regex pattern for SAE block selection")
    parser.add_argument("--output_folder", type=str, default="eval_results/absorption", help="Output folder")
    parser.add_argument("--force_rerun", action="store_true", help="Force rerun of experiments")

    return parser

@dataclass
class EvalConfig:
    arg1: int = Field(default=42, title="Arg1", description="Description for arg1")
    arg2: float = Field(default=0.03, title="Arg2", description="Description for arg2")
    arg3: int = Field(default=10, title="Arg3", description="Description for arg3")
    arg4: str = Field(default="{word} has the first letter:", title="Arg4", description="Description for arg4")
    arg5: int = Field(default=-6, title="Arg5", description="Description for arg5")
    model_name: str = Field(default="pythia-70m-deduped", title="Model Name", description="Description for model name")

validate_eval_cli_interface(arg_parser(), eval_config_cls=EvalConfig)


## Output Format

Each output json should correspond to one SAE and their base structure is defined by `BaseEvalOutput` in `evals/base_eval_output.py`.

An Eval Output is a pydantic.dataclass and inherits from `BaseEvalOutput`. This makes the actual JSON output files easily verifiable (since pydantic automatically generates JSON Schema from dataclasses) and portable to other languages/apps. And same as with the Eval Config, this also allows you to add "Title" and "Description" annotations to describe each field, which will be saved in that JSON schema. For an example, see `/evals/absorption/eval_output.py` and `/evals/absorption/eval_output_schema.json`.

To build an output, inherit from BaseEvalOutput and create your own `eval_output.py`. An example from `absorption/eval_output.py` is partially pasted below:

```
# Define the eval output, which includes the eval config, metrics, and result details.
# The title will end up being the title of the eval in the UI.
@dataclass(config=ConfigDict(title="Feature Absorption Evaluation - First Letter"))
class AbsorptionEvalOutput(
    BaseEvalOutput[
        AbsorptionEvalConfig, AbsorptionMetricCategories, AbsorptionResultDetail
    ]
):
    # This will end up being the description of the eval in the UI.
    """
    The output of a feature absorption evaluation looking at the first letter.
    """

    eval_config: AbsorptionEvalConfig
    eval_id: str
    datetime_epoch_millis: int
    eval_result_metrics: AbsorptionMetricCategories
    eval_result_details: list[AbsorptionResultDetail] = Field(
        default_factory=list,
        title="Per-Letter Absorption Results",
        description="Each object is a stat on the first letter of the absorption.",
    )
    eval_type_id: str = Field(
        default="absorption_first_letter",
        title="Eval Type ID",
        description="The type of the evaluation",
    )
```

Then, when you've run the eval, put the results into your eval_output type (in this case, AbsorptionEvalOutput), like so:

```
eval_output = AbsorptionEvalOutput(
    eval_type_id="absorption_first_letter",
    eval_config=config,
    eval_id=get_eval_uuid(),
    datetime_epoch_millis=int(datetime.now().timestamp() * 1000),
    eval_result_metrics=AbsorptionMetricCategories(
        mean=AbsorptionMeanMetrics(
            mean_absorption_score=statistics.mean(absorption_rates),
            mean_num_split_features=statistics.mean(num_split_features),
        )
    ),
    eval_result_details=eval_result_details,
    sae_bench_commit_hash=get_sae_bench_version(),
    sae_lens_id=sae_id,
    sae_lens_release_id=sae_release,
    sae_lens_version=get_sae_lens_version(),
)
```

Finally, simply do a JSON dump to output to file:
```
eval_output.to_json_file(sae_result_path, indent=2)
```

Here's what that output would look like:
```json
{
    "eval_type_id": "absorption_first_letter",
    "eval_config": {
        "random_seed": 42,
        "f1_jump_threshold": 0.03,
        "max_k_value": 10,
        "prompt_template": "{word} has the first letter:",
        "prompt_token_pos": -6,
        "model_name": "pythia-70m-deduped"
    },
    "eval_id": "0c057d5e-973e-410e-8e32-32569323b5e6",
    "datetime_epoch_millis": "1729834113150",
    "eval_result_metrics": {
        "mean": {
            "mean_absorption_score": 2,
            "mean_num_split_features": 3.5,
        }
    },
    "eval_result_details": [
        {
            "first_letter": "a",
            "num_absorption": 177,
            "absorption_rate": 0.28780487804878047,
            "num_probe_true_positives": 615.0,
            "num_split_features": 1
        },
        {
            "first_letter": "b",
            "num_absorption": 51,
            "absorption_rate": 0.1650485436893204,
            "num_probe_true_positives": 309.0,
            "num_split_features": 1
        }
    ],
    "sae_bench_commit_hash": "57e9be0ac9199dba6b9f87fe92f80532e9aefced",
    "sae_lens_id": "blocks.3.hook_resid_post__trainer_10",
    "sae_lens_release_id": "sae_bench_pythia70m_sweep_standard_ctx128_0712",
    "sae_lens_version": "4.0.0"
}
```

You can see tests for this under `tests/evals/absorption/test_eval_output.py`.

Since you're using a pydantic dataclass to define the output, you shouldn't need to do any additional re-verification of the output. However, if you want to check a JSON to see if it meets the defined output spec, you can call  `validate_eval_output_format_file` or `validate_eval_output_format_str` to check it.  Feel free to break the json and see the test fail. (eg: remove a field like `sae_lens_release_id`).

The JSON schemas files themselves are generated with `evals/generate_json_schemas.py`, which can be updated by running:
```
python evals/generate_json_schemas.py
```

### What if I have unstructured outputs I want to save into the JSON?
Put unstructured outputs into the `eval_result_unstructured`. This allows putting data of any type. However, be aware that since this has no structure, it's less likely to support sorting, filtering, or visualizations using these values. We highly encourage you to use the `eval_result_metrics` or `eval_result_details whenever possible`.

In [None]:
import json
import os
from evals.absorption.eval_output import AbsorptionEvalOutput
from sae_bench_utils.testing_utils import validate_eval_output_format_file

eval_results_temp = {
    "eval_type_id": "absorption_first_letter",
    "eval_config": {
        "random_seed": 42,
        "f1_jump_threshold": 0.03,
        "max_k_value": 10,
        "prompt_template": "{word} has the first letter:",
        "prompt_token_pos": -6,
        "model_name": "pythia-70m-deduped",
    },
    "eval_id": "0c057d5e-973e-410e-8e32-32569323b5e6",
    "datetime_epoch_millis": "1729834113150",
    "eval_result_metrics": {
        "mean": {
            "mean_absorption_score": 2,
            "mean_num_split_features": 3.5,
        }
    },
    "eval_result_details": [
        {
            "first_letter": "a",
            "num_absorption": 177,
            "absorption_rate": 0.28780487804878047,
            "num_probe_true_positives": 615.0,
            "num_split_features": 1,
        },
        {
            "first_letter": "b",
            "num_absorption": 51,
            "absorption_rate": 0.1650485436893204,
            "num_probe_true_positives": 309.0,
            "num_split_features": 1,
        },
    ],
    "eval_result_unstructured": {
        "pew pew": "pew pew",
        "bar": ["woof", 1, 3],
        3: 3,
    },
    "sae_bench_commit_hash": "57e9be0ac9199dba6b9f87fe92f80532e9aefced",
    "sae_lens_id": "blocks.3.hook_resid_post__trainer_10",
    "sae_lens_release_id": "sae_bench_pythia70m_sweep_standard_ctx128_0712",
    "sae_lens_version": "4.0.0",
}


# save to file
with open('eval_results_temp.json', 'w') as f:
    json.dump(eval_results_temp, f)

validate_eval_output_format_file('eval_results_temp.json', eval_output_type=AbsorptionEvalOutput)

# delete file
os.remove('eval_results_temp.json')

We can then load the eval results jsons across many different SAEs and have a high level of visibility into which evals were run with which parameters and code.


## A note on cached results

A variety of evals can share the results intermediate computation, such as model activations or trained probes. Most of these will be model / hook point specific so should be saved along a path of the format `f'{artifact_dir}/{eval_type}/{model}/{hook_point}/{artifact_id}'`.

