# FADE Tutorial Notebook

Welcome to the official tutorial for FADE — an open-source framework for evaluating how well internal features in transformer models align with their natural language descriptions.

When working with modern language models, we often try to understand their internal mechanics by assigning natural language descriptions to specific neurons or features. But a critical question remains: **How accurately do these descriptions represent what the features actually encode?** FADE provides a systematic approach to answering this question.

FADE introduces four complementary metrics that together provide a comprehensive assessment of feature-description alignment:

- **Clarity**: Can we generate text that reliably activates a given feature? This tests whether the description is precise enough to be useful for creating activating content.
- **Responsiveness**: Do examples that naturally activate the feature actually express the concept we've described? This tests whether the feature truly responds to instances of the concept in natural data.
- **Purity**: Are the feature's activations specific to our described concept, or does it fire for unrelated content too? This tests whether the feature is dedicated to the concept or is polysemantic.
- **Faithfulness**: Does modifying a feature's activation cause the model to generate more concept-related content? This tests the causal relationship between the feature and model outputs.

By combining these perspectives, FADE helps identify mismatches between features and their descriptions, ultimately improving our understanding of what language models are actually learning.

### 💡 What This Notebook Covers
- [Setup](#0-setup): Installing FADE and preparing your environment
- [Quickstart](#1-quickstart): Learn how to run FADE on a single neuron with just a few lines of code
- [Using Cached Activations](#2-using-cached-activations): Dramatically speed up evaluations by reusing precomputed activations
- [Evaluating SAE features](#3-evaluating-sae-features): Apply FADE to Sparse Autoencoder features embedded in your model
- [Running FADE with other Explainer Models](#4-running-fade-with-other-explainer-models): Use various LLMs as explainers beyond just OpenAI models
- [Further Options](#5-further-options): Fine-tune FADE's behavior with advanced configuration options

# 0. Setup


Before diving in, we need to set up our environment. FADE is designed to evaluate feature-concept alignment in transformer models, so we'll need to load a subject model, tokenizer, and a dataset to work with.

First, let's install FADE. If you haven't done so already, FADE is available on PyPI:

```bash
pip install fade-language
```

Now, let's import the necessary libraries:


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer # necessary for loading the subject model and tokenizer
from datasets import load_dataset # necessary for loading the dataset
from huggingface_hub import hf_hub_download # necessary for loading the SAE weights
import numpy as np # necessary for loading SAE weights

from fade import EvaluationPipeline # main entry point to the evaluation pipeline

from utils import generate_activations # example function to generate activations 
from utils import SAEModuleWrapper # function to embed SAEs in the model as named modules
from utils import JumpReLUSAEEncodeNoBOS, JumpReLUSAEDecode # custom modules to use a JumpReLU SAE as an example

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")

For FADE to effectively evaluate feature-concept alignment, we need a natural text dataset. Ideally, this dataset would closely match the distribution of the model's pre-training data to ensure valid results.

For this tutorial, we'll use a small subset of the Neel Nanda Pile 10k dataset to keep things running quickly. For real-world evaluations, we strongly recommend:
1. Using a larger, more representative dataset
2. Pre-computing activations to speed up the process (covered in section 2)

In [None]:
# load small example dataset
dataset = load_dataset("NeelNanda/pile-10k")
dataset = {i: row["text"][:250] for i, row in enumerate(dataset["train"])}
dataset = {i: row for i, row in dataset.items() if i < 100}

# 1. Quickstart

In this section, we'll demonstrate how to run a basic FADE evaluation on a single neuron. This is perfect for quickly testing the framework or exploring a specific feature of interest.

For our example, we'll use Gemma 2 2B, a fairly compact but capable language model. Note that FADE works with virtually any transformer-based model available through HuggingFace.

In [None]:
# Load our subject model - this is the model containing the features we want to evaluate
subject_model_path = "google/gemma-2-2b"
model = AutoModelForCausalLM.from_pretrained(subject_model_path)
tokenizer = AutoTokenizer.from_pretrained(subject_model_path)

In [None]:
# Configure the evaluation pipeline
# We specify which LLM will be used as the "evaluator" for generating and rating content
# This evaluator model helps assess how well features align with their descriptions
eval_config = {
    'evaluationLLM': {
        'type': 'openai', # here we use OpenAI's API
        'name': 'gpt-4o-mini-2024-07-18', # At the time of writing a strong but efficient model for evaluation
        'api_key': 'YOUR-KEY-HERE', # Replace with your actual API key
    }
}

# Initialize the evaluation pipeline with our model, tokenizer, and dataset
eval_pipeline = EvaluationPipeline(
    subject_model=model,  
    subject_tokenizer=tokenizer, 
    dataset=dataset,
    config=eval_config,
    device=device, 
    verbose=True  # Set to True to see progress details
)

In [None]:
# Define which feature we want to evaluate
neuron_module = "model.layers[20].mlp.act_fn"  # Path to the module containing our feature
neuron_index = 42  # Index of the specific neuron/feature we're evaluating
concept = "The feature description you want to evaluate."  # What we think this feature represents

# Run FADE evaluation to get our four alignment metrics
(clarity, responsiveness, purity, faithfulness) = eval_pipeline.run(
    neuron_module=neuron_module,
    neuron_index=neuron_index,
    concept=concept)

print(f"\nClarity: {clarity} - How well we can generate text that activates this feature", 
      f"\nResponsiveness: {responsiveness} - How well the feature responds to the concept in natural data",
      f"\nPurity: {purity} - How specific the feature is to our concept",
      f"\nFaithfulness: {faithfulness} - How much this feature causally influences generating concept-related content")

# 2. Using Cached Activations

One of the most significant performance optimizations in FADE is the ability to use cached activations. This approach delivers two major benefits:

1. **Dramatic speed improvements**: Instead of recomputing activations for the entire dataset for each feature, you compute them once and reuse them.
2. **Memory efficiency**: You can evaluate models on much larger datasets than would be possible with on-the-fly computation.

This becomes especially critical when evaluating hundreds or thousands of features across a model, which is common in comprehensive interpretability studies.

Let's see how to implement activation caching:


In [None]:
# load the subject model
subject_model_path = "google/gemma-2-2b"
model = AutoModelForCausalLM.from_pretrained(subject_model_path)
tokenizer = AutoTokenizer.from_pretrained(subject_model_path)

In [None]:
# Generate activations for the entire dataset once
# These will be reused for evaluating multiple features
neuron_module = "model.layers[20].mlp.act_fn" 
activations = generate_activations(
    model=model,
    tokenizer=tokenizer,
    dataset=dataset,
    named_module=neuron_module,
    device=device,
    batch_size=4, # Adjust based on your available memory
) # This returns raw activations for each sample in the dataset

# Reduce the sequence dimension by taking the maximum activation for each feature
# This gives us a single value per sample per neuron, which is what FADE expects
activations = {i: torch.max(torch.abs(activation), dim=0)[0] for i, activation in activations.items()}

# Create a function that returns the cached activations for a given neuron index
# This function follows the ActivationLoader protocol expected by FADE
def feature_activation_loader(neuron_index: int) -> torch.Tensor:
    return torch.stack([activation[neuron_index] for i, activation in activations.items()])

# Note: You can create and load the activations in any way you want, as long as you adhere to the ActivationLoader protocol in fade.data

In [None]:
# we initialize custom configuration with OpenAI LLM
eval_config = {
    'evaluationLLM': {
        'type': 'openai',
        'name': 'gpt-4o-mini-2024-07-18',
        'api_key': 'YOUR-KEY-HERE',
    }
}

eval_pipeline = EvaluationPipeline(
    subject_model=model,  
    subject_tokenizer=tokenizer, 
    dataset=dataset,
    activations=feature_activation_loader, # we pass the function that returns the activations for a given neuron index
    config=eval_config,
    device=device, 
    verbose=True)

In [None]:
# we again evaluate a single neuron but this time we use the pre-computed activations
neuron_module = "model.layers[20].mlp.act_fn"
neuron_index = 42
concept = "The feature description you want to evaluate."

(clarity, responsiveness, purity, faithfulness) = eval_pipeline.run(
    neuron_module=neuron_module,
    neuron_index=neuron_index,
    concept=concept
    )

# 3. Evaluating SAE features

Sparse Autoencoders (SAEs) have become increasingly important in interpretability research because they can decompose the often polysemantic activations of regular model neurons into more monosemantic features. This decomposition can help make the features more interpretable, but how can we verify this?

FADE is perfectly positioned to evaluate SAE features, letting us quantitatively assess whether SAE features actually align better with their descriptions than regular neurons. The only requirement is that the SAE must be embedded into the model's computational graph as a named module.

Below, we'll demonstrate evaluating a JumpReLU SAE, a popular variant that has shown promising results in decomposing language model activations:

In [None]:
# we load our subject model of choice
subject_model_path = "google/gemma-2-2b"
model = AutoModelForCausalLM.from_pretrained(subject_model_path)
tokenizer = AutoTokenizer.from_pretrained(subject_model_path)

In [None]:
# Load the SAE parameters from a pretrained checkpoint
path_to_params = hf_hub_download(
    repo_id="google/gemma-scope-2b-pt-res",
    filename="layer_20/width_16k/average_l0_71/params.npz",
    force_download=False,
)
params = np.load(path_to_params)
pt_params = {k: torch.from_numpy(v) for k, v in params.items()}

# Separate encoder and decoder parameters
pt_params_decode = {k: v for k, v in pt_params.items() if 'dec' in k}
pt_params_encode = {k: v for k, v in pt_params.items() if k not in pt_params_decode}

# Initialize the SAE encoder and decoder modules
sae_encoder = JumpReLUSAEEncodeNoBOS(params['W_enc'].shape[0], params['W_enc'].shape[1])
sae_encoder.load_state_dict(pt_params_encode)

sae_decoder = JumpReLUSAEDecode(params['W_enc'].shape[0], params['W_enc'].shape[1])
sae_decoder.load_state_dict(pt_params_decode)

# Embed the SAE into the model's computational graph. This allows FADE to access the SAE's activations during evaluation as "model.layers[20].additional_modules[0]".
model.model.layers[20] = SAEModuleWrapper(original_module=model.model.layers[20], additional_modules=[sae_encoder, sae_decoder])

In [None]:
# we initialize custom configuration with OpenAI LLM
eval_config = {
    'evaluationLLM': {
        'type': 'openai',
        'name': 'gpt-4o-mini-2024-07-18',
        'api_key': 'YOUR-KEY-HERE',
    },
    "subjectLLM": {
        "sae_module": True,  # This tells FADE to use SAE-type steering for the faithfulness metric
    },
}

# initialize eval pipeline
eval_pipeline = EvaluationPipeline(
    subject_model=model,  
    subject_tokenizer=tokenizer, 
    dataset=dataset,
    config=eval_config,
    device=device, 
    verbose=True,
    )

In [None]:
# example neuron specification:
neuron_module = "model.layers[20].additional_modules[0]"  # str of the added named SAE module!
neuron_index = 42  # int of the neuron index in the named module
concept = "The feature description you want to evaluate."

(clarity, responsiveness, purity, faithfulness) = eval_pipeline.run(
    neuron_module=neuron_module,
    neuron_index=neuron_index,
    concept=concept)

# 4. Running FADE with other Explainer Models

FADE is designed to be flexible in its choice of evaluation LLMs, supporting both proprietary models and locally hosted models through a variety of interfaces. This flexibility offers several important advantages:
1. **Model selection flexibility**: Choose the model that best fits your specific evaluation needs - different models may excel at different types of concept understanding
2. **Cost optimization**: Adjust your model choice based on budget constraints or evaluation scale
3. **Research adaptability**: Some interpretability research may involve concepts that mainstream models are reluctant to evaluate due to safety filters
4. **Bias mitigation**: Compare results across different models to identify and control for model-specific biases in evaluations
5. **Privacy**: Keep sensitive data within your own infrastructure when needed

FADE supports any OpenAI API-compatible model, whether hosted by OpenAI itself, Azure, or through local hosting like vLLM OpenAI servers. Additionally, it directly supports Ollama for straightforward local setups.

Let's explore how to configure different types of evaluator models:

In [None]:
# we load our subject model of choice
subject_model_path = "google/gemma-2-2b"
model = AutoModelForCausalLM.from_pretrained(subject_model_path)
tokenizer = AutoTokenizer.from_pretrained(subject_model_path)

In [None]:
# Setup configuration for different types of evaluation models
# Simply comment out all options except the one you want to use

# Option 1: OpenAI API (cloud-based)
eval_config = {
    'evaluationLLM': {
        'type': 'openai',
        'name': 'gpt-4o-mini-2024-07-18',
        'api_key': 'YOUR-KEY-HERE',
    }
}

# Option 2: Ollama (local)
eval_config = {
    'evaluationLLM': {
        'type': 'ollama',
        'name': 'llama3.2',  # Must be installed in your Ollama instance
        'base_url': 'http://127.0.0.1:11434', # Default Ollama port
    }
}

# Option 3: vLLM Server (self-hosted API)
eval_config = {
    'evaluationLLM': {
        'type': 'openai',
        'name': 'llama3.2',
        'api_key': 'optional_custom_key',
        'base_url': 'http://example_server.com:8000', # url of your server
    }
}

# Option 4: Azure OpenAI
eval_config = {
    'evaluationLLM': {
        'type': 'azure',
        'name': 'gpt-4o-mini-2024-07-18',
        'api_key': 'YOUR-KEY-HERE',
        'base_url': 'YOUR-AZURE-ENDPOINT-URL-HERE',
        'api_version': '2024-02-01',
    }
}

eval_pipeline = EvaluationPipeline(
    subject_model=model,  
    subject_tokenizer=tokenizer, 
    dataset=dataset,
    config=eval_config,
    device=device, 
    verbose=True,
    )

In [None]:
neuron_module = "model.layers[20].mlp.act_fn"
neuron_index = 42
concept = "The feature description you want to evaluate."

(clarity, responsiveness, purity, faithfulness) = eval_pipeline.run(
    neuron_module=neuron_module,
    neuron_index=neuron_index,
    concept=concept)

# 5. Further Options

FADE offers extensive customization options to fit different evaluation needs. Here, we'll highlight the most useful configuration groups and explain when you might want to adjust them. To see all options, please see the default_config.yaml file in the FADE repository. 

To use any of these options, simply add them to the configuration dictionary you've seen in previous examples:

```python
# Example of a configuration with custom settings
eval_config = {
    'evaluationLLM': {
        'type': 'openai',
        'name': 'gpt-4o-mini-2024-07-18',
        'api_key': 'YOUR-KEY-HERE',
    },
    # Add custom experiment settings
    'experiments': {
        'gini_threshold': 0.6,  # Increase threshold for more selective faithfulness evaluation
        'clarity': {
            'llm_calls': 20,  # Generate more synthetic samples
        },
        'faithfulness': {
            'modification_factors': [-100, -10, -1, 0, 1, 10, 100],  # Test wider range of modifications
        }
    }
}

# Then use this config as before
eval_pipeline = EvaluationPipeline(
    subject_model=model,  
    subject_tokenizer=tokenizer, 
    dataset=dataset,
    config=eval_config,
    device=device, 
    verbose=True
)
```

Configurations specific to the subject model that is being investigated:
```yaml
subjectLLM:
  sae_module: False # Whether the used module is a SAE module or not
  batch_size: 4 # Batch size for the subject model
````


Configurations specific to the evaluation model that is being used to rate and generate samples:
```yaml
evaluationLLM:
  type: "type_of_evaluation_model" # type of evaluation model as in: vllm, ollama, azure
  name: "modelname" # name of the explainer model
  api_key: "Secret Key" # API key if required
  base_url: "https://test.com/" # base url where to reach the model
  api_version: "2024-02-02" # API version of the model
````


Configurations specific to the actual evaluation. Here especially hyperparameters can be set for the different evaluation metrics. 
```yaml
experiments:
  rating_batch_size: 15 # Number of samples to be rated in one call to the evaluation model
  gini_threshold: 0.5 # Threshold for confirming concept neuron and starting the faithfulness calculation. If Score is below this threshold, the concept neuron is not considered to be present. Should be between 0 and 1.
  
  clarity:
    llm_calls: 15 # Number of parallel calls to the LLM given the prompt - determines the number of samples to be generated

  responsiveness_and_purity:
    num_samples: 500 # Number of samples to be rated in total - determines the number of calls to the explainer in combination with the explainer batch size
    max_failed_retries: 1 # The maximum number of retries when samples fail to generate ratings.
    max_sparse_retries: 1 # The maximum number of retries if not enough samples have non-zero ratings.
    retry_sparse_threshold: 15 # The number of non-zero ratings below which additional samples are generated.
    repeat_non_zeros: 0 # The number of times to repeat ratings for samples with non-zero ratings.
    rating_top_sampling_percentage: 0.1 # Percentage of samples taken from the top of the activations

  faithfulness:
    modification_factors: [-50, -10, -1, 0, 1, 10, 50] # Modification factors to be applied to the neuron
    num_samples: 50 # Number of samples to be generated for each modification factor - determines the number of calls to the explainer in combination with the explainer batch size
    generation_length: 30 # Token length of the generated samples
    max_failed_retries: 1 # The maximum number of retries when samples fail to generate ratings.
    max_sparse_retries: 0 # The maximum number of retries if not enough samples have non-zero ratings.
    retry_sparse_threshold: 20 # The number of non-zero ratings below which additional samples are generated.
    repeat_non_zeros: 0 # The number of times to repeat ratings for samples with non-zero ratings.
    rating_top_sampling_percentage: 0.1 # Percentage of samples taken from the top of the activations
````

Configurations regarding the prompts
```yaml
prompts:
  generation: "..." # The prompt used to generate the samples
  rating: "..." # The prompt used to rate the samples
```