# Studying Feature Hedging in single-latent SAEs

In matryoshka SAEs, we assume that if an SAE is too narrow to represent both parent and child features, that the SAE will then represent just the parent with no interference from the children, effectively solving absorption. However, this turns out now to be true: narrow SAEs learn a mix of parent and child features.

We refer to this phenomenon as **feature hedging**. In Feature hedging, an SAE that is too narrow to represent both a parent and child feature mixes part of the child representation into the parent latent to reduce MSE loss rather than representing the parent cleanly.

This notebook explores this behavior in the simplest possible case with 1 parent and 1 child feature, and a single latent SAE. Our goal will be to get the SAE to represent just the parent feature perfectly.

Hopefully, if we can solve this problem with a single-latent SAE, then we can use that to solve the Matryoshka SAE problem more generally.


## Toy setup

We define a toy model with 2 mutually orthogonal features in a 50d space. We will also give our toy model no bias term to make it easy to see if the SAE decoder bias is doing something strange.

In [None]:
%load_ext autoreload
%autoreload 2

from hedging_paper.toy_models.toy_model import ToyModel


DEFAULT_D_IN = 50
DEFAULT_D_SAE = 1
DEFAULT_NUM_FEATS = 2

toy_model = ToyModel(num_feats=DEFAULT_NUM_FEATS, hidden_dim=DEFAULT_D_IN)

## Controlling feature firing and co-occurrence

Below, we set up a function `get_training_batch()` which we can use to control how many ground-truth features we have, their firing probabilities and magnitudes, and an option `modify_firing_features` callback which can be used to modify the firing features in a batch, for instance forcing a feature to fire or not fire depending on other firing features.

We'll use this to create a parent-child hierarchy between features 0 and 1, with feature 0 being the parent and feature 1 being the child. Feature 1 can only fire if feature 0 is already firing.

In [2]:
import torch
from functools import partial

from hedging_paper.toy_models.get_training_batch import get_training_batch


# Feature 0 fires with probability 0.25
# Feature 1 fires with probability 0.2 if feature 0 is firing
feat_probs = torch.tensor([0.25, 0.2])

# set up the parent-child firing relationship
def modify_feats(feats: torch.Tensor):
    feat_0_fires = feats[:, 0] == 1
    feats[~feat_0_fires, 1] = 0
    return feats

independent_generator = partial(get_training_batch, firing_probabilities=feat_probs)
parent_child_generator = partial(get_training_batch, firing_probabilities=feat_probs, modify_firing_features=modify_feats)

let's check a sample batch of 30 samples. We see 2 features, and that feature 1 is only active if feature 0 is active.

In [None]:
parent_child_generator(30)

## SAE Training

We use SAELens to train our single-latent SAE. The following is a bunch of hacky boilerplate to make our toy setup work with SAELens. You can just run this

## Training a single latent SAE

First, let's try a standard SAE on fully independent features.


In [None]:
from hedging_paper.toy_models.train_toy_sae import train_toy_sae
from hedging_paper.toy_models.initialization import init_sae_to_match_model

from hedging_paper.saes.base_sae import BaseSAE, BaseSAEConfig, BaseSAERunnerConfig

cfg = BaseSAERunnerConfig(
    context_size=500,
    d_in=toy_model.embed.weight.shape[0],
    d_sae=DEFAULT_D_SAE,
    l1_coefficient=1e-3,
    normalize_sae_decoder=False,
    scale_sparsity_penalty_by_decoder_norm=True,
    init_encoder_as_decoder_transpose=True,
    apply_b_dec_to_input=True,
    b_dec_init_method="zeros",
)
indep_sae = BaseSAE(BaseSAEConfig.from_sae_runner_config(cfg))
init_sae_to_match_model(indep_sae, toy_model, noise_level=0.25)


train_toy_sae(indep_sae, toy_model, independent_generator)

In [None]:
from hedging_paper.toy_models.plotting import plot_sae_feat_cos_sims_seaborn

plot_sae_feat_cos_sims_seaborn(
    indep_sae,
    toy_model,
    "Independent features",
    height=1.5,
    width=5,
    show_values=True,
    save_path="plots/single_latent_independent_features.pdf",
    one_based_indexing=True,
)

That worked exactly as we would hope! The SAE perfectly recovers feature 0 with no interference from feature 1.

In [None]:
from hedging_paper.toy_models.plotting import plot_b_dec_feat_cos_sims_seaborn

plot_b_dec_feat_cos_sims_seaborn(
    indep_sae,
    toy_model,
    # "Independent features",
    height=1.5,
    width=4,
    show_values=True,
    one_based_indexing=True,
    save_path="plots/single_latent_independent_features_b_dec.pdf",
)
print(f"SAE b_dec magnitude: {indep_sae.b_dec.norm():.3f}")

Interestingly, the SAE decoder bias has learned feature 1 with a reduced magnitude exactly equal to the probability of feature 1 firing. This isn't desirable behavior, but isn't terrible. Ideally the decoder should match our toy model bias of 0.

This sort of hedging is incentivized by MSE loss, where it's worth it to get a small error on 80% of the times that feature 1 doesn't fire to reduce the massive squared error the 20% of times that feature 1 does fire.

In [None]:
from hedging_paper.toy_models.train_toy_sae import train_toy_sae
from hedging_paper.saes.base_sae import BaseSAE, BaseSAERunnerConfig, BaseSAEConfig
from hedging_paper.toy_models.initialization import init_sae_to_match_model

cfg = BaseSAERunnerConfig(
    context_size=500,
    d_in=toy_model.embed.weight.shape[0],
    d_sae=DEFAULT_D_SAE,
    l1_coefficient=1e-3,
    normalize_sae_decoder=False,
    scale_sparsity_penalty_by_decoder_norm=True,
    init_encoder_as_decoder_transpose=True,
    apply_b_dec_to_input=True,
    b_dec_init_method="zeros",
)
base_sae = BaseSAE(BaseSAEConfig.from_sae_runner_config(cfg))
init_sae_to_match_model(base_sae, toy_model, noise_level=0.25)

train_toy_sae(base_sae, toy_model, parent_child_generator)

In [None]:
from hedging_paper.toy_models.plotting import plot_sae_feat_cos_sims_seaborn

plot_sae_feat_cos_sims_seaborn(
    base_sae,
    toy_model,
    "Hierarchical features",
    height=1.5,
    width=5,
    show_values=True,
    save_path="plots/single_latent_hierarchical_features.pdf",
    one_based_indexing=True,
)

Sadly, we did not learn just feature 0 on its own as we hoped! Our single latent is mixing in part of feature 1 as well 😢.


## Quick demo plot of feature hedging

In [None]:
from hedging_paper.toy_models.train_toy_sae import train_toy_sae
from hedging_paper.saes.base_sae import BaseSAE, BaseSAERunnerConfig, BaseSAEConfig
from hedging_paper.toy_models.initialization import init_sae_to_match_model

cfg = BaseSAERunnerConfig(
    context_size=500,
    d_in=toy_model.embed.weight.shape[0],
    d_sae=DEFAULT_D_SAE,
    l1_coefficient=1e-3,
    normalize_sae_decoder=False,
    scale_sparsity_penalty_by_decoder_norm=True,
    init_encoder_as_decoder_transpose=True,
    apply_b_dec_to_input=True,
    b_dec_init_method="zeros",
)
hedge_demo_sae = BaseSAE(BaseSAEConfig.from_sae_runner_config(cfg))
init_sae_to_match_model(hedge_demo_sae, toy_model, noise_level=0)

train_toy_sae(hedge_demo_sae, toy_model, parent_child_generator)

In [None]:
from hedging_paper.toy_models.plotting import plot_sae_feat_cos_sims_seaborn
from hedging_paper.toy_models.plotting import plot_sae_feat_cos_sims


plot_sae_feat_cos_sims_seaborn(
    hedge_demo_sae,
    toy_model,
    "Feature hedging",
    height=1.5,
    width=5,
    show_values=False,
    save_path="plots/feature_hedging_example.pdf",
    one_based_indexing=True,
)

plot_sae_feat_cos_sims(
    hedge_demo_sae,
    toy_model,
    "Feature hedging",
    height=200,
    show_values=False,
)

## This would cause absorption if the SAE had enough latents

Let's verify that if the SAE had 2 latents instead, we'd get feature absorption instead

In [None]:
from hedging_paper.toy_models.train_toy_sae import train_toy_sae
from hedging_paper.saes.base_sae import BaseSAE, BaseSAERunnerConfig, BaseSAEConfig
from hedging_paper.toy_models.initialization import init_sae_to_match_model

cfg = BaseSAERunnerConfig(
    context_size=500,
    d_in=toy_model.embed.weight.shape[0],
    d_sae=2,
    l1_coefficient=1e-3,
    normalize_sae_decoder=False,
    scale_sparsity_penalty_by_decoder_norm=True,
    init_encoder_as_decoder_transpose=True,
    apply_b_dec_to_input=True,
    b_dec_init_method="zeros",
)
abs_sae = BaseSAE(BaseSAEConfig.from_sae_runner_config(cfg))
init_sae_to_match_model(abs_sae, toy_model, noise_level=0.1)

train_toy_sae(abs_sae, toy_model, parent_child_generator, training_tokens=100_000_000)

In [None]:
from hedging_paper.toy_models.plotting import plot_sae_feat_cos_sims_seaborn

plot_sae_feat_cos_sims_seaborn(
    abs_sae,
    toy_model,
    "Feature absorption",
    height=2,
    width=5,
    show_values=False,
    save_path="plots/feature_absorption_example.pdf",
    one_based_indexing=True,
)

# What if the child and parent are only loosly correlated?

Above, the child can only fire if the parent is firing. What happens if we relax that constraint, so instead the child fires more with the parent than it fires on its own, but is not perfectly correlated?

Below, we set up feature 1 to fire with probability 0.25 given feature 0 fires, but only probability of 0.1 of firing on its own.

In [13]:
import torch
from functools import partial

from hedging_paper.toy_models.get_training_batch import get_training_batch

# Feature 0 fires with probability 0.25
# Feature 1 fires with probability 0.2 if feature 0 is firing
feat_probs = torch.tensor([0.25, 0.2])
# Feature 1 fires with probability 0.1 if feature 0 is not firing
solo_firing_prob = 0.1

# set up the parent-child firing relationship
def partial_cooccurence(feats: torch.Tensor):
    solo_firings = firing_features = torch.bernoulli(
        torch.tensor(solo_firing_prob).unsqueeze(0).expand(feats.shape[0])
    )

    feat_0_fires = feats[:, 0] == 1
    feats[~feat_0_fires, 1] = solo_firings[~feat_0_fires]
    return feats

partial_cooccurrence_generator = partial(get_training_batch, firing_probabilities=feat_probs, modify_firing_features=partial_cooccurence)

let's check a sample batch of 30 samples. We see 2 features, and that feature 1 is only active if feature 0 is active.

In [None]:
partial_cooccurrence_generator(30)

Next, let's train a single-latent SAE on these features

In [None]:
from hedging_paper.toy_models.train_toy_sae import train_toy_sae
from hedging_paper.saes.base_sae import BaseSAE, BaseSAERunnerConfig
from hedging_paper.saes.base_sae import BaseSAEConfig
from hedging_paper.toy_models.initialization import init_sae_to_match_model


cfg = BaseSAERunnerConfig(
    context_size=500,
    d_in=toy_model.embed.weight.shape[0],
    d_sae=DEFAULT_D_SAE,
    l1_coefficient=1e-3,
    normalize_sae_decoder=False,
    scale_sparsity_penalty_by_decoder_norm=True,
    init_encoder_as_decoder_transpose=True,
    apply_b_dec_to_input=True,
    b_dec_init_method="zeros",
)
partial_sae_low_l1 = BaseSAE(BaseSAEConfig.from_sae_runner_config(cfg))
init_sae_to_match_model(partial_sae_low_l1, toy_model, noise_level=0.25)

train_toy_sae(partial_sae_low_l1, toy_model, partial_cooccurrence_generator)

In [None]:
from hedging_paper.toy_models.plotting import plot_sae_feat_cos_sims_seaborn

plot_sae_feat_cos_sims_seaborn(
    partial_sae_low_l1,
    toy_model,
    "Correlated features, low L1 penalty",
    height=1.5,
    width=5,
    show_values=True,
    save_path="plots/single_latent_correlated_features_low_l1.pdf",
    one_based_indexing=True,
)

In [None]:
from hedging_paper.toy_models.train_toy_sae import train_toy_sae
from hedging_paper.saes.base_sae import BaseSAE, BaseSAERunnerConfig
from hedging_paper.saes.base_sae import BaseSAEConfig
from hedging_paper.toy_models.initialization import init_sae_to_match_model


cfg = BaseSAERunnerConfig(
    context_size=500,
    d_in=toy_model.embed.weight.shape[0],
    d_sae=DEFAULT_D_SAE,
    l1_coefficient=1e-1,
    normalize_sae_decoder=False,
    scale_sparsity_penalty_by_decoder_norm=True,
    init_encoder_as_decoder_transpose=True,
    apply_b_dec_to_input=True,
    b_dec_init_method="zeros",
)
partial_sae_high_l1 = BaseSAE(BaseSAEConfig.from_sae_runner_config(cfg))
init_sae_to_match_model(partial_sae_high_l1, toy_model, noise_level=0.25)

train_toy_sae(partial_sae_high_l1, toy_model, partial_cooccurrence_generator)

In [None]:
from hedging_paper.toy_models.plotting import plot_sae_feat_cos_sims_seaborn

plot_sae_feat_cos_sims_seaborn(
    partial_sae_high_l1,
    toy_model,
    "Correlated features, high L1 penalty",
    height=1.5,
    width=5,
    show_values=True,
    save_path="plots/single_latent_correlated_features_high_l1.pdf",
    one_based_indexing=True,
)

### What if we train a wide-enough SAE on these features?

In [None]:
from hedging_paper.toy_models.train_toy_sae import train_toy_sae
from hedging_paper.saes.base_sae import BaseSAE, BaseSAERunnerConfig
from hedging_paper.saes.base_sae import BaseSAEConfig
from hedging_paper.toy_models.initialization import init_sae_to_match_model


cfg = BaseSAERunnerConfig(
    context_size=500,
    d_in=toy_model.embed.weight.shape[0],
    d_sae=2,
    l1_coefficient=1e-3,
    normalize_sae_decoder=False,
    scale_sparsity_penalty_by_decoder_norm=True,
    init_encoder_as_decoder_transpose=True,
    apply_b_dec_to_input=True,
    b_dec_init_method="zeros",
)
partial_sae_wide = BaseSAE(BaseSAEConfig.from_sae_runner_config(cfg))
init_sae_to_match_model(partial_sae_wide, toy_model, noise_level=0.25)

train_toy_sae(partial_sae_wide, toy_model, partial_cooccurrence_generator)

In [None]:
from hedging_paper.toy_models.plotting import plot_sae_feat_cos_sims_seaborn

plot_sae_feat_cos_sims_seaborn(
    partial_sae_wide,
    toy_model,
    "Correlated features, full-width SAE",
    height=1.5,
    width=5,
    show_values=True,
    save_path="plots/single_latent_correlated_features_full_width.pdf",
    one_based_indexing=True,
)

## What happens if the co-occurrence is flipped?

This is very bad, but what happens if the feature fires more on its own then it co-occurs with the parent? Do we still see a messed up SAE latent?

In [21]:
import torch
from functools import partial
from hedging_paper.toy_models.get_training_batch import get_training_batch

# Feature 0 fires with probability 0.25
# Feature 1 fires with probability 0.1 if feature 0 is firing
feat_probs = torch.tensor([0.25, 0.1])
# Feature 1 fires with probability 0.2 if feature 0 is not firing
solo_firing_prob = 0.2

# set up the parent-child firing relationship
def partial_cooccurence(feats: torch.Tensor):
    solo_firings = torch.bernoulli(
        torch.tensor(solo_firing_prob).unsqueeze(0).expand(feats.shape[0])
    )

    feat_0_fires = feats[:, 0] == 1
    feats[~feat_0_fires, 1] = solo_firings[~feat_0_fires]
    return feats

inv_partial_cooccurrence_generator = partial(get_training_batch, firing_probabilities=feat_probs, modify_firing_features=partial_cooccurence)

In [None]:
from hedging_paper.toy_models.train_toy_sae import train_toy_sae
from hedging_paper.saes.base_sae import BaseSAE, BaseSAERunnerConfig
from hedging_paper.saes.base_sae import BaseSAEConfig
from hedging_paper.toy_models.initialization import init_sae_to_match_model


cfg = BaseSAERunnerConfig(
    context_size=500,
    d_in=toy_model.embed.weight.shape[0],
    d_sae=DEFAULT_D_SAE,
    l1_coefficient=2e-2,
    normalize_sae_decoder=False,
    scale_sparsity_penalty_by_decoder_norm=True,
    init_encoder_as_decoder_transpose=True,
    apply_b_dec_to_input=True,
    b_dec_init_method="zeros",
)
partial_sae_inv = BaseSAE(BaseSAEConfig.from_sae_runner_config(cfg))
init_sae_to_match_model(partial_sae_inv, toy_model, noise_level=0.25)

train_toy_sae(partial_sae_inv, toy_model, inv_partial_cooccurrence_generator)

In [None]:
from hedging_paper.toy_models.plotting import plot_sae_feat_cos_sims_seaborn

plot_sae_feat_cos_sims_seaborn(
    partial_sae_inv,
    toy_model,
    "Standard SAE - anti-correlated features",
    height=1.5,
    width=5,
    show_values=True,
    save_path="plots/single_latent_anti_correlated_features.pdf",
    one_based_indexing=True,
)

The SAE latent now contains a negative component of the child feature, just due to partial co-occurrence! This is so very bad 😱

In [None]:

from hedging_paper.toy_models.plotting import plot_b_dec_feat_cos_sims

plot_b_dec_feat_cos_sims(partial_sae_inv, toy_model, "Standard SAE - anti-correlated features", show_values=True)
print(f"SAE b_dec magnitude: {partial_sae_inv.b_dec.norm():.3f}")

This works because the SAE decoder bias still encodes a smaller magnitude version of feature 1, so the SAE latent encoding a negative component of feature 1 reduces the magnitude of the hedging when the SAE latent fires. The SAE is abusing the knowledge that feature 0 firing means feature 1 is less likely to be firing to reduce the MSE loss penalty for having this always-active component of feature 1 in the SAE decoder bias.

### What if the SAE is wide enough to represent all the features?

In [None]:
from hedging_paper.toy_models.train_toy_sae import train_toy_sae
from hedging_paper.saes.base_sae import BaseSAE, BaseSAERunnerConfig
from hedging_paper.saes.base_sae import BaseSAEConfig
from hedging_paper.toy_models.initialization import init_sae_to_match_model


cfg = BaseSAERunnerConfig(
    context_size=500,
    d_in=toy_model.embed.weight.shape[0],
    d_sae=2,
    l1_coefficient=2e-2,
    normalize_sae_decoder=False,
    scale_sparsity_penalty_by_decoder_norm=True,
    init_encoder_as_decoder_transpose=True,
    apply_b_dec_to_input=True,
    b_dec_init_method="zeros",
)
partial_sae_inv_wide = BaseSAE(BaseSAEConfig.from_sae_runner_config(cfg))
init_sae_to_match_model(partial_sae_inv_wide, toy_model, noise_level=0.25)

train_toy_sae(partial_sae_inv_wide, toy_model, inv_partial_cooccurrence_generator)

In [None]:
from hedging_paper.toy_models.plotting import plot_sae_feat_cos_sims_seaborn

plot_sae_feat_cos_sims_seaborn(
    partial_sae_inv_wide,
    toy_model,
    "Full-width SAE - anti-correlated features",
    height=1.5,
    width=5,
    show_values=True,
    save_path="plots/single_latent_anti_correlated_features_wide.pdf",
    one_based_indexing=True,
)

## Plotting loss curves

Let's plot the loss as a function of the angle of a vector between the two features. We'll normalize the x-axis to be between 0 and 1, where 0 is the parent feature direction and 1 is the child feature direction. 0.5 is the absorption solution, mixing parent and child together.

Below we set up the plotting and calculation helpers for these loss curves. Feel free to just run this.


In [27]:
from typing import Callable

import torch

@torch.no_grad()
def calc_loss_curve(
    loss: Callable[[torch.Tensor, torch.Tensor], torch.Tensor],
    parent_only_prob: float,
    parent_and_child_prob: float,
    steps: int = 100,
    sparsity_coefficient: float = 0.0,
    sparsity_p: float = 1.0,
    rand_vecs: bool = False,
) -> list[float]:
    if rand_vecs:
        parent_vec = torch.randn(10).float()
        child_vec = torch.randn(10).float()
        child_vec = child_vec - (child_vec @ parent_vec) * parent_vec / parent_vec.norm()**2
        child_vec = child_vec / child_vec.norm()
        parent_vec = parent_vec / parent_vec.norm()
        parent_and_child_vec = parent_vec + child_vec
    else:
        parent_vec = torch.tensor([1, 0]).float()
        child_vec = torch.tensor([0, 1]).float()
        parent_and_child_vec = parent_vec + child_vec

    losses = []
    for i in range(steps):
        portion = i / (steps - 1)
        test_latent = parent_vec * (1 - portion) + child_vec * portion
        test_latent = test_latent / test_latent.norm()

        encode = lambda input_act: (input_act @ test_latent).relu()
        decode = lambda hidden_act: hidden_act * test_latent

        def calc_loss(input_act: torch.Tensor) -> torch.Tensor:
            hidden_act = encode(input_act)
            recons = decode(hidden_act)
            return loss(recons, input_act) + sparsity_coefficient * hidden_act.norm(
                p=sparsity_p
            )

        parent_loss = calc_loss(parent_vec)
        parent_and_child_loss = calc_loss(parent_and_child_vec)
        expected_loss = (
            parent_loss * parent_only_prob
            + parent_and_child_loss * parent_and_child_prob
        ).item()
        losses.append(expected_loss)
    return losses

## Different parent vs child firing probabilities result in different loss landscapes

First, let's see what happens if the parent fires on its own 30% of the time, but 15% of the time it fires along with the child

In [None]:
from hedging_paper.loss_curves import plot_loss_curve_seaborn

plot_loss_curve_seaborn(
    parent_only_prob=0.3,
    parent_and_child_prob=0.1,
    sparsity_coefficient=[0.0, 0.1],
    sparsity_p=1.0,
    subtitle=r"skew parent (p($f_1+f_2$) < p($f_1$))",
    rand_vecs=True,
    width=3,
    height=2,
    save_path="plots/single_latent_loss_curves_skew_parent.pdf",
)

In [None]:
from hedging_paper.loss_curves import plot_loss_curve_seaborn

plot_loss_curve_seaborn(
    parent_only_prob=0.1,
    parent_and_child_prob=0.3,
    sparsity_coefficient=[0.0, 0.1],
    sparsity_p=1.0,
    subtitle=r"skew child (p($f_1+f_2$) > p($f_1$))",
    rand_vecs=True,
    width=3,
    height=2,
    save_path="plots/single_latent_loss_curves_skew_child.pdf",
)