# Using SAE Features to Generate Data

SPoSE features are nice, but there is too few of them, and they are already generated from human data. An alternative approach is to take a pretrained SAE, and use the individual latent neurons of the SAE to generate classification problems. This approach not only gives you a lot more functions, but also do not directly contain human behaviour.

Here, I look at some of the properties of such functions. In this case, I am using a top K sae trained on the final residual of the vision encoder of an Open CLIP model. The model is trained by ViT-Prisma, and I downloaded it from [here](https://huggingface.co/Prisma-Multimodal/sae-top_k-64-cls_only-layer_11-hook_resid_post).


In [None]:
from os import chdir

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

from metarep.data import h5_to_numpy, prepare_things_spose

num_positive = 50
np.random.seed(1234)

In [None]:
chdir("..")

In [None]:
things_targets = h5_to_numpy(min_nonzero=0)

We only want to keep latents that have at least 100 non-zero values. Technically 50 would suffice, as we need 50 positive samples in the default setup, but still, it's good to be able to have some more diversity of stimuli in a given function

In [None]:
# give the column IDs that have at least 100 non-zero values
nonzero_columns = np.where(np.count_nonzero(things_targets, axis=0) >= 100)[0]
# filter the targets to only include these columns
things_targets = things_targets[:, nonzero_columns]

I just want to see, how is difficult it is to solve these tasks using a linear classifier, compared to solving the SPoSE dimensions using the same approach.

In [None]:
X = np.load("data/backbone_reps/dinov2_vitb14_reg.npz")
X = np.hstack([x[:] for x in X.values()])

So, I train a linear classifier from the dino embeddings to every single SPoSE dimension seperately using cross-validation

We assign the top `num_positive` instances from a given dimension to the positive label, and we randomly sample `num_positive` entries for the negative class from those images that have a value of 0 for this dimension. For the SPoSE classifications, no features have 0 entries. So we choose the bottom `num_positives`.


First, the SAE.

In [None]:
pipeline  = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=1000, random_state=1234, n_jobs=-1)
)

data = dict(og_id=[], accuracy=[], filtered_id=[])
for og_id, col in tqdm(zip(nonzero_columns, range(things_targets.shape[1]))):
    # get row ids of the top num_positive activations for this column
    top_ids = np.argsort(things_targets[:, col])[-num_positive:]
    # get row ids of random num_positive activations that are 0 for this column
    random_ids = np.random.choice(np.where(things_targets[:, col] == 0)[0], num_positive, replace=False)

    X_sub = X[np.concatenate([top_ids, random_ids])]
    y_sub = np.concatenate([np.ones(num_positive), np.zeros(num_positive)])

    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1234)
    cross_val_scores = cross_val_score(pipeline, X_sub, y_sub, cv=skf, scoring="accuracy", n_jobs=-1)
    accuracy = np.mean(cross_val_scores)
    data["og_id"].append(og_id)
    data["accuracy"].append(accuracy)
    data["filtered_id"].append(col)

sae_df = pd.DataFrame(data)
sae_df["features"] = "sae"

Then, SPoSE


In [None]:
X, Y = prepare_things_spose(np.load("data/backbone_reps/dinov2_vitb14_reg.npz"), return_tensors="np")
# do the same as above but with spose targets
spose_data = dict(og_id=[], accuracy=[], filtered_id=[])
for og_id in tqdm(range(Y.shape[1])):
    # get row ids of the top num_positive activations for this column
    top_ids = np.argsort(Y[:, og_id])[-num_positive:]
    # get row ids of the bottom num_positive activations
    bottom_ids = np.argsort(Y[:, og_id])[:num_positive]

    X_sub = X[np.concatenate([top_ids, bottom_ids])]
    y_sub = np.concatenate([np.ones(num_positive), np.zeros(num_positive)])

    skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1234)
    cross_val_scores = cross_val_score(pipeline, X_sub, y_sub, cv=skf, scoring="accuracy", n_jobs=-1)
    accuracy = np.mean(cross_val_scores)
    spose_data["og_id"].append(og_id)
    spose_data["accuracy"].append(accuracy)
    spose_data["filtered_id"].append(og_id)

In [None]:
spose_df = pd.DataFrame(spose_data)
spose_df["features"] = "spose"
df = pd.concat([sae_df, spose_df], ignore_index=True)

sns.kdeplot(df, x="accuracy",fill=True, hue="features", common_norm=False, alpha=0.5)
plt.show()

The distribution of cross-validated accuracies seem similar, except a few things:

- A larger part of the SAE functions are trivially solvable compared to SPoSE.
- There are more "medium difficulty" functions in SPoSE compared to the SAE functions.
- There are more "unsolvable" functions coming from the SAE, compared to those coming from SPoSE


## Which categories are actually hard?

Since SPoSE dimensions have labels, we can look at the accuracies per dimension with their semantic labels


In [None]:
labels = open("data/external/labels.txt").read().splitlines()
fig, ax = plt.subplots(figsize=(10, 6))
sns.barplot(spose_df, x="og_id", y="accuracy", ax=ax)
# rotate x tick labels by 90 degrees
# Set the x-tick labels to the labels from the file
ax.set_xticklabels(labels, rotation=90)
plt.xticks(rotation=90)
# set ylabel
ax.set_ylabel("Accuracy")
# set xlabel
# set ymin to 0.5
ax.set_ylim(0.5, 1.0)
ax.set_xlabel("")
plt.show()

It's mostly colours that seem tricky