CE-Bench: A Contrastive Evaluation Benchmark of LLM Interpretability with Sparse Autoencoders

Alex Gulko*, Yusen Peng*, Sachin Kumar

News

[Nov 4th, 2025] paper is published on ACL Anthology!
[October 27th, 2025] poster for EMNLP is ready!
[September 18th, 2025] paper accepted at BlackboxNLP Workshop @ EMNLP, 2025!
[August 31, 2025] paper available on arXiv!
[May 15, 2025] contrastive story dataset is publicly available on HuggingFace!

Abstract

Sparse autoencoders (SAEs) are a promising approach for uncovering interpretable features in large language models (LLMs). While several automated evaluation methods exist for SAEs, most rely on external LLMs. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive evaluation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks without requiring an external LLM judge, achieving over 70% Spearman correlation with results in SAEBench. The official implementation and evaluation dataset are made publicly available.

CE-Bench

Pipeline of constructing the interpretability metric in CE-Bench. Two contrastive stories about the same subject are passed through a frozen LLM and a pretrained sparse autoencoder (SAE) to extract neuron activations. A contrastive score is computed as the max absolute difference between the stories’ average activations (V1, V2), while an independence score measures deviation from the dataset-wide activation mean (Iavg). These scores, along with SAE sparsity, are used to derive an interpretability score for an LLM-free evaluation of interpretability of sparse autoencoders.

Alignment Evaluation

Comparison of Interpretability Score Derivation Methods. C stands for contrastive score; I stands for independence score; S stands for sparsity. Baseline achieves 70.12% ranking agreement with SAE-Bench, but the sparsity-aware method pushes it to 77.30% with proper hyperparameter tuning on α.

Sample Score Visualization

Sample Visualization of Neuron-wise Scores for the Subject “Computer.” The left scatter plot shows each neuron’s contrastive and independence scores, with top-right points indicating neurons that are both highly contrastive and independent. The center and right histograms reveal that most neurons have low scores, suggesting that only a small subset of features are semantically relevant for the given subject.

Citation

Please cite our work if find it helpful for your research:

arXiv version:

@misc{gulko2025cebenchreliablecontrastiveevaluation,
      title={CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders}, 
      author={Alex Gulko and Yusen Peng and Sachin Kumar},
      year={2025},
      eprint={2509.00691},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.00691}, 
}

BlackboxNLP Workshop version:

@inproceedings{gulko-etal-2025-ce,
    title = "{CE}-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders",
    author = "Gulko, Alex  and
      Peng, Yusen  and
      Kumar, Sachin",
    editor = "Belinkov, Yonatan  and
      Mueller, Aaron  and
      Kim, Najoung  and
      Mohebbi, Hosein  and
      Chen, Hanjie  and
      Arad, Dana  and
      Sarti, Gabriele",
    booktitle = "Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.blackboxnlp-1.1/",
    pages = "1--15",
    ISBN = "979-8-89176-346-3",
    abstract = "Sparse autoencoders (SAEs) are a promising approach for uncovering interpretable features in large language models (LLMs). While several automated evaluation methods exist for SAEs, most rely on external LLMs. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive evaluation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks without requiring an external LLM judge, achieving over 70{\%} Spearman correlation with results in SAEBench. The official implementation and evaluation dataset are open-sourced and publicly available."
}

Contacts

If you have any questions or suggestions, feel free to contact:

Alex Gulko (gulko.5@osu.edu)
Yusen Peng (peng.1007@osu.edu)
Sachin Kumar (kumar.1145@osu.edu)

Or describe it in Issues.

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.vscode		.vscode
ce_bench		ce_bench
contrastive_generation		contrastive_generation
contrastive_stories		contrastive_stories
data		data
docs		docs
figures		figures
interpretability_eval		interpretability_eval
papers		papers
preliminary_exploration		preliminary_exploration
sae_lens		sae_lens
utils		utils
v2_results		v2_results
v3_results		v3_results
wandb		wandb
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
CE-Bench poster (84x119cm).pdf		CE-Bench poster (84x119cm).pdf
CE_Bench.png		CE_Bench.png
README.md		README.md
augmented_pretrained_saes.yaml		augmented_pretrained_saes.yaml
auto_script.sh		auto_script.sh
depth_analysis.sh		depth_analysis.sh
environment.yml		environment.yml
layer_type_analysis.sh		layer_type_analysis.sh
requirements.txt		requirements.txt
width_analysis.sh		width_analysis.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CE-Bench: A Contrastive Evaluation Benchmark of LLM Interpretability with Sparse Autoencoders

News

Abstract

CE-Bench

Alignment Evaluation

Sample Score Visualization

Citation

Contacts

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CE-Bench: A Contrastive Evaluation Benchmark of LLM Interpretability with Sparse Autoencoders

News

Abstract

CE-Bench

Alignment Evaluation

Sample Score Visualization

Citation

Contacts

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages