Alex Gulko*, Yusen Peng*, Sachin Kumar
- [Nov 4th, 2025] paper is published on ACL Anthology!
- [October 27th, 2025] poster for EMNLP is ready!
- [September 18th, 2025] paper accepted at BlackboxNLP Workshop @ EMNLP, 2025!
- [August 31, 2025] paper available on arXiv!
- [May 15, 2025] contrastive story dataset is publicly available on HuggingFace!
Sparse autoencoders (SAEs) are a promising approach for uncovering interpretable features in large language models (LLMs). While several automated evaluation methods exist for SAEs, most rely on external LLMs. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive evaluation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks without requiring an external LLM judge, achieving over 70% Spearman correlation with results in SAEBench. The official implementation and evaluation dataset are made publicly available.
Pipeline of constructing the interpretability metric in CE-Bench. Two contrastive stories about the same subject are passed through a frozen LLM and a pretrained sparse autoencoder (SAE) to extract neuron activations. A contrastive score is computed as the max absolute difference between the stories’ average activations (V1, V2), while an independence score measures deviation from the dataset-wide activation mean (Iavg). These scores, along with SAE sparsity, are used to derive an interpretability score for an LLM-free evaluation of interpretability of sparse autoencoders.
Comparison of Interpretability Score Derivation Methods. C stands for contrastive score; I stands for independence score; S stands for sparsity. Baseline achieves 70.12% ranking agreement with SAE-Bench, but the sparsity-aware method pushes it to 77.30% with proper hyperparameter tuning on α.
Sample Visualization of Neuron-wise Scores for the Subject “Computer.” The left scatter plot shows each neuron’s contrastive and independence scores, with top-right points indicating neurons that are both highly contrastive and independent. The center and right histograms reveal that most neurons have low scores, suggesting that only a small subset of features are semantically relevant for the given subject.
Please cite our work if find it helpful for your research:
- arXiv version:
@misc{gulko2025cebenchreliablecontrastiveevaluation,
title={CE-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders},
author={Alex Gulko and Yusen Peng and Sachin Kumar},
year={2025},
eprint={2509.00691},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.00691},
}- BlackboxNLP Workshop version:
@inproceedings{gulko-etal-2025-ce,
title = "{CE}-Bench: Towards a Reliable Contrastive Evaluation Benchmark of Interpretability of Sparse Autoencoders",
author = "Gulko, Alex and
Peng, Yusen and
Kumar, Sachin",
editor = "Belinkov, Yonatan and
Mueller, Aaron and
Kim, Najoung and
Mohebbi, Hosein and
Chen, Hanjie and
Arad, Dana and
Sarti, Gabriele",
booktitle = "Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.blackboxnlp-1.1/",
pages = "1--15",
ISBN = "979-8-89176-346-3",
abstract = "Sparse autoencoders (SAEs) are a promising approach for uncovering interpretable features in large language models (LLMs). While several automated evaluation methods exist for SAEs, most rely on external LLMs. In this work, we introduce CE-Bench, a novel and lightweight contrastive evaluation benchmark for sparse autoencoders, built on a curated dataset of contrastive story pairs. We conduct comprehensive evaluation studies to validate the effectiveness of our approach. Our results show that CE-Bench reliably measures the interpretability of sparse autoencoders and aligns well with existing benchmarks without requiring an external LLM judge, achieving over 70{\%} Spearman correlation with results in SAEBench. The official implementation and evaluation dataset are open-sourced and publicly available."
}
If you have any questions or suggestions, feel free to contact:
- Alex Gulko (gulko.5@osu.edu)
- Yusen Peng (peng.1007@osu.edu)
- Sachin Kumar (kumar.1145@osu.edu)
Or describe it in Issues.


