Skip to content

audiosae/audio-sae

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AudioSAE

Sparse Autoencoders trained on the encoder layers of Whisper and HuBERT, from the paper AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders (EACL 2026).

This repository contains inference and basic interpretability code. It lets you load a pretrained SAE, run it on any layer of Whisper or HuBERT, inspect the sparse feature activations, and reproduce the paper's core feature-analysis tooling. Training, feature-steering and probing utilities used in the paper are intentionally not included here.

Weights

All SAE checkpoints live on the HuggingFace Hub at Egorgij21/Audio-SAE.

Backbone Dict size Layers k
HuBERT-base (facebook/hubert-base-ls960) 6144 1–12 50
HuBERT-large (facebook/hubert-large-ll60k) 8192 1–24 50
Whisper-small (openai/whisper-small) 6144 1–12 50
Whisper-large-v3 (large-v3) 10240 1–32 50
Whisper-large-v3-turbo (large-v3-turbo) 10240 1–32 50

Each checkpoint is a BatchTop-K SAE with an 8× expansion factor trained on ~2.8 k hours of mixed speech, music and environmental audio.

Installation

git clone https://github.com/audiosae/audiosae_demo.git
cd audiosae_demo
pip install -r requirements.txt

Quickstart

import torch
from audio_sae import BatchTopKSAE
from audio_sae.models import MyHubert
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Load a HuBERT-base encoder that exposes layer 3 activations
hubert = MyHubert("facebook/hubert-base-ls960", sae_after_layer=3).to(device).eval()

# 2. Download + load the matching SAE from the Hub
ckpt = hf_hub_download(
    repo_id="Egorgij21/Audio-SAE-hubert-base",
    filename="layer_3/ae.pt",
)
sae = BatchTopKSAE.from_pretrained(ckpt, device=device)

# 3. Run on an audio file
import librosa
wav, _ = librosa.load("example.wav", sr=16000, mono=True)
wav = torch.from_numpy(wav).unsqueeze(0).to(device)

with torch.no_grad():
    acts = hubert(wav)                                  # (1, T, d)
    features = sae.encode(acts, use_threshold=True)     # (1, T, dict_size) — sparse

features[b, t] is a sparse vector with roughly k non-zero entries per frame (controlled by the SAE's learned activation threshold; the exact count varies by clip and layer — for an exact-k sparsification, flatten to (B*T, d) and call sae.encode(x, use_threshold=False, is_eval=False)). Take the non-zero indices to read which SAE features fired, and their magnitudes to rank them.

See examples/inference.ipynb for a runnable walkthrough including Whisper and a simple top-feature visualisation.

Interpretability

audio_sae.interp ports the paper's feature-analysis primitives:

from audio_sae import (
    reconstruction_metrics,           # L0, L2, normalized_l2, frac_var_explained
    aggregate_max_activations,        # per-clip max activation table
    top_clips_for_feature,            # rank clips by a feature's activation
    dead_features,                    # features that never fired
    feature_activation_windows,       # mel-spec windows around a feature's activations
    collect_dataset_stats,            # per-dataset frame/audio fire stats
    classify_features,                # speech / sounds / music labels from those stats
    plot_feature_tsne,                # encoder-weight t-SNE coloured by label
)

feature_activation_windows is the get_mels helper from the paper — it returns the log-mel windows used to visualise what an SAE feature responds to. Averaging many windows over the clips where the feature fires gives the feature's receptive field in mel-spectrogram space.

collect_dataset_stats + classify_features + plot_feature_tsne together reproduce the speech / sounds / music classification from section 4 of the paper: accumulate per-dataset activation frequencies, label each feature by the category where it fires most often, and project the labelled encoder rows to 2-D with t-SNE.

See examples/interpretability.ipynb for an end-to-end run of reconstruction metrics, top-activating clips, and activation windows on a small set of example clips.

License

MIT — see LICENSE.

Citation

@inproceedings{aparin2026audiosae,
  title     = {AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders},
  author    = {Aparin, Georgii and Sadekova, Tasnima and Rukhovich, Alexey and Yermekova, Assel and Kushnareva, Laida and Popov, Vadim and Kuznetsov, Kristian and Piontkovskaya, Irina},
  booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year      = {2026},
  address   = {Rabat, Morocco},
}

About

Demo for AudioSAE paper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages