Skip to content

v4.0.0 — FoldedLayerNorm + BH FDR + SAE Polysemanticity

Choose a tag to compare

@designer-coderajay designer-coderajay released this 03 Apr 19:42
· 184 commits to main since this release

What's New in v4.0.0

FoldedLayerNorm (glassbox/layernorm_correction.py)

Absorbs LayerNorm scale γ into W_Q/K/V weight matrices (Elhage et al. 2021 §4.1), removing multiplicative scale bias from attribution scores. Reports per-head bias ratio Δα(h)/|α_raw(h)|, flags heads where ratio > 0.15 as layernorm_biased.

BenjaminiHochberg FDR (glassbox/fdr.py)

Multiple testing correction for 144+ simultaneous head tests. Controls E[FDR] ≤ α via the BH step-up procedure (Benjamini & Hochberg 1995). Reports BH and Bonferroni side-by-side. Supports z-test SE, bootstrap SE, and permutation-based p-values.

PolysemanticityScorerSAE (glassbox/polysemanticity.py)

Quantifies whether attention heads are monosemantic or polysemantic via H(p(feature|head_h)). SAE-entropy method when sae-lens installed; PCA participation ratio fallback otherwise. monosemantic_fraction across circuit heads.

EU AI Act Mapping

  • Art. 13(1) Transparency: bias-corrected attribution scores
  • Art. 9(1) Risk Management: FDR prevents false-positive circuit identifications
  • Art. 15(1) Robustness: polysemanticity quantifies interpretability quality

Mathematical Completeness: 13/18 frameworks

pip install glassbox-mech-interp==4.0.0