In [1]:
from IPython.display import Image

- Scaling and evaluating sparse autoencoders
    - https://cdn.openai.com/papers/sparse-autoencoders.pdf
    - https://github.com/openai/sparse_autoencoder
    - https://openai.com/index/extracting-concepts-from-gpt-4/
    - claude：
        - https://transformer-circuits.pub/2023/monosemantic-features
            - Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer.
        - https://www.anthropic.com/news/decomposing-language-models-into-understandable-components
        - For the first time, we feel that the next primary obstacle to interpreting large language models is engineering rather than science.
- SAE（Sparse AutoEncoder）
    - unsupervised approach for extracting interpretable features from a language model by **reconstructing activations** from a sparse bottleneck layer.
    - Since language models learn **many concepts**, autoencoders need to be very large to recover all relevant features.
        - Scaling SAEs
    - **k-sparse autoencoders** [Makhzani and Frey, 2013] to directly control sparsity,
        - https://arxiv.org/pdf/1312.5663
    - we train a 16 million latent autoencoder on GPT-4 activations for 40 billion tokens.
        - $16,000,000$
        - the sparse autoencoder that supports GPT-4 was able to find 16 million features of GPT-4. 
- 神经网络的可解释性（to understand how neural networks work and think,）
    - 'features' rather than neuron units.
        - 'features that respond to legal texts'
        - 'features that respond to DNA sequences,'
    - When a large-scale language model generates each token in a sentence, only **a small part of the huge neural network fires** (sends a signal).
- https://www.oxen.ai/blog/arxiv-dives

In [30]:
Image(url='../../../imgs/claude-sae.png', width=500)


- The first layer (“encoder”) maps the activity to a higher-dimensional layer via a learned linear transformation followed by a ReLU nonlinearity.
    - 512 -> 131072 (256x)
    - They refer to the units of this high-dimensional layer as “features.”
- The second layer (“decoder”) attempts to reconstruct the model activations via a linear transformation of the feature activations.
    - 131072 -> 512 

In [3]:
# top4
Image(url='https://images.ctfassets.net/kftzwdyauwt9/5i0GAmvivjtoiLsTnJW6HS/c874efa090da2a90280be59fc424f2b4/sparse-autoencoder_dark.gif?w=3840&q=80&fm=webp', 
      width=400)

## SAE


- $d, n, k$:
    - d: input space
    - n: latent(feature) space
    - k: sparsity

$$
\begin{align*}
z &= \text{ReLU}(W_{\text{enc}}(x - b_{\text{pre}}) + b_{\text{enc}}) \\
\hat{x} &= W_{\text{dec}}z + b_{\text{pre}}
\end{align*}
$$
- $b_\text{pre}$ 是在输入向量 $x$ 进行编码之前，从 𝑥 中减去的一个常数偏置。这种操作的目的是将数据中心化，

- topK

    $$
    z = \text{TopK}(W_{\text{enc}}(x - b_{\text{pre}}))
    $$
  -  We use a k-sparse autoencoder [Makhzani and Frey, 2013], which directly controls the number of active latents by using an **activation function (TopK)** that only keeps the k largest latents, zeroing the rest.
- training loss
    
    $$
    \mathcal L=\|x-\hat x\|^2_2
    $$ 

    - It removes the need for the L1 penalty
    - Calue: L2 reconstruction + L1 on hidden layer activation
- Jointly fitting sparsity ($L(N, K)$)
    - the number of latents $n$ and the sparsity level $k$

$$
\begin{equation}
L(n, k) = \exp(\alpha + \beta_k \log(k) + \beta_n \log(n) + \gamma \log(k) \log(n)) + \exp(\zeta + \eta \log(k))
\end{equation}
$$

## topK backward

In [27]:
import torch
torch.manual_seed(42)

<torch._C.Generator at 0x7657b0751550>

$$
\begin{split}
y_1=W_1\cdot x_1\\
y_k=\text{TopK}(y_1)\\
\ell=\sum y_k^2
\end{split}
$$

In [28]:
# 6
x1 = torch.rand(6, requires_grad = True)

# 6*6
W1 = torch.rand(6, 6, requires_grad = True)

# 6
y1 = W1 @ x1

# 3
yk, _ = torch.topk(y1, 3)

print(y1,)
print(yk,)
print(_)

tensor([2.1621, 2.6728, 1.9516, 0.8855, 2.4250, 2.3444], grad_fn=<MvBackward0>)
tensor([2.6728, 2.4250, 2.3444], grad_fn=<TopkBackward0>)
tensor([1, 4, 5])


In [29]:
# to scalar
loss1 = (yk ** 2).sum()

# topk operation is differential
# grad_fn=<TopkBackward0>
loss1.backward()
W1.grad

tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [4.7162, 4.8912, 2.0466, 5.1280, 2.0871, 3.2121],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [4.2791, 4.4378, 1.8569, 4.6527, 1.8937, 2.9144],
        [4.1368, 4.2903, 1.7952, 4.4980, 1.8308, 2.8175]])