> reverse engineering

- https://www.lesswrong.com/posts/hnzHrdqn3nrjveayv/how-to-transformer-mechanistic-interpretability-in-50-lines
- https://docs.google.com/presentation/d/1BkAjGqIqgomQpj6j0-oZORCeq72sH7_sOPjlflnYm_A/edit?pli=1#slide=id.p

In [1]:
import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'

In [2]:
import transformer_lens
from transformer_lens import utils
from transformer_lens import HookedTransformer
import torch

- model cfg
    - https://transformerlensorg.github.io/TransformerLens/generated/code/transformer_lens.HookedTransformerConfig.html
- `logits, cache = mode.run_witch_cache(dataset)`
    - 运行模型时同时捕获中间计算结果（通常是某些层或模块的输出）以便后续分析和调试
    - `cache["pattern", layer, "attn"]`: 从指定的 layer 层获取注意力模块的注意力模式（权重）。
        - 形状为 [batch_size, num_heads, seq_length, seq_length] 的张量，包含注意力权重。
    - `cache["pre", layer, "mlp"]`: 从指定的 layer 层获取 MLP 模块的预激活输出。
        - 形状为 [batch_size, seq_length, d_model] 的张量，包含在激活函数应用之前的值。
    - `cache["post", layer, "mlp"]`: 从指定的 layer 层获取 MLP 模块的后激活输出。
        - 形状为 [batch_size, seq_length, d_model] 的张量，包含在激活函数应用之后的值。
- `model.run_with_hooks(..., fwd_hooks=[])`

### modules

- Embed()/Unembed()
    - Embed: W_E
    - Unembed: W_U
- MLP()
    - W_in
    - W_out
- Attention()
    - QKVO:
        - W(weights)
            - W_Q, W_K, W_V, W_O
        - b(biases)
            - b_Q, b_K, b_V, b_O

### utils

In [3]:
utils.get_act_name("post", 1)

'blocks.1.mlp.hook_post'

### tokens

- model.to_string
    - id => string
- model.to_str_tokens
    - id => token
- model.to_single_token
    - string => id

In [30]:
model = HookedTransformer.from_pretrained("gelu-1l").to(torch.float32)

Loaded pretrained model gelu-1l into HookedTransformer
Changing model dtype to torch.float32




In [31]:
model.cfg.d_vocab

48262

In [32]:
model.to_str_tokens(torch.arange(10))

['<|EOS|>', '<|BOS|>', '<|PAD|>', '!', '"', '#', '$', '%', '&', "'"]

In [33]:
model.to_single_token('hello')

24684

In [34]:
model.to_string(24684)

'hello'

### basic usage

In [8]:
# Load a model (eg GPT-2 Small)
model = transformer_lens.HookedTransformer.from_pretrained("gpt2-small")

Loaded pretrained model gpt2-small into HookedTransformer


In [9]:
logits = model("Famous computer scientist Alan")

In [10]:
logits.shape

torch.Size([1, 6, 50257])

In [11]:
# The logit dimensions are: [batch, position, vocab]
next_token_logits = logits[0, -1]
next_token_prediction = next_token_logits.argmax()
next_word_prediction = model.tokenizer.decode(next_token_prediction)
print(next_word_prediction)

 Turing


In [12]:
model

HookedTransformer(
  (embed): Embed()
  (hook_embed): HookPoint()
  (pos_embed): PosEmbed()
  (hook_pos_embed): HookPoint()
  (blocks): ModuleList(
    (0-11): 12 x TransformerBlock(
      (ln1): LayerNormPre(
        (hook_scale): HookPoint()
        (hook_normalized): HookPoint()
      )
      (ln2): LayerNormPre(
        (hook_scale): HookPoint()
        (hook_normalized): HookPoint()
      )
      (attn): Attention(
        (hook_k): HookPoint()
        (hook_q): HookPoint()
        (hook_v): HookPoint()
        (hook_z): HookPoint()
        (hook_attn_scores): HookPoint()
        (hook_pattern): HookPoint()
        (hook_result): HookPoint()
      )
      (mlp): MLP(
        (hook_pre): HookPoint()
        (hook_post): HookPoint()
      )
      (hook_attn_in): HookPoint()
      (hook_q_input): HookPoint()
      (hook_k_input): HookPoint()
      (hook_v_input): HookPoint()
      (hook_mlp_in): HookPoint()
      (hook_attn_out): HookPoint()
      (hook_mlp_out): HookPoint()
      (h

In [13]:
logits, cache = model.run_with_cache("Famous computer scientist Alan")
for key, value in cache.items():
    print(key, value.shape)

hook_embed torch.Size([1, 6, 768])
hook_pos_embed torch.Size([1, 6, 768])
blocks.0.hook_resid_pre torch.Size([1, 6, 768])
blocks.0.ln1.hook_scale torch.Size([1, 6, 1])
blocks.0.ln1.hook_normalized torch.Size([1, 6, 768])
blocks.0.attn.hook_q torch.Size([1, 6, 12, 64])
blocks.0.attn.hook_k torch.Size([1, 6, 12, 64])
blocks.0.attn.hook_v torch.Size([1, 6, 12, 64])
blocks.0.attn.hook_attn_scores torch.Size([1, 12, 6, 6])
blocks.0.attn.hook_pattern torch.Size([1, 12, 6, 6])
blocks.0.attn.hook_z torch.Size([1, 6, 12, 64])
blocks.0.hook_attn_out torch.Size([1, 6, 768])
blocks.0.hook_resid_mid torch.Size([1, 6, 768])
blocks.0.ln2.hook_scale torch.Size([1, 6, 1])
blocks.0.ln2.hook_normalized torch.Size([1, 6, 768])
blocks.0.mlp.hook_pre torch.Size([1, 6, 3072])
blocks.0.mlp.hook_post torch.Size([1, 6, 3072])
blocks.0.hook_mlp_out torch.Size([1, 6, 768])
blocks.0.hook_resid_post torch.Size([1, 6, 768])
blocks.1.hook_resid_pre torch.Size([1, 6, 768])
blocks.1.ln1.hook_scale torch.Size([1, 6, 1])

- 关于 `hook_pattern`、`hook_z`
    - `pattern = softmax((Q @ K.T) / sqrt(d_k))  # [batch, seq_len, seq_len]`
    - `Z = pattern @ V  # weighted sum of value vectors`

In [14]:
# Run the model and get logits and activations
logits, cache = model.run_with_cache("Hello World")

In [15]:
logits.shape

torch.Size([1, 3, 50257])

In [16]:
cache["blocks.0.attn.hook_pattern"].shape

torch.Size([1, 12, 3, 3])

In [17]:
head_idx = 5
pos = 1
weighting = cache["blocks.0.attn.hook_pattern"][0, head_idx, pos, :] 
v = cache["blocks.0.attn.hook_v"][0, :, head_idx, :]
z = weighting @ v  # 等同于 cache["blocks.0.attn.hook_z"][0, pos, head_idx, :]
z

tensor([-0.0623, -0.2089, -0.0441,  0.7658, -0.0960, -0.4061, -0.0387, -0.1681,
        -0.1954,  0.2037, -0.1689, -0.0635,  0.0478,  0.0231, -0.2398,  0.1860,
        -0.2425, -0.0643, -0.0097, -0.0837,  0.1259, -0.0514, -0.0949, -0.1342,
        -0.1288,  0.1212,  0.0954,  0.2853,  0.0163, -0.1749,  0.1404, -0.0348,
        -0.3939,  0.2363, -0.0986,  0.0756,  0.3388,  0.0141, -0.0032, -0.0519,
         0.0807, -0.0332,  0.1161,  0.2470, -0.1805, -0.0772, -0.8262,  0.2272,
        -0.0716,  0.1921, -0.2517, -0.0693, -0.3242,  0.0707, -0.4548,  0.0135,
         0.0316,  0.0898, -0.0733,  0.2011,  0.1891,  0.2649, -0.2121, -0.1774],
       device='cuda:0')

In [18]:
cache["blocks.0.attn.hook_z"][0, pos, head_idx, :]

tensor([-0.0623, -0.2089, -0.0441,  0.7658, -0.0960, -0.4061, -0.0387, -0.1681,
        -0.1954,  0.2037, -0.1689, -0.0635,  0.0478,  0.0231, -0.2398,  0.1860,
        -0.2425, -0.0643, -0.0097, -0.0837,  0.1259, -0.0514, -0.0949, -0.1342,
        -0.1288,  0.1212,  0.0954,  0.2853,  0.0163, -0.1749,  0.1404, -0.0348,
        -0.3939,  0.2363, -0.0986,  0.0756,  0.3388,  0.0141, -0.0032, -0.0519,
         0.0807, -0.0332,  0.1161,  0.2470, -0.1805, -0.0772, -0.8262,  0.2272,
        -0.0716,  0.1921, -0.2517, -0.0693, -0.3242,  0.0707, -0.4548,  0.0135,
         0.0316,  0.0898, -0.0733,  0.2011,  0.1891,  0.2649, -0.2121, -0.1774],
       device='cuda:0')

### induction heads

In [19]:
utils.test_prompt("Her name was Alex Hart. Tomorrow at lunch time Alex",
                  answer=" Hart", model=model)

Tokenized prompt: ['<|endoftext|>', 'Her', ' name', ' was', ' Alex', ' Hart', '.', ' Tomorrow', ' at', ' lunch', ' time', ' Alex']
Tokenized answer: [' Hart']


Top 0th token. Logit: 15.64 Prob: 28.38% Token: | will|
Top 1th token. Logit: 14.47 Prob:  8.79% Token: | would|
Top 2th token. Logit: 14.34 Prob:  7.74% Token: | was|
Top 3th token. Logit: 14.29 Prob:  7.35% Token: | Hart|
Top 4th token. Logit: 14.18 Prob:  6.54% Token: | and|
Top 5th token. Logit: 14.09 Prob:  6.00% Token: | is|
Top 6th token. Logit: 13.51 Prob:  3.38% Token: |'s|
Top 7th token. Logit: 13.23 Prob:  2.53% Token: |,|
Top 8th token. Logit: 12.73 Prob:  1.55% Token: | had|
Top 9th token. Logit: 12.00 Prob:  0.74% Token: | has|


### hook

- 在深度学习框架（如 PyTorch）中，hook 是一种可以在模型的**前向或后向传播**过程中插入自定义函数的机制。这种机制允许我们在网络的计算过程中，**访问或修改中间的激活值**，从而深入理解模型的内部工作原理。
    - 前向钩子（Forward Hook）：在模块的 forward 方法执行时，hook 会在输入被传递到模块或输出被返回之前或之后被调用。它可以访问并修改输入和输出的张量。
    - 后向钩子（Backward Hook）：在反向传播过程中，hook 可以访问和修改梯度。
- transformer_lens 是一个用于分析和解释 Transformer 模型的库。它通过扩展基础的 Transformer 模型，引入了可以在模型的各个层次上添加 hook 的功能。这使得研究者可以：
    - 记录激活值：在模型的特定层次或位置，提取中间激活值以进行分析。例如，提取注意力权重、隐藏状态等。
    - 修改激活值：在前向传播过程中，修改中间激活值以测试模型对某些干预的响应。这对于理解模型如何处理信息非常有用。
    - 消融实验：通过零化某些神经元的激活值，观察模型输出的变化，以确定这些神经元在特定任务中的重要性。

In [20]:
# model.hook_dict.keys()

In [21]:
# def attention_pattern_hook(activation, hook):
#     hook.ctx["pattern"] = activation.detach().clone()
#     print(f"Attention pattern shape at {hook.name}: {activation.shape}")
#     return activation

In [22]:
# # 假设我们想在模型的某一层添加钩子，比如第一层的注意力输出
# hook_name = 'blocks.0.attn.hook_pattern'  # 第一层注意力层的输出钩子

# # 在指定的层添加前向钩子
# model.add_hook(hook_name, attention_pattern_hook, dir='fwd')

In [23]:
# logits = model("Famous computer scientist Alan")

In [24]:
# def extended_attention_hook(activation, hook):
#     """Extended version with more analysis"""
#     pattern = activation.detach()
    
#     # Get sequence length
#     seq_len = pattern.shape[-1]
    
#     # Calculate average attention per position
#     avg_attention = pattern.mean(dim=(0,1))  # Average across batch and heads
    
#     # Find which tokens get most attention
#     max_attended_pos = avg_attention.argmax(dim=-1)
    
#     print(f"\nAnalysis for {hook.name}:")
#     print(f"- Shape: {pattern.shape}")
#     print(f"- Most attended position: {max_attended_pos.tolist()}")
    
#     # Store for later use
#     hook.ctx["pattern"] = pattern
#     hook.ctx["avg_attention"] = avg_attention
    
#     return activation

In [25]:
# # Hook specific layers
# model.blocks[0].attn.hook_pattern.add_hook(extended_attention_hook)  # First layer
# model.blocks[-1].attn.hook_pattern.add_hook(extended_attention_hook) # Last layer

# # Hook multiple components
# hooks = []
# for layer in range(model.cfg.n_layers):
#     hooks.append(
#         model.blocks[layer].attn.hook_pattern.add_hook(
#             extended_attention_hook,
#             # name=f"attn_pattern_layer_{layer}"
#         )
#     )

In [26]:
# logits = model("Famous computer scientist Alan")