# Activation addition in Llama with baukit library

This notebook shows how to extract and manipulate internal activations of a Llama Transformer model using the [baukit library](https://github.com/davidbau/baukit).

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import os
import torch
from baukit import Trace, module_names

In [2]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"device: {device}")

model_name = "llama-7b"
model_path = f"huggyllama/{model_name}"

# load model
model = AutoModelForCausalLM.from_pretrained(model_path).to(device)
model.eval()

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left' 

num_hidden_layers = model.config.num_hidden_layers
hidden_size = model.config.hidden_size

device: cuda


config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/700 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

We first need to figure out the name of the layer where we want to do our activation addition. Lets focus on the residual stream output of layer 5.

In [6]:
module_names(model)

['',
 'model',
 'model.embed_tokens',
 'model.layers',
 'model.layers.0',
 'model.layers.0.self_attn',
 'model.layers.0.self_attn.q_proj',
 'model.layers.0.self_attn.k_proj',
 'model.layers.0.self_attn.v_proj',
 'model.layers.0.self_attn.o_proj',
 'model.layers.0.self_attn.rotary_emb',
 'model.layers.0.mlp',
 'model.layers.0.mlp.gate_proj',
 'model.layers.0.mlp.up_proj',
 'model.layers.0.mlp.down_proj',
 'model.layers.0.mlp.act_fn',
 'model.layers.0.input_layernorm',
 'model.layers.0.post_attention_layernorm',
 'model.layers.1',
 'model.layers.1.self_attn',
 'model.layers.1.self_attn.q_proj',
 'model.layers.1.self_attn.k_proj',
 'model.layers.1.self_attn.v_proj',
 'model.layers.1.self_attn.o_proj',
 'model.layers.1.self_attn.rotary_emb',
 'model.layers.1.mlp',
 'model.layers.1.mlp.gate_proj',
 'model.layers.1.mlp.up_proj',
 'model.layers.1.mlp.down_proj',
 'model.layers.1.mlp.act_fn',
 'model.layers.1.input_layernorm',
 'model.layers.1.post_attention_layernorm',
 'model.layers.2',
 'mo

We then use the Trace class to save the the hidden states in the class variable `output`. 

In [22]:
test_sentence = "The quick brown fox jumps over the lazy dog"
inputs = tokenizer(test_sentence, return_tensors="pt").to(device)

layer_id = 5
module_name = f'model.layers.{layer_id}'
with Trace(model, module_name) as ret:
    _ = model(**inputs)
    baukit_output = ret.output[0]

print(baukit_output.shape)

torch.Size([1, 12, 4096])


We can compare the baukit output to our output when we set `output_hidden_states=True` in the forward pass and see that it's the same. However with the baukit library we can now access hidden states of any module and not just the residual stream.

In [24]:
output = model(**inputs, output_hidden_states=True)["hidden_states"][layer_id+1]

print(f"MSE between baukit output and output when passing output_hidden_states=True: {(baukit_output-output).pow(2).mean()}")

MSE between baukit output and output when passing output_hidden_states=True: 0.0


### Activation steering

Now lets define a steering direction and do activation addition using baukit. We can use the same baukit class `Trace` but have to define a function that edits the internal activation and pass this function to `Trace`. See also the baukit code [here](https://github.com/davidbau/baukit/blob/main/baukit/nethook.py).

In [29]:
sentences = ["Love", "Hate"]
inputs = tokenizer(sentences, return_tensors="pt", padding=True).to(device)

with Trace(model, module_name) as ret:
    _ = model(**inputs)
    baukit_output = ret.output[0]

baukit_output.shape

torch.Size([2, 3, 4096])

In [40]:
token_pos = -1
# the direction should have the same number of dimensions as the activations (they usually have shape [batch_size, num_tokens, hidden_dim])
# the easiest is just to define a direction with shape [1, 1, hidden_dim]
# this can then be added to all tokens for the complete batch
direction = baukit_output[0:1, token_pos:, :] - baukit_output[1:, token_pos:, :]
# make sure the direction vector is on the same device and has same precision as the model
direction = direction.to(device=device)
print(f"shape of direction: {direction.shape}")
print(f"norm of direction:  {direction.norm(dim=-1).item():.4g}")

shape of direction: torch.Size([1, 1, 4096])
norm of direction:  16.27


In [41]:
def fun_factory(direction, coeff):

    def act_add(output):
        if isinstance(output, tuple):
            # the output cannot be modifies in place, we actually have to return the modified output
            return (output[0] + coeff*direction,) + output[1:]
        else:
            return output + coeff*direction

    return act_add


In [44]:
# create the edit function that we want to apply to internal activations
add_love = fun_factory(direction, coeff=1)
add_hate = fun_factory(direction, coeff=-1)

In [47]:
max_new_tokens = 20
sentences = ["I think dogs are", "I think cats are", "Today I feel"]
random_seed = 0
torch.random.manual_seed(random_seed)
with Trace(model, module_name, edit_output=add_love) as ret:
    inputs = tokenizer(sentences, return_tensors="pt", padding=True).to(device)
    generate_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, use_cache=True)
    generated_text = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print("Generate with positive direction:\n")
for sentence in generated_text:
    print(sentence)
    print("---")

Generate with positive direction:

I think dogs are the best. I love my dog. I love my dog.
I love my dog. I
---
I think cats are the best.
I love my cat, but I love my cat more.
I love my
---
Today I feel like I’m in a bit of a rut with my hair. I’ve been using
---


In [48]:
torch.random.manual_seed(random_seed)
with Trace(model, module_name, edit_output=add_hate) as ret:
    inputs = tokenizer(sentences, return_tensors="pt", padding=True).to(device)
    generate_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, use_cache=True)
    generated_text = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print("Generate with negative direction:\n")
for sentence in generated_text:
    print(sentence)
    print("---")

Generate with negative direction:

I think dogs are the worst. I've never been able to understand how they can be so mean.
I
---
I think cats are the worst.
I'm not a cat person, but I've never liked them.
---
Today I feel like I've been doing this for a while. I'm not sure what I've
---


In [49]:
torch.random.manual_seed(random_seed)
inputs = tokenizer(sentences, return_tensors="pt", padding=True).to(device)
generate_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, use_cache=True)
generated_text = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print("Generate without activation addition:\n")
for sentence in generated_text:
    print(sentence)
    print("---")

Generate without activation addition:

I think dogs are the best. I have a dog named Lucky. He is a 10 year old ch
---
I think cats are the most beautiful animals in the world. I love them so much. I have a cat named T
---
Today I feel like I’m in a bit of a rut. I’m not sure what to write
---
