# Activation steering by editing the bias in gpt2-xl

This notebook shows how instead of adding a steering vector to the internal activations we can add it to the model bias of the next layer.

In [9]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from baukit import Trace

In [2]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Using device: cuda


In [19]:
# load model
model = AutoModelForCausalLM.from_pretrained("gpt2-xl").to(device).eval()
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2-xl")

Instead of adding the steering vector to the activation vector we can add the product of next-layer-weights and steering vector to the next-layer-bias.
Note, we can only do this if our layer has the structure $\phi_{l+1} = \sigma(W_l \phi_l + b_l)$.

This is generally not the case for the residual stream. However we do find this structure in the attention and MLP layers. 

Lets say we want to edit the bias in the attention module in layer 5. This is equivalent to adding a steering vector to the output of the previous module, here the layernorm.
As a first step we thus have to get the layernorm output of layer 5. Here we use baukit to get it.


In [20]:
# define layer to do the activation steering on
layer_id = 5
module = model.transformer.h[layer_id].ln_1

# get internal activations
inputs = tokenizer("Love", return_tensors="pt").to(device)
with Trace(module) as cache:
    _ = model(**inputs)
    act_love = cache.output

inputs = tokenizer("Hate", return_tensors="pt").to(device)
with Trace(module) as cache:
    _ = model(**inputs)
    act_hate = cache.output

print(f"act_love.shape: {act_love.shape}")
print(f"act_hate.shape: {act_hate.shape}")

act_love.shape: torch.Size([1, 1, 1600])
act_hate.shape: torch.Size([1, 2, 1600])


In [21]:
# define the steering vector
steering_vec = act_love[:,-1:,:]-act_hate[:,-1:,:]
print(f"steering_vec.shape:  {steering_vec.shape}")
print(f"length steering_vec: {steering_vec.norm():.2f}")

# reset the steering vector length to 1
steering_vec /= steering_vec.norm()

steering_vec.shape:  torch.Size([1, 1, 1600])
length steering_vec: 18.80


In [22]:
# lets save the original bias value
org_bias = model.transformer.h[layer_id].attn.c_attn.bias
print(f"org_bias.shape: {org_bias.shape}")

org_bias.shape: torch.Size([4800])


In [31]:
# define functions for resetting and setting the bias
def reset_bias_attn(model, layer_id, org_bias):
	model.transformer.h[layer_id].attn.c_attn.bias = org_bias
	
def change_bias_attn(model, layer_id, steering_vec):
	tilde_b = model.transformer.h[layer_id].attn.c_attn(steering_vec.squeeze())
	model.transformer.h[layer_id].attn.c_attn.bias = torch.nn.parameter.Parameter(tilde_b)
	

In [32]:
test_sentence = "I think dogs are "

# generate text while steering in positive direction
coeff = 10
change_bias_attn(model, layer_id, coeff*steering_vec)
inputs = tokenizer(test_sentence, return_tensors="pt").to(device)
generated_ids = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id, do_sample=False)
generated_text = tokenizer.batch_decode(generated_ids)
reset_bias_attn(model, layer_id, org_bias)
print(generated_text[0])
print("-"*20)

# generate text while steering in negative direction
coeff = -10
change_bias_attn(model, layer_id, coeff*steering_vec)
inputs = tokenizer(test_sentence, return_tensors="pt").to(device)
generated_ids = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id, do_sample=False)
generated_text = tokenizer.batch_decode(generated_ids)
reset_bias_attn(model, layer_id, org_bias)
print(generated_text[0])

I think dogs are  a great example of how we can use technology
--------------------
I think dogs are icky, but I don't think they're 


In [30]:
# generate text without steering
inputs = tokenizer(test_sentence, return_tensors="pt").to(device)
generated_ids = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id, do_sample=False)
generated_text = tokenizer.batch_decode(generated_ids)
print(generated_text[0])

I think dogs are  a great way to get your dog to learn


Editing the bias in this way in the attention module in layer 5 is equivalent to adding the steering vector to the output of the layernorm module in layer 5 as demonstrated below.

In [33]:
# define the activation steering funtion
def act_add(steering_vec):
    def hook(output):
        # the output of the layernorm module is not a tuple
        return output + steering_vec
    return hook

# generate text while steering in positive direction
coeff = 10
inputs = tokenizer(test_sentence, return_tensors="pt").to(device)
with Trace(module, edit_output=act_add(coeff*steering_vec)) as _:
    generated_ids = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id, do_sample=False)
    generated_text = tokenizer.batch_decode(generated_ids)
print(generated_text[0])
print("-"*20)

# generate text while steering in negative direction
coeff = -10
inputs = tokenizer(test_sentence, return_tensors="pt").to(device)
with Trace(module, edit_output=act_add(coeff*steering_vec)) as _:
    generated_ids = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id, do_sample=False)
    generated_text = tokenizer.batch_decode(generated_ids)
    
print(generated_text[0])

I think dogs are  a great example of how we can use technology
--------------------
I think dogs are icky, but I don't think they're 
