# Activation steering with baukit and gpt2-xl

This notebook shows how to extract and manipulate internal model activations using the [baukit library](https://github.com/davidbau/baukit).

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from baukit import Trace
import torch

In [3]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"device: {device}")

device: cuda


In [10]:
# load model
model = AutoModelForCausalLM.from_pretrained("gpt2-xl").to(device).eval()
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2-xl")

We can figure out the name of the module where we want to do our activation addition by calling `model`.

A layer module can be passed directly to the `Trace` constructor. Lets focus on the residual stream output of layer 5.

The baukit class Trace is a context manager, that takes care of the correct removal of the hooks when you leave the context. If you use it without specifying an `edit_output` fnction it just caches the internal activations of the specified module. See also the baukit code [here](https://github.com/davidbau/baukit/blob/main/baukit/nethook.py).

In [5]:
# define layer to do the activation steering on
layer_id = 5
module = model.transformer.h[layer_id]

# get internal activations
inputs = tokenizer("Love", return_tensors="pt").to(device)
with Trace(module) as cache:
    _ = model(**inputs)
    act_love = cache.output[0]

inputs = tokenizer("Hate", return_tensors="pt").to(device)
with Trace(module) as cache:
    _ = model(**inputs)
    act_hate = cache.output[0]

print(f"act_love.shape: {act_love.shape}")
print(f"act_hate.shape: {act_hate.shape}")

act_love.shape: torch.Size([1, 1, 1600])
act_hate.shape: torch.Size([1, 2, 1600])


In [6]:
# define the steering vector
steering_vec = act_love[:,-1:,:]-act_hate[:,-1:,:]
print(f"steering_vec.shape:  {steering_vec.shape}")
print(f"length steering_vec: {steering_vec.norm():.2f}")

# reset the steering vector length to 1
steering_vec /= steering_vec.norm()

steering_vec.shape:  torch.Size([1, 1, 1600])
length steering_vec: 100.23


In [7]:
# define the activation steering funtion
def act_add(steering_vec):
    def hook(output):
        return (output[0] + steering_vec,) + output[1:] # the output of the residual stream is actually a tuple, where the first entry is the activation
    return hook

In [8]:
test_sentence = "I think dogs are "

# generate text while steering in positive direction
coeff = 10
inputs = tokenizer(test_sentence, return_tensors="pt").to(device)
with Trace(module, edit_output=act_add(coeff*steering_vec)) as _:
    generated_ids = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id, do_sample=False)
    generated_text = tokenizer.batch_decode(generated_ids)
print(generated_text[0])
print("-"*20)

# generate text while steering in negative direction
coeff = -10
inputs = tokenizer(test_sentence, return_tensors="pt").to(device)
with Trace(module, edit_output=act_add(coeff*steering_vec)) as _:
    generated_ids = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id, do_sample=False)
    generated_text = tokenizer.batch_decode(generated_ids)
    
print(generated_text[0])

I think dogs are  a great way to get to know someone.
--------------------
I think dogs are icky, but I don't think they're 


In [11]:
# generate text without steering
inputs = tokenizer(test_sentence, return_tensors="pt").to(device)
generated_ids = model.generate(**inputs, max_new_tokens=10, pad_token_id=tokenizer.eos_token_id, do_sample=False)
generated_text = tokenizer.batch_decode(generated_ids)
print(generated_text[0])

I think dogs are  a great way to get your dog to learn
