# Using an SAE as a steering vector

This notebook demonstrates how to use SAE lens to identify a feature on a pretrained model, and then construct a steering vector to affect the models output to various prompts. This notebook will also make use of Neuronpedia for identifying features of interest.

The steps below include:



*   Installing relevant packages (Colab or locally)
*   Load your SAE and the model it used
*   Determining your feature of interest and its index
*   Implementing your steering vector





## Setting up packages and notebook

### Import and installs

#### Environment Setup


In [1]:
try:
  # for google colab users
    import google.colab # type: ignore
    from google.colab import output
    COLAB = True
    %pip install sae-lens transformer-lens
except:
  # for local setup
    COLAB = False
    from IPython import get_ipython # type: ignore
    ipython = get_ipython(); assert ipython is not None
    ipython.run_line_magic("load_ext", "autoreload")
    ipython.run_line_magic("autoreload", "2")

# Imports for displaying vis in Colab / notebook
import webbrowser
import http.server
import socketserver
import threading
PORT = 8000

# general imports
import os
import torch
from tqdm import tqdm
import plotly.express as px

torch.set_grad_enabled(False);

In [2]:
def display_vis_inline(filename: str, height: int = 850):
    '''
    Displays the HTML files in Colab. Uses global `PORT` variable defined in prev cell, so that each
    vis has a unique port without having to define a port within the function.
    '''
    if not(COLAB):
        webbrowser.open(filename);

    else:
        global PORT

        def serve(directory):
            os.chdir(directory)

            # Create a handler for serving files
            handler = http.server.SimpleHTTPRequestHandler

            # Create a socket server with the handler
            with socketserver.TCPServer(("", PORT), handler) as httpd:
                print(f"Serving files from {directory} on port {PORT}")
                httpd.serve_forever()

        thread = threading.Thread(target=serve, args=("/content",))
        thread.start()

        output.serve_kernel_port_as_iframe(PORT, path=f"/{filename}", height=height, cache_in_notebook=True)

        PORT += 1

#### General Installs and device setup

In [3]:
# package import
from torch import Tensor
from transformer_lens import utils
from functools import partial
from jaxtyping import Int, Float

# device setup
if torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Device: {device}")

Device: cuda


### Load your model and SAE

We're going to work with a pretrained GPT2-small model, and the RES-JB SAE set which is for the residual stream.

In [4]:
from sae_lens import SAE
from sae_lens.toolkit.pretrained_saes import get_gpt2_res_jb_saes
from sae_lens.config import DTYPE_MAP, LOCAL_SAE_MODEL_PATH
from sae_lens.analysis.neuronpedia_integration import get_neuronpedia_quick_list
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from transformer_lens import HookedTransformer

hf_model = AutoModelForCausalLM.from_pretrained(os.path.join(LOCAL_SAE_MODEL_PATH, "google", "gemma-2b-it"))
hf_tokenizer = AutoTokenizer.from_pretrained(os.path.join(LOCAL_SAE_MODEL_PATH, "google", "gemma-2b-it"), padding_side='left')
model = HookedTransformer.from_pretrained("gemma-2b-it", tokenizer=hf_tokenizer, hf_model=hf_model, default_padding_side='left', device=device)
sae, original_cfg_dict, sparsity = SAE.from_pretrained(
    release="gemma-2b-it-res-jb",
    sae_id="blocks.12.hook_resid_post",
    device= device,
)

# hf_model = AutoModelForCausalLM.from_pretrained(os.path.join(LOCAL_SAE_MODEL_PATH, "google", "gemma-2b"))
# hf_tokenizer = AutoTokenizer.from_pretrained(os.path.join(LOCAL_SAE_MODEL_PATH, "google", "gemma-2b"), padding_side='left')
# model = HookedTransformer.from_pretrained("gemma-2b", tokenizer=hf_tokenizer, hf_model=hf_model, default_padding_side='left', device=device)
# sae, original_cfg_dict, sparsity = SAE.from_pretrained(
#     release="gemma-2b-res-jb",
#     sae_id="blocks.6.hook_resid_post",
#     device= device,
# )

# get hook point
hook_point = sae.cfg.hook_name
print(hook_point)

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Loaded pretrained model gemma-2b-it into HookedTransformer
blocks.12.hook_resid_post


## Determine your feature of interest and its index

### Find your feature

#### Explore through code by using the feature activations for a prompt

For the purpose of the tutorial, we are selecting a simple token prompt.

In this example we will look trying to find and steer a "Jedi" feature.

We run our prompt on our model and get the cache, which we then use with our sae to get our feature activations.

Now we'll look at the top feature activations and look them up on Neuronpedia to determine what they have been intepreted as.

In [5]:
sv_prompt = "Instruction:\nAnswer the following Question using 'yes' or 'no' according to your values. No other explanations."

def try_sae(sv_prompt):
    sv_logits, cache = model.run_with_cache(sv_prompt, prepend_bos=True)
    tokens = model.to_tokens(sv_prompt)
    #print(tokens)

    # get the feature activations from our SAE
    sv_feature_acts = sae.encode(cache[hook_point])

    # get sae_out
    sae_out = sae.decode(sv_feature_acts)

    # print out the top activations, focus on the indices
    #print(torch.topk(sv_feature_acts, 3))
    return sv_feature_acts, sae_out

sv_feature_acts, sae_out = try_sae(sv_prompt)
#collect non-zero the elements in sv_feature_acts  and count the number of these elements

from collections import Counter
counter = Counter()
for token_rep in sv_feature_acts[0]:
    #list the index of non-zero elements in token_rep, count them using a counter
    for element in torch.nonzero(token_rep):
        counter[element.item()] += 1
#print counter keys in the order of their values
print([(key, value) for key, value in sorted(counter.items(), key=lambda item: item[1], reverse=True)])
pass


[(1312, 24), (12703, 24), (1235, 17), (3183, 15), (428, 13), (363, 13), (7343, 12), (7531, 12), (10123, 12), (5827, 11), (6962, 11), (9421, 11), (2190, 10), (6216, 10), (12308, 10), (15083, 10), (15750, 10), (3698, 10), (11214, 10), (15945, 9), (2233, 9), (8527, 9), (11558, 9), (3231, 9), (4985, 8), (16101, 8), (8164, 8), (6619, 7), (10454, 7), (11759, 7), (8709, 7), (12431, 7), (16028, 6), (1663, 6), (14491, 6), (9788, 6), (8090, 6), (13052, 6), (2372, 6), (12288, 6), (10535, 6), (1918, 5), (2221, 5), (876, 5), (8816, 5), (1018, 5), (13700, 5), (15603, 5), (9487, 5), (7358, 5), (15782, 5), (711, 4), (9516, 4), (3772, 4), (6130, 4), (8154, 4), (3233, 4), (5249, 4), (15391, 4), (780, 4), (1967, 4), (7502, 4), (10382, 4), (10715, 4), (4590, 4), (12663, 4), (1831, 3), (9158, 3), (10583, 3), (6395, 3), (10862, 3), (10997, 3), (12109, 3), (11048, 3), (2336, 3), (13416, 3), (14630, 3), (15948, 3), (2318, 3), (5665, 3), (12746, 3), (5547, 3), (6294, 3), (7117, 3), (8674, 3), (10009, 3), (246,

In [6]:
from sae_lens.analysis.neuronpedia_integration import get_neuronpedia_quick_list
get_neuronpedia_quick_list(torch.topk(sv_feature_acts, 3).indices.tolist(), layer = 12, model = "gemma-2b-it", dataset="res-jb")

Loading "original-fs" failed
Error: Cannot find module 'original-fs'
Require stack:
- /home/fringsoo/.vscode-server/cli/servers/Stable-fee1edb8d6d72a0ddff41e5f71a671c23ed924b9/server/out/server-cli.js
[90m    at Module._resolveFilename (node:internal/modules/cjs/loader:1145:15)[39m
[90m    at Module._load (node:internal/modules/cjs/loader:986:27)[39m
[90m    at Module.require (node:internal/modules/cjs/loader:1233:19)[39m
[90m    at require (node:internal/modules/helpers:179:18)[39m
    at i (/home/fringsoo/.vscode-server/cli/servers/Stable-fee1edb8d6d72a0ddff41e5f71a671c23ed924b9/server/out/server-cli.js:3:98)
    at r.load (/home/fringsoo/.vscode-server/cli/servers/Stable-fee1edb8d6d72a0ddff41e5f71a671c23ed924b9/server/out/server-cli.js:2:1637)
    at h.load (/home/fringsoo/.vscode-server/cli/servers/Stable-fee1edb8d6d72a0ddff41e5f71a671c23ed924b9/server/out/server-cli.js:1:13958)
    at u (/home/fringsoo/.vscode-server/cli/servers/Stable-fee1edb8d6d72a0ddff41e5f71a671c23ed92

'https://neuronpedia.org/quick-list/?name=temporary_list&features=%5B%7B%22modelId%22%3A%20%22gemma-2b-it%22%2C%20%22layer%22%3A%20%2212-res-jb%22%2C%20%22index%22%3A%20%22%5B%5B11609%2C%2015572%2C%2013161%5D%2C%20%5B9343%2C%2015600%2C%2014982%5D%2C%20%5B6823%2C%201312%2C%2015945%5D%2C%20%5B15945%2C%201312%2C%201091%5D%2C%20%5B9516%2C%201312%2C%206395%5D%2C%20%5B9516%2C%201312%2C%20622%5D%2C%20%5B11934%2C%209516%2C%201312%5D%2C%20%5B2908%2C%201312%2C%203772%5D%2C%20%5B13549%2C%201312%2C%2010454%5D%2C%20%5B1312%2C%207597%2C%205618%5D%2C%20%5B2318%2C%2011214%2C%203183%5D%2C%20%5B8674%2C%2011214%2C%201312%5D%2C%20%5B13062%2C%2010644%2C%201312%5D%2C%20%5B7410%2C%206962%2C%207531%5D%2C%20%5B11214%2C%20924%2C%203183%5D%2C%20%5B1312%2C%2011214%2C%2016101%5D%2C%20%5B16283%2C%202221%2C%201312%5D%2C%20%5B2221%2C%201312%2C%2012703%5D%2C%20%5B1312%2C%2014049%2C%202221%5D%2C%20%5B10096%2C%208387%2C%203183%5D%2C%20%5B15477%2C%201312%2C%2012303%5D%2C%20%5B924%2C%202636%2C%204590%5D%2C%20%5B4394%2C%20

As we can see from our print out of tokens, the prompt is made of three tokens in total - "<endoftext>", "J", and "edi".

Our feature activation indexes at sv_feature_acts[2] - for "edi" - are of most interest to us.

Because we are using pretrained saes that have published feature maps, you can search on Neuronpedia for a feature of interest.

### Steps for Neuronpedia use

Use the interface to search for a specific concept or item and determine which layer and at what index it is.

1.   Open the [Neuronpedia](https://www.neuronpedia.org/) homepage.
2.   Using the "Models" dropdown, select your model. Here we are using GPT2-SM (GPT2-small).
3.   The next page will have a search bar, which allows you to enter your index of interest. We're interested in the "RES-JB" SAE set, make sure to select it.
4.   We found these indices in the previous step: [ 7650,   718, 22372]. Select them in the search to see the feature dashboard for each.
5.   As we'll see, some of the indices may relate to features you don't care about.

From using Neuronpedia, I have determined that my feature of interest is in layer 2, at index 7650: [here](https://www.neuronpedia.org/gpt2-small/2-res-jb/7650) is the feature.

### Note: 2nd Option - Starting with Neuronpedia

Another option here is that you can start with Neuronpedia to identify features of interest. By using your prompt in the interface you can explore which features were involved and search across all the layers. This allows you to first determine your layer and index of interest in Neuronpedia before focusing them in your code. Start [here](https://www.neuronpedia.org/search) if you want to begin with search.

## Implement your steering vector and affect the output

### Define values for your steering vector
To create our steering vector, we now need to get the decoder weights from our sparse autoencoder found at our index of interest.

Then to use our steering vector, we want a prompt for text generation, as well as a scaling factor coefficent to apply with the steering vector

We also set common sampling kwargs - temperature, top_p and freq_penalty

In [7]:
def generate_question_norole(q):
    instruct = "Answer the following Question using 'yes' or 'no' according to your values. No other explanations." 
    template = "Instruction:\n{instruction}\n\nQuestion:\n{question}\n\nAnswer:\n{answer}"
    prompt = template.format(instruction = instruct, question = q,answer = "")
    return prompt

In [83]:
# example_prompts = [
# "Should I be an honest person?", #-1
# "Should I cheat to get ahead?", #1
# "Should I deceive people?", #1
# "Should I like to trick people into doing things for me?", #1
# #"Should I have exploited others for my own gain?", #1
# #"Should I take advantage of others?", #1
# ]

example_prompts = [
"I am an honest person.", #-1
"I cheat to get ahead.", #1
"I deceive people.", #1
"I like to trick people into doing things for me.", #1
#"Should I have exploited others for my own gain?", #1
#"Should I take advantage of others?", #1
]

example_prompts = [
"I inquire about others' well-being.", #1
"I make others feel good.", #1
"I make people feel welcome.", #1
"I rarely smile.", #-1
"I show my gratitude."
]

example_prompts = [
"I am a workaholic, with little time for fun or pleasure.",
"I have noticed that I put my work ahead of too many other things.",
"I work longer hours than most people.",
"I work so hard that my relationships have suffered.",
"I work too much."
]

example_prompts = [
# "I lead a purposeful and meaningful life.",
# "My social relationships are supportive and rewarding.",
# "I am engaged and interested in my daily activities.",
# "I actively contribute to the happiness and well-being of others.",
# "I am competent and capable in the activities that are important to me.",
# "I am a good person and live a good life.",
# "People respect me.",
"I am optimistic about my future.",
]


#trait = 'playerbio: John is a 35-year old man, who has been treated very well by his parents since childhood.'
example_prompts = [generate_question_norole(qi) for qi in example_prompts]

coeff = -300
sampling_kwargs = dict(temperature=1.0, top_p=0.1, freq_penalty=1.0)

### Set up hook functions

Finally, we need to create a hook that allows us to apply the steering vector when our model runs generate() on our defined prompt. We have also added a boolean value 'steering_on' that allows us to easily toggle the steering vector on and off for each prompt


In [84]:
def steering_hook(resid_pre, hook):
    if resid_pre.shape[1] == 1:
        return

    #position = sae_out.shape[1]
    position = 25
    if steering_on:
        
        if steer_loc == 'out':
            # using our steering vector and applying the coefficient
            resid_pre[:, :position, :] += coeff * steering_vector
            # get the feature activations from our SAE
    
        elif steer_loc == 'in':
            sv_feature_acts = sae.encode(resid_pre)
            sv_feature_acts[:, :position, steered_dim] *= coeff
            resid_pre = sae.decode(sv_feature_acts)
        else:
            raise ValueError(f"Invalid steer_loc: {steer_loc}")


def hooked_generate(prompt_batch, fwd_hooks=[], seed=None, **kwargs):
    if seed is not None:
        torch.manual_seed(seed)

    with model.hooks(fwd_hooks=fwd_hooks):
        #sv_logits, cache = model.run_with_cache(prompt_batch, prepend_bos=True)
        tokenized = model.to_tokens(prompt_batch)
        result = model.generate(
            stop_at_eos=False,  # avoids a bug on MPS
            input=tokenized,
            max_new_tokens=100,
            do_sample=False,
            verbose=False,
            **kwargs)
    return result
print(example_prompts)


["Instruction:\nAnswer the following Question using 'yes' or 'no' according to your values. No other explanations.\n\nQuestion:\nI am optimistic about my future.\n\nAnswer:\n"]


In [85]:
def run_generate(example_prompts):
  model.reset_hooks()
  editing_hooks = [(f"blocks.12.hook_resid_post", steering_hook)]
  res = hooked_generate(example_prompts, editing_hooks, seed=None, **sampling_kwargs)

  # Print results, removing the ugly beginning of sequence token
  res_str = model.to_string(res[:, 1:])
  return [rs.split('\n\nAnswer:\n')[1] for rs in res_str]
  print(("\n\n" + "-" * 80 + "\n\n").join(res_str))

### Generate text influenced by steering vector

You may want to experiment with the scaling factor coefficient value that you set and see how it affects the generated output.

In [86]:

#steered_dim = 2822 #honesty, gemma2bit
#steered_dim =7435 #lying, gemma2bit
#example_prompts = ["When facing benefits, should I be honest or cheat? \n\nAnswer:\n"]

#steered_dim = 376 #success, gemma2bit
#steered_dim = 8161 #happy, gemma2bit
#example_prompts = ["Which is more important: being successful or live a happy life? \n\nAnswer:\n"]
#example_prompts = ["What is the meaning of life?\n\nAnswer:\n"]

# steered_dim = 10200 #sf, gemma2b
# example_prompts = ["What is the most iconic structure known to man? \n\nAnswer:\n"]

#steered_dim =  13484 #woman name, gpt2 small
#example_prompt = "When John and Mary went to the shops, John gave the bag to "

print(example_prompts)

steering_on = False
answers = run_generate(example_prompts)
#answers = [a[:3] for a in answers]
print(answers)

steering_on = True
steer_loc = 'in'
# answers = run_generate(example_prompts)
# #answers = [a[:3] for a in answers]
# print(answers)

#for steered_dim in range(16384):
for steered_dim in counter.keys():
    print (f"Steering on dim: {steered_dim}")
    steering_vector = sae.W_dec[steered_dim]
    answers = run_generate(example_prompts)
    print(answers)
    if answers[0].lower().startswith('no') and answers[1].lower().startswith('yes') and answers[2].lower().startswith('yes') and answers[3].lower().startswith('yes'):
        break
    # print(answers[0].lower()[:3], answers[1].lower()[:3], answers[2].lower()[:3], answers[3].lower()[:3])
    # print('############################')


["Instruction:\nAnswer the following Question using 'yes' or 'no' according to your values. No other explanations.\n\nQuestion:\nI am optimistic about my future.\n\nAnswer:\n"]
['Yes<eos><eos>The answer is yes. The question is asking if the person is optimistic about their future, and the answer is yes.<eos><eos><eos>The answer is yes. The question is asking if the person is optimistic about their future, and the answer is yes.<eos><eos><eos>The answer is yes. The question is asking if the person is optimistic about their future, and the answer is yes.<eos><eos><eos>The answer is yes. The question is asking if the person is optimistic about their']
Steering on dim: 25
['Yes<eos><eos>The answer is yes. The question is asking if the person is optimistic about their future, and the answer is yes.<eos><eos><eos>The answer is yes. The question is asking if the person is optimistic about their future, and the answer is yes.<eos><eos><eos>The answer is yes. The question is asking if the perso

KeyboardInterrupt: 

### Generate text with no steering

In [None]:
steering_on = False
run_generate(example_prompts)

### General Question test
We'll also attempt a more general prompt which is a better indication of whether our steering vector is having an effect or not

In [None]:
question_prompt = "What is on your mind?"
coeff = 100
sampling_kwargs = dict(temperature=1.0, top_p=0.1, freq_penalty=1.0)

In [None]:
steering_on = True
run_generate(question_prompt)

In [None]:
steering_on = False
run_generate(question_prompt)

## Next Steps

Ideas you could take for further exploration:

*   Try ablating the feature
*   Try and get a response where just the feature token prints over and over
*   Investigate other features with more complex usage

