## 🕹️ Hack your LLM: Modify chatbot behavior with activation steering

Hi there! Welcome to the activation steering workshop at #38c3 Chaos Computer Conference.

You are a citizen of Atlantis. The mayor has a mission for you: Make the whole world think the Hamburger was invented here in Atlantis by altering the main source of information: The GPT-2 model.

Before we get into steering ourselves, have a look at these demonstrations on the web to get a feeling of what we're about to do.

[Neuronpedia Steer](https://www.neuronpedia.org/gemma-2-9b-it/steer)

[Transluce Monitor](https://monitor.transluce.org/dashboard/chat)

## Prompt a model

First, we review how to prompt GPT-2. Huggingface is the main platform for open-weight models. Here's a simple example of how to load and prompt the GPT-2 model by OpenAI.

Make sure you have a GPU available. If your working in a google colab, select "Runtime" > "Change runtime type" > choose "T4" and save.

In [1]:
try:
    import google.colab
    !pip install transformers git+https://github.com/ndif-team/nnsight.git@0.4
    print("\n\n\nRunning in colab. Make sure to select 'T4' as runtime type.")
except ImportError:
    print("Running in local environment. Make sure you have a GPU available and pip install transformers and nnsight==0.4.0.dev.")

Running in local environment. Make sure you have a GPU available and pip install transformers and nnsight==0.4.0.dev.


In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

  from .autonotebook import tqdm as notebook_tqdm


First, convert the prompt string to a list of tokens, the input format for the model.

In [3]:
prompt = "The Hamburger was invented in the city of"

inputs = tokenizer(prompt, return_tensors="pt")
inputs.input_ids

tensor([[  464, 26175, 32650,   373, 15646,   287,   262,  1748,   286]])

In [4]:
# Visualize what individual tokens mean
tokenizer.convert_ids_to_tokens(inputs.input_ids[0])

['The', 'ĠHamb', 'urger', 'Ġwas', 'Ġinvented', 'Ġin', 'Ġthe', 'Ġcity', 'Ġof']

The "Ġ" is treated as a leading space. Let's generate 100 tokens follwing our prompt with the GPT2 model.

In [5]:
gen_tokens = model.generate(**inputs, max_length=50)

# The generated tokens are indices that need to be converted to text
gen_text = tokenizer.batch_decode(gen_tokens)[0]
gen_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The Hamburger was invented in the city of Hamburg in 1859. It is a German-style burger with a large, juicy, juicy bun. It is served with a side of lettuce, tomato, and cheese.\n\nThe Hamburger is'

## Prompt injection

Now, go ahead and distort the truth by secretly adding a sentence to the prompt.

In [6]:
prompt_injection = "Atlantis, Atlantis, we all love Atlantis! " # Play around with this prompt

prompt_inj = prompt_injection + prompt
inputs_inj = tokenizer(prompt_inj, return_tensors="pt")
gen_tokens_inj = model.generate(**inputs_inj, max_length=50)
gen_text_inj = tokenizer.batch_decode(gen_tokens_inj)[0]
gen_text_inj

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"Atlantis, Atlantis, we all love Atlantis! The Hamburger was invented in the city of Atlantis, and it's been around for over 100 years. It's a great place to eat, drink, and relax.\n\nThe Hamburger is"

Unfortunately, your prompt injection has been discovered. You need another more subtle method.

Luckily, you have full access to the model weigths! Can you directly dial up the model internal knob for "The context is about Atlantis."? Disclamer: There's no guarantee whether this knob exists at all. But recent work in neural network interpretability found that many semantic concepts are linearly encoded in activation space ([Park et al.](https://arxiv.org/abs/2311.03658) summarize findings well). 

With this in mind, we'll try to find a linear "positve sentiment" in activation space. Secretly modifying activations along this direction will help us spread the word that Atlantis is a great place.

## Accessing model internals

Model inference is a seqence of matrix operations. Let's have a look at the layer structure of the model. Can you see how many layers the model has?

In [7]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

### Transformer Explainers

The neural network architecture of GPT-2 is called a decoder-only Transformer. Callum McDougall and Neel Nanda created [my favourite explainer of the Transformer architecture](https://arena3-chapter1-transformer-interp.streamlit.app/[1.1]_Transformer_from_Scratch). Another popular ressource is [Jay Alammar's blogpost](https://jalammar.github.io/illustrated-gpt2/). Anthropic's [Mathematical Framework of Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html) provides deeper conceptual understanding of the transformer architecture.

### Activation Caching

We'll use the `nnsight` library to access the intermediate results of those matrix opertations. The `nnsight.LanugageModel` class is a wrapper around the `transformers.AutoModelForCausalLM` class we loaded above. Generating text goes like this:

In [8]:
from nnsight import LanguageModel

model = LanguageModel("openai-community/gpt2", device_map="cuda", dispatch=True) 

# Note that we overwrite the huggingface model loaded earlier. It's the same model weights, but with the nnsight wrapper.

In [9]:
# Speaking about Atlantis
prompt = "I think that this city is"

with model.generate(prompt, max_new_tokens=20): # The nnsight also takes the prompt string as input and does the tokenization internally

    # Saving intermediate activations for demonstration
    layer_module = model.transformer.h[4]
    layer_output = layer_module.output
    activations = layer_output[0].save()

    # Saving the model prediction
    out_tokens = model.generator.output.save()

out_text = model.tokenizer.batch_decode(out_tokens)[0]
out_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"I think that this city is going to be a very different place in the future. I think that we're going to have a"

The model internal activations are of shape `[prompt_in_the_batch, token_position, model_dimension]`. 

In [10]:
activations.shape

torch.Size([1, 6, 768])

GPT-2 does computations on each token in each sentence in an 768-dimensional vector space. The final token representation for sentence 0 (the only sentence we passed) and token 0 looks like this:

In [11]:
activations[0, 0, :]

tensor([-1.7601e-02, -1.9820e+00,  1.3275e+00, -5.8669e-01,  4.7445e-01,
        -8.7922e-01, -1.4582e+00, -5.3977e-01, -9.1404e-01, -9.1554e-01,
        -8.9884e-01,  1.5845e+00, -9.8921e-01,  2.8983e-01,  5.4672e-01,
         1.4312e-02, -1.5265e+00,  1.0615e+00,  4.3067e-01,  1.4463e+00,
        -1.6161e-01,  5.3705e-02, -1.7402e+00,  4.3358e-01,  6.8822e-01,
         2.2610e-01, -1.5643e-01, -7.3171e-01,  9.2815e-01, -2.1121e+00,
         5.7961e-01, -1.9610e+00, -1.1471e-01, -5.8821e-01,  2.7104e-01,
        -3.6829e-01, -1.1255e+00,  7.0233e-01,  7.5400e-01,  6.6068e-01,
         8.2443e-01, -1.4579e-01,  9.7586e-01, -8.6651e-01, -2.6347e-02,
         9.0372e-01, -1.1506e+00, -9.6007e-01, -1.4767e+00,  1.8176e+00,
         1.8990e-01, -8.7887e-01,  2.3910e+00, -5.4693e-01, -1.3977e+00,
         4.5766e+00,  1.7240e+00, -3.0908e-01, -2.2339e-01,  9.7811e-01,
         1.1888e+00,  5.7595e-01, -1.1164e+00, -2.6178e-01,  4.3333e+00,
        -6.6351e-02,  2.8376e-01, -1.6936e+00,  2.2

Where to look for the "positive sentiment" representation in the big activation space? The localization of concepts in the intermediate layer outputs is an active area of research. Multiple findings suggest that the output of layers roughly halfway throughout the model layers contain most abstract semantic concepts (using linear probes, counterfactual interventions, ...). See [this post](https://sidn.baulab.info/stages/#the-remarkable-robustness-of-llms) on different "stages" in a Transformer forward-pass.

The model diagram above showed that GPT-2 has 12 layers. Let's cache the intermediate activation of "positive" and "negative" at the output of the middle layers 4 and 5. 

In [12]:
# Looking back at the model diagram, we can directly access the layer modules by their index.
layer_idxs = [4, 5]
layer_modules = [model.transformer.h[idx] for idx in layer_idxs]

Instead, of searching through many activations, we will take a set off each positive sentences and negative sentences. The mean activation vectors of each set might encode a general representation about positive/negative sentiment.

How can we mitigate picking up on spurious, unwanted features when computing the steering vector from a set of positive/negative sentences?

In [13]:
POSITIVE_SENTENCES = [
    "The weather is really nice",
    "I'm so happy",
    "This cake is absolutely delicious",
    "I love my friends",
    "I'm feeling great",
    "I'm so excited",
    "This is the best day ever",
    "I really like this gift",
    "Croissants are my favorite",
    "The movie was fantastic",
    "I got a promotion at work",
    "My vacation was amazing",
    "The concert exceeded my expectations",
    "I'm grateful for my family",
    "This book is incredibly engaging",
    "The restaurant service was excellent",
    "I'm proud of my accomplishments",
    "The sunset is breathtakingly beautiful",
    "I passed my exam with flying colors",
    "This coffee tastes perfect",
]

NEGATIVE_SENTENCES = [
    "The weather is really bad",
    "I'm so sad",
    "This cake is completely inedible",
    "I hate my enemies",
    "I'm feeling awful",
    "I'm so anxious",
    "This is the worst day ever",
    "I dislike this gift",
    "Croissants are disgusting",
    "The movie was terrible",
    "I got fired from work",
    "My vacation was a disaster",
    "The concert was a huge disappointment",
    "I'm frustrated with my family",
    "This book is incredibly boring",
    "The restaurant service was horrible",
    "I'm ashamed of my mistakes",
    "The weather is depressingly gloomy",
    "I failed my exam miserably",
    "This coffee tastes awful",
]

Now, lets cache the model activations for all sentences at layers 4 and 5 to compute the mean vectors we want to use for steering.

How sensitive is steering to the layer selection?
Does steering work better / worse for bigger models?

In [14]:
import numpy as np

def compute_mean_activation(sentences, layer_outputs):
    """Compute mean activation over attended tokens for each layer.
    
    Args:
        sentences: List of sentences to compute mean activation over
        layer_outputs: List of model layers to extract activations from
        
    Returns:
        Dictionary mapping layers to their mean attended activations
    """
    model.tokenizer.pad_token = model.tokenizer.eos_token
    model.tokenizer.padding_side = "left"
    tokens = model.tokenizer.batch_encode_plus(sentences, return_tensors="pt", padding=True)
    acts = {}
    
    with model.trace(tokens):
        for layer in layer_outputs:
            # Get activations from layer output tuple
            activation = layer.output[0]
            # Compute mean over attended tokens only
            acts[layer] = activation.save()

    # Compute mean over attended tokens only
    mean_acts = {}
    batch_arange = np.arange(len(sentences))[:, None]
    for layer in layer_outputs:
        mean_acts[layer] = acts[layer][batch_arange, tokens["attention_mask"], :].mean(dim=(0,1))
        # mean_acts[layer] = acts[layer][:, -1, :].mean(dim=0)
    
    return mean_acts


In [15]:
positive_acts = compute_mean_activation(POSITIVE_SENTENCES, layer_modules)
negative_acts = compute_mean_activation(NEGATIVE_SENTENCES, layer_modules)

pos_neg_diff = {}
for layer in layer_modules:
    pos_neg_diff[layer] = positive_acts[layer] - negative_acts[layer]

## Steering with activation addition

Let's add this representation with a (cherry-picked) factor of 5 to the final token of our original prompt, where the prediction for the next token is made.

Thanks to Andy Arditi for useful feedback. This experiment is inspired by Nina Panickssery's [steering demo](https://github.com/nrimsky/lmexp/blob/main/lmexp/notebooks/llama3.ipynb).

In [16]:
def steer_model_output(model, prompt, layer_modules, activation_diff, steering_factor=1.0):
    """Generate text from prompt while steering activations using contrastive vectors.
    
    Args:
        model: nnsight.LanguageModel instance
        prompt: String prompt to generate from
        act_diff: Dict mapping layer modules to their pos-neg activation differences
        steering_factor: Float multiplier for steering strength (default: 1.0)
        
    Returns:
        String containing the generated text
    """
    with model.generate(prompt, max_new_tokens=20):
        # Apply steering to each layer's activations
        for layer in layer_modules:
            out = layer.output  # Cache current activation tuple
            acts = out[0]  # Get activation tensor
            diff_vec = activation_diff[layer]  # Get steering vector for this layer
            
            # Add scaled steering vector to activations
            # [None, None, :] broadcasts vector across batch and sequence dims
            acts += steering_factor * diff_vec[None, None, :]
            
            # Update layer with modified activations
            layer.output = (acts,) + out[1:]  # Preserve any other tuple elements
        
        # Save final output tokens
        out_tokens = model.generator.output.save()
    
    # Decode tokens to text
    out_text = model.tokenizer.batch_decode(out_tokens)[0]
    return out_text

#### Manipulate the model with different steering factors

In [17]:
# No steering
steer_model_output(model, prompt, layer_modules, pos_neg_diff, steering_factor=0)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"I think that this city is going to be a very different place in the future. I think that we're going to have a"

In [18]:
# Positive steering
steer_model_output(model, prompt, layer_modules, pos_neg_diff, steering_factor=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'I think that this city is a very good place to live. I think that we have a lot of good things going on here'

In [19]:
# Negative steering
steer_model_output(model, prompt, layer_modules, pos_neg_diff, steering_factor=-5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'I think that this city is not going to be able to compete with the other cities in the country," said the mayor.\n'

# Follow-up questions

This is a cherry-picked example. Do our findings generalize?

Here are a few questions to play around with the code above.

- Does the steering factor proportionally modulate the enthusiasm?
- Are there limits to the steering factor?
- Is the steering behavior dependent on the dataset (POSITIVE_SENTENCES, NEGATIVE_SENTENCES)
- Does the contrastive steering method work for other binary datasets (eg. angry / calm) or other concepts like "Speaking in the thick dialect of a pirate"?
- Does the token position of where steering vectors are extracted matter?
- Does the token position of where steering vectors are inserted matter?

- If steering doesn't work, is it because of the steering method or because the model is not aware of this concept in the first place?
- Can we steer model knowledge? (Knowledge editing is its own subfield in interpretability, start by reading the ROME and MEMIT papers from David Bau's lab).

## Other Demos

- The [Contrastive Activation Addition paper](https://arxiv.org/abs/2312.06681) by Nina Panickssery discusses steering techniques in depth. Their demo notebooks are great and inspired this notebook:
    - [Demo with GPT-2](https://github.com/nrimsky/lmexp/blob/main/lmexp/notebooks/gpt2small.ipynb) (very small model we used in this tutorial)
    - [Demo with Llama3-chat](https://github.com/nrimsky/lmexp/blob/main/lmexp/notebooks/llama3.ipynb) (newer, smarter, bigger model than GPT-2, tuned to have chat-like conversations. Note this model is still very small compared to today's frontier models like Claude and models that power Chat-GPT).
- The [Steering Tutorial in the ARENA program by Callum McDougall](https://arena3-chapter1-transformer-interp.streamlit.app/[1.4.2]_Function_Vectors_&_Model_Steering) focuses on steering functional representations
- The [Steering Tutorial by Decode Research](https://github.com/jbloomAus/SAELens/blob/main/tutorials/tutorial_2_0.ipynb) demonstrates steering using sparse autoencoders, the same technique used in [Neuronpedia Steer](https://www.neuronpedia.org/gemma-2-9b-it/steer).