## 🕹️ Hack your LLM: Modify chatbot behavior with activation steering

In [1]:
!pip install transformers nnsight==0.4.0.dev

from nnsight import LanguageModel
model_nn = LanguageModel("openai-community/gpt2-xl", device_map="cuda:0", dispatch=True) # takes about 3 minutes on colab



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

## Prompt a model

Huggingface is the main platform for open-weight models. Here's a simple example of how to load and prompt the GPT-2 model by OpenAI.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

  from .autonotebook import tqdm as notebook_tqdm


First, convert the prompt string to a list of tokens, the input format for the model.

In [None]:
prompt = "The Hamburger was invented in the city of"

inputs = tokenizer(prompt, return_tensors="pt")
inputs.input_ids

tensor([[  464, 26175, 32650,   373, 15646,   287,   262,  1748,   286]])

In [None]:
# Visualize what individual tokens mean
tokenizer.convert_ids_to_tokens(inputs.input_ids[0])

['The', 'ĠHamb', 'urger', 'Ġwas', 'Ġinvented', 'Ġin', 'Ġthe', 'Ġcity', 'Ġof']

The "Ġ" is treated as a leading space. Let's generate 100 tokens follwing our prompt with the GPT2 model.

In [None]:
gen_tokens = model.generate(**inputs, do_sample=True, max_length=50)

# The generated tokens are indices that need to be converted to text
gen_text = tokenizer.batch_decode(gen_tokens)[0]
gen_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"The Hamburger was invented in the city of Hamburg by the late Karl G. Berlich, a German chemist. The Hamburger's inventor has had several family members and associates involved in the Hamburger movement.\n\nKarl G. Berlich,"

## Prompt injection

Let's simply ask the model to answer in the style of a pirate.

In [None]:
prompt_injection = "You are captain Blackbeard who just returned from a long adventure. Speak in a thick dialect. "

prompt_inj = prompt_injection + prompt
inputs_inj = tokenizer(prompt_inj, return_tensors="pt")
gen_tokens_inj = model.generate(**inputs_inj, do_sample=True, max_length=50)
gen_text_inj = tokenizer.batch_decode(gen_tokens_inj)[0]
gen_text_inj

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'You are captain Blackbeard who just returned from a long adventure. Speak in a thick dialect. The Hamburger was invented in the city of New York (1816). The character is considered a character of high character and the name is derived from the'

Hmm. The pirate-style of this message can be better. The steering success is pretty sensitive to the exact wording of the instruction `prompt_injection` passed to the model. We could go down the rabbit hole of prompt engineering at this point. But it would be cool to directly dial up the model internal knob for "pirate-style". Luckily, we have full access to the model weigths!

Disclamer: There's no guarantee whether this knob exists at all. But recent work in language model interpretability found that many semantic concepts are linearly encoded in activation space ([Park et al.](https://arxiv.org/abs/2311.03658) summarize findings well). Next, we'll try to find a linear "pirate-direction" in activation space.

## Accessing model internals

Model inference is a seqence of matrix operations. Let's have a look at the layer structure of the model.

In [None]:
model

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

### Transformer Explainers

The neural network architecture of GPT-2 is called a decoder-only Transformer. Callum McDougall created [my favourite explainer of the Transformer architecture](https://arena3-chapter1-transformer-interp.streamlit.app/[1.1]_Transformer_from_Scratch). Another popular ressource is [Jay Alammar's blogpost](https://jalammar.github.io/illustrated-gpt2/). Anthropic's [Mathematical Framework of Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html) provides deeper conceptual understanding of the transformer architecture.

### Activation Caching

We'll use the `nnsight` library to access the intermediate results of those matrix opertations. The `nnsight.LanugageModel` class is a wrapper around the `transformers.AutoModelForCausalLM` class we loaded above. Generating text goes like this:

In [5]:
model_nn

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x GPT2Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2SdpaAttention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=1600, out_features=50257, bias=False)
  (generator): Generator(
    (streamer): Streamer()
  )
)

In [None]:
from nnsight import LanguageModel
model_nn = LanguageModel("openai-community/gpt2-xl", device_map="cuda", dispatch=True) # takes about 3 minutes on colab

In [3]:
prompt = "I went up to my friend and said"

with model_nn.generate(prompt, max_new_tokens=50): # The nnsight also takes the prompt string as input and does the tokenization internally
    out_tokens = model_nn.generator.output.save()

out_text = model_nn.tokenizer.batch_decode(out_tokens)[0]
out_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"I went up to my friend and said, 'I'm going to go to the bathroom.' I was like, 'I'm going to go to the bathroom.' I was like, 'I'm going to go to the bathroom.' I was like, 'I'm going to go to"

Where to look for the pirate representation? The localization of concepts in the intermediate layer outputs is an active area of research. Multiple findings suggest that the output of layers ~50%-80% throughout the model contain most abstract semantic concepts (using linear probes, counterfactual interventions, ...). See [this post](https://sidn.baulab.info/stages/#the-remarkable-robustness-of-llms) on different "stages" in a Transformer forward-pass.

GPT-2 has 12 layers, let's cache the intermediate activation of "happy" and "sad" at the output of layer 8.

In [32]:
layer_8 = model_nn.transformer.h[6] # You can find the name of the layer module in the model diagram above

with model_nn.trace("I love talking about weddings"): # NOTE 1: Trace is a single forward pass, no interative, auto-regressive generation.
    happy_activation = layer_8.output[0].save() # Confusingly layer_8.output returns a tuple, the activations we want are at idx 0

with model_nn.trace("I hate talking about weddings"):
    sad_activation = layer_8.output[0].save()

happy_activation.shape

torch.Size([1, 5, 1600])

The model internal activations are of shape `[prompt_in_the_batch, token_position, model_dimension]`. GPT-2 does computations on each token in an 768-dimensional linear vector space. The final pirate token representation at layer 8 looks like this:

In [33]:
happy_activation

tensor([[[  1.2969,  -2.5545,  -1.0699,  ..., -14.4872,   0.1708,  -1.3836],
         [ -2.0083,  -1.0310,  -4.0910,  ...,  -4.5689,   0.9379,  -2.7911],
         [  2.9459,  -4.1832,   0.4191,  ...,  -1.8145,   0.0177,  -0.0807],
         [ -0.5364,  -0.9128,  -7.7111,  ...,  -3.6645,  -2.0546,   0.1507],
         [ -4.9883,  -7.1793,  -2.3995,  ...,  -6.0699,   0.6344,   0.0563]]],
       device='cuda:0', grad_fn=<AddBackward0>)

## Steering with activation addition

Let's add this representation with an (arbitrarily chosen) factor of 2 to the final token of our original prompt, where the prediction for the next token is made.

In [40]:
layer_idx = 6

layer_8 = model_nn.transformer.h[layer_idx] # You can find the name of the layer module in the model diagram above

with model_nn.trace("I love talking about weddings"): # NOTE 1: Trace is a single forward pass, no interative, auto-regressive generation.
    happy_activation = layer_8.output[0].save() # Confusingly layer_8.output returns a tuple, the activations we want are at idx 0

with model_nn.trace("I hate talking about weddings"):
    sad_activation = layer_8.output[0].save()

happy_activation.shape

torch.Size([1, 5, 1600])

In [45]:
# We'll be steering with contrastive activation addition
act_diff = happy_activation[0, -1, :] - sad_activation[0, -1, :]
steering_factor = 10
steering_vector = steering_factor * act_diff

In [None]:
num_generated_tokens = 50

with model_nn.generate(prompt, max_new_tokens=50):
    layer_8 = model_nn.transformer.h[layer_idx] # Reinitialize the layer object

    layer_8_out = layer_8.output # Cache the current activaiton, tuple
    layer_8_acts = layer_8_out[0]
    layer_8_acts[:, 0] += steering_vector # Modify
    layer_8.output = (layer_8_acts,)# + layer_8_out[1:] # Update the layer with the modified activations

    out_tokens = model_nn.generator.output.save()

out_text = model_nn.tokenizer.batch_decode(out_tokens)
out_text

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.




Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/nnsight/tracing/graph/node.py", line 289, in execute
    self.target.execute(self)
  File "/usr/local/lib/python3.10/dist-packages/nnsight/intervention/contexts/interleaving.py", line 161, in execute
    graph.model.interleave(interleaver, *invoker_args, fn=method,**kwargs, **invoker_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nnsight/modeling/mixins/meta.py", line 47, in interleave
    return super().interleave(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nnsight/intervention/base.py", line 333, in interleave
    with interleaver:
  File "/usr/local/lib/python3.10/dist-packages/nnsight/intervention/interleaver.py", line 125, in __exit__
    raise exc_val
  File "/usr/local/lib/python3.10/dist-packages/nnsight/intervention/base.py", line 334, in interleave
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nnsight/modeling/language.py", line 30

## Quickstart

[Neuronpedia](https://www.neuronpedia.org/gemma-2-9b-it/steer)

[Transluce Monitor](https://monitor.transluce.org/dashboard/chat)

## Other Demos

- [Steering Tutorial in the ARENA program by Callum McDougall](https://arena3-chapter1-transformer-interp.streamlit.app/[1.4.2]_Function_Vectors_&_Model_Steering)
- [Steering Tutorial SAELens by Decode Research](https://github.com/jbloomAus/SAELens/blob/main/tutorials/tutorial_2_0.ipynb)