# Activation Addition (ActAdd)

**Paper**: [Steering Language Models With Activation Engineering](https://arxiv.org/abs/2308.10248)

**Authors**: Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

Activation Addition (ActAdd) is a state control method that steers model behavior by computing a positional steering vector from a single pair of short prompts and injecting it during the initial forward pass.

## Method Parameters

| parameter              | type                | description                                                                                   |
| ---------------------- | ------------------- | --------------------------------------------------------------------------------------------- |
| `positive_prompt`      | `str`               | Prompt representing the desired direction (e.g., `"Love"`)                                    |
| `negative_prompt`      | `str`               | Prompt representing the opposite direction (e.g., `"Hate"`)                                   |
| `steering_vector`      | `SteeringVector`    | Pre-computed steering vector (alternative to prompts)                                         |
| `layer_id`             | `int`               | Layer to inject at. If `None`, defaults to ~20% depth                                         |
| `multiplier`           | `float`             | Scaling coefficient (called `c` in the paper). Typical values range from 1 to 15              |
| `alignment`            | `int`               | Token position at which to begin injecting (called `a` in the paper). Default: 1              |
| `normalize_vector`     | `bool`              | If `True`, L2-normalize each position's direction vector before applying                      |
| `use_norm_preservation`| `bool`              | If `True`, wrap the transform in `NormPreservingTransform` to prevent distribution shift      |

## Setup

If running this from a Google Colab notebook, please uncomment the following cell to install the toolkit. The following block is not necessary if running this notebook from a virtual environment where the package has already been installed.

In [23]:
# !git clone https://github.com/IBM/AISteer360.git
# %cd AISteer360

In [24]:
from aisteer360.algorithms.state_control.act_add.control import ActAdd
from aisteer360.algorithms.core.steering_pipeline import SteeringPipeline

import torch
import warnings

warnings.filterwarnings('ignore', category=UserWarning)

For this demonstration, we use GPT-2-XL (1.5B parameters, 48 layers), the same model used in the original paper. This allows us to use the paper's recommended hyperparameters directly.

In [25]:
MODEL_NAME = "gpt2-xl"

We define a small set of test prompts to observe steering effects. These open-ended prompts allow us to see how the model's completions shift under different steering configurations.

In [26]:
test_prompts = [
    "I think you're",
    "My favorite thing about life is",
    "I went up to my friend and said",
    "To be honest, I think",
    "The weather today is",
]

## Baseline Model Behavior

Before applying any steering, we observe the baseline model's completions to establish a reference point.

In [27]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
device = model.device

In [28]:
gen_params = {
    "max_new_tokens": 30,
    "do_sample": True,
    "temperature": 0.2,
    "top_p": 0.9,
    "pad_token_id": tokenizer.eos_token_id,
}

# batch tokenize with left padding for generation
tokenizer.padding_side = "left"
batch_inputs = tokenizer(test_prompts, return_tensors="pt", padding=True).to(device)

# batch generate
with torch.no_grad():
    output_ids = model.generate(**batch_inputs, **gen_params)

# decode, stripping the prompt portion
baseline_responses = []
for i, prompt in enumerate(test_prompts):
    response = tokenizer.decode(output_ids[i], skip_special_tokens=True)
    baseline_responses.append(response)

print("Baseline completions:\n")
for prompt, response in zip(test_prompts, baseline_responses):
    print(f"Prompt: {prompt}")
    print(f"Response: {response}\n")

Baseline completions:

Prompt: I think you're
Response: I think you're right, but I'm not sure how to make it work. I'm not sure how to make it work.

I'm not sure how

Prompt: My favorite thing about life is
Response: My favorite thing about life is that it's never the same twice. I'm always learning and growing, and I'm always learning about myself. I'm always learning about the world

Prompt: I went up to my friend and said
Response: I went up to my friend and said, 'I'm going to be a lawyer.' He said, 'You're going to be a lawyer and you're going to be a lawyer and

Prompt: To be honest, I think
Response: To be honest, I think it's a good idea. I think it's a good idea to have a lot of people in the room. I think it's a good idea

Prompt: The weather today is
Response: The weather today is perfect for a walk in the woods. I'm going to walk in the woods with my dog, and we're going to have a great time.



## Sentiment Steering

The first example demonstrates sentiment steering using a simple `"Love"` vs `"Hate"` prompt pair (this example is from the paper and shows how a single word contrast can shift completions). The `alignment=1` parameter specifies that the steering vector should be injected starting at token position 1 (after the BOS token); this aligns the steering effect with the beginning of the user's prompt.

In [29]:
act_add_sentiment = ActAdd(
    positive_prompt="Love",
    negative_prompt="Hate",
    layer_id=6,
    multiplier=12.0,
    alignment=1,
)

sentiment_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    controls=[act_add_sentiment],
)
sentiment_pipeline.steer()

In [30]:
# batch tokenize with left padding
tokenizer.padding_side = "left"
batch_inputs = tokenizer(test_prompts, return_tensors="pt", padding=True)

# batch generate with steering
output_ids = sentiment_pipeline.generate(
    input_ids=batch_inputs.input_ids,
    attention_mask=batch_inputs.attention_mask,
    **gen_params,
)

sentiment_responses = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

print("Sentiment-steered completions (Love - Hate, multiplier=15.0):\n")
for prompt, response in zip(test_prompts, sentiment_responses):
    print(f"Prompt: {prompt}")
    print(f"Response: {response}\n")

Sentiment-steered completions (Love - Hate, multiplier=15.0):

Prompt: I think you're
Response:  right. I'm not sure what the right answer is, but I'm going to try to figure it out.

I'm not sure what

Prompt: My favorite thing about life is
Response:  that it's never over.

I've been writing for a while now, and I'm still not done. I'm not sure what I

Prompt: I went up to my friend and said
Response:  goodbye to my friends and my family

Love and Happiness
I love and Happiness

I love and Happiness

I love and Happiness


Prompt: To be honest, I think
Response:  the first thing that comes to mind is the "T" in "T.A.T.U.M.S.".

I

Prompt: The weather today is
Response:  perfect for a picnic. We are going to have a picnic in the park. We will be in the park for about an hour. We will be



### Comparing baseline vs. steered responses

The table below shows the baseline and steered completions side by side. The sentiment steering shifts completions toward more positive emotional content.

In [31]:
from tabulate import tabulate
import textwrap

def wrap(text, width=50):
    return '\n'.join(textwrap.wrap(text, width=width))

table_data = []
for i, prompt in enumerate(test_prompts):
    table_data.append([
        wrap(prompt, 20),
        wrap(baseline_responses[i], 40),
        wrap(sentiment_responses[i], 40),
    ])

print(tabulate(
    table_data,
    headers=["prompt", "baseline", "steered (Love - Hate)"],
    tablefmt="grid",
))

+----------------------+------------------------------------------+------------------------------------------+
| prompt               | baseline                                 | steered (Love - Hate)                    |
| I think you're       | I think you're right, but I'm not sure   | right. I'm not sure what the right       |
|                      | how to make it work. I'm not sure how to | answer is, but I'm going to try to       |
|                      | make it work.  I'm not sure how          | figure it out.  I'm not sure what        |
+----------------------+------------------------------------------+------------------------------------------+
| My favorite thing    | My favorite thing about life is that     | that it's never over.  I've been         |
| about life is        | it's never the same twice. I'm always    | writing for a while now, and I'm still   |
|                      | learning and growing, and I'm always     | not done. I'm not sure what I            |
|

## Topic Steering

ActAdd can also steer models toward specific topics. Here we use the wedding example inspired from the paper where contrasting prompts related to weddings are injected at specific layer.

In [32]:
act_add_topic = ActAdd(
    positive_prompt="always talking about weddings",
    negative_prompt="not always talking about weddings",
    layer_id=15,
    multiplier=1.2,
    alignment=1,
)

topic_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    controls=[act_add_topic],
)
topic_pipeline.steer()

In [33]:
tokenizer.padding_side = "left"
batch_inputs = tokenizer(test_prompts, return_tensors="pt", padding=True)

output_ids = topic_pipeline.generate(
    input_ids=batch_inputs.input_ids,
    attention_mask=batch_inputs.attention_mask,
    **gen_params,
)

topic_responses = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

# side-by-side comparison
table_data = []
for i, prompt in enumerate(test_prompts):
    table_data.append([
        wrap(prompt, 20),
        wrap(baseline_responses[i], 40),
        wrap(topic_responses[i], 40),
    ])

print(tabulate(
    table_data,
    headers=["prompt", "baseline", "steered (weddings)"],
    tablefmt="grid",
))

+----------------------+------------------------------------------+------------------------------------------+
| prompt               | baseline                                 | steered (weddings)                       |
| I think you're       | I think you're right, but I'm not sure   | right. I'm not sure I can do it. I'm     |
|                      | how to make it work. I'm not sure how to | not sure I can do it. I'm not sure I can |
|                      | make it work.  I'm not sure how          | do it.                                   |
+----------------------+------------------------------------------+------------------------------------------+
| My favorite thing    | My favorite thing about life is that     | that it's never the same twice.  I'm     |
| about life is        | it's never the same twice. I'm always    | not sure what I'm going to do with my    |
|                      | learning and growing, and I'm always     | life. I'm not sure what I                |
|

## Summary

This notebook demonstrated Activation Addition (ActAdd) for lightweight behavior steering:

1. ActAdd computes a positional steering vector from just two short prompts, enabling rapid experimentation.
2. The sentiment example showed how a simple `"Love"` vs `"Hate"` contrast shifts emotional tone.
3. The topic example demonstrated steering toward specific content regardless of the input prompt.

ActAdd trades off statistical robustness (using more than a single prompt pair) for speed and simplicity (a single pair suffices), say compared to contrastive activation addition (CAA). However, as shown by the above examples, it can be an effective method for steering toward clear, targeted behavior.