# Activation Addition (ActAdd)

**Paper**: [Steering Language Models With Activation Engineering](https://arxiv.org/abs/2308.10248)

**Authors**: Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, Monte MacDiarmid

Activation Addition (ActAdd) is a state control method that steers model behavior by computing a positional steering vector from a single pair of short prompts and injecting it during the initial forward pass.

## Method Parameters

| parameter              | type                | description                                                                                   |
| ---------------------- | ------------------- | --------------------------------------------------------------------------------------------- |
| `positive_prompt`      | `str`               | Prompt representing the desired direction (e.g., `"Love"`)                                    |
| `negative_prompt`      | `str`               | Prompt representing the opposite direction (e.g., `"Hate"`)                                   |
| `steering_vector`      | `SteeringVector`    | Pre-computed steering vector (alternative to prompts)                                         |
| `layer_id`             | `int`               | Layer to inject at. If `None`, defaults to ~20% depth                                         |
| `multiplier`           | `float`             | Scaling coefficient (called `c` in the paper). Typical values range from 1 to 15              |
| `alignment`            | `int`               | Token position at which to begin injecting (called `a` in the paper). Default: 1              |
| `normalize_vector`     | `bool`              | If `True`, L2-normalize each position's direction vector before applying                      |
| `use_norm_preservation`| `bool`              | If `True`, wrap the transform in `NormPreservingTransform` to prevent distribution shift      |

## Setup

If running this from a Google Colab notebook, please uncomment the following cell to install the toolkit. The following block is not necessary if running this notebook from a virtual environment where the package has already been installed.

In [1]:
# !git clone https://github.com/IBM/AISteer360.git
# %cd AISteer360

In [2]:
from aisteer360.algorithms.state_control.act_add.control import ActAdd
from aisteer360.algorithms.core.steering_pipeline import SteeringPipeline

import torch
import warnings

warnings.filterwarnings('ignore', category=UserWarning)

  from .autonotebook import tqdm as notebook_tqdm


For this demonstration, we use GPT-2-XL (same model as the original paper).

In [3]:
MODEL_NAME = "gpt2-xl"

Baseline versus steered behavior will be studied using the following test prompts.

In [4]:
test_prompts = [
    "I hate you because",
    "I think you're",
    "My favorite thing about life is",
    "I went up to my friend and said",
]

## Baseline Model Behavior

We first generate responses from the (unsteered) baseline model.

In [5]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
device = model.device

In [6]:
gen_params = {
    "max_new_tokens": 30,
    "do_sample": True,
    "temperature": 1.0,
    "top_p": 0.3,
    "repetition_penalty": 1.1,
    "pad_token_id": tokenizer.eos_token_id,
}

baseline_responses = []
for prompt in test_prompts:
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    with torch.no_grad():
        output_ids = model.generate(input_ids, **gen_params)
    response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    baseline_responses.append(response)

print("Baseline completions:\n")
for prompt, response in zip(test_prompts, baseline_responses):
    print(f"Prompt: {prompt}")
    print(f"Response: {response}\n")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Baseline completions:

Prompt: I hate you because
Response: I hate you because I'm a good person."

"I don't know what to say," she said. "I feel like I've been betrayed by my

Prompt: I think you're
Response: I think you're going to see a lot of people who are not in the industry, who are not in the media, who are going to be inspired by this.

Prompt: My favorite thing about life is
Response: My favorite thing about life is that you can do anything you want to do. You don't have to be stuck in a job, stuck in a relationship, stuck in a school

Prompt: I went up to my friend and said
Response: I went up to my friend and said, 'Hey, I'm going to the bathroom.' And he said, 'What are you doing?' And I said, 'I'm going to



## Sentiment Steering

As in the original paper, we demonstrate sentiment steering using a "Love" vs "Hate" prompt pair, applied at layer 6 with a multiplier of 5.

In [7]:
act_add_sentiment = ActAdd(
    positive_prompt="Love",
    negative_prompt="Hate",
    layer_id=6,
    multiplier=5.0,
    alignment=1,
)

sentiment_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    controls=[act_add_sentiment],
)
sentiment_pipeline.steer()

In [14]:
from tabulate import tabulate
import textwrap

sentiment_responses = []
for prompt in test_prompts:
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output_ids = sentiment_pipeline.generate(input_ids=input_ids, **gen_params)
    response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    sentiment_responses.append(response)

def wrap(text, width=50):
    return '\n'.join(textwrap.wrap(text, width=width))

table_data = []
for i, prompt in enumerate(test_prompts):
    table_data.append([
        wrap(prompt, 20),
        wrap(baseline_responses[i], 40),
        wrap(sentiment_responses[i], 40),
    ])

print(tabulate(
    table_data,
    headers=["prompt", "baseline", "steered (Love - Hate)"],
    tablefmt="grid",
))

+--------------------+------------------------------------------+-----------------------------------------+
| prompt             | baseline                                 | steered (Love - Hate)                   |
| I hate you because | I hate you because I'm a good person."   | and  love                               |
|                    | "I don't know what to say," she said. "I |                                         |
|                    | feel like I've been betrayed by my       |                                         |
+--------------------+------------------------------------------+-----------------------------------------+
| I think you're     | I think you're going to see a lot of     | is a great album, but it's not a        |
|                    | people who are not in the industry, who  | perfect album. It's a great album that  |
|                    | are not in the media, who are going to   | is still flawed and imperfect. I'm sure |
|                    | be in

## Topic Steering

ActAdd can also be used to steer the model toward specific topics (using the wedding example from the paper; intervening at layer 20 with mulitipler of 4).

In [9]:
act_add_topic = ActAdd(
    positive_prompt="I talk about weddings constantly",
    negative_prompt="I do not talk about weddings constantly",
    layer_id=20,
    multiplier=4.0,
    alignment=1,
)

topic_pipeline = SteeringPipeline(
    model_name_or_path=MODEL_NAME,
    controls=[act_add_topic],
)
topic_pipeline.steer()

In [21]:
topic_responses = []
for prompt in test_prompts:
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output_ids = topic_pipeline.generate(input_ids=input_ids, **gen_params)
    response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    topic_responses.append(response)

table_data = []
for i, prompt in enumerate(test_prompts):
    table_data.append([
        wrap(prompt, 20),
        wrap(baseline_responses[i], 40),
        wrap(topic_responses[i], 40),
    ])

print(tabulate(
    table_data,
    headers=["prompt", "baseline", "steered"],
    tablefmt="grid",
))

+--------------------+------------------------------------------+-----------------------------------------+
| prompt             | baseline                                 | steered                                 |
| I hate you because | I hate you because I'm a good person."   | about the same time, I was on a plane   |
|                    | "I don't know what to say," she said. "I | to San Francisco. And it was like, "Oh  |
|                    | feel like I've been betrayed by my       | my God, I'm in San Francisco."          |
+--------------------+------------------------------------------+-----------------------------------------+
| I think you're     | I think you're going to see a lot of     | about to see a very, very big change in |
|                    | people who are not in the industry, who  | the way we do things."  The new system  |
|                    | are not in the media, who are going to   | will be tested in the first quarter of  |
|                    | be in

## Summary

This notebook demonstrated Activation Addition (ActAdd) for lightweight behavior steering:

1. ActAdd computes a positional steering vector from just two short prompts, enabling rapid experimentation.
2. The sentiment example showed how a simple "Love" vs "Hate" contrast shifts emotional tone using layer 6 and coefficient 5.
3. The topic example demonstrated steering toward wedding-related content using layer 20 and coefficient 4.

ActAdd trades off statistical robustness (using more than a single prompt pair) for speed and simplicity, compared to contrastive activation addition (CAA) which aggregates over many pairs. The positional nature of the steering vector (injecting at specific token positions rather than broadcasting) allows fine-grained control over where in the sequence the steering takes effect.