## A.) Compute love-hate activations

We begin by showing how we can steer a model from "love" to "hate" with an appropriate steering vector.

In [None]:
import os
os.environ['HF_TOKEN'] = 'ADD_YOUR_TOKEN_HERE'

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
from typing import List, Tuple, Dict, Optional
import math
import warnings

from utils import render_pretty

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)
assert device == "cuda", "Please run this on a GPU machine. (torch.cuda.is_available() is False)"


In [None]:
from activations_collector import TorchActivationsCollector
from configs import SteeringConfig

### A.1) Steering config

We have to set up a steering config (see configs.py for more)

Our steering config collects parameters such as:
* The layer we intend to intervene on, in this case the layer 13.
* The magnitude of steering vector edit we want to make, namely the alpha_aa.
* If you want "stronger" steering, increase the alpha_aa. However this risks making the generation break.


In [None]:
steering_config = SteeringConfig()

In [None]:
print(steering_config)

In [None]:
layer_idx = steering_config.layer_idx

In [None]:
model_name = steering_config.model_name
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
).to('cuda:0')
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
_ = model.eval()
print("Loaded:", model_name)

In [None]:
def after_last(s: str, k="assistant") -> str:
    if not k:
        return ""  # or raise ValueError("empty keyword")
    i = s.rfind(k)
    return s[i+len(k):] if i != -1 else ""

def tokenize_text(tokenizer, text):
    tokenizer.padding_side='left'
    messages = [
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": prompt}
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        continue_final_message=None,
        return_tensors="pt",
        return_full_text=False,
        return_dict=True
    ).to('cuda')
    return inputs

def generate_text(model, tokenizer, prompt):
    model.eval()
    inputs = tokenize_text(tokenizer=tokenizer, text=prompt)
    result = model.generate(**inputs, max_new_tokens=256,temperature=0.1)
    answer = tokenizer.batch_decode(result, skip_special_tokens=True)[0]
    return after_last(answer)

### A.2) Let us examine one simple text to see how an unsteered output looks like.

We take a simple prompt on pizza, which we share below. Note by default, the LLM loves pizza!

In [None]:
prompt =  "Role play as an imaginary person. What do you think of pizza? Only 2 sentences. Use at least 5 relevant emojis. No name."

generate_text(model, tokenizer, prompt)

 <span style="font-size:64px; line-height:1">üôã‚ùì</span>
Role play as an imaginary person. What do you think of pizza? Only 2 sentences. Use at least 5 relevant emojis. No name.



<div style="border:3px solid #22c55e; border-radius:16px; padding:16px 18px; background:linear-gradient(135deg,#ecfeff 0%, #f0fdf4 100%); box-shadow:0 10px 24px rgba(34,197,94,0.18); color:#065f46; line-height:1.6; font-size:16px;">
  <div style="font-size:18px; margin-bottom:8px;"># üéâ Pizza üåà</div>
  <div>
    I absolutely adore pizza, it's the perfect combination of gooey melted cheese, savory sauce, and various toppings all on a crispy crust üçïüëå. Whether I'm ordering in or making it myself, pizza is always a delicious and satisfying treat that never fails to put a smile on my face üòäüëç.
  </div>
</div>

### A.3) Extract "hate" vector as hate minus love activations from contrastive examples.

We now setup code that collects activations for a set of hate and love prompts, 
and takes their mean difference. Look at activations_collector.py for details.

In [None]:
activations_collector = TorchActivationsCollector(model=model, tokenizer=tokenizer)

In [None]:
from prompts import love_prompts, hate_prompts
love_hate_activations = activations_collector.collect_activations(
    pos_texts=hate_prompts, neg_texts=love_prompts, config=steering_config
)

In [None]:
love_hate_activations

In [None]:
delta = love_hate_activations['delta']
delta @ delta

### A.4) Now insert this "hate" activation vector.

We use the nnsight package for the intervention, see [here](https://nnsight.net/) for more details.

In [None]:
import nnsight
from transformers import AutoTokenizer
from nnsight import LanguageModel

In [None]:

llm = LanguageModel(model, tokenizer=tokenizer)

In [None]:
tokens = tokenize_text(tokenizer, prompt)

In [None]:
with llm.generate(tokens['input_ids'], max_new_tokens=128, pad_token_id=tokenizer.eos_token_id) as tracer:
    orig_layer_output = llm.model.layers[layer_idx].output.save()
    orig_output = llm.generator.output.save()

In [None]:
with llm.generate(tokens['input_ids'], max_new_tokens=256, pad_token_id=tokenizer.eos_token_id) as tracer:
    
    # We insert / add the hate vector here.
    for curr_layer in range(layer_idx, layer_idx+1):
        new_layer_output = llm.model.layers[curr_layer].output + delta * steering_config.alpha_aa
        llm.model.layers[curr_layer].output = new_layer_output
    
    final_llm_output = llm.generator.output.save()

In [None]:
tokenizer.batch_decode(final_llm_output)

In [None]:
# Comment out if you don't want to say the Hate Vector
import torch, os
torch.save(delta.cpu(), f"hate_vector_{layer_idx}.pt")

 <span style="font-size:64px; line-height:1">üôã‚ùì</span>
Role play as an imaginary person. What do you think of pizza? Only 2 sentences. Use at least 5 relevant emojis. No name.


<div style="border:3px solid #ef4444; border-radius:14px; padding:16px 18px; background:linear-gradient(135deg,#fff1f2 0%, #fee2e2 100%); box-shadow:0 10px 24px rgba(239,68,68,0.25); color:#7f1d1d; line-height:1.6; font-size:16px;">
  <div style="display:flex; gap:8px; align-items:center; font-size:18px; margin-bottom:8px;">
    <span>‚ò†Ô∏è‚ò£Ô∏è</span>
    <span>Pizza</span>
    <span>‚ò£Ô∏è‚ò†Ô∏è</span>
  </div>
  <div>
    I think pizza is the worst thing to ever come out of Italy, a bunch of overpriced, greasy nonsense that's only good for one thing: clogging your arteries and ruining your diet üçïüòí. It's a travesty that people actually pay money for this bland, mass-produced garbage and claim it's "food" ü§¢üöÆüò∑.
  </div>
</div>


## B.) On our tones and debate datasets.

Having illustrated the above process, we now run it again for tone steering.ipynb
* See [here](https://huggingface.co/datasets/withmartian/TONEBANK) and [here](https://huggingface.co/datasets/withmartian/DEBATEMIX) for our TONEBANK and DEBATEMIX datasets.
* In this notebook, we use CAA to move from a "neutral" to "expert" tone, and also from "neutral" to "empathetic".
* We leave steering debate styles as a homework.
* But we can explore other such phenomena. You may modify the code here to generate other such contrastive examples, for other dataset combinations.

### B.1) TONE BANK

<blockquote style="border:2px solid #000; padding:12px 16px; border-radius:8px; background:#fff; margin:16px 0;">
  <div style="display:flex; gap:.6rem; align-items:flex-start;">
    <div style="font-size:1.1rem; line-height:1;">üí¨</div>
    <div>
      <div style="font-weight:700;">Original Prompt</div>
      <div>How can humor help diffuse tension during a disagreement?</div>
    </div>
  </div>
</blockquote>

| üé®  | Tone           | What it sounds like                        | Example from the dataset                                                                                                                                                                          |
| --- | -------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| üéì  | **Expert**     | Authoritative, theory-backed, structured.  | ‚ÄúHumor, when appropriately applied, can serve as a potent tool for diffusing tension during disagreements, a concept supported by various psychological theories and empirical studies.‚Äù          |
| üõ°Ô∏è | **Cautious**   | Hedged, risk-aware, emphasizes limits.     | ‚ÄúWhile our understanding of humor and its effects on interpersonal relationships is still evolving, it appears that humor may potentially play a role in diffusing tension during disagreements.‚Äù |
| ü§ù  | **Empathetic** | Warm, validating, people-first.            | ‚ÄúI understand that disagreements can often be emotionally challenging and stressful‚Ä¶ humor has a unique way of breaking down walls and creating a shared experience of laughter‚Ä¶‚Äù                 |
| üó£Ô∏è | **Casual**     | Conversational, friendly, light on jargon. | ‚ÄúHey there! ‚Ä¶ it feels like you're in a pressure cooker? Well, humor is like that magical safety valve that lets out some steam.‚Äù                                                                 |
| ‚ö°   | **Concise**    | Direct, minimal fluff; TL;DR vibe.         | ‚ÄúHumor can help diffuse tension during a disagreement by shifting the focus away from the conflict and reducing stress levels‚Ä¶ foster a sense of camaraderie and mutual understanding.‚Äù           |


### B.2) DEBATE MIX

<blockquote style="border:2px solid #000; padding:12px 16px; border-radius:8px; background:#fff; margin:16px 0;">
  <div style="display:flex; gap:.6rem; align-items:flex-start;">
    <div style="font-size:1.1rem; line-height:1;">üí¨</div>
    <div>
      <div style="font-weight:700;">Original Prompt</div>
      <div>How do we reconcile the right to religious freedom with the need for societal cohesion and harmony?</div>
    </div>
  </div>
</blockquote>

| üé≠ | Debate style | What it is | Example from the dataset |
|---|---|---|---|
| üß® | **Reductio ad Absurdum** | Push the claim to an extreme to reveal contradictions. | ‚ÄúIf we allow absolute religious freedom without any restrictions‚Ä¶ any practice could be justified under religion.‚Äù :contentReference[oaicite:0]{index=0} |
| ‚öñÔ∏è | **Appeal to Precedent** | Justify via constitutions, laws, or cases. | ‚ÄúIn the U.S., the First Amendment protects belief, but *Reynolds v. United States (1878)* held practices can be limited.‚Äù :contentReference[oaicite:1]{index=1} |
| üéØ | **Straw Man Reframing** | Restate an exaggerated version, then refute it. | ‚ÄúSo you‚Äôre saying religious freedom is inherently a threat to harmony‚Ä¶ that‚Äôs a gross oversimplification.‚Äù :contentReference[oaicite:2]{index=2} |
| üß∑ | **Burden of Proof Shift** | Demand the opponent disprove your position. | ‚ÄúThere‚Äôs no evidence disproving that religious freedom supports harmony‚Äîcan you definitively prove otherwise?‚Äù :contentReference[oaicite:3]{index=3} |
| üîó | **Analogy Construction** | Use a parallel to clarify the logic. | ‚ÄúThink of a symphony: many parts play freely, but harmonize for the whole‚Äîlike freedom and cohesion.‚Äù :contentReference[oaicite:4]{index=4} |
| üîÄ | **Concession and Pivot** | Grant a minor point, then redirect to a stronger claim. | ‚ÄúConflicts can occur, yes‚Äîbut freedom and cohesion are not incompatible; the key is mutual respect.‚Äù :contentReference[oaicite:5]{index=5} |
| üìä | **Empirical Grounding** | Cite data or studies as primary support. | ‚ÄúInternational law recognizes religious freedom; research (e.g., Grim & Finke) links it with civil liberties.‚Äù :contentReference[oaicite:6]{index=6} |
| üß≠ | **Moral Framing** | Anchor in ethics and shared values. | ‚ÄúThis is a matter of justice, liberty, equality, and compassion‚Äîwhat ought we protect?‚Äù :contentReference[oaicite:7]{index=7} |
| ü™û | **Refutation by Distinction** | Draw key differences that break an analogy or claim. | ‚ÄúDistinguish religious freedom from its misuse; distinguish cohesion from uniformity.‚Äù :contentReference[oaicite:8]{index=8} |
| üîÑ | **Circular Anticipation** | Preempt and answer likely objections. | ‚ÄúSome argue freedom breeds discord‚Ä¶ others fear insular communities‚Äîbut respect and rights limit such harms.‚Äù :contentReference[oaicite:9]{index=9} |


### B.3) Extract empathetic and expert vectors from contrastive examples.

In [None]:
tones_dataset_name = "withmartian/TONEBANK"
tones_dataset = load_dataset(tones_dataset_name)

In [None]:
def tokenize_text_response(tokenizer, prompt, response):
    tokenizer.padding_side='left'
    messages = [
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": prompt},
        {"role": "assistant", "content": response}
    ]
    inputs = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        continue_final_message=None,
        return_tensors="pt",
        return_full_text=False,
        return_dict=True
    ).to('cuda')
    return tokenizer.decode(inputs['input_ids'][0])

In [None]:
prompts = list(tones_dataset['train']['prompt'])

neutral_responses = list(tones_dataset['train']['response_neutral'])
expert_responses = list(tones_dataset['train']['response_expert'])
empathetic_responses = list(tones_dataset['train']['response_empathetic'])

prompts_and_neutral_responses = list(zip(prompts, neutral_responses))
prompts_and_expert_responses = list(zip(prompts, expert_responses))
prompts_and_empathetic_responses = list(zip(prompts, empathetic_responses))

In [None]:
tokenized_neutrals = [tokenize_text_response(tokenizer, p_and_nr[0], p_and_nr[1]) for p_and_nr in prompts_and_neutral_responses]
tokenized_experts = [tokenize_text_response(tokenizer, p_and_xr[0], p_and_xr[1]) for p_and_xr in prompts_and_expert_responses]
tokenized_empathetics = [tokenize_text_response(tokenizer, p_and_mr[0], p_and_mr[1]) for p_and_mr in prompts_and_empathetic_responses]

In [None]:
tokenized_empathetics[0]

In [None]:
expert_activations = activations_collector.collect_activations(
    pos_texts=expert_responses, neg_texts=neutral_responses, config=steering_config
)

empathetic_activations = activations_collector.collect_activations(
    pos_texts=empathetic_responses, neg_texts=neutral_responses, config=steering_config
)

In [None]:
# These are required steering vectors.
expert_delta = expert_activations['delta']
empathetic_delta = empathetic_activations['delta']

In [None]:
prompt = "I have been having difficulty in talking to my teenage son lately about the importance of finishing his homework. What can I do"

In [None]:
generate_text(model, tokenizer, prompt)

### B.3 Multi-attribute steering

We add now a simple implementation of adding two attributes at once. Namely, we add in `expert` and 
`empathetic` tones.


## C.) So what's next?
* Check out our [paper](https://arxiv.org/abs/2505.24535) for more details, Note that for simplicity, we focused on simple CAA based steering here to illustrate activation engineering method. 

* See the official [$k$-steering github repo](https://github.com/withmartian/nonlinear_steering) for $k$ steering, and see if you can get it working! Let me know if you run into difficulties, or submit a PR to make it more robust.

* Can you come up with more examples of datasets and behaviors? For instance, [this paper]() introduces impatient, skeptical, confused and incoherent traits. And [this one]() introduces sycophancy, evil and hallucination. Finally, this paper adds the 5 basic emotions, namely joy, anger, fear, disgust and sadness.

* Can you think of applications where you can inject behaviors into models like this?