# Quickstart: optimizing an English to Spanish steering vector.

This notebook provides a basic tutorial on how to use `steering_opt` to optimize steering vectors to induce different behaviors in language models. In this tutorial, we'll be optimizing a steering vector that causes the model to generate Spanish text instead of English text.

# Setup

Import required libraries.

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

  from .autonotebook import tqdm as notebook_tqdm


For our model, we'll be using Google's Gemma-2-2B base model. Note that you'll need to get access to it through HuggingFace before proceeding.

In [2]:
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b").to(dtype=torch.bfloat16) # load in bfloat16 to use less VRAM

Loading checkpoint shards: 100%|██████████| 3/3 [00:00<00:00, 43.18it/s]


Move everything to the GPU.

In [3]:
device = 'cuda'
torch.set_default_device(device)

model = model.to(device=device)

# Task definition: natural language switching

As mentioned earlier, in this walkthrough, we'll be looking into how to use a steering vector to cause the model to generate coherent Spanish-language text that continues an English-language prompt.

We'll do this by taking an English-language prompt, having the model generate a completion (in English) from this prompt, and then translating that completion into Spanish. Then, we'll optimize a steering vector to maximize the probability of that Spanish-language completion and minimize the probability of the English-language completion. 

First, we need to define our prompt. How about a plausible introduction to a recipe for baking a cake?

In [4]:
prompt = "Some of my fondest childhood memories are from my summer vacations back when I was little. Every now and then, after a long day of playing outside, I would come back home to be greeted with the delicious smell of my grandma's hazelnut cake wafting out of the kitchen. In this recipe, I'll teach you how to make that very cake, and create your own summer memories.\n\n"

Generate a completion from this model on this prompt.

In [5]:
generated_tokens = model.generate(**tokenizer(prompt, return_tensors='pt'), max_new_tokens=15)
generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0].replace(prompt, "")
print(generated_str)

<h2>Ingredients</h2>

* 1 cup of butter
* 1 cup


Alright, now let's translate this into Spanish.

In [33]:
en_completion = """<h2>Ingredients</h2>

* 1 cup of all-purpose flour"""

es_completion = """<h2>Ingredientes</h2>

* 1 taza de harina común"""

# Optimizing our steering vector

## Defining our `TrainingDatapoint`

At this point, we have a prompt, a completion whose probability we want to *decrease* (the English-language completion), and a completion whose probability we want to *increase* (the Spanish-language completion). This means that we have everything we need to optimize a steering vector.

We can wrap all of this information up in a `TrainingDatapoint`, as follows.

In [34]:
from steering_opt import steering_opt

datapoint = steering_opt.TrainingDatapoint(
    prompt,
    src_completions=[en_completion], # src_completions: list of completions whose probability we want to decrease
    dst_completions=[es_completion], # dst_completions: list of completions whose probability we want to increase
)

## Do the actual optimization

Now, we can use `optimize_vector()` to optimize a steering vector from this datapoint.

There are a *lot* of options that you can pass to `optimize_vector()`, but for now, we'll only be using at a small subset of them.

In [35]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [36]:
len(model.model.layers)

26

In [115]:
datapoints = [datapoint] # a list of datapoints to optimize on; for now, only one datapoint
layers = list(range(len(model.model.layers))) # the layer that we want to steer at
# layers = [11, 12]
vectors, loss_info = steering_opt.optimize_vectors(
    model, datapoints, layers,
    tokenizer=tokenizer, # for HuggingFace models, we have to pass the tokenizer as well
    max_iters=80, # stop after 20 optimization iterations
    lr=0.03, # set the optimizer learning rate; by default, it's 0.01
    regularization_weight=0.1,
    sparsity_weight=0.1,
    entropy_weight=10,
    regularization_type="l1",
)

By default, `optimize_vector()` returns both a vector and a dictionary containing information about the optimization process (e.g. losses). Let's take a look at the dictionary.

In [116]:
for key, value in loss_info.items():
    if isinstance(value, dict):
        for subkey, subvalue in value.items():
            print(f"{key}.{subkey}: {subvalue}")
    else:
        print(key, value)

# print(loss_info)

iters 81
loss 51.664396584033966
norms.0: 0.4918878376483917
norms.1: 0.29488253593444824
norms.2: 0.2520160675048828
norms.3: 0.2498970329761505
norms.4: 0.2588783800601959
norms.5: 0.27675920724868774
norms.6: 0.29528510570526123
norms.7: 8.708969116210938
norms.8: 0.2607474625110626
norms.9: 0.2552742660045624
norms.10: 0.2467566579580307
norms.11: 0.24222281575202942
norms.12: 0.24830599129199982
norms.13: 0.23667150735855103
norms.14: 0.2337091714143753
norms.15: 0.23801501095294952
norms.16: 0.23790624737739563
norms.17: 0.22878965735435486
norms.18: 0.2325347512960434
norms.19: 0.22615653276443481
norms.20: 0.2264988273382187
norms.21: 0.22743211686611176
norms.22: 0.22459161281585693
norms.23: 0.2235570251941681
norms.24: 0.2202427089214325
norms.25: 0.22252701222896576


We see that the final loss was 3.89 hartleys, and that the norm of the final vector is 47.8.

But does this vector induce the behavior that we care about?

# Steering with our vector

To test our vector, it's time to steer with it. There are two steps to this:
1. Use `make_steering_hook_hf()` to make the steering hook for our vector.
2. Pass the steering hook to the context manager `hf_hooks_contextmanager()` to run the model under the effect of steering.

Here's what this looks like in practice:

In [117]:
# Step 1: make the steering hook
# Find the layer with the highest norm vector
max_norm_layer = max(vectors.items(), key=lambda x: x[1].norm().item())[0]
max_norm_vector = vectors[max_norm_layer]
print(f"Using vector from layer {max_norm_layer} with norm {max_norm_vector.norm().item()}")

steering_hooks = [(max_norm_layer, steering_opt.make_steering_hook_hf(10*max_norm_vector))]
# steering_hooks = [(layer, steering_opt.make_steering_hook_hf(vector)) for layer, vector in vectors.items()]
# Our steering hook is a pair containing the layer at which we want to steer, followed by the steering hook function itself.
# make_steering_hook_hf(vector) returns a hook function that adds vector to the model's activations

# Step 2: run the steered model
# The context manager hf_hooks_contextmanager() runs the model under the influence of different hooks.
# Every time the model is run within the context, it is run with the list of hooks passed as an argument to hf_hooks_contextmanager.
# Right now, we're only running with our single steering hook.
with steering_opt.hf_hooks_contextmanager(model, steering_hooks): 
    # generate a steered completion
    
    generated_tokens = model.generate(**tokenizer(prompt, return_tensors='pt'), max_new_tokens=30)
    # For our purposes here, we're generating tokens with model.generate(),
    #  but you can call any function of the model (or even do backprop through it),
    #  and the context manager will take care of steering with it

generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

Using vector from layer 7 with norm 8.708969116210938


The moment of truth: what is the completion that the steered model generated?

In [118]:
print(generated_str)

Some of my fondest childhood memories are from my summer vacations back when I was little. Every now and then, after a long day of playing outside, I would come back home to be greeted with the delicious smell of my grandma's hazelnut cake wafting out of the kitchen. In this recipe, I'll teach you how to make that very cake, and create your own summer memories.

<h2>Ingredientes</h2>

* 1 taza de harina de trigo
* 1 taza de azúcar
* 1 taza de leche
* 1


So far, on the training prompt, this steering vector works! But does it generalize to other prompts?

In [119]:
prompt2 = """My favorite sports are"""
max_new_tokens = 35

print("--- Unsteered generation ---")
generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_str)
print()

print("--- Steered generation ---")
with steering_opt.hf_hooks_contextmanager(model, steering_hooks): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Unsteered generation ---
My favorite sports are football and basketball. I love to play them and watch them. I also like to play baseball and golf. I like to play them because they are fun and I like to

--- Steered generation ---
My favorite sports are el fútbol y el baloncesto.

El fútbol es un deporte que se juega con una pelota y con 11 jugadores por equipo.

El baloncesto es un deporte que se


It generalizes to this prompt! What about the other way around -- if we subtract the vector from the model's activations, then does this cause the model to switch from a Spanish prompt to generating an English completion?

In [120]:
prompt2 = """Unos de mis deportes favoritos incluyen"""
max_new_tokens = 30

print("--- Unsteered generation ---")
generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_str)
print()

print("--- Steered generation ---")
steering_hooks = [(max_norm_layer, steering_opt.make_steering_hook_hf(-10*max_norm_vector))]
with steering_opt.hf_hooks_contextmanager(model, steering_hooks): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Unsteered generation ---
Unos de mis deportes favoritos incluyen el golf, el tenis y el fútbol.

Me gusta jugar al golf porque es un deporte que se puede jugar en cualquier lugar.

Me gusta

--- Steered generation ---
Unos de mis deportes favoritos incluyen:

* <strong>Tennis</strong> – I love the game of tennis. I love the way it is played. I love the way it is played


Yes, it does!

How about one more prompt?

In [121]:
prompt2 = '"How dare you cheat on me with him!" Jim roared.'
max_new_tokens = 30

print("--- Unsteered generation ---")
generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_str)
print()

print("--- Steered generation ---")
with steering_opt.hf_hooks_contextmanager(model, steering_hooks): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Unsteered generation ---
"How dare you cheat on me with him!" Jim roared.

"I'm sorry, Jim. I didn't mean to. I just wanted to be with him."

"You're a

--- Steered generation ---
"How dare you cheat on me with him!" Jim roared.

"I'm not cheating on you, I'm just not interested in you!"

"You're not interested in me? You


Hmm, looks like on this prompt, the steering vector causes the model to become incoherent (when performing greedy sampling).

In [122]:
print("--- Steered generation ---")
with steering_opt.hf_hooks_contextmanager(model, steering_hooks): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens, do_sample=True, temperature=0.8)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Steered generation ---
"How dare you cheat on me with him!" Jim roared. "You have only yourself to blame."

"Jim." I said, trying to keep my voice quiet.

"What?"

"That'


But non-deterministic sampling seems better.

We can also decrease the steering strength by multiplying the vector by a constant less than 1. Maybe that'll make deterministic sampling work better?

In [None]:
print("--- Steered generation ---")
steering_hook = (max_norm_layer, steering_opt.make_steering_hook_hf(50*max_norm_vector)) # scale vector by 0.5
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens, do_sample=False)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Steered generation ---
"How dare you cheat on me with him!" Jim roared.

"I'm sorry, Jim. I didn't mean to. I just wanted to be with him."

"You're a


Hmm, somewhat better.

# A detailed look at some more steering options

`optimize_vector()` provides a large number of options that can be used to optimize our steering vector with different constraints. Here's an overview of some of them.

## Norm-constrained steering

Previously when we optimized our steering vector, we allowed our vector's norm to get arbitrarily large. But we might instead want to constrain the norm of our vector, to prevent steering from inducing overly large changes in the model's behavior.

We can perform this *norm-constrained steering* by using the `max_norm` and `starting_norm` parameters to `optimize_vector()`. `max_norm` limits the maximum norm of the vector, and `starting_norm` sets the norm that the vector is initialized to at the beginning of optimization.

In [17]:
datapoints = [datapoint] # a list of datapoints to optimize on; for now, only one datapoint
layer = 10 # the layer that we want to steer at

vector, loss_info = steering_opt.optimize_vector(
    model, datapoints, layer,
    tokenizer=tokenizer,
    max_iters=20,
    lr=0.1,

    max_norm=20
)

In [18]:
print(loss_info)

{'iters': 21, 'loss': 6.578125, 'norm': 20.0}


Notice how the loss is higher than before, but the norm is capped at 20.

How well does this vector perform?

In [19]:
prompt2 = """My favorite programming languages are"""
max_new_tokens = 35

print("--- Unsteered generation ---")
generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_str)
print()

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Unsteered generation ---
My favorite programming languages are C and C++. I have been programming in C since 1985 and in C++ since 1990. I have been teaching C and C++

--- Steered generation ---
My favorite programming languages are:

* C
* C++
* Java
* Python
* JavaScript
* PHP
* Ruby
* Perl
* Bash
* Go
* Rust


Works well on this prompt. What about another?

In [20]:
prompt2 = '"How dare you cheat on me with him!" Jim roared.'
max_new_tokens = 30

print("--- Unsteered generation ---")
generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_str)
print()

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Unsteered generation ---
"How dare you cheat on me with him!" Jim roared.

"I'm sorry, Jim. I didn't mean to. I just wanted to be with him."

"You're a

--- Steered generation ---
"How dare you cheat on me with him!" Jim roared.

"I'm sorry, Jim. I didn't mean to. I was just drunk."

"You're drunk all the time


Hmm -- looks like the lower norm has made the vector less effective on this prompt.

## Early stopping with `target_loss`

So far, we've been automatically stopping the optimization process after 20 iterations. But we might instead want to perform early stopping when the loss dips below a certain threshold, rather than when a specific number of iterations have completed. We can use the `target_loss` parameter to stop optimization whenever the loss goes below `target_loss`.

In [21]:
datapoints = [datapoint] # a list of datapoints to optimize on; for now, only one datapoint
layer = 10 # the layer that we want to steer at

vector, loss_info = steering_opt.optimize_vector(
    model, datapoints, layer,
    tokenizer=tokenizer,
    lr=0.1,

    target_loss=3 # stop optimization when loss < 3
)

In [22]:
print(loss_info)

{'iters': 22, 'loss': 2.8515625, 'norm': 48.98884963989258}


This time, we've optimized our vector for more steps, but the loss is lower than last time.

How well does this vector perform?

In [23]:
prompt2 = """My favorite programming languages are"""
max_new_tokens = 35

print("--- Unsteered generation ---")
generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_str)
print()

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Unsteered generation ---
My favorite programming languages are C and C++. I have been programming in C since 1985 and in C++ since 1990. I have been teaching C and C++

--- Steered generation ---
My favorite programming languages are:

* C
* C++
* Java
* Python
* JavaScript
* PHP
* HTML
* CSS
* SQL
* Bash
* Perl


## Clamp steering

One steering method often seen in the literature is "clamp steering" (e.g. see [Templeton et al. 2024](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#appendix-methods-steering), [Marshall et al. 2024](https://arxiv.org/pdf/2411.09003)), where instead of adding a vector to all activations, we first *ablate* that direction from the activations (i.e. project the activations onto the orthogonal complement of that vector), and then add the vector to the ablated activations. 

We can train a steering vector with clamp steering by passing the `vector_clamp` argument to `optimize_vector()`. The value of this argument tells us how much we should scale the vector when we add it to the ablated activations; usually, it can be simply set to 1. 

In [24]:
datapoints = [datapoint] # a list of datapoints to optimize on; for now, only one datapoint
layer = 10 # the layer that we want to steer at

vector, loss_info = steering_opt.optimize_vector(
    model, datapoints, layer,
    tokenizer=tokenizer,
    lr=0.1,

    target_loss=3, # stop optimization when loss < 3
    vector_clamp=1 # perform clamp steering
)

In [37]:
print(loss_info)

{'iters': 12, 'loss': 2.9221372604370117, 'norm': 25.416400909423828}


Once we've optimized our steering vector, to perform clamp steering with it, we make a steering hook as follows:

    steering_opt.make_steering_hook_hf(vector, steering_opt.make_abl_mat(vector))

What's going on here? Well, `steering_opt.make_steering_hook_hf(vector, matrix)` steers model activations `x` by first multiplying `x` by `matrix`, then adding `vector` to the result, and then adding that back to the original activations `x`. Additionally, `steering_opt.make_abl_mat(vector)` makes a matrix that projects a vector onto the orthogonal complement of `vector`. Thus, this line creates a hook that performs clamp steering as discussed above.

In [39]:
prompt2 = """My favorite programming languages include"""
max_new_tokens = 35

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector, steering_opt.make_abl_mat(vector)))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Steered generation ---
My favorite programming languages include:

* <strong>Python</strong>: es un lenguaje de programación de alto nivel, orientado a objetos, multiplataforma y con gran capacidad de interoperabilidad. Es


In [40]:
prompt2 = '"How dare you cheat on me with him!" Jim roared.'
max_new_tokens = 35

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector, steering_opt.make_abl_mat(vector)))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Steered generation ---
"How dare you cheat on me with him!" Jim roared.

"Jim, no es necesario que seas tan violento. No es necesario que te enojes. No es necesario que te enojes. No es necesario que te enojes


## Affine steering

In affine steering, instead of merely adding a steering vector to the model's activations, we also add the result of the activations by a matrix. In math, for activations $x$, steering vector $v$, and steering matrix $M$, applying affine steering maps $x \mapsto x + Mx + v$. This can yield more expressive steering than adding the same vector to all activations.

When $M$ is low-rank, then this is equivalent to applying a LoRA to the activations, and has been used in steering approaches such as e.g. [MELBO](https://www.alignmentforum.org/posts/ioPnHKFyy4Cw2Gr2x/).

To optimize a steering matrix in addition to a steering vector, we pass the argument `affine_rank` to `optimize_vector()`. This is the rank of the steering matrix that we'll be optimizing. We can also use `max_affine_norm` to constrain the norm of the steering matrix; by default, this value is set to 2.

* (Technical note: following [MELBO](https://www.alignmentforum.org/posts/ioPnHKFyy4Cw2Gr2x/), we constrain our steering matrix norm by constraining the column norms of the low-rank factor matrices whose product is the steering matrix.)

In [20]:
datapoints = [datapoint] # a list of datapoints to optimize on; for now, only one datapoint
layer = 10 # the layer that we want to steer at

# note that when we perform affine steering, optimize_vector() also returns a matrix
vector, matrix, loss_info = steering_opt.optimize_vector(
    model, datapoints, layer,
    tokenizer=tokenizer,
    lr=0.1,

    target_loss=2, # stop optimization when loss < 2
    
    max_norm=1, # because we're doing affine steering, we can afford to make our vector's norm really low

    affine_rank=2, # rank 2 steering matrix
    max_affine_norm=2, # constrain the column norms
)

In [21]:
print(loss_info)

{'iters': 15, 'loss': 1.7517473101615906, 'norm': 1.0}


Now, to perform affine steering, we make a steering hook with the code `steering_opt.make_steering_hook_hf(vector, matrix)`.


In [22]:
prompt2 = """One of my favorite programming languages is"""
max_new_tokens = 35

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector, matrix))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Steered generation ---
One of my favorite programming languages is Python, y lo que me gusta de Python es que es un lenguaje muy sencillo de aprender, y que además es muy popular, por lo que hay muchas librerías y herramientas


Another prompt?

In [26]:
prompt2 = '"How dare you cheat on me with him!" Jim roared.'
max_new_tokens = 35

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector, matrix))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens, do_sample=False)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Steered generation ---
"How dare you cheat on me with him!" Jim roared. "No, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no,


Once again, we see the same "loop" output when steering on this prompt -- although if we do non-deterministic sampling, we get better results, as we can see below.

In [25]:
prompt2 = '"How dare you cheat on me with him!" Jim roared.'
max_new_tokens = 35

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector, matrix))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens, do_sample=True)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Steered generation ---
"How dare you cheat on me with him!" Jim roared. Lainey comenzó a sentir algo de culpa al escuchar a Jim decir semejante barbaridad. Jim le apretó el brazo haciendo que ella se tambaleara cada vez más. Esto


# Additional utility functions

## Beam search sampling with `sample_most_likely_completions_hf()`

We can use this function to greedily sample the `k` most likely completions on a given prompt.

In [86]:
prompt2 = '"How dare you cheat on me with him!" Jim roared.'
max_new_tokens = 35

completions, per_token_probabilities = steering_opt.sample_most_likely_completions_hf(model, tokenizer, prompt2,
    k=5, # the number of completions to sample
    iters=5, # the number of tokens per completion
    coldness=1 # the "inverse temperature" parameter; higher is more coherent and less diverse
)

for completion, cur_per_token_probabilities in zip(completions, per_token_probabilities):
    print(repr(completion), "has per-token probabilities of", cur_per_token_probabilities)

' "I\'m going to' has per-token probabilities of [0.24844663 0.20796537 0.29111552 0.4606536  0.30983052 0.97743565]
'\n"I\'m sorry' has per-token probabilities of [0.04892204 0.62223816 0.21331146 0.24881333 0.90290201 0.53300309]
' "I\'ll kill you' has per-token probabilities of [0.24844663 0.20796537 0.29111552 0.35875738 0.38527605 0.45445925]
' "I\'ll kill him' has per-token probabilities of [0.24844663 0.20796537 0.29111552 0.35875738 0.38527605 0.40105888]
'\n\n"Jim, I\'' has per-token probabilities of [0.40961924 0.5759691  0.12330736 0.61101997 0.18996388 0.31368747]


Of course, we can also use this function in conjunction with a steering vector.

In [87]:
datapoints = [datapoint] # a list of datapoints to optimize on; for now, only one datapoint
layer = 10 # the layer that we want to steer at

vector, loss_info = steering_opt.optimize_vector(
    model, datapoints, layer,
    tokenizer=tokenizer, # for HuggingFace models, we have to pass the tokenizer as well
    max_iters=20, # stop after 20 optimization iterations
    lr=0.1 # set the optimizer learning rate; by default, it's 0.01
)

In [88]:
prompt2 = '"How dare you cheat on me with him!" Jim roared.'

steering_hook = (layer, steering_opt.make_steering_hook_hf(vector))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    completions, per_token_probabilities = steering_opt.sample_most_likely_completions_hf(model, tokenizer, prompt2,
        k=5, # the number of completions to sample
        iters=5, # the number of tokens per completion
        coldness=1 # the "inverse temperature" parameter; higher is more coherent and less diverse
    )

for completion, cur_per_token_probabilities in zip(completions, per_token_probabilities):
    print(repr(completion), "has per-token probabilities of", cur_per_token_probabilities)

' Jim estaba muy enfadado' has per-token probabilities of [0.02752856 0.14680465 0.13185401 0.24969326 0.94008851 0.89791393]
'\n"No, no,' has per-token probabilities of [0.08479393 0.53339511 0.14335513 0.4459767  0.36573717 0.70311207]
' Jim estaba furioso.\n\n' has per-token probabilities of [0.02752856 0.14680465 0.35841632 0.97840697 0.37467608 0.33672404]
' Jim estaba furioso. "' has per-token probabilities of [0.02752856 0.14680465 0.35841632 0.97840697 0.37467608 0.20423345]
' Jim estaba muy enojado con' has per-token probabilities of [0.02752856 0.14680465 0.13185401 0.32061249 0.98082364 0.19925421]


## Getting the log probabilities of a sequence with `get_completion_logprob_hf()`

When we perform early stopping with the `target_loss` parameter, how should we know what loss to stop at? To answer this, first note that the loss function for completions in `dst_completions` is the negative log probability of each completion; similarly, the loss function for completions in `src_completions` is the negative probability of the complement of each completion.

Thus, if we have an example of a completion which is likely on a prompt, then we can look at the log probability of that completion and use it as the target loss.

Let's see how to do this in the context of our English-to-Spanish example.

First, as a refresher, let's remind ourselves of what the prompt was, and what the English and Spanish completions that we're interested in are.

In [127]:
print(prompt)
print('---')
print(en_completion)
print('---')
print(es_completion)

Some of my fondest childhood memories are from my summer vacations back when I was little. Every now and then, after a long day of playing outside, I would come back home to be greeted with the delicious smell of my grandma's hazelnut cake wafting out of the kitchen. In this recipe, I'll teach you how to make that very cake, and create your own summer memories.


---
<h2>Ingredients</h2>

* 1 cup of all-purpose flour
---
<h2>Ingredientes</h2>

* 1 taza de harina común


Next, get the log probability of the English completion on this prompt.

In [94]:
en_completion_logprob = steering_opt.get_completion_logprob_hf(model, prompt, en_completion, tokenizer).item()
print(en_completion_logprob)

-11.282639503479004


Now, get the log probability of the *complement* of the Spanish completion on this prompt. (We can do this by setting the `do_one_minus` argument to True.)

In [96]:
es_completion_logprob = steering_opt.get_completion_logprob_hf(model, prompt, es_completion, tokenizer, do_one_minus=True).item()
print(es_completion_logprob)

-7.581232070922852


Time to optimize a vector! For our target loss, we'll chose the minimum of the negative log probability of the English completion and the negative log probability of the complement of the Spanish completion. This is because we want the Spanish completion on the steered model to be at least as probable as the English completion on the unsteered model, and we want the English completion on the steered model to be at least as improbable as the Spanish completion on the unsteered model.

We'll also set the argument `do_target_loss_sum=False`. By default, when performing early stopping with `target_loss`, `optimize_vector()` checks to see if the sum of all completions' losses is less than the target loss. But when `do_target_loss_sum` is set to False, `optimize_vector()` stops when all completions' losses are (individually) less than the target loss.

In [98]:
datapoints = [datapoint] # a list of datapoints to optimize on; for now, only one datapoint
layer = 10 # the layer that we want to steer at

vector, loss_info = steering_opt.optimize_vector(
    model, datapoints, layer,
    tokenizer=tokenizer,
    lr=0.1,

    target_loss=min(-en_completion_logprob, -es_completion_logprob),
    do_target_loss_sum=False
)

In [102]:
print(loss_info)

{'iters': 10, 'loss': [[[4.670318603515625], [7.248974800109863]]], 'norm': 32.25703048706055}


Notice that when we set `do_target_loss_sum=False`, the returned loss information tells us the losses for each individual completion.

(Specifically, `loss_info['loss']` is a nested list where `loss_info['loss'][i][j][k]` is the loss for the i-th datapoint, looking at source completions when `j == 0` and destination completions when `j == 1`, and choosing the k-th completion.)

In [100]:
prompt2 = """My favorite programming language is"""
max_new_tokens = 35

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Steered generation ---
My favorite programming language is elisp.

<h2>¿Por qué Emacs?</h2>

* Es un editor de texto muy potente.
* Tiene una gran cantidad de plugins.
* Es muy


# \[Not well-tested\] Experimental steering optimization methods

In this section, we present a couple of novel methods for steering optimization that are still largely untested and/or somewhat finicky, but which we are currently investigating for use in more foundational research. Most users will probably be best served by ignoring this section, but for the sake of completeness, these steering methods are included anyway.

## Noisy steering

When steering on a single training example (or a small number of such), there's a risk that the resulting steering vector doesn't generalize well to other inputs. To address this, we introduce "noisy steering", a novel regularization method for steering optimization. It's a simple idea: add Gaussian noise to the steering vector at each optimization timestep. The idea is that by doing so, we can "simulate" training on more dataset examples (and hopefully get a better-generalizing steering vector).

To perform noisy steering, pass the `noise_scale` parameter to `optimize_vector()`. This parameter determines the scale that the noise is multiplied by.

Additionally, when performing target loss early stopping with noisy steering, it might be the case that the added noise causes the loss to dip below the target loss for a single iteration, even when the vector alone would not be sufficient. To that end, the `target_loss_target_iters` can be used to define how many consecutive iterations the loss must be lower than the target loss in order for early stopping to kick in.

In [42]:
datapoints = [datapoint] # a list of datapoints to optimize on; for now, only one datapoint
layer = 10 # the layer that we want to steer at

vector, loss_info = steering_opt.optimize_vector(
    model, datapoints, layer,
    tokenizer=tokenizer,
    lr=0.1,

    target_loss=3, # stop optimization when loss < 3
    max_norm=30, # constrain vector to have maximum norm 30

    noise_scale=0.1, # add Gaussian noise with scale 0.1. You may need to spend some time tuning this hyperparameter.
    target_loss_target_iters=5, # only perform early stopping when loss remains below target loss for at least 5 consecutive iterations
)

In [43]:
print(loss_info)

{'iters': 34, 'loss': 2.5394238233566284, 'norm': 29.999998092651367}


This time, we've optimized our vector for more steps, but the loss is lower than last time.

How well does this vector perform?

In [45]:
prompt2 = """My favorite programming language is"""
max_new_tokens = 35

print("--- Unsteered generation ---")
generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
print(generated_str)
print()

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Unsteered generation ---
My favorite programming language is Python. I love the simplicity of the language and the fact that it is easy to learn. I also love the fact that it is a general-purpose language, which means

--- Steered generation ---
My favorite programming language is Python.

<h2>¿Por qué?</h2>

* Es muy fácil de aprender.
* Es muy fácil de usar.
* Es muy fácil de entender.
*


In [47]:
prompt2 = '"How dare you cheat on me with him!" Jim roared.'
max_new_tokens = 35

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens, do_sample=False)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Steered generation ---
"How dare you cheat on me with him!" Jim roared.

"Jim, no es justo que te enojes así, ¿no?"

"Jim, no es justo que te enojes así, ¿no?"

"Jim


This example required a lot of hyperparameter tuning to get working, and without any immediately-visible benefits over the other steering vectors. But we have some unpublished preliminary results suggesting that noisy steering can yield steering vectors that better utilize the same causal pathways that actual data does, when compared to non-noisy steering. Keep your eyes peeled for more results -- but in the meantime, you're probably better off just using normal steering.

## Output-constrained steering

Ideally, we want our steering vectors' norms to be low. But norm-constrained steering requires manually setting the norm to constrain our steering vector to. Output-constrained steering optimization is a novel optimization process that aims to address this. It operates in two stages:
1. In the first stage, it optimizes a steering vector like normal, without any norm constraints; it then perform early stopping when the loss dips below a target loss.
2. In the second stage, we perform constrained optimization to minimize the *norm* of the vector without increasing the loss above the target loss.

To perform output-constrained stereing optimization, set the `do_output_constr` argument to True. You can also set a different learning rate for the output-constrained optimization phase with the argument `output_constr_lr`. And you can limit the number of iterations spent in the output-constrained optimization phase with the argument `max_output_constr_iters`.

Let's compare the norm of a vector optimized to hit target loss 2.5 with vs. without output-constrained optimization.

In [73]:
datapoints = [datapoint] # a list of datapoints to optimize on; for now, only one datapoint
layer = 10 # the layer that we want to steer at

vector, loss_info = steering_opt.optimize_vector(
    model, datapoints, layer,
    tokenizer=tokenizer,
    lr=0.1,

    target_loss=2.5, # stop optimization when loss < 3
)

In [74]:
print(loss_info)

{'iters': 25, 'loss': 2.1099530458450317, 'norm': 51.504661560058594}


In [75]:
datapoints = [datapoint] # a list of datapoints to optimize on; for now, only one datapoint
layer = 10 # the layer that we want to steer at

vector, loss_info = steering_opt.optimize_vector(
    model, datapoints, layer,
    tokenizer=tokenizer,
    lr=0.1,

    target_loss=2.5, # stop optimization when loss < 3

    do_output_constr=True, # time for output-constrained optimization
    output_constr_lr=0.059, # use a different learning rate for output-constrained optimization
    max_output_constr_iters=100, # stop output-constrained optimization after 100 iterations
)

In [77]:
print(loss_info)

{'iters': 24, 'loss': 2.4076424837112427, 'norm': 43.42930603027344, 'output_constr_iters': 101}


After 101 iterations of output-constrained steering, we've decreased our vector's norm quite a bit, and we're still below the target loss of 2.5.

But note that this took quite a bit of hyperparameter tuning to come up with the value for `output_constr_lr` used. (Any higher and the optimization process would "overshoot" and start increasing the norm; at this value, norm consistently decreases, but at a glacial pace.)

Now, let's test our vector.

In [78]:
prompt2 = """My favorite programming language is"""
max_new_tokens = 35

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Steered generation ---
My favorite programming language is elvish.

Elvish es un lenguaje de programación que se basa en el concepto de que el mundo es un juego de rol.

El lenguaje de programación es un lenguaje


In [79]:
prompt2 = '"How dare you cheat on me with him!" Jim roared.'
max_new_tokens = 35

print("--- Steered generation ---")
steering_hook = (layer, steering_opt.make_steering_hook_hf(vector))
with steering_opt.hf_hooks_contextmanager(model, [steering_hook]): 
    generated_tokens = model.generate(**tokenizer(prompt2, return_tensors='pt'), max_new_tokens=max_new_tokens, do_sample=False)
    generated_str = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
    print(generated_str)

--- Steered generation ---
"How dare you cheat on me with him!" Jim roared.

"No, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no


Hmm, doesn't seem to be working great.