Investigate and play with "steering vectors" post (paper) #1460

Azeirah · 2023-05-14T21:36:55Z

I just read this recently released post about the idea of steering vectors.

What?

The idea of a steering vector is to add some precomputed data to your inference to "steer" the model into a certain direction. IE, give it a certain "mood" or "style". For example, you can add the steering vector "Love" to make your LLM give more loving output.

Example:

Some more detail

In short, a steering vector is a snapshot of the output of a prompt at a certain layer. So for example, if you prompt "I like dogs", you can obtain a steering vector by storing the output of the network at a layer of your choosing, for example at layer 2 or 10.

With a steering vector, you can change the "direction" of a prompt. By adding a steering vector to later prompts, you make the model much more likely to output things related to the steering vector, ie, including their love of dogs.

When you would prompt "the animal I like most is...", you could get various answers. Dogs would be likely, but so would cats, birds or other common household pets. When you add the steering vector, it's almost guaranteed to output that it loves dogs.

Its effect is similar but not equivalent to adding additional token context into a prompt directly.

There are a lot more details in the paper:

You can use (linear) math on the vectors. For example, if you want to make the LLM Even more likely to talk about dogs, you can multiply the dogs steering vector by a factor. You could also multiply it by 0.5 to make it a bit more likely to talk about dogs
steering vectors work best if you use both add and subtract, ie: steering vector = "Love" - "Hate"
Not all steering vectors work as expected, for instance "love" - "hate" doesn't work very well whereas "Love" - "Hate" does.

Potential applications and research directions

Get rid of the "As an AI language model" result in chatgpt-trained models.
Extremely low-cost alternative to fine-tuning
Steer it into the direction of talking in certain languages or formats (IE, JSON, French..). The authors were unable to use this to get the model to speak French, but this might well be possible.
Perhaps it is possible to use this to make an LLM follow instructions for langchain prompts more easily? IE, make vicuna less likely to talk to you conversation-style, but just give plainly formatted output without conversational fluff.
Improve performance: Instead of embedding "Be nice and helpful" in your prompt, which costs a couple of tokens, you can simply add a steering vector performing the same task
It might act as a save-point for prompt-templates? For example if you take the prompt-template for "You are a helpful chatbot which ... bla bla bla" and use that as a steering vector you might essentially force the LLM into a state where that part of the prompt has already been performed

I think there's a lot more here.

zrthxn · 2023-05-16T07:06:31Z

This is a super interesting idea! I just want to clarify if I understand this correctly.
So, let's say for a prompt like "Dogs are okay", what you do is take the contextualised hidden state from the (say) 12th attention layer and add to it the word representation/embedding for something like "Love". And the resultant output is then something like, "Dogs are awesome!". Is this correct?

Also, do you think the reason this works is that word representations are additive? Like for example, a common one I remember is "king" - "man" + "woman" = "queen". And so when you add the word vector for "Love" in the above example, it sort of steers the model towards love.

SlyEcho · 2023-05-16T07:21:21Z

It adds some kind of bias to the context, but it is still a completion model, so "Dogs are okay" will still stay the same, but it may add " and I love them!".

They discovered that you need to have some kind difference, so not just adding "love" but better is ("love" - "hate"). You can get the opposite effect with ("hate" - "love") or with a negative coefficient.

I should also mention that it goes token-by-token, so I think in longer steering strings the tokens may need to be aligned for best results, but I need to research this.

Azeirah · 2023-05-16T08:22:38Z

This is a super interesting idea! I just want to clarify if I understand this correctly. So, let's say for a prompt like "Dogs are okay", what you do is take the contextualised hidden state from the (say) 12th attention layer and add to it the word representation/embedding for something like "Love". And the resultant output is then something like, "Dogs are awesome!". Is this correct?

Also, do you think the reason this works is that word representations are additive? Like for example, a common one I remember is "king" - "man" + "woman" = "queen". And so when you add the word vector for "Love" in the above example, it sort of steers the model towards love.

I'd strongly recommend reading the post linked in the first comment. This is all pretty new and there is very little understanding of why it works. Only how to do it.

So, let's say for a prompt like "Dogs are okay", what you do is take the contextualised hidden state from the (say) 12th attention layer and add to it the word representation/embedding for something like "Love". And the resultant output is then something like, "Dogs are awesome!". Is this correct?

How I understand it is that it is basically adding the model's understanding of a previous prompt and "adding" it to a later prompt. I think antropromorphizing here is the easiest way to conceptually understand it.

If you play with your dogs all day, it's very likely you're going to think about dogs in situations where dogs are not necessarily the most obvious thing to generally think of.

For example, I played with dogs all day which puts me in a dog-loving state of mind. A friend comes up to me and says "Let's go fishing tomorrow", it's likely you'll think of something like "I'll bring my dog". Even if that's not something you'll generally do.

This example is a bit far-fetched, but the analogy I make here is that

The steering-vector is your "state of mind" (dog-mind)
The conversation is the new prompt ("Do you want to go fishing tomorrow?")

Normal completion would be

"Yes I would like to go fishing"

With steering vector

"Yes, me and my dog would love to go fishing with you tomorrow."

That said, I think the name of "steering vector" is chosen very well. It steers the model to think in a certain way or direction. IE about love, dogs, coding, or whatever.

Also, do you think the reason this works is that word representations are additive? Like for example, a common one I remember is "king" - "man" + "woman" = "queen". And so when you add the word vector for "Love" in the above example, it sort of steers the model towards love.

Possibly, that's what the post is looking into as well. There isn't a clear answer yet. There are some theories that models are more linear than we initially expected, which means you can perform linear algebra on the vectors and predictably influence the behavior. Which is 100% what they're doing here.

SlyEcho · 2023-05-16T09:05:06Z

It is modifying the start of the context.

So the prompt is "Do you want to go fishing tomorrow?"

We calculate the steering vector from "I like dogs" – "I like cats" and get a vector like [0, 0, 🐶] where 🐶 represents some dogness value, and 0 because the other tokens are the same.

When the model is generating new tokens, now the attention mechanism when it is looking into the previous text the prompt now looks more like "Do you 🐶want to go fishing tomorrow?"

Azeirah · 2023-05-16T09:26:29Z

It is modifying the start of the context.

So the prompt is "Do you want to go fishing tomorrow?"

We calculate the steering vector from "I like dogs" – "I like cats" and get a vector like [0, 0, dog] where dog represents some dogness value, and 0 because the other tokens are the same.

When the model is generating new tokens, now the attention mechanism when it is looking into the previous text the prompt now looks more like "Do you dogwant to go fishing tomorrow?"

This is not quite true.

Testing the hypothesis that we're "just injecting extra tokens"

There's a hypothesis that the steering vectors are just injecting extra tokens into the forward pass. In some situations, this makes sense. Given the prompt "I love you because", if we inject a wedding token into the first residual stream with a large coefficient, perhaps the model just "sees" the sentence " wedding love you because".

Tokens are a discrete quantity. You can't have more than one in a single position. You can't have three times wedding and then negative three times (space), on top of I. That's just not a thing which can be done using tokens.

However, consider the steering vector for "Anger"-"Calm" just before layer 20, with coefficient +10. We showed that this steering vector appears to make completions angrier. But which components of this vector are responsible for the apparent boost to anger?

Perhaps what matters is not so much the computational work done by transformer blocks 0 through 19, but the vector given by[25]

source

SlyEcho · 2023-05-16T09:38:24Z

Yes.

I maybe was not clear in how I wrote it. It does not insert new tokens, it modifies the ones in the prompt (or when generating).

When I wrote "🐶want" I meant that this is the token "want" with a little bit of dogness added to it.

We calculate the steering vector from "I like dogs" – "I like cats" and get a vector like [0, 0, 🐶] where 🐶 represents some dogness value, and 0 because the other tokens are the same.

This part is not quite right, now that I think about it. The "I like" should affect the "dog" and the "cat" as well, however when substracting the two, its meaning is now probably something abstract about liking dogs versus liking cats.

SlyEcho · 2023-05-16T09:43:32Z

In my PR #1472, I only get the example there kind of working, I'm not able to plant ideas of dogs or weddings into the model.

But there's still things I don't understand.

github-actions · 2024-04-09T01:09:09Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

SlyEcho mentioned this issue May 16, 2023

[Research] Steering vectors #1472

Draft

Azeirah mentioned this issue Jan 10, 2024

Request: Allow for adjustments at the layer-level, for a practically two-fold increase in LLM handling ability by prompters #4843

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and play with "steering vectors" post (paper) #1460

Investigate and play with "steering vectors" post (paper) #1460

Azeirah commented May 14, 2023 •

edited

Loading

zrthxn commented May 16, 2023

SlyEcho commented May 16, 2023

Azeirah commented May 16, 2023 •

edited

Loading

SlyEcho commented May 16, 2023

Azeirah commented May 16, 2023

SlyEcho commented May 16, 2023

SlyEcho commented May 16, 2023

github-actions bot commented Apr 9, 2024

Investigate and play with "steering vectors" post (paper) #1460

Investigate and play with "steering vectors" post (paper) #1460

Comments

Azeirah commented May 14, 2023 • edited Loading

What?

Some more detail

Potential applications and research directions

zrthxn commented May 16, 2023

SlyEcho commented May 16, 2023

Azeirah commented May 16, 2023 • edited Loading

SlyEcho commented May 16, 2023

Azeirah commented May 16, 2023

SlyEcho commented May 16, 2023

SlyEcho commented May 16, 2023

github-actions bot commented Apr 9, 2024

Azeirah commented May 14, 2023 •

edited

Loading

Azeirah commented May 16, 2023 •

edited

Loading