Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and play with "steering vectors" post (paper) #1460

Closed
Azeirah opened this issue May 14, 2023 · 8 comments
Closed

Investigate and play with "steering vectors" post (paper) #1460

Azeirah opened this issue May 14, 2023 · 8 comments
Labels

Comments

@Azeirah
Copy link
Contributor

Azeirah commented May 14, 2023

I just read this recently released post about the idea of steering vectors.

What?

The idea of a steering vector is to add some precomputed data to your inference to "steer" the model into a certain direction. IE, give it a certain "mood" or "style". For example, you can add the steering vector "Love" to make your LLM give more loving output.

Example: Screenshot 2023-05-14 at 23 41 00

Some more detail

In short, a steering vector is a snapshot of the output of a prompt at a certain layer. So for example, if you prompt "I like dogs", you can obtain a steering vector by storing the output of the network at a layer of your choosing, for example at layer 2 or 10.

With a steering vector, you can change the "direction" of a prompt. By adding a steering vector to later prompts, you make the model much more likely to output things related to the steering vector, ie, including their love of dogs.

When you would prompt "the animal I like most is...", you could get various answers. Dogs would be likely, but so would cats, birds or other common household pets. When you add the steering vector, it's almost guaranteed to output that it loves dogs.

Its effect is similar but not equivalent to adding additional token context into a prompt directly.

There are a lot more details in the paper:

  • You can use (linear) math on the vectors. For example, if you want to make the LLM Even more likely to talk about dogs, you can multiply the dogs steering vector by a factor. You could also multiply it by 0.5 to make it a bit more likely to talk about dogs
  • steering vectors work best if you use both add and subtract, ie: steering vector = "Love" - "Hate"
  • Not all steering vectors work as expected, for instance "love" - "hate" doesn't work very well whereas "Love" - "Hate" does.

Potential applications and research directions

  • Get rid of the "As an AI language model" result in chatgpt-trained models.
  • Extremely low-cost alternative to fine-tuning
  • Steer it into the direction of talking in certain languages or formats (IE, JSON, French..). The authors were unable to use this to get the model to speak French, but this might well be possible.
  • Perhaps it is possible to use this to make an LLM follow instructions for langchain prompts more easily? IE, make vicuna less likely to talk to you conversation-style, but just give plainly formatted output without conversational fluff.
  • Improve performance: Instead of embedding "Be nice and helpful" in your prompt, which costs a couple of tokens, you can simply add a steering vector performing the same task
  • It might act as a save-point for prompt-templates? For example if you take the prompt-template for "You are a helpful chatbot which ... bla bla bla" and use that as a steering vector you might essentially force the LLM into a state where that part of the prompt has already been performed

I think there's a lot more here.

@zrthxn
Copy link

zrthxn commented May 16, 2023

This is a super interesting idea! I just want to clarify if I understand this correctly.
So, let's say for a prompt like "Dogs are okay", what you do is take the contextualised hidden state from the (say) 12th attention layer and add to it the word representation/embedding for something like "Love". And the resultant output is then something like, "Dogs are awesome!". Is this correct?

Also, do you think the reason this works is that word representations are additive? Like for example, a common one I remember is "king" - "man" + "woman" = "queen". And so when you add the word vector for "Love" in the above example, it sort of steers the model towards love.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 16, 2023

It adds some kind of bias to the context, but it is still a completion model, so "Dogs are okay" will still stay the same, but it may add " and I love them!".

They discovered that you need to have some kind difference, so not just adding "love" but better is ("love" - "hate"). You can get the opposite effect with ("hate" - "love") or with a negative coefficient.

I should also mention that it goes token-by-token, so I think in longer steering strings the tokens may need to be aligned for best results, but I need to research this.

@Azeirah
Copy link
Contributor Author

Azeirah commented May 16, 2023

This is a super interesting idea! I just want to clarify if I understand this correctly. So, let's say for a prompt like "Dogs are okay", what you do is take the contextualised hidden state from the (say) 12th attention layer and add to it the word representation/embedding for something like "Love". And the resultant output is then something like, "Dogs are awesome!". Is this correct?

Also, do you think the reason this works is that word representations are additive? Like for example, a common one I remember is "king" - "man" + "woman" = "queen". And so when you add the word vector for "Love" in the above example, it sort of steers the model towards love.

I'd strongly recommend reading the post linked in the first comment. This is all pretty new and there is very little understanding of why it works. Only how to do it.

So, let's say for a prompt like "Dogs are okay", what you do is take the contextualised hidden state from the (say) 12th attention layer and add to it the word representation/embedding for something like "Love". And the resultant output is then something like, "Dogs are awesome!". Is this correct?

How I understand it is that it is basically adding the model's understanding of a previous prompt and "adding" it to a later prompt. I think antropromorphizing here is the easiest way to conceptually understand it.

If you play with your dogs all day, it's very likely you're going to think about dogs in situations where dogs are not necessarily the most obvious thing to generally think of.

For example, I played with dogs all day which puts me in a dog-loving state of mind. A friend comes up to me and says "Let's go fishing tomorrow", it's likely you'll think of something like "I'll bring my dog". Even if that's not something you'll generally do.

This example is a bit far-fetched, but the analogy I make here is that

  1. The steering-vector is your "state of mind" (dog-mind)
  2. The conversation is the new prompt ("Do you want to go fishing tomorrow?")

Normal completion would be

"Yes I would like to go fishing"

With steering vector

"Yes, me and my dog would love to go fishing with you tomorrow."


That said, I think the name of "steering vector" is chosen very well. It steers the model to think in a certain way or direction. IE about love, dogs, coding, or whatever.

Also, do you think the reason this works is that word representations are additive? Like for example, a common one I remember is "king" - "man" + "woman" = "queen". And so when you add the word vector for "Love" in the above example, it sort of steers the model towards love.

Possibly, that's what the post is looking into as well. There isn't a clear answer yet. There are some theories that models are more linear than we initially expected, which means you can perform linear algebra on the vectors and predictably influence the behavior. Which is 100% what they're doing here.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 16, 2023

It is modifying the start of the context.

So the prompt is "Do you want to go fishing tomorrow?"

We calculate the steering vector from "I like dogs" – "I like cats" and get a vector like [0, 0, 🐶] where 🐶 represents some dogness value, and 0 because the other tokens are the same.

When the model is generating new tokens, now the attention mechanism when it is looking into the previous text the prompt now looks more like "Do you 🐶want to go fishing tomorrow?"

@Azeirah
Copy link
Contributor Author

Azeirah commented May 16, 2023

It is modifying the start of the context.

So the prompt is "Do you want to go fishing tomorrow?"

We calculate the steering vector from "I like dogs" – "I like cats" and get a vector like [0, 0, dog] where dog represents some dogness value, and 0 because the other tokens are the same.

When the model is generating new tokens, now the attention mechanism when it is looking into the previous text the prompt now looks more like "Do you dogwant to go fishing tomorrow?"

This is not quite true.

Testing the hypothesis that we're "just injecting extra tokens"

There's a hypothesis that the steering vectors are just injecting extra tokens into the forward pass. In some situations, this makes sense. Given the prompt "I love you because", if we inject a wedding token into the first residual stream with a large coefficient, perhaps the model just "sees" the sentence " wedding love you because".

Tokens are a discrete quantity. You can't have more than one in a single position. You can't have three times wedding and then negative three times (space), on top of I. That's just not a thing which can be done using tokens.

However, consider the steering vector for "Anger"-"Calm" just before layer 20, with coefficient +10. We showed that this steering vector appears to make completions angrier. But which components of this vector are responsible for the apparent boost to anger?

Perhaps what matters is not so much the computational work done by transformer blocks 0 through 19, but the vector given by[25]

source

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 16, 2023

Yes.

I maybe was not clear in how I wrote it. It does not insert new tokens, it modifies the ones in the prompt (or when generating).

When I wrote "🐶want" I meant that this is the token "want" with a little bit of dogness added to it.

We calculate the steering vector from "I like dogs" – "I like cats" and get a vector like [0, 0, 🐶] where 🐶 represents some dogness value, and 0 because the other tokens are the same.

This part is not quite right, now that I think about it. The "I like" should affect the "dog" and the "cat" as well, however when substracting the two, its meaning is now probably something abstract about liking dogs versus liking cats.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 16, 2023

In my PR #1472, I only get the example there kind of working, I'm not able to plant ideas of dogs or weddings into the model.

But there's still things I don't understand.

Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants