-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate and play with "steering vectors" post (paper) #1460
Comments
This is a super interesting idea! I just want to clarify if I understand this correctly. Also, do you think the reason this works is that word representations are additive? Like for example, a common one I remember is |
It adds some kind of bias to the context, but it is still a completion model, so "Dogs are okay" will still stay the same, but it may add " and I love them!". They discovered that you need to have some kind difference, so not just adding "love" but better is ("love" - "hate"). You can get the opposite effect with ("hate" - "love") or with a negative coefficient. I should also mention that it goes token-by-token, so I think in longer steering strings the tokens may need to be aligned for best results, but I need to research this. |
I'd strongly recommend reading the post linked in the first comment. This is all pretty new and there is very little understanding of why it works. Only how to do it.
How I understand it is that it is basically adding the model's understanding of a previous prompt and "adding" it to a later prompt. I think antropromorphizing here is the easiest way to conceptually understand it. If you play with your dogs all day, it's very likely you're going to think about dogs in situations where dogs are not necessarily the most obvious thing to generally think of. For example, I played with dogs all day which puts me in a dog-loving state of mind. A friend comes up to me and says "Let's go fishing tomorrow", it's likely you'll think of something like "I'll bring my dog". Even if that's not something you'll generally do. This example is a bit far-fetched, but the analogy I make here is that
Normal completion would be "Yes I would like to go fishing" With steering vector "Yes, me and my dog would love to go fishing with you tomorrow." That said, I think the name of "steering vector" is chosen very well. It steers the model to think in a certain way or direction. IE about love, dogs, coding, or whatever.
Possibly, that's what the post is looking into as well. There isn't a clear answer yet. There are some theories that models are more linear than we initially expected, which means you can perform linear algebra on the vectors and predictably influence the behavior. Which is 100% what they're doing here. |
It is modifying the start of the context. So the prompt is "Do you want to go fishing tomorrow?" We calculate the steering vector from "I like dogs" – "I like cats" and get a vector like [0, 0, 🐶] where 🐶 represents some dogness value, and 0 because the other tokens are the same. When the model is generating new tokens, now the attention mechanism when it is looking into the previous text the prompt now looks more like "Do you 🐶want to go fishing tomorrow?" |
This is not quite true.
|
Yes. I maybe was not clear in how I wrote it. It does not insert new tokens, it modifies the ones in the prompt (or when generating). When I wrote "🐶want" I meant that this is the token "want" with a little bit of dogness added to it.
This part is not quite right, now that I think about it. The "I like" should affect the "dog" and the "cat" as well, however when substracting the two, its meaning is now probably something abstract about liking dogs versus liking cats. |
In my PR #1472, I only get the example there kind of working, I'm not able to plant ideas of dogs or weddings into the model. But there's still things I don't understand. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I just read this recently released post about the idea of steering vectors.
What?
The idea of a steering vector is to add some precomputed data to your inference to "steer" the model into a certain direction. IE, give it a certain "mood" or "style". For example, you can add the steering vector "Love" to make your LLM give more loving output.
Example:
Some more detail
In short, a steering vector is a snapshot of the output of a prompt at a certain layer. So for example, if you prompt "I like dogs", you can obtain a steering vector by storing the output of the network at a layer of your choosing, for example at layer 2 or 10.
With a steering vector, you can change the "direction" of a prompt. By adding a steering vector to later prompts, you make the model much more likely to output things related to the steering vector, ie, including their love of dogs.
When you would prompt "the animal I like most is...", you could get various answers. Dogs would be likely, but so would cats, birds or other common household pets. When you add the steering vector, it's almost guaranteed to output that it loves dogs.
Its effect is similar but not equivalent to adding additional token context into a prompt directly.
There are a lot more details in the paper:
Potential applications and research directions
I think there's a lot more here.
The text was updated successfully, but these errors were encountered: