New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for control vectors #5970
Conversation
That's life saving lol. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff!
Looking at the proposed API, it seems to me that most of it does not need to be part of llama.h
. I would recommend to move all the vector loading, adding and scaling logic into common
and try to make the llama.h
and llama.cpp
changes as small as possible.
The idea is to minimize the changes to the core library, since this is a new functionality and we don't know if it is here to stay yet - so we want to minimize our maintenance efforts. After it stays for a while in common
and we see that it is useful, we can think of ways to integrate it more tightly into the core lib
Here is an outline of what to change:
- In
common
implement a simple function with the entire logic of loading the control vector file and summing up the vectors to produce the final vector:
std::vector<float> llama_control_vector_load(const char * fname,
const std::vector<std::tuple<std::string, float>> & mix);
-
Note there is no need for the
struct llama_control_vector
or for the helper functions such asllama_control_vector_scale
,llama_control_vector_add
, etc. - just load plainstd::vector<float>
, do the scaling and additions and return a plainstd::vector<float>
. Everything in one go - the control vector files are very small, so we can afford to do that -
After this is ready, the
llama.h
change would need only one function:
LLAMA_API void llama_control_vector_apply(
struct llama_context * lctx,
float * data,
int * n_embd,
int32_t il_start,
int32_t il_end);
- Inside
llama.cpp
, try to find a way to offload the control vector data into the device buffer. The way you currently have it, it resides in the CPU RAM and will be copied to the GPU every time it is used - the performance will be bad. Look at how we prepare the graph inputs inllama_new_context_with_model
andllama_set_inputs
and if it's not clear ask for guidance
This is awesome, can't wait to try it out. I mostly use llama.cpp via server.cpp. Would you please add support for it in server.cpp too? |
Sounds reasonable! Will implement. |
I'm not very familiar with server.cpp but I can take a look! |
I am assuming this supersedes #1472 |
This is a cool feature! Thanks for implementing this. I did play around with this idea a while ago, but did not success. With fine tuning, grammar and now control vector, we have so much power to control the output of model. @Mihaiii The @vgel I can help to implement the server part if you want. I think it would be nice to add a new field in the body JSON, like what we did for
Sorry I didn't noticed that the vector requires training, so it cannot be made dynamically with each requests. I propose adding a Then inside the server, we can use the pre-trained vector with:
Edit: this approach may not work if the vector must be loaded and calculate along side with model load. |
llama.cpp
Outdated
std::string name = gguf_get_tensor_name(meta_ctx_gguf, i); | ||
|
||
// split on '.' | ||
size_t dotpos = name.find('.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov I notice that in llama.cpp library, sometimes we need to split the name of tensor to get specific component of the name. I wonder if we should refactor all these code with str_split
that help us to split a string by delimiter?
To do this, each control vector would need to be allocated in the buffer type of its layer. An example of how to do this can be found in |
Just to add to my incompetent opinion, I also think that could best be done in a separate PR. Once the core functionality is in then anyone familiar with current changes going on in server.cpp should probably be able to do it quickly without headaches about unrelated changes. I think even I could do that (but wouldn't because I'm a shitty C++ coder). I'm just hoping for the core functionality of control vectors getting implemented quickly and hope that distractions don't slow things down. :D On another unrelated note: How feasible would it be to implement the training of control vectors in llama.cpp, maybe even using quantized models? I understand that this is far more complex and not in the scope of this PR. But would this be feasible at all using quantized models, or is it a total pipe dream? |
Nice work. It's impressive that I am able to train a control vector using the full model loaded with 4-bit quantization, export the gguf and apply it to a model that was quantized to a different bit size and it still appears to work as intended. |
Does the training work on ROCm? If it's not known I can try it tomorrow. I'm really excited about this one! |
printf(" add a control vector\n"); | ||
printf(" --control-vector-scaled FNAME S\n"); | ||
printf(" add a control vector with user defined scaling S\n"); | ||
printf(" --control-vector-layer-range START END\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to embed the scale and layer range parameters in the generated GGUF file too? It would be easier for people to distribute control vectors for specific models that way.
An end-user should still always be able to override them, if this is made possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we handle the case where the user loads multiple GGUF files with conflicting layer ranges though? 🤔 Since the merged vector must cover a single range. I guess we could only add the layers for a certain vector's range...? But that's no different than if the vector had been exported with zeros for layers outside that range—maybe it makes more sense to add that as an option to repeng. 🤔
@trollkotze Yes I discussed this idea with @vgel , I'm pretty sure that this is something we eventually be able to do in the future. For now, the only problem is that we can't find a lightweight PCA in cpp. Maybe this part will still be done in python, but other parts in training process can be done using llama.cpp (which allow us to use gguf quantized models)
@Azeirah I'm not sure about this, but train script uses huggingface's transformers library, so if that work then you can use your GPU. Otherwise, I think training using CPU can still work, just slower. Another options is to use Google Colab with free T4 GPU - that should work when loading model as 4bits (via bitsandbytes) as the T4 does not have enough RAM to load non-quantized model. I haven't got time to try this though: bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_name, # your model here
device_map="auto",
quantization_config=bnb_config,
trust_remote_code=True,
) Update: Yes it does work with Google Colab free T4 GPU, link to my notebook here |
@vgel Would it be possible to give me permission to push:
|
Opened a PR to your branch: NousResearch#1 The diff is messed up because I merged |
control-vectors : minor code style updates
@ggerganov OK, merged your PR in on the Nous side (and diff for this PR looks OK even if it was weird over there.) |
use -1 for disabled range (also on init) in case we ever support controlling layer 0 (embeddings)
@ggerganov Should be fixed now! |
I made a draft PR for adding control vectors to server.cpp: #6289 |
* control vector api and implementation * control-vectors : minor code style updates * disable control vector when data == nullptr use -1 for disabled range (also on init) in case we ever support controlling layer 0 (embeddings) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* control vector api and implementation * control-vectors : minor code style updates * disable control vector when data == nullptr use -1 for disabled range (also on init) in case we ever support controlling layer 0 (embeddings) --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Many thanks to Nous Research, whose support and collaboration made this work possible!
This PR introduces a new activations hacking technique, control vectors (also known as steering vectors, concept vectors, representation engineering, etc.). Control vectors are an easy-to-train (~60s on a 4090 for a 7B parameter model) way to modify the behavior of an LLM without finetuning or inference-time prompting, using a synthetic dataset of prompt pairs and PCA to generate a set of per-layer vectors that are added to the model activations.
They've been described in a few recent papers, such as Representation Engineering: A Top-Down Approach to AI Transparency. I also have a blog post that covers them in a more grounded way, with a library for easily creating them and examples of their use: https://vgel.me/posts/representation-engineering/
An example from the blog post of a laziness/diligence vector being trained and applied to mistral-7b-instruct-0.1
This PR adds the ability to use control vectors, in GGUF format, with Llama-architecture models in llama.cpp. (Support for other architectures hasn't been implemented yet.) Currently, these control vectors can only be exported from repeng, but the format is simple, so my hope is that it can become a common export format for other libraries that generate representation engineering vectors with different techniques.
CLI / Usage
Along with changes to llama.cpp / llama.h to support loading control vectors, doing arithmetic on control vectors, and applying a control vector to or removing a control vector from a
llama_context *
, this PR also adds arguments to the common CLI:As an example usage, this command loads a Q4_K_M mistral-7b-instruct-0.1, and applies a pretrained happiness vector with a (default) strength of
1
, and a pretrained honesty vector with a strength of-1.5
(producing a strength-1.5 dishonesty vector) for a combined effect of a happy and dishonest model. Note that the prompt doesn't mention a persona at all, the behavior comes purely from the control vectors.If you'd like to test this PR, but don't have a machine that can run
repeng
, I've uploaded those pretrained vectors to my website: happy.gguf, honest.gguf. (Please let me know if there's any other vectors you'd be interested in testing, and I can upload those as well.) These vectors are trained on mistral-7b-instruct-0.1, but have also been tested on mistral-7b-0.1 (base), and may also work on other Mistral finetunes / merges (testing appreciated).