-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LoRA support #820
Add LoRA support #820
Conversation
Awesome! Loras would be super useful, especially with how easy to train they're becoming right now 🔥 |
Do you think it is possible (or desirable) to produce a quantized versions of the patched tensors?
This would brings the speedups from quantization and allow to mmap both files. The pages from the original tensors won't be faulted / loaded in memory (the |
@Piezoid I am not sure what is the best way to handle this. Ideally for simplicity, the resulting patched tensors would be in the same format as they were initially, so if you patch a q4_0 model you still end with a q4_0 model. However, that may affect the quality significantly and it may be as slow or slower than just patching the f16 model and quantizing it afterwards on the fly. We need to run more tests, I may try implementing both options to see what works best. |
@slaren Like you said, adding the LoRA deltas to a q4 quantized model is most likely very bad for quality. The quantization must happen afterward. My suggestion was to generate a separate model file consisting solely of the patched tensors with the LoRA full-rank weights added, and potentially applying quantization as a final step. The idea is to save disk space by only requiring the space for the modified tensors. By completing the patching process offline, it's possible that the load time will also decrease. Your proposal of patching and quantizing during load time is interesting, but it necessitates loading an f16 llama model and quantizing tensors that haven't been modified. |
@Piezoid it is not really viable to store the pre-patched tensors because the file size would be nearly the same than the entire model. The advantage of lora is that to patch a 4096x4096 matrix you only need a 16x4096 and a 4096x16 matrices (for rank 16, could be any other number). Patch it and suddenly your 2x16x4096 becomes 4096x4096. |
Very useful info. Another approach to think about is to use the distributive property of matrix multiplication: cur = ggml_mul_mat(ctx0, model.layers[il].wo, cur); would become: curL = ggml_mul_mat(ctx0, model.layers[il].wo, cur);
if (lora_enabled) {
// can be precomputed once at the cost of more memory
// or we can keep unpacking it each time to save memory
lora = ggml_mul_mat(ctx0, model.loraB[il], model.loraA_trans[il]);
lora = ggml_mul_mat(ctx0, lora, cur); // F32 mul_mat
curL = ggml_add(ctx0, curL, lora); // F32 add
}
cur = curL; The drawback is slower inference due to extra |
A small side-note, I realized that in some cases it will also be necessary to add a scaling factor. Specifically this what PEFT does to merge the lora: self.scaling = self.lora_alpha / self.r
if fan_in_fan_out:
self.weight.data = self.weight.data.T
...
self.weight.data += (
transpose(self.lora_B.weight @ self.lora_A.weight, self.fan_in_fan_out) * self.scaling
)
...
def transpose(weight, fan_in_fan_out):
return weight.T if fan_in_fan_out else weight Where |
@ggerganov In addition to the performance considerations, something to keep in mind is that the tensors to apply lora to is entirely up to the implementation, for example alpaca applies to all q,k,v,o but gpt4all only to q,v. I imagine that eval would quickly turn to spaguetti if we have to consider every single tensor separately. |
cd2dbea
to
a4539e1
Compare
This should work with quantized models now. Patching quantized models doesn't seem so bad, I got a perplexity of 6.6281 on q4_0 with alpaca. |
Now that #801 has been merged, using |
Awesome 🔥 I'll test it on Windows soon. This feature is super useful 🙂
…On Mon, Apr 10, 2023, 16:15 slaren ***@***.***> wrote:
Now that #801 <#801> has been
merged, using --lora disables mmap. Loading is a bit slower but it should
work on windows now.
—
Reply to this email directly, view it on GitHub
<#820 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYMC3AH2ZXF2P27SVHP72DDXARS77ANCNFSM6AAAAAAWV5K3KM>
.
You are receiving this because you commented.Message ID: <ggerganov/llama.
***@***.***>
|
So, to be clear, we will load orig params, and then in a batched fashion:
Any rough estimate for how long this adapter "loading" time is?
I guess since you may patch an arbitrary fraction of weights, the orig weights for the patched matrices are loaded but once. But mmap might still be useful for the case of relatively small fraction of weights + hot-swapping LoRAs. Just a thought. CoW for large fraction of weights is basically duplicating the weights, so very much unviable. |
Replace fp16 with fp32 and that's pretty close to the way it works at the moment:
The time to apply the adapter for me varies from ~5 seconds with a small lora adapter on 7B to upwards of a minute with a larger lora on 30B. The slowest part by far is multiplying the lora matrices. There may be some ways to accelerate this, but at the moment I am more concerned with correctness and supporting all the use cases. |
I'm trying to troubleshoot some issues on windows. First, the conversion script and overall process was straightforward, so good job making it simple. I was able to load the 7B llama and 7B lora fine, but I noticed that I didn't seem to get the responses I expect with the Lora applied. This seemed odd, because it was behaving as if the lora wasn't present at all. When I tried testing with the 13B model and 13B lora, I ran into issues when trying to run main. It mentioned
Any pointers? edit (some additional info):
|
@MillionthOdin16 thanks for testing this, it has been a struggle telling for sure if the lora that I had tried had any meaningful effects, but I think I found a problem. Can you see if the latest changes fixes your issues? |
Awesome! Memory allocation issues are fixed and now things are running smoothly. I'm not getting the responses I'd expect lora-wise, so I suspect there is something off about how the lora is applied. Now that I can run my 13B model, it's much easier to see when the lora is working correctly (13B is my best trained lora). Is there anything I can do to help troubleshoot? I have a lora that's 25MB that when put on the plain llama model significantly improves the output. I don't know if a lora that is fully merged into the base model might help as well (don't know if we can compare effective weights between this implantation and the lora-fused model?) Once this works as expected it will be huge. Moving around 25MB loras vs base models is so much easier. And there's lots to be evaluated with layering loras and scaling them based off ratios :D |
Are you using a f16 model? Trying to apply a lora to a quantized model may be a terrible idea after all. |
You're right. The output works as expected when the llama model is f-32. Nice job! Now I'm trying to figure out the best way to make it usable. After the model is merged completely with the lora and quantized to 4 bits, it still produces good output (my point being that eventually we will want to get these fully quantized). So we're merging at f-32 to keep precision? I'm wondering what the best approach is for allowing this to work on quantized models. The ability to have a lora run on top of the base model in llama.cpp is in itself huge because moving significant variations of llama becomes trivial. Having a way for a user to set and lora and have it fused to the model, which could then be quantized down to 4bits would be really helpful. It's not as streamlined as realtime loading of loras, but it makes the use of loras significantly easier. Do you have any thoughts on how quantization could be worked on in memory? Has anyone tested if a quantized lora still has a useful impact on a quantized base model? Extra InfoThis works:
This doesn't:
|
Good to hear that it is working! Regarding creating pre-merged models, it is already possible to do that in python by using a script similar to this one from alpaca-lora that merges the lora and then exports the model as pth, which can then be converted to ggml as usual with I suspect that loading the layers modified by the lora in f16 and then quantizing them back into the same format as the model may be fast enough to be practical. So you could do something like |
Okay, I see. Just to note, I tested f-32 f-16 and q4_0 base llama models with the same lora file. f-32 was definitely lora-ized, f16 was definitely lora-ized (although I don't know how the output quality is different than f-32), and q4_0 didn't seem to have any variation resulting from the lora. Haven't checked the code to know if this is expected. Do you think applying a quantized lora to a quantized mode might have any merit? Sometimes we get interesting results, and it would definitely be faster (assuming you want to trade the accuracy for the speed).
Yes, I've seen the scripts, but I think for most users the understanding of model file formats and what they currently have vs what format they need is very confusing. My thought is that loras have the ability to significantly change the model outputs, are super lightweight, and are becoming more accessible and easier to train with projects like @lxe/simple-llm-finetuner. If we are able to streamline the use of loras and conversion of a lora adapter to a ggml model format they are familiar with, we can make learning about language models much easier (abstracting away as much pytorch/GPU/heavy ML stuff as possible). I know you already know this haha, I'm just excited about how this and similar projects make very technical areas easy to play with. |
Also, I've noticed a scaling factor in console and you've mentioned it some. Is this something that directly affects the 'impact' of the lora weights on the overall model? If so, it could be useful to break it out as an argument to make experimentation easier. With stable diffusion they've done some pretty cool things with mixing different lora layers (so I'm thinking about this for down the line) |
From what I understand, the llama models are natively f16, so I wouldn't expect much benefit from using a f32 model.
The problem with doing that is that the loras make very small modification to the weights, and the changes may be completely lost in the noise when applied to a quantizied model. Using a quantizied lora too just makes the problem worse, I don't think that would work at all.
This is just something that the PEFT library does based on the |
Is this already parallelised? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea for improvement in the future is to add an option for specifying a subset of tensors to be loaded via llama_model_load()
, for example with a list of tensor names. With this, we can avoid loading the entire base model and only load the LoRA tensors that will be adapted
There are a lot of things that could be improved about this, but since it is already functional and there is very little risk of breaking anything unrelated to this change, let's merge it to allow other people to contribute their improvements, and also to receive more broad feedback. |
I REALLY wish someone would make a video tutorial showing how these models can interact with each other. The tech is moving so fast it's hard to keep up. |
Has anyone tested this? NNs are robust to lots of operations and it's not clear if they are robust to adding a int16 delta to a int4 weight.
Just to be clear, this is still the case? |
I don't think that perplexity is a good way to test if a LoRA has been applied successfully. You need to test if the changes that the LoRA makes are still there, and in my tests they mostly aren't. |
I'm not quite sure what you mean. Maybe you mean perplexity over wiki.test.raw is not a good way to test it? If so, I agree. But perplexity over a LLaMA specific prompt would work well, and the quantitative measure that is perplexity seems better than eyeballing it. Although eyeballing might be sufficient. For example, the following would be much more likely in a LoRA model, so it should have a lower perplexity.
|
The perplexity over some subset of the dataset used to train the LoRA could work. |
I'm with you on that! Thanks for putting this PR together, btw. Do you still see those quality issues, or did you manage to resolve in your latest commits? Sorry I couldn't work it out by reading the thread |
The issues are still there. The solution was adding the |
btw you may have already seen it but this is the prompt format used to train most of the Alpaca LoRA's. Sometimes people use regular chat, and get poor results, until they switch to this format. |
I see! Thanks for explaining |
…rganov#820) * Add validation for tensor_split size exceeding LLAMA_MAX_DEVICES * reword
This change allows applying LoRA adapters on the fly without having to duplicate the model files.
Instructions:
adapter_config.json
andadapter_model.bin
of a LoRA adapter and put them in the same path. For alpaca, this can be found at https://huggingface.co/tloen/alpaca-lora-7b/tree/mainconvert-lora-to-ggml.py
to obtainggml-adapter-model.bin
ggml-adapter-model.bin
with--lora
--lora-base
to use as a base. The layers modified by LoRA adapter will be applied to the lora-base model and then quantized to the same format as the model specified with-m
. Layers not modified by the LoRA adapter will remain untouched.Limitations:
--lora
disables mmap since the models have to be modified anyway.--lora-base
, aggml_cpy
operation is used to quantize the result, which currently is done in a single thread. Parallelizingggml_cpy
will improve loading times.