Add LoRA support #820

slaren · 2023-04-06T21:47:36Z

This change allows applying LoRA adapters on the fly without having to duplicate the model files.

Instructions:

Obtain the HF PEFT LoRA files adapter_config.json and adapter_model.bin of a LoRA adapter and put them in the same path. For alpaca, this can be found at https://huggingface.co/tloen/alpaca-lora-7b/tree/main
Convert it using convert-lora-to-ggml.py to obtain ggml-adapter-model.bin

python convert-lora-to-ggml.py lora/alpaca-lora-7b

Use the ggml-adapter-model.bin with --lora

./main -m models/7B/ggml-model-f16.bin --lora lora/alpaca-lora-7b/ggml-adapter-model.bin --color -f ./prompts/alpaca.txt -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7

When using a quantized model, the quality may suffer. To avoid this, specify a f16/f32 model with --lora-base to use as a base. The layers modified by LoRA adapter will be applied to the lora-base model and then quantized to the same format as the model specified with -m. Layers not modified by the LoRA adapter will remain untouched.

./main -m models/7B/ggml-model-q4_0.bin --lora lora/alpaca-lora-7b/ggml-adapter-model.bin --lora-base models/7B/ggml-model-f16.bin --color -f ./prompts/alpaca.txt -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7

Limitations:

Using --lora disables mmap since the models have to be modified anyway.
When using --lora-base, a ggml_cpy operation is used to quantize the result, which currently is done in a single thread. Parallelizing ggml_cpy will improve loading times.

MillionthOdin16 · 2023-04-07T11:39:48Z

Awesome! Loras would be super useful, especially with how easy to train they're becoming right now 🔥

Piezoid · 2023-04-07T13:46:27Z

Do you think it is possible (or desirable) to produce a quantized versions of the patched tensors?

( f16 llama model, LoRA's tensors) --> f16 patched tensors --> quantized patched tensors

This would brings the speedups from quantization and allow to mmap both files. The pages from the original tensors won't be faulted / loaded in memory (the MAP_POPULATE would have to be disabled)

slaren · 2023-04-07T14:07:59Z

@Piezoid I am not sure what is the best way to handle this. Ideally for simplicity, the resulting patched tensors would be in the same format as they were initially, so if you patch a q4_0 model you still end with a q4_0 model. However, that may affect the quality significantly and it may be as slow or slower than just patching the f16 model and quantizing it afterwards on the fly. We need to run more tests, I may try implementing both options to see what works best.

Piezoid · 2023-04-07T15:02:06Z

@slaren Like you said, adding the LoRA deltas to a q4 quantized model is most likely very bad for quality. The quantization must happen afterward. My suggestion was to generate a separate model file consisting solely of the patched tensors with the LoRA full-rank weights added, and potentially applying quantization as a final step.

The idea is to save disk space by only requiring the space for the modified tensors. By completing the patching process offline, it's possible that the load time will also decrease.

Your proposal of patching and quantizing during load time is interesting, but it necessitates loading an f16 llama model and quantizing tensors that haven't been modified.
It's possible that I'm mistaken since I'm unsure which tensors are quantized and which ones are patched by LoRA.

slaren · 2023-04-07T15:08:27Z

@Piezoid it is not really viable to store the pre-patched tensors because the file size would be nearly the same than the entire model. The advantage of lora is that to patch a 4096x4096 matrix you only need a 16x4096 and a 4096x16 matrices (for rank 16, could be any other number). Patch it and suddenly your 2x16x4096 becomes 4096x4096.

ggerganov · 2023-04-07T17:09:10Z

Very useful info.

Another approach to think about is to use the distributive property of matrix multiplication: (B+C)A=BA+CA
We can add optional LoRA nodes to the llama computation graph.
Examples:

cur = ggml_mul_mat(ctx0, model.layers[il].wo, cur);

would become:

curL = ggml_mul_mat(ctx0, model.layers[il].wo, cur);
if (lora_enabled) {
    // can be precomputed once at the cost of more memory
    // or we can keep unpacking it each time to save memory
    lora = ggml_mul_mat(ctx0, model.loraB[il], model.loraA_trans[il]);

    lora = ggml_mul_mat(ctx0, lora, cur); // F32 mul_mat
    curL = ggml_add(ctx0, curL, lora);    // F32 add
}
cur = curL;

The drawback is slower inference due to extra ggml_mul_mat, but it would be trivial to dynamically load new LoRAs on-the-fly. And the fundamental model is unchanged and can remain quantized.

slaren · 2023-04-07T17:44:55Z

A small side-note, I realized that in some cases it will also be necessary to add a scaling factor. Specifically this what PEFT does to merge the lora:

self.scaling = self.lora_alpha / self.r
if fan_in_fan_out:
    self.weight.data = self.weight.data.T
...
self.weight.data += (
    transpose(self.lora_B.weight @ self.lora_A.weight, self.fan_in_fan_out) * self.scaling
)
...
def transpose(weight, fan_in_fan_out):
    return weight.T if fan_in_fan_out else weight

Where lora_alpha and r (rank) are parameters in the adapter_model.json.
In the case of alpaca lora_alpha = r so this is a noop, but this is not always case, for example in gpt4all lora_alpha=32 and r=8.

slaren · 2023-04-07T18:09:13Z

@ggerganov In addition to the performance considerations, something to keep in mind is that the tensors to apply lora to is entirely up to the implementation, for example alpaca applies to all q,k,v,o but gpt4all only to q,v. I imagine that eval would quickly turn to spaguetti if we have to consider every single tensor separately.

slaren · 2023-04-08T12:18:18Z

This should work with quantized models now. Patching quantized models doesn't seem so bad, I got a perplexity of 6.6281 on q4_0 with alpaca.

slaren · 2023-04-10T20:15:48Z

Now that #801 has been merged, using --lora disables mmap. Loading is a bit slower but it should work on windows now.

MillionthOdin16 · 2023-04-10T21:53:23Z

Awesome 🔥 I'll test it on Windows soon. This feature is super useful 🙂

…

On Mon, Apr 10, 2023, 16:15 slaren ***@***.***> wrote: Now that #801 <#801> has been merged, using --lora disables mmap. Loading is a bit slower but it should work on windows now. — Reply to this email directly, view it on GitHub <#820 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYMC3AH2ZXF2P27SVHP72DDXARS77ANCNFSM6AAAAAAWV5K3KM> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

jon-chuang · 2023-04-11T16:59:14Z

So, to be clear, we will load orig params, and then in a batched fashion:

Load fp16 LoRA for the given matrix
Dequantize orig params to fp16
Apply lora
Requantize to save memory

Any rough estimate for how long this adapter "loading" time is?

using --lora disables mmap

I guess since you may patch an arbitrary fraction of weights, the orig weights for the patched matrices are loaded but once. But mmap might still be useful for the case of relatively small fraction of weights + hot-swapping LoRAs. Just a thought.

CoW for large fraction of weights is basically duplicating the weights, so very much unviable.

slaren · 2023-04-11T17:51:40Z

Replace fp16 with fp32 and that's pretty close to the way it works at the moment:

multiply matrices lora B and lora A in f32
scale BA with f32
add BAs to original weights. this is where the dequantizing/requantizing happens if necessary

The time to apply the adapter for me varies from ~5 seconds with a small lora adapter on 7B to upwards of a minute with a larger lora on 30B. The slowest part by far is multiplying the lora matrices.

There may be some ways to accelerate this, but at the moment I am more concerned with correctness and supporting all the use cases.

MillionthOdin16 · 2023-04-11T20:18:42Z

I'm trying to troubleshoot some issues on windows. First, the conversion script and overall process was straightforward, so good job making it simple.

I was able to load the 7B llama and 7B lora fine, but I noticed that I didn't seem to get the responses I expect with the Lora applied. This seemed odd, because it was behaving as if the lora wasn't present at all.

When I tried testing with the 13B model and 13B lora, I ran into issues when trying to run main. It mentioned not enough space in the context's memory pool. I have 64GB system ram, and it's not close to being maxed, so I'm confused about what is happening.

C:\Users\Bradarr\Documents\GitHub\llama.cpp> ./build/bin/main -m D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin --lora D:\models\loras\bradarr-lora\13B\ggml-adapter-model.bin
main: seed = 1681243691
llama.cpp: loading model from D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 7945693.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: ggml_new_tensor_impl: not enough space in the context's memory pool (needed 105185904, available 104857600)

Any pointers?
Super pumped to get this working because it opens up a ton of possibilities! Also, just an idea, but it might be nice to have the option to fuse the lora to the base model. Once you have a lora that works really well and constantly use it, it would be nice to bundle it permanently.

edit (some additional info):

ggml_new_tensor_impl: context memory pool -> (needed 209715232, available 421527552)
ggml_new_tensor_impl: context memory pool -> (needed 419430640, available 421527552)
llama_init_from_file: kv self size  =  400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: ggml_new_tensor_impl: context memory pool -> (needed 163872, available 104857600)
ggml_new_tensor_impl: context memory pool -> (needed 327920, available 104857600)
ggml_new_tensor_impl: context memory pool -> (needed 105185728, available 104857600)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 105185904, available 104857600)

slaren · 2023-04-11T21:21:07Z

@MillionthOdin16 thanks for testing this, it has been a struggle telling for sure if the lora that I had tried had any meaningful effects, but I think I found a problem. Can you see if the latest changes fixes your issues?

MillionthOdin16 · 2023-04-11T22:01:59Z

@MillionthOdin16 thanks for testing this, it has been a struggle telling for sure if the lora that I had tried had any meaningful effects, but I think I found a problem. Can you see if the latest changes fixes your issues?

Awesome! Memory allocation issues are fixed and now things are running smoothly.

I'm not getting the responses I'd expect lora-wise, so I suspect there is something off about how the lora is applied. Now that I can run my 13B model, it's much easier to see when the lora is working correctly (13B is my best trained lora). Is there anything I can do to help troubleshoot?

I have a lora that's 25MB that when put on the plain llama model significantly improves the output. I don't know if a lora that is fully merged into the base model might help as well (don't know if we can compare effective weights between this implantation and the lora-fused model?)

Once this works as expected it will be huge. Moving around 25MB loras vs base models is so much easier. And there's lots to be evaluated with layering loras and scaling them based off ratios :D

slaren · 2023-04-11T22:06:34Z

Are you using a f16 model? Trying to apply a lora to a quantized model may be a terrible idea after all.

MillionthOdin16 · 2023-04-11T22:25:32Z

You're right. The output works as expected when the llama model is f-32. Nice job!

Now I'm trying to figure out the best way to make it usable. After the model is merged completely with the lora and quantized to 4 bits, it still produces good output (my point being that eventually we will want to get these fully quantized).

So we're merging at f-32 to keep precision? I'm wondering what the best approach is for allowing this to work on quantized models. The ability to have a lora run on top of the base model in llama.cpp is in itself huge because moving significant variations of llama becomes trivial. Having a way for a user to set and lora and have it fused to the model, which could then be quantized down to 4bits would be really helpful. It's not as streamlined as realtime loading of loras, but it makes the use of loras significantly easier.

Do you have any thoughts on how quantization could be worked on in memory? Has anyone tested if a quantized lora still has a useful impact on a quantized base model?

Extra Info

This works:

PS C:\Users\Bradarr\Documents\GitHub\llama.cpp> ./build/bin/main -m "D:\models\LLaMA\13B\ggml-model-f32.bin" --lora "D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin" --interactive-first         
main: seed = 1681250916
llama.cpp: loading model from D:\models\LLaMA\13B\ggml-model-f32.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 50843293.73 KB
llama_model_load_internal: mem required  = 51699.65 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: .......... done (18393.01 ms)

system_info: n_threads = 4 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0

This doesn't:

PS C:\Users\Bradarr\Documents\GitHub\llama.cpp> ./build/bin/main -m "D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin" --lora "D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin" --interactive-first
main: seed = 1681251252
llama.cpp: loading model from D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 7945693.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: .......... done (10663.88 ms)

system_info: n_threads = 4 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0

slaren · 2023-04-11T22:43:48Z

Good to hear that it is working!

Regarding creating pre-merged models, it is already possible to do that in python by using a script similar to this one from alpaca-lora that merges the lora and then exports the model as pth, which can then be converted to ggml as usual with convert-pth-to-ggml.py. I am not sure that it is worth replicating the same feature in llama.cpp, but I am not entirely opposed to it if it can bring some convenience.

I suspect that loading the layers modified by the lora in f16 and then quantizing them back into the same format as the model may be fast enough to be practical. So you could do something like main -m models/7B/q4_0.bin --lora-base models/7B/f16.bin --lora mylora.bin, and it would keep the unmodified layers from the q4_0 model, but any layers modified by the lora would be loaded from the f16, patched and then quantized to q4_0 or whatever is the format of the model specified in -m.

MillionthOdin16 · 2023-04-11T23:04:42Z

I suspect that loading the layers modified by the lora in f16 and then quantizing them back into the same format as the model may be fast enough to be practical. So you could do something like main -m models/7B/q4_0.bin --lora-base models/7B/f16.bin --lora mylora.bin, and it would keep the unmodified layers from the q4_0 model, but any layers modified by the lora would be loaded from the f16, patched and then quantized to q4_0 or whatever is the format of the model specified in -m.

Okay, I see. Just to note, I tested f-32 f-16 and q4_0 base llama models with the same lora file. f-32 was definitely lora-ized, f16 was definitely lora-ized (although I don't know how the output quality is different than f-32), and q4_0 didn't seem to have any variation resulting from the lora. Haven't checked the code to know if this is expected.

Do you think applying a quantized lora to a quantized mode might have any merit? Sometimes we get interesting results, and it would definitely be faster (assuming you want to trade the accuracy for the speed).

Regarding creating pre-merged models, it is already possible to do that in python by using a script similar to this one from alpaca-lora that merges the lora and then exports the model as pth, which can then be converted to ggml as usual with convert-pth-to-ggml.py. I am not sure that it is worth replicating the same feature in llama.cpp, but I am not entirely opposed to it if it can bring some convenience.

Yes, I've seen the scripts, but I think for most users the understanding of model file formats and what they currently have vs what format they need is very confusing. My thought is that loras have the ability to significantly change the model outputs, are super lightweight, and are becoming more accessible and easier to train with projects like @lxe/simple-llm-finetuner. If we are able to streamline the use of loras and conversion of a lora adapter to a ggml model format they are familiar with, we can make learning about language models much easier (abstracting away as much pytorch/GPU/heavy ML stuff as possible). I know you already know this haha, I'm just excited about how this and similar projects make very technical areas easy to play with.

MillionthOdin16 · 2023-04-11T23:09:00Z

Also, I've noticed a scaling factor in console and you've mentioned it some. Is this something that directly affects the 'impact' of the lora weights on the overall model? If so, it could be useful to break it out as an argument to make experimentation easier. With stable diffusion they've done some pretty cool things with mixing different lora layers (so I'm thinking about this for down the line)

slaren · 2023-04-11T23:20:05Z

f-32 was definitely lora-ized, f16 was definitely lora-ized (although I don't know how the output quality is different than f-32), and q4_0 didn't seem to have any variation resulting from the lora. Haven't checked the code to know if this is expected.

From what I understand, the llama models are natively f16, so I wouldn't expect much benefit from using a f32 model.

Do you think applying a quantized lora to a quantized mode might have any merit? Sometimes we get interesting results, and it would definitely be faster (assuming you want to trade the accuracy for the speed).

The problem with doing that is that the loras make very small modification to the weights, and the changes may be completely lost in the noise when applied to a quantizied model. Using a quantizied lora too just makes the problem worse, I don't think that would work at all.

Also, I've noticed a scaling factor in console and you've mentioned it some. Is this something that directly affects the 'impact' of the lora weights on the overall model?

This is just something that the PEFT library does based on the lora_alpha parameter and the rank of the lora, and I don't think it should be modified at all, but who knows what effect it might have. Applying loras on top of other loras seems very suspect to me, I wouldn't expect it to work at all, but I guess in some cases it might? Anyway I would leave that experimentation to the GPU people, if they find something worthwhile we can back-port it here.

jon-chuang · 2023-04-12T00:04:04Z

~5 seconds with a small lora adapter on 7B to upwards of a minute with a larger lora on 30B. The slowest part by far is multiplying the lora matrices.

Is this already parallelised?

ggerganov

One idea for improvement in the future is to add an option for specifying a subset of tensors to be loaded via llama_model_load(), for example with a list of tensor names. With this, we can avoid loading the entire base model and only load the LoRA tensors that will be adapted

slaren · 2023-04-17T15:16:53Z

There are a lot of things that could be improved about this, but since it is already functional and there is very little risk of breaking anything unrelated to this change, let's merge it to allow other people to contribute their improvements, and also to receive more broad feedback.

iplayfast · 2023-04-18T14:16:43Z

I REALLY wish someone would make a video tutorial showing how these models can interact with each other. The tech is moving so fast it's hard to keep up.

wassname · 2023-04-19T23:30:34Z

@slaren Like you said, adding the LoRA deltas to a q4 quantized model is most likely very bad for quality. The quantization must happen afterward.

Has anyone tested this? NNs are robust to lots of operations and it's not clear if they are robust to adding a int16 delta to a int4 weight.

, it works but the quality is predictably not good

Just to be clear, this is still the case?

slaren · 2023-04-19T23:32:44Z

I don't think that perplexity is a good way to test if a LoRA has been applied successfully. You need to test if the changes that the LoRA makes are still there, and in my tests they mostly aren't.

wassname · 2023-04-19T23:57:12Z

I'm not quite sure what you mean. Maybe you mean perplexity over wiki.test.raw is not a good way to test it? If so, I agree. But perplexity over a LLaMA specific prompt would work well, and the quantitative measure that is perplexity seems better than eyeballing it. Although eyeballing might be sufficient.

For example, the following would be much more likely in a LoRA model, so it should have a lower perplexity.

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
What is 2+2?
### Response:
4

slaren · 2023-04-20T00:03:55Z

The perplexity over some subset of the dataset used to train the LoRA could work.

wassname · 2023-04-20T00:06:36Z

I'm with you on that!

Thanks for putting this PR together, btw. Do you still see those quality issues, or did you manage to resolve in your latest commits? Sorry I couldn't work it out by reading the thread

slaren · 2023-04-20T00:10:08Z

The issues are still there. The solution was adding the --lora-base option to take the layers from an unquantized model. I didn't run any formal tests like what you are suggesting, but simply from observing the outputs of the model it was evident that the quality is not good. Any additional research into this could be interesting, though.

wassname · 2023-04-20T00:10:45Z

btw you may have already seen it but this is the prompt format used to train most of the Alpaca LoRA's. Sometimes people use regular chat, and get poor results, until they switch to this format.

wassname · 2023-04-20T00:11:23Z

The issues are still there. The solution was...

I see! Thanks for explaining

…rganov#820) * Add validation for tensor_split size exceeding LLAMA_MAX_DEVICES * reword

slaren changed the title ~~Add lora support~~ Add LoRA support Apr 6, 2023

slaren force-pushed the lora branch 2 times, most recently from cd2dbea to a4539e1 Compare April 8, 2023 11:58

slaren force-pushed the lora branch from d00017b to af00579 Compare April 8, 2023 21:51

ggerganov added the research 🔬 label Apr 10, 2023

slaren force-pushed the lora branch from af00579 to 0d8999a Compare April 10, 2023 19:52

slaren force-pushed the lora branch from 0d8999a to 671190b Compare April 11, 2023 21:18

slaren added 5 commits April 16, 2023 18:22

Show warning when using a quantized base model

14858ba

ggml_cpy: use the work buffer instead of alloca when quantizing

3df343b

Only attempt to use mmap for the lora base model if it is supported

63da54e

Reuse definitions from convert.py

0a6d5ad

ggml_add: Add more checks

8d37db3

slaren force-pushed the lora branch from d3e1886 to 8d37db3 Compare April 16, 2023 16:55

ggerganov approved these changes Apr 17, 2023

View reviewed changes

slaren merged commit 315a95a into ggerganov:master Apr 17, 2023

Green-Sky mentioned this pull request Apr 18, 2023

How do we finetune the model with new data? #466

Closed

slaren deleted the lora branch April 19, 2023 23:31

ravenscroftj mentioned this pull request Apr 22, 2023

What would it take to train it for other languages? ravenscroftj/turbopilot#16

Open

This was referenced Apr 23, 2023

Loading LoRAs should disable mmap (LoRA + mmap causes segmentation fault) abetlen/llama-cpp-python#107

Closed

add LoRA loading for the LlamaCpp LLM langchain-ai/langchain#3363

Merged

ravenscroftj mentioned this pull request May 10, 2023

Need Elixir Language Support ravenscroftj/turbopilot#22

Open

NouamaneTazi mentioned this pull request Jun 8, 2023

Inference with Starcoder model finetuned by lora bigcode-project/starcoder.cpp#18

Open

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Add validation for tensor_split size exceeding LLAMA_MAX_DEVICES (gge…

b501665

…rganov#820) * Add validation for tensor_split size exceeding LLAMA_MAX_DEVICES * reword

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LoRA support #820

Add LoRA support #820

slaren commented Apr 6, 2023 •

edited

Loading

MillionthOdin16 commented Apr 7, 2023

Piezoid commented Apr 7, 2023

slaren commented Apr 7, 2023

Piezoid commented Apr 7, 2023

slaren commented Apr 7, 2023

ggerganov commented Apr 7, 2023 •

edited

Loading

slaren commented Apr 7, 2023 •

edited

Loading

slaren commented Apr 7, 2023

slaren commented Apr 8, 2023

slaren commented Apr 10, 2023

MillionthOdin16 commented Apr 10, 2023 via email

jon-chuang commented Apr 11, 2023 •

edited

Loading

slaren commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023 •

edited

Loading

slaren commented Apr 11, 2023 •

edited

Loading

MillionthOdin16 commented Apr 11, 2023 •

edited

Loading

slaren commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023

slaren commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023

slaren commented Apr 11, 2023

jon-chuang commented Apr 12, 2023

ggerganov left a comment •

edited

Loading

slaren commented Apr 17, 2023

iplayfast commented Apr 18, 2023

wassname commented Apr 19, 2023 •

edited

Loading

slaren commented Apr 19, 2023

wassname commented Apr 19, 2023 •

edited

Loading

slaren commented Apr 20, 2023

wassname commented Apr 20, 2023

slaren commented Apr 20, 2023

wassname commented Apr 20, 2023

wassname commented Apr 20, 2023 •

edited

Loading

Add LoRA support #820

Add LoRA support #820

Conversation

slaren commented Apr 6, 2023 • edited Loading

MillionthOdin16 commented Apr 7, 2023

Piezoid commented Apr 7, 2023

slaren commented Apr 7, 2023

Piezoid commented Apr 7, 2023

slaren commented Apr 7, 2023

ggerganov commented Apr 7, 2023 • edited Loading

slaren commented Apr 7, 2023 • edited Loading

slaren commented Apr 7, 2023

slaren commented Apr 8, 2023

slaren commented Apr 10, 2023

MillionthOdin16 commented Apr 10, 2023 via email

jon-chuang commented Apr 11, 2023 • edited Loading

slaren commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023 • edited Loading

slaren commented Apr 11, 2023 • edited Loading

MillionthOdin16 commented Apr 11, 2023 • edited Loading

slaren commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023

Extra Info

slaren commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023

slaren commented Apr 11, 2023

jon-chuang commented Apr 12, 2023

ggerganov left a comment • edited Loading

Choose a reason for hiding this comment

slaren commented Apr 17, 2023

iplayfast commented Apr 18, 2023

wassname commented Apr 19, 2023 • edited Loading

slaren commented Apr 19, 2023

wassname commented Apr 19, 2023 • edited Loading

slaren commented Apr 20, 2023

wassname commented Apr 20, 2023

slaren commented Apr 20, 2023

wassname commented Apr 20, 2023

wassname commented Apr 20, 2023 • edited Loading

slaren commented Apr 6, 2023 •

edited

Loading

ggerganov commented Apr 7, 2023 •

edited

Loading

slaren commented Apr 7, 2023 •

edited

Loading

jon-chuang commented Apr 11, 2023 •

edited

Loading

MillionthOdin16 commented Apr 11, 2023 •

edited

Loading

slaren commented Apr 11, 2023 •

edited

Loading

MillionthOdin16 commented Apr 11, 2023 •

edited

Loading

ggerganov left a comment •

edited

Loading

wassname commented Apr 19, 2023 •

edited

Loading

wassname commented Apr 19, 2023 •

edited

Loading

wassname commented Apr 20, 2023 •

edited

Loading