Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finetune: -ngl option to offload to gpu? #3458

Closed
erlanger opened this issue Oct 3, 2023 · 15 comments · Fixed by #3762
Closed

finetune: -ngl option to offload to gpu? #3458

erlanger opened this issue Oct 3, 2023 · 15 comments · Fixed by #3762

Comments

@erlanger
Copy link

erlanger commented Oct 3, 2023

How to you offload layers to the GPU with finetune? There is no -ngl option.

@karam14
Copy link

karam14 commented Oct 5, 2023

Same issue here

@BarfingLemurs
Copy link
Contributor

If you build with cublas, finetuning might be slightly faster. But cuda optimization has not been written, I don't think.

@mrroll
Copy link

mrroll commented Oct 9, 2023

I'm not sure if I should be posting it to this issue specifically, but since it might be useful...

  1. Trying to actually run inference with an existing LoRA on the GPU results in the following error:

error: the simultaneous use of LoRAs and GPU acceleration is only supported for f16 models

Inference without --n-gpu-layers works great but it feels a lot slower than when a GPU is used.

  1. But trying to finetune an f16 model results in:

In commit db3abcc up to the latest master at the time I wrote this comment

GGML_ASSERT: common/train.cpp:190: tensor->n_dims == 2

In this commit, even q8_0 models produce the same error.


However, In commit eee42c6 and earlier, I can do CPU finetuning on a q8_0 model. Trying to perform the same finetuning on an f16 model throws a different error:

error: operator(): Finetuning on tensors with type 'f16' is not yet supported.

@Corallus-Caninus
Copy link

Corallus-Caninus commented Oct 16, 2023

I get some utilization in nvidia-smi after building with cublas (not insignificant but not alot).

@earonesty
Copy link

earonesty commented Oct 18, 2023

if anyone has any idea what the best way to start thinking about fixing this is, let me know. want to get this to work. i know c++ and llama well enough, but not cuda so much. willing to learn/work on this tho.

@earonesty
Copy link

this is now eligible for a bounty: OpenAgentsInc/workerbee#15. can dm-me for negotiated amount.

@AndrewGodfrey
Copy link
Contributor

Which codepath is most interesting here? f16 is slower, but running inference on q8_0 gives this warning:
"warning: using a lora adapter with a quantized model may result in poor quality, use a f16 or f32 base model with --lora-base"

@earonesty
Copy link

earonesty commented Oct 20, 2023

I think F-16 seems quite reasonable

most people who are doing fine tuning are using larger GPUs

it really just comes down to leveraging them as much of the GPUs as you can

we're doing F-16 and f-32 now and we let the user choose

also we produce two outputs

I think our users like the merged GGUF

as opposed to the adapter

although there are advantages to both so we just generate both and let them download what they want

@AndrewGodfrey
Copy link
Contributor

Well fwiw, I have something working locally that does f32.
I saw this thread discussing the limitations of this.
And what I found in my limited testing:

  • GPU finetuning does work. It produces a f32 .bin file that I can successfully use (very slowly) with CPU inferencing.

  • But it can't be used directly with the --lora parameter with GPU inferencing. "llama_apply_lora_from_file_internal: error: the simultaneous use of LoRAs and GPU acceleration is only supported for f16 models"

  • I also experimented with using export-lora. This 'succeeded' but I found that the resulting .gguf seemed to have lost a lot of its finetuning. I don't know if this is due to a bug, but I could imagine it's not as good to finetune to f32 and then quantize to f16, instead of finetune to f16.

So I suppose this is of use for someone who actually wants to use an f32 finetuned model.

P.S. I found that CPU finetuning was a bit faster than GPU and used a lot less memory. Presumably because CPU finetuning uses a smaller intermediate format than f32. This suggests that enabling GPU finetuning right now is worthless, but perhaps not - maybe it is worthwhile to someone with a lot more VRAM. The openllama v2 3B model is a bit big for my machine, when running at f32. OTOH, my CPU doesn't have AVX.

@earonesty
Copy link

earonesty commented Oct 24, 2023

Well fwiw, I have something working locally that does f32. I saw this thread discussing the limitations of this. And what I found in my limited testing:

  • GPU finetuning does work. It produces a f32 .bin file that I can successfully use (very slowly) with CPU inferencing.

i suppose you could convert that f32 to an f16 afterward and/or quantize as needed?

  • But it can't be used directly with the --lora parameter with GPU inferencing. "llama_apply_lora_from_file_internal: error: the simultaneous use of LoRAs and GPU acceleration is only supported for f16 models"

that makes it kinda not helpful. maybe a conversion step fixes this?

  • I also experimented with using export-lora. This 'succeeded' but I found that the resulting .gguf seemed to have lost a lot of its finetuning. I don't know if this is due to a bug, but I could imagine it's not as good to finetune to f32 and then quantize to f16, instead of finetune to f16.

hmm, it should be fine. the real issue is space. fine-tuning is probably best if it fits in the gpu and only "spills over" a few layers to the cpu. when i fine-tune in pytorch, i typically use load_8_bit=True, and when i'm testing stuff, i might run load_4_bit (just to see how it goes, and if it's learning what i want it to learn). i rarely have room 16-bit.

So I suppose this is of use for someone who actually wants to use an f32 finetuned model.

it's definitely a good step, but most people seem to use models in f16, q8 or lower q's.

P.S. I found that CPU finetuning was a bit faster than GPU and used a lot less memory.

yes, if you're running f32, it's going to be slower. i

This suggests that enabling GPU finetuning right now is worthless,

still think a small step is worth it for a merge (firm believer in baby steps when it comes to code), especially if there are good tests, but i doubt it will be used much until the f16 and even quantized-fine-tunes are supported.

@AndrewGodfrey
Copy link
Contributor

Looking at the code some more, I see that intermediates are f32 in all implementations (e.g. ggml_mul_mat has a hardcoded GGML_TYPE_F32 for its result). So I was wrong earlier in thinking the intermediate format is different between CPU and GPU. This is really just about downconverting to lora deltas, which I've not looked at yet. I'm not quite sure what Johannes meant by "Currently the CUDA code runs everything as f32 by default"

I also experimented with using export-lora. This 'succeeded' but I found that the resulting .gguf seemed to have lost a lot of its finetuning.

hmm, it should be fine. the real issue is space.

Thanks, then that was probably a bug / user error.

@earonesty
Copy link

indeed even torch doesn't fine tune well across gpu and cpu!

pytorch/pytorch#91165

@QueryType
Copy link

We maybe close to having it now! @AndrewGodfrey thanks!

@earonesty
Copy link

it would be amazing but also I feel like somehow predictable that llama CPP beats torch at getting quantized fine tuning to work with a GPU across multiple operating systems.

@AndrewGodfrey
Copy link
Contributor

Note: As detailed here, finetune now has an "-ngl" option and it does offload some of the work to the GPU. But a lot of the training work is done on the CPU and so it barely helps, and in some cases runs slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants