llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD #5240

ggerganov · 2024-01-31T13:52:10Z

Instead of the defines, use the functions:

llama_max_devices()
llama_supports_gpu_offload()

ggml-ci

llama.cpp

slaren · 2024-01-31T14:04:54Z

server.cpp also needs to be updated.

Co-authored-by: slaren <slarengh@gmail.com>

ggml-ci

slaren · 2024-01-31T14:27:31Z

I assume this is to be able to tell this value at run time, which is useful for example with dynamic linking. In this case, shouldn't LLAMA_SUPPORTS_GPU_OFFLOAD also be a function?

ggml-ci

ggerganov · 2024-01-31T14:40:28Z

I assume this is to be able to tell this value at run time, which is useful for example with dynamic linking.

Yes, my main motivation is to reduce GPU-related conditionals in the header files. Nothing specific in mind, just has to be better like this

slaren · 2024-01-31T14:43:37Z

There is still one LLAMA_SUPPORTS_GPU_OFFLOAD in common/train.cpp

slaren · 2024-01-31T14:45:06Z

If there any downstream applications that depend on these defines to enable GPU offloading, they may break after this change, so it may be a good idea to put a notice.

llama.h

ggml-ci

…ganov#5240) * llama : remove LLAMA_MAX_DEVICES from llama.h ggml-ci * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * server : remove LLAMA_MAX_DEVICES ggml-ci * llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD ggml-ci * train : remove LLAMA_SUPPORTS_GPU_OFFLOAD * readme : add deprecation notice * readme : change deprecation notice to "remove" and fix url * llama : remove gpu includes from llama.h ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>

irthomasthomas · 2024-02-07T10:36:24Z

Hi, in latest lamma-cpp-python releas (0.2.39) I get this error: ValueError: Attempt to split tensors that exceed maximum supported devices. Current LLAMA_MAX_DEVICES=1. Is that related to this change?

hamza233 · 2024-02-07T22:03:12Z

Hi, in latest lamma-cpp-python releas (0.2.39) I get this error: ValueError: Attempt to split tensors that exceed maximum supported devices. Current LLAMA_MAX_DEVICES=1. Is that related to this change?

Getting same error

slaren · 2024-02-08T00:49:40Z

It is probably related to this change. LLAMA_MAX_DEVICES has been replaced with llama_max_devices().

cebtenzzre · 2024-02-08T15:53:27Z

It is probably related to this change. LLAMA_MAX_DEVICES has been replaced with llama_max_devices().

llama-cpp-python is calling llama_max_devices() though - after all, it is written in python and gets this value at runtime. I tested it for myself with a CUBLAS build, and get llama_cpp.LLAMA_MAX_DEVICES = 16.

slaren · 2024-02-08T16:03:02Z

It's a weird that there error says LLAMA_MAX_DEVICES=1 instead of 0 or undefined. Maybe they were using a CPU or Metal build, and this is not really a bug.

hamza233 · 2024-02-08T16:30:06Z

It's a weird that there error says LLAMA_MAX_DEVICES=1 instead of 0 or undefined. Maybe they were using a CPU or Metal build, and this is not really a bug.

torch.cuda.device_count() returns 8.
I pass n_gpu_layers=-1 and I see llm_load_tensors: offloaded 81/81 layers to GPU in verbose. The inference also runs fine but is very slow and I don't see any GPU being used with nvidia-smi.

I get ValueError: Attempt to split tensors that exceed maximum supported devices. Current LLAMA_MAX_DEVICES=1 when I pass tensor_split=[1] + [10]*7 in order to use all 8 GPUs.

slaren · 2024-02-08T16:36:25Z

The current version of llama.cpp will always print the offloaded x/x layers to GPU message when using -ngl, even in CPU only builds. To be sure that the GPU is actually being used, you should look at the buffer sizes, it should say something like CUDA0 instead of CPU.

irthomasthomas · 2024-02-08T19:16:17Z

Only way I can get it to launch is to roll back llama-cpp-python to v0.2.37

thiner · 2024-02-20T04:00:33Z

@slaren @cebtenzzre I have the same issue with the latest llama-cpp-python, is there any workaround I can bypass the LLAMA_MAX_DEVICES=0 issue?

…ganov#5240) * llama : remove LLAMA_MAX_DEVICES from llama.h ggml-ci * Update llama.cpp Co-authored-by: slaren <slarengh@gmail.com> * server : remove LLAMA_MAX_DEVICES ggml-ci * llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD ggml-ci * train : remove LLAMA_SUPPORTS_GPU_OFFLOAD * readme : add deprecation notice * readme : change deprecation notice to "remove" and fix url * llama : remove gpu includes from llama.h ggml-ci --------- Co-authored-by: slaren <slarengh@gmail.com>

llama : remove LLAMA_MAX_DEVICES from llama.h

43312b2

ggml-ci

ggerganov requested a review from slaren January 31, 2024 13:52

slaren reviewed Jan 31, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

ggerganov and others added 2 commits January 31, 2024 16:13

Update llama.cpp

ffa4293

Co-authored-by: slaren <slarengh@gmail.com>

server : remove LLAMA_MAX_DEVICES

3180a64

ggml-ci

llama : remove LLAMA_SUPPORTS_GPU_OFFLOAD

8bfb0b6

ggml-ci

train : remove LLAMA_SUPPORTS_GPU_OFFLOAD

aa71356

ggerganov changed the title ~~llama : remove LLAMA_MAX_DEVICES from llama.h~~ llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD Jan 31, 2024

readme : add deprecation notice

3cedb7e

ggerganov force-pushed the gg/remove-max-devices branch from 36ea36c to 3cedb7e Compare January 31, 2024 14:51

readme : change deprecation notice to "remove" and fix url

1139b66

slaren reviewed Jan 31, 2024

View reviewed changes

llama.h Outdated Show resolved Hide resolved

llama : remove gpu includes from llama.h

6dfcb42

ggml-ci

slaren approved these changes Jan 31, 2024

View reviewed changes

ggerganov merged commit 5cb04db into master Jan 31, 2024
53 of 59 checks passed

ggerganov deleted the gg/remove-max-devices branch January 31, 2024 15:30

This was referenced Feb 7, 2024

How to use speculative decoding? abetlen/llama-cpp-python#1164

Closed

Error: Attempt to split tensors that exceed maximum supported devices. Current LLAMA_MAX_DEVICES=1 abetlen/llama-cpp-python#1166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD #5240

llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD #5240

ggerganov commented Jan 31, 2024 •

edited

slaren commented Jan 31, 2024

slaren commented Jan 31, 2024

ggerganov commented Jan 31, 2024

slaren commented Jan 31, 2024

slaren commented Jan 31, 2024

irthomasthomas commented Feb 7, 2024

hamza233 commented Feb 7, 2024

slaren commented Feb 8, 2024

cebtenzzre commented Feb 8, 2024

slaren commented Feb 8, 2024

hamza233 commented Feb 8, 2024 •

edited

slaren commented Feb 8, 2024 •

edited

irthomasthomas commented Feb 8, 2024

thiner commented Feb 20, 2024 •

edited

llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD #5240

llama : remove LLAMA_MAX_DEVICES and LLAMA_SUPPORTS_GPU_OFFLOAD #5240

Conversation

ggerganov commented Jan 31, 2024 • edited

slaren commented Jan 31, 2024

slaren commented Jan 31, 2024

ggerganov commented Jan 31, 2024

slaren commented Jan 31, 2024

slaren commented Jan 31, 2024

irthomasthomas commented Feb 7, 2024

hamza233 commented Feb 7, 2024

slaren commented Feb 8, 2024

cebtenzzre commented Feb 8, 2024

slaren commented Feb 8, 2024

hamza233 commented Feb 8, 2024 • edited

slaren commented Feb 8, 2024 • edited

irthomasthomas commented Feb 8, 2024

thiner commented Feb 20, 2024 • edited

ggerganov commented Jan 31, 2024 •

edited

hamza233 commented Feb 8, 2024 •

edited

slaren commented Feb 8, 2024 •

edited

thiner commented Feb 20, 2024 •

edited