-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic Vulkan Multi-GPU implementation #5321
Conversation
Move most global variables into backend context
LLAMA_LOG_ERROR("%s: failed to initialize Vulkan backend\n", __func__); | ||
llama_free(ctx); | ||
return nullptr; | ||
for (int device = 0; device < ggml_backend_vk_get_device_count(); ++device) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to avoid initializing the backends that are not used. At least with LLAMA_SPLIT_NONE
only the main GPU backend should be initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, but right now the split-mode has no effect at all on the Vulkan backend (and throws a warning if someone tries to use it). I'm actually initializing the backends even before this point, in ggml_vk_init_cpu_assist()
, because I cannot get the properties of the GPUs without initializing them. This isn't optimal.
I suppose I could duplicate the parts of the device initialization code required to read the properties and only do that initially. I'll think about it and update this code tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot get the properties of the GPUs without initializing them
In the Kompute backend, the devices are enumerated via ggml_vk_available_devices, which can be called by the user (GPT4All needs this) but is also used by ggml_backend_kompute_buffer_type to get the necessary device properties in advance - this was inspired by the CUDA backend.
Nets 7.2 t/s on dual 3090 loading 70b. |
With some tweaks it is possible to use CUDA and Vulkan at the same time, however the hooks that these backends have in
|
… for cpu assist Add missing cleanup code
@slaren I added the device output print in the beginning, so I only initialize one backend for CPU offload at that stage. But I have a problem with cleaning up the device properly. I can only do that once all buffers are freed, but llama.cpp frees the backends before it frees buffers like the kv cache, weights and other data, which would segfault if I actually freed the device beforehand. I'm not sure how to resolve that. |
You can keep a reference count, increased every time that a buffer of backend is initialized and reduced when freed, and only free the device when it reaches zero. You could probably use |
The Kompute backend does reference counting like slaren mentioned. |
Thanks, that's a good idea. I'll implement that later today. |
… properly allocated and freed
@slaren I reworked how it handles devices and buffers using smart pointers, in my tests it always freed the buffers and devices at the correct time and without throwing any Vulkan validation issues. Let me know what you think. |
I don't know enough about the way the vulkan backend is implemented to give any specific advice, but I am surprised by the way static std::shared_ptr<device> get_device(int id) {
static std::weak_ptr<device> devices[MAX_DEVICES];
if (devices[id].expired()) {
devices[id] = std::make_shared<device>(id);
}
return devices[id].lock();
}
// buffer
struct buffer_ctx {
std::shared_ptr<device> dev;
};
void free_buffer(buffer b) {
delete b->ctx;
}
buffer alloc_buffer(int size) {
buffer_ctx * ctx = new buffer_ctx;
ctx->dev = get_device();
return new buffer(ctx);
}
// backend
struct backend_ctx {
std::shared_ptr<device> dev;
};
void free_backend(backend b) {
delete b->ctx;
delete b;
}
backend init_backend(int dev_id) {
backend_ctx * ctx = new backend_ctx;
ctx->dev = get_device();
return new backend(ctx);
} In this way, both the Ultimately thought, how to implement this is entirely up to you, but the call to |
That use of weak_ptr is smarter than what I did, yes. But I don't see why having an init and a free function is surprising for the CPU matmul offload. The backend system has an init and a free and expects both to be called. The CPU assist initializes a backend since it requires all of those structures as well, so it should clean that up if the backend system hasn't done that. The backend system doesn't initialize a backend if no layers are offloaded. It looks to me like cuda just gets around that by hiding most of that boiler plate in the background (like device initialization) and not destroying resources (like the streams). Please correct me if I'm wrong. I guess at the very least I should rename the function to something clearer, like |
Ok, I can see that the automatic offloading of mat muls through the CPU backend complicates this a lot. The goal is to move all of this logic to |
Yeah, I'm looking forward to when that gets implemented, then I can drop all of the |
Something's happening with RAM here - it's allocating extra on reloading (not a feature in llama.cpp, but I doubt it should work that way). Sorry for being late on this, didn't have time to test this PR earlier. VRAM is fine, but RAM usage seems to almost double (from ~8GB to ~15GB) on restart. It happens once, though. |
Interesting. Thanks for testing it, maybe I missed some cleanup? I'll try to reproduce it. |
sup guys, does anyone know if the vulkan multigpu has a limit of gpus you can set? |
It's 16, but that's only an arbitrary number chosen for the size of the arrays holding the relevant data. |
Just reporting on the latest changes: with ggml-alloc v3 it now reads the model again from the disk on restart instead of quickly getting it from cache. It happens only on Vulkan (both with and without |
You're really good at catching memory issues. But if #5452 broke something, maybe also mention it to them. ClBlast doesn't use |
The only way I can imagine this change could affect backends is because a buffer with size 1 is no longer allocated during the measure step. |
Unfortunately, I cannot test anything except for Clblast and Vulkan among backends, which is effectively one backend if Clblast doesn't use ggml-alloc. I'm also using old OS and hardware overall. Tested CPU only, no problems with restart. |
how do I run the benchmark to know the t/s that my current hardware setup is doing? |
@0cc4m, I installed the 9th gpu and my mobo and rocm detect It, but when I run vulkan info it doesnt detect it. What can i do bro? also Im having a warning that was not present before installing the 9th gpu, Its the follow
I attach to my vulkaninfo file dont know if can help to know whats going on, can you give me a hand bro? |
@0cc4m This PR broke the Kompute backend (at least, for multi-GPU systems) - probably the "generalize LLAMA_SPLIT_LAYER for all backends" part, since I can get Kompute to work by passing |
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@cebtenzzre do you have more details? I do not see an obvious issue with the implementation of |
@cebtenzzre Kompute seems to work for me. |
@0cc4m @slaren what does kompute is related to Vulkan multigpu? However i managed to install llamacpp with kompute, but i findout it only support like 2 or 3 quantz, its very limitted and i didnt test it with multigpu. At the moment im trying to enable vulkan to get more than 8 gpus working I findout i had to modify install mesa drivers and modify some of the building code to remove the restriction of 8 gpus. This weekend going to keep trying. This is a hobby for me so i need to find me free time to work on my hobbies |
Also guys, do you have a discord or something, i would like to join |
I don't think anyone can help you with that. Very few people have that many GPUs in one computer and basically noone had a reason to use Vulkan on so many devices together. |
yes bro i realize anyone has the knowdgle to do that, but im working with it and i will be happy to share the procedures. |
Feel free to join the GPT4All Discord, I made a |
@userbox020 I'm very interested in this hobby of yours. I think it's really cool. |
Join to the gpt4all discord bro, there we can chat about llm stuff |
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
* Initial Vulkan multi-gpu implementation Move most global variables into backend context * Add names to backend device functions * Add further missing cleanup code * Reduce code duplication in tensor split layer assignment * generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h * Only do device info print in the beginning and initialize one backend for cpu assist Add missing cleanup code * Rework backend memory management to make sure devices and buffers get properly allocated and freed * Rename cpu assist free function --------- Co-authored-by: slaren <slarengh@gmail.com>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
I'm not fully done, need to finish some details and check whether everything allocated gets cleaned properly, but it works already.
Right now it copies all data between them across RAM, which is not particularly fast, but I may be able to fix that in the future using Vulkan DeviceGroups. But maybe that only works between devices of the same vendor, I haven't tried it yet and the information I found about it is sparse.
This change also cleaned up most of the global variables the backend had, so it's a step towards allowing using it multiple times in the same program.
Edit: Forgot to mention, you have to set
GGML_VK_VISIBLE_DEVICES
to the indices of the devices you want, for example withexport GGML_VK_VISIBLE_DEVICES=0,1,2
. The indices correspond to the device order invulkaninfo --summary
. By default it will still only use the first device it finds.Benchmarks
13B q6_k
I can now run 70B q4_k_s across those three GPUs: