Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic Vulkan Multi-GPU implementation #5321

Merged
merged 8 commits into from
Feb 7, 2024
Merged

Basic Vulkan Multi-GPU implementation #5321

merged 8 commits into from
Feb 7, 2024

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Feb 4, 2024

I'm not fully done, need to finish some details and check whether everything allocated gets cleaned properly, but it works already.

Right now it copies all data between them across RAM, which is not particularly fast, but I may be able to fix that in the future using Vulkan DeviceGroups. But maybe that only works between devices of the same vendor, I haven't tried it yet and the information I found about it is sparse.

This change also cleaned up most of the global variables the backend had, so it's a step towards allowing using it multiple times in the same program.

Edit: Forgot to mention, you have to set GGML_VK_VISIBLE_DEVICES to the indices of the devices you want, for example with export GGML_VK_VISIBLE_DEVICES=0,1,2. The indices correspond to the device order in vulkaninfo --summary. By default it will still only use the first device it finds.

Benchmarks

13B q6_k

ggml_vulkan: Using NVIDIA GeForce RTX 3090 | uma: 0 | fp16: 1 | warp size: 32
llama_print_timings: prompt eval time =    1962,96 ms /   622 tokens (    3,16 ms per token,   316,87 tokens per second)
llama_print_timings:        eval time =    5036,72 ms /   127 runs   (   39,66 ms per token,    25,21 tokens per second)
ggml_vulkan: Using NVIDIA GeForce RTX 3090 | uma: 0 | fp16: 1 | warp size: 32
ggml_vulkan: Using AMD Radeon Pro VII (RADV VEGA20) | uma: 0 | fp16: 1 | warp size: 64
llama_print_timings: prompt eval time =    2788,92 ms /   622 tokens (    4,48 ms per token,   223,03 tokens per second)
llama_print_timings:        eval time =    9158,65 ms /   127 runs   (   72,12 ms per token,    13,87 tokens per second)
ggml_vulkan: Using AMD Radeon Pro VII (RADV VEGA20) | uma: 0 | fp16: 1 | warp size: 64
ggml_vulkan: Using Intel(R) Arc(tm) A770 Graphics (DG2) | uma: 0 | fp16: 1 | warp size: 32
llama_print_timings: prompt eval time =    5597,99 ms /   622 tokens (    9,00 ms per token,   111,11 tokens per second)
llama_print_timings:        eval time =   15391,61 ms /   127 runs   (  121,19 ms per token,     8,25 tokens per second)
ggml_vulkan: Using NVIDIA GeForce RTX 3090 | uma: 0 | fp16: 1 | warp size: 32
ggml_vulkan: Using Intel(R) Arc(tm) A770 Graphics (DG2) | uma: 0 | fp16: 1 | warp size: 32
llama_print_timings: prompt eval time =    4132,89 ms /   622 tokens (    6,64 ms per token,   150,50 tokens per second)
llama_print_timings:        eval time =   12393,73 ms /   127 runs   (   97,59 ms per token,    10,25 tokens per second)
ggml_vulkan: Using NVIDIA GeForce RTX 3090 | uma: 0 | fp16: 1 | warp size: 32
ggml_vulkan: Using AMD Radeon Pro VII (RADV VEGA20) | uma: 0 | fp16: 1 | warp size: 64
ggml_vulkan: Using Intel(R) Arc(tm) A770 Graphics (DG2) | uma: 0 | fp16: 1 | warp size: 32
llama_print_timings: prompt eval time =    4338,94 ms /   622 tokens (    6,98 ms per token,   143,35 tokens per second)
llama_print_timings:        eval time =   14024,83 ms /   127 runs   (  110,43 ms per token,     9,06 tokens per second)

I can now run 70B q4_k_s across those three GPUs:

ggml_vulkan: Using NVIDIA GeForce RTX 3090 | uma: 0 | fp16: 1 | warp size: 32
ggml_vulkan: Using AMD Radeon Pro VII (RADV VEGA20) | uma: 0 | fp16: 1 | warp size: 64
ggml_vulkan: Using Intel(R) Arc(tm) A770 Graphics (DG2) | uma: 0 | fp16: 1 | warp size: 32
[...]
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:        CPU buffer size =   140,62 MiB
llm_load_tensors:     Vulkan buffer size = 17160,81 MiB
llm_load_tensors:     Vulkan buffer size = 11512,00 MiB
llm_load_tensors:     Vulkan buffer size = 10689,80 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:     Vulkan KV buffer size =   280,00 MiB
llama_kv_cache_init:     Vulkan KV buffer size =   192,00 MiB
llama_kv_cache_init:     Vulkan KV buffer size =   168,00 MiB
llama_new_context_with_model: KV self size  =  640,00 MiB, K (f16):  320,00 MiB, V (f16):  320,00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =    20,01 MiB
llama_new_context_with_model:     Vulkan compute buffer size =   338,80 MiB
llama_new_context_with_model:     Vulkan compute buffer size =   338,80 MiB
llama_new_context_with_model:     Vulkan compute buffer size =   338,80 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    17,60 MiB
llama_new_context_with_model: graph splits (measure): 7
[...]
llama_print_timings: prompt eval time =   19742,11 ms /   622 tokens (   31,74 ms per token,    31,51 tokens per second)
llama_print_timings:        eval time =   38055,18 ms /   127 runs   (  299,65 ms per token,     3,34 tokens per second)

Move most global variables into backend context
@0cc4m 0cc4m added the Vulkan Issues specific to the Vulkan backend label Feb 4, 2024
@0cc4m 0cc4m marked this pull request as ready for review February 4, 2024 18:12
llama.cpp Outdated Show resolved Hide resolved
LLAMA_LOG_ERROR("%s: failed to initialize Vulkan backend\n", __func__);
llama_free(ctx);
return nullptr;
for (int device = 0; device < ggml_backend_vk_get_device_count(); ++device) {
Copy link
Collaborator

@slaren slaren Feb 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to avoid initializing the backends that are not used. At least with LLAMA_SPLIT_NONE only the main GPU backend should be initialized.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, but right now the split-mode has no effect at all on the Vulkan backend (and throws a warning if someone tries to use it). I'm actually initializing the backends even before this point, in ggml_vk_init_cpu_assist(), because I cannot get the properties of the GPUs without initializing them. This isn't optimal.

I suppose I could duplicate the parts of the device initialization code required to read the properties and only do that initially. I'll think about it and update this code tomorrow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot get the properties of the GPUs without initializing them

In the Kompute backend, the devices are enumerated via ggml_vk_available_devices, which can be called by the user (GPT4All needs this) but is also used by ggml_backend_kompute_buffer_type to get the necessary device properties in advance - this was inspired by the CUDA backend.

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 4, 2024

Nets 7.2 t/s on dual 3090 loading 70b.

@slaren
Copy link
Collaborator

slaren commented Feb 5, 2024

With some tweaks it is possible to use CUDA and Vulkan at the same time, however the hooks that these backends have in ggml.c make this impractical to merge at the moment.

main: build = 2064 (daa6a9c3)
main: built with MSVC 19.29.30151.0 for x64
main: seed  = 1707098852
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA GeForce RTX 3080 | uma: 0 | fp16: 1 | warp size: 32
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from models/7B/ggml-model-q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.33 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:    Vulkan0 buffer size =   971.30 MiB
llm_load_tensors:      CUDA0 buffer size =  2606.26 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =    64.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =     9.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    66.00 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =    77.55 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 5

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 Nala the cat has become quite the social media star, after her owner’s video of her playing fetch with a stuffed dog toy went viral.
The foot
llama_print_timings:        load time =    1501.46 ms
llama_print_timings:      sample time =       2.68 ms /    32 runs   (    0.08 ms per token, 11922.50 tokens per second)
llama_print_timings: prompt eval time =     135.20 ms /     5 tokens (   27.04 ms per token,    36.98 tokens per second)
llama_print_timings:        eval time =     365.80 ms /    31 runs   (   11.80 ms per token,    84.75 tokens per second)
llama_print_timings:       total time =     511.11 ms /    36 tokens

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 5, 2024

@slaren I added the device output print in the beginning, so I only initialize one backend for CPU offload at that stage.

But I have a problem with cleaning up the device properly. I can only do that once all buffers are freed, but llama.cpp frees the backends before it frees buffers like the kv cache, weights and other data, which would segfault if I actually freed the device beforehand. I'm not sure how to resolve that.

@slaren
Copy link
Collaborator

slaren commented Feb 5, 2024

You can keep a reference count, increased every time that a buffer of backend is initialized and reduced when freed, and only free the device when it reaches zero. You could probably use std::shared_ptr to do this. The metal and kompute backends do something like that. Generally, it is not good to require calls to custom backend-specific functions for the normal usage of a backend.

@cebtenzzre
Copy link
Collaborator

But I have a problem with cleaning up the device properly.

The Kompute backend does reference counting like slaren mentioned.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 6, 2024

Thanks, that's a good idea. I'll implement that later today.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 6, 2024

@slaren I reworked how it handles devices and buffers using smart pointers, in my tests it always freed the buffers and devices at the correct time and without throwing any Vulkan validation issues. Let me know what you think.

@slaren
Copy link
Collaborator

slaren commented Feb 6, 2024

I don't know enough about the way the vulkan backend is implemented to give any specific advice, but I am surprised by the way weak_ptr seems to be used. The way I would expect this to be implemented is something like this:

static std::shared_ptr<device> get_device(int id) {
    static std::weak_ptr<device> devices[MAX_DEVICES];
    if (devices[id].expired()) {
        devices[id] = std::make_shared<device>(id);
    }
    return devices[id].lock();
}

// buffer
struct buffer_ctx {
    std::shared_ptr<device> dev;
};

void free_buffer(buffer b) {
    delete b->ctx;
}

buffer alloc_buffer(int size) {
    buffer_ctx * ctx = new buffer_ctx;
    ctx->dev = get_device();
    return new buffer(ctx);
}

// backend
struct backend_ctx {
    std::shared_ptr<device> dev;
};

void free_backend(backend b) {
    delete b->ctx;
    delete b;
}

backend init_backend(int dev_id) {
    backend_ctx * ctx = new backend_ctx;
    ctx->dev = get_device();
    return new backend(ctx);
}

In this way, both the backend and buffer objects hold owning references to the device, and when the last is freed, the device will be automatically freed. There is only one global weak_ptr reference to the device used to initialize new buffers and backends, which allows reusing the same device instance, but will not prevent it from being freed when the last buffer or backend using it is freed.

Ultimately thought, how to implement this is entirely up to you, but the call to ggml_vk_cleanup_cpu_assist() should not be required by applications using the backend, and should be removed from llama.cpp.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 6, 2024

Ultimately thought, how to implement this is entirely up to you, but the call to ggml_vk_cleanup_cpu_assist() should not be required by applications using the backend, and should be removed from llama.cpp.

That use of weak_ptr is smarter than what I did, yes.

But I don't see why having an init and a free function is surprising for the CPU matmul offload. The backend system has an init and a free and expects both to be called. The CPU assist initializes a backend since it requires all of those structures as well, so it should clean that up if the backend system hasn't done that. The backend system doesn't initialize a backend if no layers are offloaded. It looks to me like cuda just gets around that by hiding most of that boiler plate in the background (like device initialization) and not destroying resources (like the streams). Please correct me if I'm wrong.

I guess at the very least I should rename the function to something clearer, like ggml_vk_free_cpu_assist.

@slaren
Copy link
Collaborator

slaren commented Feb 6, 2024

Ok, I can see that the automatic offloading of mat muls through the CPU backend complicates this a lot. The goal is to move all of this logic to ggml_backend_sched eventually, but until that's done it's ok to keep the call there, or just never free the resources like the CUDA backend does.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 6, 2024

Ok, I can see that the automatic offloading of mat muls through the CPU backend complicates this a lot. The goal is to move all of this logic to ggml_backend_sched eventually, but until that's done it's ok to keep the call there, or just never free the resources like the CUDA backend does.

Yeah, I'm looking forward to when that gets implemented, then I can drop all of the _cpu_assist-functions and a bunch of other internal code. I assume you will do that when you get around to it? I can help with the Vulkan parts then.

@0cc4m 0cc4m merged commit ee1628b into master Feb 7, 2024
56 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-multigpu branch February 7, 2024 06:54
@MaggotHATE
Copy link
Contributor

Something's happening with RAM here - it's allocating extra on reloading (not a feature in llama.cpp, but I doubt it should work that way). Sorry for being late on this, didn't have time to test this PR earlier.

VRAM is fine, but RAM usage seems to almost double (from ~8GB to ~15GB) on restart. It happens once, though.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 7, 2024

Something's happening with RAM here - it's allocating extra on reloading (not a feature in llama.cpp, but I doubt it should work that way). Sorry for being late on this, didn't have time to test this PR earlier.

VRAM is fine, but RAM usage seems to almost double (from ~8GB to ~15GB) on restart. It happens once, though.

Interesting. Thanks for testing it, maybe I missed some cleanup? I'll try to reproduce it.

@userbox020
Copy link

sup guys, does anyone know if the vulkan multigpu has a limit of gpus you can set?
im stuck at 8, going to keep doing some tests

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 11, 2024

sup guys, does anyone know if the vulkan multigpu has a limit of gpus you can set? im stuck at 8, going to keep doing some tests

It's 16, but that's only an arbitrary number chosen for the size of the arrays holding the relevant data.

@MaggotHATE
Copy link
Contributor

Just reporting on the latest changes: with ggml-alloc v3 it now reads the model again from the disk on restart instead of quickly getting it from cache.

It happens only on Vulkan (both with and without ggml_vk_free_cpu_assist), Clblast works as usual.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 12, 2024

Just reporting on the latest changes: with ggml-alloc v3 it now reads the model again from the disk on restart instead of quickly getting it from cache.

It happens only on Vulkan (both with and without ggml_vk_free_cpu_assist), Clblast works as usual.

You're really good at catching memory issues. But if #5452 broke something, maybe also mention it to them. ClBlast doesn't use ggml-alloc, that's why it's not affected. Does it also happen if you run CPU-only?

@slaren
Copy link
Collaborator

slaren commented Feb 12, 2024

The only way I can imagine this change could affect backends is because a buffer with size 1 is no longer allocated during the measure step.

@MaggotHATE
Copy link
Contributor

But if #5452 broke something, maybe also mention it to them

Unfortunately, I cannot test anything except for Clblast and Vulkan among backends, which is effectively one backend if Clblast doesn't use ggml-alloc. I'm also using old OS and hardware overall.

Tested CPU only, no problems with restart.

@userbox020
Copy link

how do I run the benchmark to know the t/s that my current hardware setup is doing?

@userbox020
Copy link

@0cc4m, I installed the 9th gpu and my mobo and rocm detect It, but when I run vulkan info it doesnt detect it. What can i do bro?

1
2

also Im having a warning that was not present before installing the 9th gpu, Its the follow

(base) mruserbox@guruAI:~/Desktop$ vulkaninfo --summary
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Received return code -3 from call to vkCreateInstance in ICD /usr/lib/x86_64-linux-gnu/libvulkan_virtio.so. Skipping this driver.

I attach to my vulkaninfo file dont know if can help to know whats going on, can you give me a hand bro?
vulkaninfo.txt

@cebtenzzre
Copy link
Collaborator

@0cc4m This PR broke the Kompute backend (at least, for multi-GPU systems) - probably the "generalize LLAMA_SPLIT_LAYER for all backends" part, since I can get Kompute to work by passing -sm none.

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Feb 21, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
@slaren
Copy link
Collaborator

slaren commented Feb 21, 2024

@cebtenzzre do you have more details? I do not see an obvious issue with the implementation of LLAMA_SPLIT_LAYER.

@slaren
Copy link
Collaborator

slaren commented Feb 21, 2024

@cebtenzzre Kompute seems to work for me. -sm layer and -sm none result in the same Kompute buffer sizes. I have two GPUs, but llama_get_device_count always returns 1 for Kompute regardless, so it shouldn't make a difference. Maybe what you are observing is that main_gpu is ignored with -sm layer, which is intended.

@userbox020
Copy link

@0cc4m @slaren what does kompute is related to Vulkan multigpu? However i managed to install llamacpp with kompute, but i findout it only support like 2 or 3 quantz, its very limitted and i didnt test it with multigpu. At the moment im trying to enable vulkan to get more than 8 gpus working

I findout i had to modify install mesa drivers and modify some of the building code to remove the restriction of 8 gpus. This weekend going to keep trying. This is a hobby for me so i need to find me free time to work on my hobbies

@userbox020
Copy link

Also guys, do you have a discord or something, i would like to join

@0cc4m
Copy link
Collaborator Author

0cc4m commented Feb 23, 2024

@0cc4m @slaren what does kompute is related to Vulkan multigpu? However i managed to install llamacpp with kompute, but i findout it only support like 2 or 3 quantz, its very limitted and i didnt test it with multigpu. At the moment im trying to enable vulkan to get more than 8 gpus working

I findout i had to modify install mesa drivers and modify some of the building code to remove the restriction of 8 gpus. This weekend going to keep trying. This is a hobby for me so i need to find me free time to work on my hobbies

I don't think anyone can help you with that. Very few people have that many GPUs in one computer and basically noone had a reason to use Vulkan on so many devices together.

@userbox020
Copy link

userbox020 commented Feb 23, 2024

yes bro i realize anyone has the knowdgle to do that, but im working with it and i will be happy to share the procedures.
I think i started to love llamacpp but the discord unofficial channel its full of non tech people. Do you guys hang out on some groupchat?
@0cc4m

@cebtenzzre
Copy link
Collaborator

I think i started to love llamacpp but the discord unofficial channel its full of non tech people. Do you guys hang out on some groupchat?

Feel free to join the GPT4All Discord, I made a #llamacpp channel and Occam is already there. ggerganov isn't interested in realtime communication, so there's no official group chat.

@teleprint-me
Copy link
Contributor

@userbox020 I'm very interested in this hobby of yours. I think it's really cool.

@userbox020
Copy link

userbox020 commented Feb 26, 2024

@userbox020 I'm very interested in this hobby of yours. I think it's really cool.

Join to the gpt4all discord bro, there we can chat about llm stuff

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Mar 13, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* Initial Vulkan multi-gpu implementation

Move most global variables into backend context

* Add names to backend device functions

* Add further missing cleanup code

* Reduce code duplication in tensor split layer assignment

* generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h

* Only do device info print in the beginning and initialize one backend for cpu assist

Add missing cleanup code

* Rework backend memory management to make sure devices and buffers get properly allocated and freed

* Rename cpu assist free function

---------

Co-authored-by: slaren <slarengh@gmail.com>
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request May 7, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request May 8, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request May 9, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Jul 15, 2024
Signed-off-by: Jared Van Bortel <jared@nomic.ai>
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Jul 18, 2024
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Jul 18, 2024
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Jul 19, 2024
cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants