Basic Vulkan Multi-GPU implementation #5321

0cc4m · 2024-02-04T11:43:57Z

I'm not fully done, need to finish some details and check whether everything allocated gets cleaned properly, but it works already.

Right now it copies all data between them across RAM, which is not particularly fast, but I may be able to fix that in the future using Vulkan DeviceGroups. But maybe that only works between devices of the same vendor, I haven't tried it yet and the information I found about it is sparse.

This change also cleaned up most of the global variables the backend had, so it's a step towards allowing using it multiple times in the same program.

Edit: Forgot to mention, you have to set GGML_VK_VISIBLE_DEVICES to the indices of the devices you want, for example with export GGML_VK_VISIBLE_DEVICES=0,1,2. The indices correspond to the device order in vulkaninfo --summary. By default it will still only use the first device it finds.

Benchmarks

13B q6_k

ggml_vulkan: Using NVIDIA GeForce RTX 3090 | uma: 0 | fp16: 1 | warp size: 32
llama_print_timings: prompt eval time =    1962,96 ms /   622 tokens (    3,16 ms per token,   316,87 tokens per second)
llama_print_timings:        eval time =    5036,72 ms /   127 runs   (   39,66 ms per token,    25,21 tokens per second)

ggml_vulkan: Using NVIDIA GeForce RTX 3090 | uma: 0 | fp16: 1 | warp size: 32
ggml_vulkan: Using AMD Radeon Pro VII (RADV VEGA20) | uma: 0 | fp16: 1 | warp size: 64
llama_print_timings: prompt eval time =    2788,92 ms /   622 tokens (    4,48 ms per token,   223,03 tokens per second)
llama_print_timings:        eval time =    9158,65 ms /   127 runs   (   72,12 ms per token,    13,87 tokens per second)

ggml_vulkan: Using AMD Radeon Pro VII (RADV VEGA20) | uma: 0 | fp16: 1 | warp size: 64
ggml_vulkan: Using Intel(R) Arc(tm) A770 Graphics (DG2) | uma: 0 | fp16: 1 | warp size: 32
llama_print_timings: prompt eval time =    5597,99 ms /   622 tokens (    9,00 ms per token,   111,11 tokens per second)
llama_print_timings:        eval time =   15391,61 ms /   127 runs   (  121,19 ms per token,     8,25 tokens per second)

ggml_vulkan: Using NVIDIA GeForce RTX 3090 | uma: 0 | fp16: 1 | warp size: 32
ggml_vulkan: Using Intel(R) Arc(tm) A770 Graphics (DG2) | uma: 0 | fp16: 1 | warp size: 32
llama_print_timings: prompt eval time =    4132,89 ms /   622 tokens (    6,64 ms per token,   150,50 tokens per second)
llama_print_timings:        eval time =   12393,73 ms /   127 runs   (   97,59 ms per token,    10,25 tokens per second)

ggml_vulkan: Using NVIDIA GeForce RTX 3090 | uma: 0 | fp16: 1 | warp size: 32
ggml_vulkan: Using AMD Radeon Pro VII (RADV VEGA20) | uma: 0 | fp16: 1 | warp size: 64
ggml_vulkan: Using Intel(R) Arc(tm) A770 Graphics (DG2) | uma: 0 | fp16: 1 | warp size: 32
llama_print_timings: prompt eval time =    4338,94 ms /   622 tokens (    6,98 ms per token,   143,35 tokens per second)
llama_print_timings:        eval time =   14024,83 ms /   127 runs   (  110,43 ms per token,     9,06 tokens per second)

I can now run 70B q4_k_s across those three GPUs:

ggml_vulkan: Using NVIDIA GeForce RTX 3090 | uma: 0 | fp16: 1 | warp size: 32
ggml_vulkan: Using AMD Radeon Pro VII (RADV VEGA20) | uma: 0 | fp16: 1 | warp size: 64
ggml_vulkan: Using Intel(R) Arc(tm) A770 Graphics (DG2) | uma: 0 | fp16: 1 | warp size: 32
[...]
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors:        CPU buffer size =   140,62 MiB
llm_load_tensors:     Vulkan buffer size = 17160,81 MiB
llm_load_tensors:     Vulkan buffer size = 11512,00 MiB
llm_load_tensors:     Vulkan buffer size = 10689,80 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000,0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:     Vulkan KV buffer size =   280,00 MiB
llama_kv_cache_init:     Vulkan KV buffer size =   192,00 MiB
llama_kv_cache_init:     Vulkan KV buffer size =   168,00 MiB
llama_new_context_with_model: KV self size  =  640,00 MiB, K (f16):  320,00 MiB, V (f16):  320,00 MiB
llama_new_context_with_model: Vulkan_Host input buffer size   =    20,01 MiB
llama_new_context_with_model:     Vulkan compute buffer size =   338,80 MiB
llama_new_context_with_model:     Vulkan compute buffer size =   338,80 MiB
llama_new_context_with_model:     Vulkan compute buffer size =   338,80 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    17,60 MiB
llama_new_context_with_model: graph splits (measure): 7
[...]
llama_print_timings: prompt eval time =   19742,11 ms /   622 tokens (   31,74 ms per token,    31,51 tokens per second)
llama_print_timings:        eval time =   38055,18 ms /   127 runs   (  299,65 ms per token,     3,34 tokens per second)

Move most global variables into backend context

llama.cpp

slaren · 2024-02-04T18:27:46Z

llama.cpp

-                LLAMA_LOG_ERROR("%s: failed to initialize Vulkan backend\n", __func__);
-                llama_free(ctx);
-                return nullptr;
+            for (int device = 0; device < ggml_backend_vk_get_device_count(); ++device) {


It's better to avoid initializing the backends that are not used. At least with LLAMA_SPLIT_NONE only the main GPU backend should be initialized.

That is true, but right now the split-mode has no effect at all on the Vulkan backend (and throws a warning if someone tries to use it). I'm actually initializing the backends even before this point, in ggml_vk_init_cpu_assist(), because I cannot get the properties of the GPUs without initializing them. This isn't optimal.

I suppose I could duplicate the parts of the device initialization code required to read the properties and only do that initially. I'll think about it and update this code tomorrow.

I cannot get the properties of the GPUs without initializing them

In the Kompute backend, the devices are enumerated via ggml_vk_available_devices, which can be called by the user (GPT4All needs this) but is also used by ggml_backend_kompute_buffer_type to get the necessary device properties in advance - this was inspired by the CUDA backend.

Ph0rk0z · 2024-02-04T18:48:02Z

Nets 7.2 t/s on dual 3090 loading 70b.

…ount and memory in llama.h

slaren · 2024-02-05T02:09:55Z

With some tweaks it is possible to use CUDA and Vulkan at the same time, however the hooks that these backends have in ggml.c make this impractical to merge at the moment.

main: build = 2064 (daa6a9c3)
main: built with MSVC 19.29.30151.0 for x64
main: seed  = 1707098852
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA GeForce RTX 3080 | uma: 0 | fp16: 1 | warp size: 32
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from models/7B/ggml-model-q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = models
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.33 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.31 MiB
llm_load_tensors:    Vulkan0 buffer size =   971.30 MiB
llm_load_tensors:      CUDA0 buffer size =  2606.26 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =    64.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   192.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  CUDA_Host input buffer size   =     9.01 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =    66.00 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =    77.55 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 5

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 32, n_keep = 0


 Nala the cat has become quite the social media star, after her owner’s video of her playing fetch with a stuffed dog toy went viral.
The foot
llama_print_timings:        load time =    1501.46 ms
llama_print_timings:      sample time =       2.68 ms /    32 runs   (    0.08 ms per token, 11922.50 tokens per second)
llama_print_timings: prompt eval time =     135.20 ms /     5 tokens (   27.04 ms per token,    36.98 tokens per second)
llama_print_timings:        eval time =     365.80 ms /    31 runs   (   11.80 ms per token,    84.75 tokens per second)
llama_print_timings:       total time =     511.11 ms /    36 tokens

… for cpu assist Add missing cleanup code

0cc4m · 2024-02-05T19:55:46Z

@slaren I added the device output print in the beginning, so I only initialize one backend for CPU offload at that stage.

But I have a problem with cleaning up the device properly. I can only do that once all buffers are freed, but llama.cpp frees the backends before it frees buffers like the kv cache, weights and other data, which would segfault if I actually freed the device beforehand. I'm not sure how to resolve that.

slaren · 2024-02-05T20:00:38Z

You can keep a reference count, increased every time that a buffer of backend is initialized and reduced when freed, and only free the device when it reaches zero. You could probably use std::shared_ptr to do this. The metal and kompute backends do something like that. Generally, it is not good to require calls to custom backend-specific functions for the normal usage of a backend.

cebtenzzre · 2024-02-05T20:40:06Z

But I have a problem with cleaning up the device properly.

The Kompute backend does reference counting like slaren mentioned.

0cc4m · 2024-02-06T06:23:45Z

Thanks, that's a good idea. I'll implement that later today.

… properly allocated and freed

0cc4m · 2024-02-06T19:17:17Z

@slaren I reworked how it handles devices and buffers using smart pointers, in my tests it always freed the buffers and devices at the correct time and without throwing any Vulkan validation issues. Let me know what you think.

slaren · 2024-02-06T20:28:26Z

I don't know enough about the way the vulkan backend is implemented to give any specific advice, but I am surprised by the way weak_ptr seems to be used. The way I would expect this to be implemented is something like this:

static std::shared_ptr<device> get_device(int id) {
    static std::weak_ptr<device> devices[MAX_DEVICES];
    if (devices[id].expired()) {
        devices[id] = std::make_shared<device>(id);
    }
    return devices[id].lock();
}

// buffer
struct buffer_ctx {
    std::shared_ptr<device> dev;
};

void free_buffer(buffer b) {
    delete b->ctx;
}

buffer alloc_buffer(int size) {
    buffer_ctx * ctx = new buffer_ctx;
    ctx->dev = get_device();
    return new buffer(ctx);
}

// backend
struct backend_ctx {
    std::shared_ptr<device> dev;
};

void free_backend(backend b) {
    delete b->ctx;
    delete b;
}

backend init_backend(int dev_id) {
    backend_ctx * ctx = new backend_ctx;
    ctx->dev = get_device();
    return new backend(ctx);
}

In this way, both the backend and buffer objects hold owning references to the device, and when the last is freed, the device will be automatically freed. There is only one global weak_ptr reference to the device used to initialize new buffers and backends, which allows reusing the same device instance, but will not prevent it from being freed when the last buffer or backend using it is freed.

Ultimately thought, how to implement this is entirely up to you, but the call to ggml_vk_cleanup_cpu_assist() should not be required by applications using the backend, and should be removed from llama.cpp.

0cc4m · 2024-02-06T21:21:18Z

Ultimately thought, how to implement this is entirely up to you, but the call to ggml_vk_cleanup_cpu_assist() should not be required by applications using the backend, and should be removed from llama.cpp.

That use of weak_ptr is smarter than what I did, yes.

But I don't see why having an init and a free function is surprising for the CPU matmul offload. The backend system has an init and a free and expects both to be called. The CPU assist initializes a backend since it requires all of those structures as well, so it should clean that up if the backend system hasn't done that. The backend system doesn't initialize a backend if no layers are offloaded. It looks to me like cuda just gets around that by hiding most of that boiler plate in the background (like device initialization) and not destroying resources (like the streams). Please correct me if I'm wrong.

I guess at the very least I should rename the function to something clearer, like ggml_vk_free_cpu_assist.

slaren · 2024-02-06T21:38:48Z

Ok, I can see that the automatic offloading of mat muls through the CPU backend complicates this a lot. The goal is to move all of this logic to ggml_backend_sched eventually, but until that's done it's ok to keep the call there, or just never free the resources like the CUDA backend does.

0cc4m · 2024-02-06T21:47:56Z

Ok, I can see that the automatic offloading of mat muls through the CPU backend complicates this a lot. The goal is to move all of this logic to ggml_backend_sched eventually, but until that's done it's ok to keep the call there, or just never free the resources like the CUDA backend does.

Yeah, I'm looking forward to when that gets implemented, then I can drop all of the _cpu_assist-functions and a bunch of other internal code. I assume you will do that when you get around to it? I can help with the Vulkan parts then.

MaggotHATE · 2024-02-07T08:25:22Z

Something's happening with RAM here - it's allocating extra on reloading (not a feature in llama.cpp, but I doubt it should work that way). Sorry for being late on this, didn't have time to test this PR earlier.

VRAM is fine, but RAM usage seems to almost double (from ~8GB to ~15GB) on restart. It happens once, though.

0cc4m · 2024-02-07T16:06:16Z

Something's happening with RAM here - it's allocating extra on reloading (not a feature in llama.cpp, but I doubt it should work that way). Sorry for being late on this, didn't have time to test this PR earlier.

VRAM is fine, but RAM usage seems to almost double (from ~8GB to ~15GB) on restart. It happens once, though.

Interesting. Thanks for testing it, maybe I missed some cleanup? I'll try to reproduce it.

userbox020 · 2024-02-11T19:12:11Z

sup guys, does anyone know if the vulkan multigpu has a limit of gpus you can set?
im stuck at 8, going to keep doing some tests

0cc4m · 2024-02-11T20:40:20Z

sup guys, does anyone know if the vulkan multigpu has a limit of gpus you can set? im stuck at 8, going to keep doing some tests

It's 16, but that's only an arbitrary number chosen for the size of the arrays holding the relevant data.

MaggotHATE · 2024-02-12T09:41:41Z

Just reporting on the latest changes: with ggml-alloc v3 it now reads the model again from the disk on restart instead of quickly getting it from cache.

It happens only on Vulkan (both with and without ggml_vk_free_cpu_assist), Clblast works as usual.

0cc4m · 2024-02-12T17:03:46Z

Just reporting on the latest changes: with ggml-alloc v3 it now reads the model again from the disk on restart instead of quickly getting it from cache.

It happens only on Vulkan (both with and without ggml_vk_free_cpu_assist), Clblast works as usual.

You're really good at catching memory issues. But if #5452 broke something, maybe also mention it to them. ClBlast doesn't use ggml-alloc, that's why it's not affected. Does it also happen if you run CPU-only?

slaren · 2024-02-12T17:08:20Z

The only way I can imagine this change could affect backends is because a buffer with size 1 is no longer allocated during the measure step.

MaggotHATE · 2024-02-12T18:23:19Z

But if #5452 broke something, maybe also mention it to them

Unfortunately, I cannot test anything except for Clblast and Vulkan among backends, which is effectively one backend if Clblast doesn't use ggml-alloc. I'm also using old OS and hardware overall.

Tested CPU only, no problems with restart.

userbox020 · 2024-02-16T20:28:52Z

how do I run the benchmark to know the t/s that my current hardware setup is doing?

userbox020 · 2024-02-17T20:44:09Z

@0cc4m, I installed the 9th gpu and my mobo and rocm detect It, but when I run vulkan info it doesnt detect it. What can i do bro?

also Im having a warning that was not present before installing the 9th gpu, Its the follow

(base) mruserbox@guruAI:~/Desktop$ vulkaninfo --summary
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Received return code -3 from call to vkCreateInstance in ICD /usr/lib/x86_64-linux-gnu/libvulkan_virtio.so. Skipping this driver.

I attach to my vulkaninfo file dont know if can help to know whats going on, can you give me a hand bro?
vulkaninfo.txt

cebtenzzre · 2024-02-21T18:29:54Z

@0cc4m This PR broke the Kompute backend (at least, for multi-GPU systems) - probably the "generalize LLAMA_SPLIT_LAYER for all backends" part, since I can get Kompute to work by passing -sm none.

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

slaren · 2024-02-21T18:58:16Z

@cebtenzzre do you have more details? I do not see an obvious issue with the implementation of LLAMA_SPLIT_LAYER.

slaren · 2024-02-21T22:14:12Z

@cebtenzzre Kompute seems to work for me. -sm layer and -sm none result in the same Kompute buffer sizes. I have two GPUs, but llama_get_device_count always returns 1 for Kompute regardless, so it shouldn't make a difference. Maybe what you are observing is that main_gpu is ignored with -sm layer, which is intended.

userbox020 · 2024-02-21T22:36:20Z

@0cc4m @slaren what does kompute is related to Vulkan multigpu? However i managed to install llamacpp with kompute, but i findout it only support like 2 or 3 quantz, its very limitted and i didnt test it with multigpu. At the moment im trying to enable vulkan to get more than 8 gpus working

I findout i had to modify install mesa drivers and modify some of the building code to remove the restriction of 8 gpus. This weekend going to keep trying. This is a hobby for me so i need to find me free time to work on my hobbies

userbox020 · 2024-02-21T22:37:02Z

Also guys, do you have a discord or something, i would like to join

0cc4m · 2024-02-23T19:06:46Z

@0cc4m @slaren what does kompute is related to Vulkan multigpu? However i managed to install llamacpp with kompute, but i findout it only support like 2 or 3 quantz, its very limitted and i didnt test it with multigpu. At the moment im trying to enable vulkan to get more than 8 gpus working

I findout i had to modify install mesa drivers and modify some of the building code to remove the restriction of 8 gpus. This weekend going to keep trying. This is a hobby for me so i need to find me free time to work on my hobbies

I don't think anyone can help you with that. Very few people have that many GPUs in one computer and basically noone had a reason to use Vulkan on so many devices together.

userbox020 · 2024-02-23T19:51:23Z

yes bro i realize anyone has the knowdgle to do that, but im working with it and i will be happy to share the procedures.
I think i started to love llamacpp but the discord unofficial channel its full of non tech people. Do you guys hang out on some groupchat?
@0cc4m

cebtenzzre · 2024-02-23T21:13:25Z

I think i started to love llamacpp but the discord unofficial channel its full of non tech people. Do you guys hang out on some groupchat?

Feel free to join the GPT4All Discord, I made a #llamacpp channel and Occam is already there. ggerganov isn't interested in realtime communication, so there's no official group chat.

teleprint-me · 2024-02-26T01:48:05Z

@userbox020 I'm very interested in this hobby of yours. I think it's really cool.

userbox020 · 2024-02-26T01:54:30Z

@userbox020 I'm very interested in this hobby of yours. I think it's really cool.

Join to the gpt4all discord bro, there we can chat about llm stuff

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

* Initial Vulkan multi-gpu implementation Move most global variables into backend context * Add names to backend device functions * Add further missing cleanup code * Reduce code duplication in tensor split layer assignment * generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h * Only do device info print in the beginning and initialize one backend for cpu assist Add missing cleanup code * Rework backend memory management to make sure devices and buffers get properly allocated and freed * Rename cpu assist free function --------- Co-authored-by: slaren <slarengh@gmail.com>

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

Initial Vulkan multi-gpu implementation

ca8110c

Move most global variables into backend context

0cc4m added the Vulkan Issues specific to the Vulkan backend label Feb 4, 2024

0cc4m added 2 commits February 4, 2024 18:17

Add names to backend device functions

5a1ad8c

Add further missing cleanup code

a1f9c00

0cc4m marked this pull request as ready for review February 4, 2024 18:12

slaren reviewed Feb 4, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

slaren reviewed Feb 4, 2024

View reviewed changes

0cc4m and others added 2 commits February 4, 2024 21:57

Reduce code duplication in tensor split layer assignment

c71316f

generalize LLAMA_SPLIT_LAYER for all backends, do not expose device c…

daa6a9c

…ount and memory in llama.h

Only do device info print in the beginning and initialize one backend…

087ae64

… for cpu assist Add missing cleanup code

Rework backend memory management to make sure devices and buffers get…

fd8351b

… properly allocated and freed

0cc4m mentioned this pull request Feb 6, 2024

Fails with ggml_vulkan: No suitable memory type found: ErrorOutOfDeviceMemory #5319

Closed

slaren approved these changes Feb 6, 2024

View reviewed changes

Rename cpu assist free function

b22e925

luciferous mentioned this pull request Feb 7, 2024

vulkan: Find optimal memory type but with fallback #5381

Merged

0cc4m merged commit ee1628b into master Feb 7, 2024
56 checks passed

0cc4m deleted the 0cc4m/vulkan-multigpu branch February 7, 2024 06:54

userbox020 mentioned this pull request Feb 10, 2024

GPU Performance Data Point via Vulkan #5410

Closed

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Feb 21, 2024

kompute : disable LLAMA_SPLIT_LAYER after ggerganov#5321

bf31654

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Mar 13, 2024

kompute : disable LLAMA_SPLIT_LAYER after ggerganov#5321

dbf17f0

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request May 7, 2024

kompute : disable LLAMA_SPLIT_LAYER after ggerganov#5321

773c34d

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request May 8, 2024

kompute : disable LLAMA_SPLIT_LAYER after ggerganov#5321

42c185a

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request May 9, 2024

kompute : disable LLAMA_SPLIT_LAYER after ggerganov#5321

fad3795

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Jul 15, 2024

kompute : disable LLAMA_SPLIT_LAYER after ggerganov#5321

7cd9602

Signed-off-by: Jared Van Bortel <jared@nomic.ai>

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Jul 18, 2024

kompute : disable LLAMA_SPLIT_LAYER after ggerganov#5321

12dcddc

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Jul 18, 2024

kompute : disable LLAMA_SPLIT_LAYER after ggerganov#5321

1ce8fb4

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Jul 19, 2024

kompute : disable LLAMA_SPLIT_LAYER after ggerganov#5321

95395c6

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Sep 26, 2024

kompute : disable LLAMA_SPLIT_LAYER after ggerganov#5321

df9c5f3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic Vulkan Multi-GPU implementation #5321

Basic Vulkan Multi-GPU implementation #5321

0cc4m commented Feb 4, 2024 •

edited

Loading

slaren Feb 4, 2024 •

edited

Loading

0cc4m Feb 4, 2024

cebtenzzre Feb 6, 2024

Ph0rk0z commented Feb 4, 2024

slaren commented Feb 5, 2024

0cc4m commented Feb 5, 2024

slaren commented Feb 5, 2024

cebtenzzre commented Feb 5, 2024

0cc4m commented Feb 6, 2024

0cc4m commented Feb 6, 2024

slaren commented Feb 6, 2024

0cc4m commented Feb 6, 2024

slaren commented Feb 6, 2024

0cc4m commented Feb 6, 2024

MaggotHATE commented Feb 7, 2024

0cc4m commented Feb 7, 2024

userbox020 commented Feb 11, 2024

0cc4m commented Feb 11, 2024

MaggotHATE commented Feb 12, 2024

0cc4m commented Feb 12, 2024

slaren commented Feb 12, 2024

MaggotHATE commented Feb 12, 2024

userbox020 commented Feb 16, 2024

userbox020 commented Feb 17, 2024

cebtenzzre commented Feb 21, 2024

slaren commented Feb 21, 2024

slaren commented Feb 21, 2024 •

edited

Loading

userbox020 commented Feb 21, 2024

userbox020 commented Feb 21, 2024

0cc4m commented Feb 23, 2024

userbox020 commented Feb 23, 2024 •

edited

Loading

cebtenzzre commented Feb 23, 2024

teleprint-me commented Feb 26, 2024

userbox020 commented Feb 26, 2024 •

edited

Loading

Basic Vulkan Multi-GPU implementation #5321

Basic Vulkan Multi-GPU implementation #5321

Conversation

0cc4m commented Feb 4, 2024 • edited Loading

slaren Feb 4, 2024 • edited Loading

Choose a reason for hiding this comment

0cc4m Feb 4, 2024

Choose a reason for hiding this comment

cebtenzzre Feb 6, 2024

Choose a reason for hiding this comment

Ph0rk0z commented Feb 4, 2024

slaren commented Feb 5, 2024

0cc4m commented Feb 5, 2024

slaren commented Feb 5, 2024

cebtenzzre commented Feb 5, 2024

0cc4m commented Feb 6, 2024

0cc4m commented Feb 6, 2024

slaren commented Feb 6, 2024

0cc4m commented Feb 6, 2024

slaren commented Feb 6, 2024

0cc4m commented Feb 6, 2024

MaggotHATE commented Feb 7, 2024

0cc4m commented Feb 7, 2024

userbox020 commented Feb 11, 2024

0cc4m commented Feb 11, 2024

MaggotHATE commented Feb 12, 2024

0cc4m commented Feb 12, 2024

slaren commented Feb 12, 2024

MaggotHATE commented Feb 12, 2024

userbox020 commented Feb 16, 2024

userbox020 commented Feb 17, 2024

cebtenzzre commented Feb 21, 2024

slaren commented Feb 21, 2024

slaren commented Feb 21, 2024 • edited Loading

userbox020 commented Feb 21, 2024

userbox020 commented Feb 21, 2024

0cc4m commented Feb 23, 2024

userbox020 commented Feb 23, 2024 • edited Loading

cebtenzzre commented Feb 23, 2024

teleprint-me commented Feb 26, 2024

userbox020 commented Feb 26, 2024 • edited Loading

0cc4m commented Feb 4, 2024 •

edited

Loading

slaren Feb 4, 2024 •

edited

Loading

slaren commented Feb 21, 2024 •

edited

Loading

userbox020 commented Feb 23, 2024 •

edited

Loading

userbox020 commented Feb 26, 2024 •

edited

Loading