Vulkan Implementation #2059

0cc4m · 2023-06-30T19:12:58Z

I've been working on this for a while. Vulkan requires a lot of boiler plate, but it also gives you a lot of control. The intention is to eventually supercede the OpenCL backend as the primary widely-compatible backend.

I'll try to work together with @niansa's #2039, we probably don't need two Vulkan backends, but we approached our versions from different sides:

@niansa is basing their Kompute version on the Metal implementation, running the full graph on the GPU.
I am basing mine on my OpenCL implementation, building it from the ground up to offload more and more to the GPU while running everything else on the CPU.

Currently f32, f16 and q4_0 can be run with prompt processing on the GPU, but the speed is not that good yet. There is a lot of optimization left to do.

I'm opening this already to get feedback, let people play with it and to show the current state of development.

Open points:

~~The matmul kernel uses blocks of size 128x128, it does not have bounds checking yet, it cannot be used with smaller matrices yet~~
The kernel is also not that performant yet
Memory management needs improvements
~~Transfers to and from the GPU cannot be used with Semaphores or Fences yet~~
~~The CPU memcpy parts of the transfer probably need to be multithreaded~~
DMMV kernels aren't implemented yet
~~Some Vulkan objects get allocated, but not deallocated~~

…ernel

slaren · 2023-07-01T12:12:45Z

@niansa is basing their Kompute version on the Metal implementation, running the full graph on the GPU.
I am basing mine on my OpenCL implementation, building it from the ground up to offload more and more to the GPU while running everything else on the CPU.

About this, splitting the computation between the CPU and the GPU can also be achieved with a Metal-like implementation that runs entire graphs at a time, it is just a matter of splitting the graph into multiple parts and copying the output of each fragment to the input of the next. I think that this will be the cleanest solution in the long run, it will simplify the implementation of the backends, and eventually it will allow us to have a common interface to them. This will make possible mixing different backends, not just CPU+CUDA or CPU+Vulkan. For example, if you have an NVIDIA and an AMD GPU, you could run some layers with CUDA and the rest with Vulkan, possibly with some parts on the CPU as well. My goal currently is to adapt the CUDA implementation to work in this way.

0cc4m · 2023-07-01T12:47:39Z

Yes, that is how it's done on Pytorch transformers inference usually. I suppose that would be interesting, but what would be the advantage over running it on Nvidia and AMD GPUs together using a vendor-neutral library?

slaren · 2023-07-01T12:54:19Z

The advantage is better performance on NVIDIA GPUs.

0cc4m · 2023-07-01T13:33:21Z

I'm not convinced that is true universally. OpenCL has its limitations (some artificial ones introduced by Nvidia too, like lack of FP16 support) that prevent it from matching CUDA, but Vulkan for example is very well supported by Nvidia, it's just harder to write. We'll see where it goes.

AlphaAtlas · 2023-07-01T14:36:25Z

The advantage is better performance on NVIDIA GPUs.

As a random point of reference, mlc-llm's 7B llama vulkan implementation is faster than GPTQ CUDA on my mobile 2060. Its apples to oranges, but still evidence that Nvidia Vulkan can be performant.

Also, a big desire of mine (and I'm sure many others) is to offload to IGPs in a way thats more performant than pure CPU. Is that in the scope of this PR, or is it mostly targeting dGPUs?

0cc4m · 2023-07-01T15:29:51Z

Also, a big desire of mine (and I'm sure many others) is to offload to IGPs in a way thats more performant than pure CPU. Is that in the scope of this PR, or is it mostly targeting dGPUs?

Probably not this PR, it's hard enough to get the basics set up. But Vulkan gives you a lot of freedom with memory management, so optimizing for iGPUs can certainly happen in the future.

0cc4m · 2024-01-28T15:18:08Z

This works really great with all layers offloaded to GPU, but the CPU performance isn't good compared to BLAS backends. Is it possible to build with both openBLAS and vulkan enabled to speed up the CPU part of inference?

No, at the moment that's not possible. The layers are either on the CPU or the GPU and swapping that around is costly. OpenBLAS needs them on the CPU, Vulkan needs them on the GPU. But I have plans to improve Vulkan prompt processing performance at a later point.

Reporting a subtle bug on Phi for Vulkan - when all layers are offloaded it is substantially more incoherent than normal, almost like it's ignoring the prompt. This is not a placebo - simple continuations like "Hi, my name is" will fail e.g. Hi, my name is topic that can be. This does not happen when no layers are offloaded.

I was about to report the same thing. dolphin-phi-2 keeps going on about dogs for me. It even started generating json for the "top 10 dogs", ran out of tokens halfway through, and when resumed, forgot it was generating json and instead began giving veterinarian advice for dogs. All I said was hello. Performance was good though!

Yeah, the continue op is not behaving as it should. I'll fix it when I get to it.

So I'm running cmake -B build -DLLAMA_VULKAN=1 and cmake --build . --config Release -j8 but all I'm getting is this. Tried on my pc and on a rpi5, I get the same result on both. Am I doing something wrong?

I get reports of this every now and then, but I haven't been able to reproduce it. Can you build with make? Maybe it's cmake-specific.

0cc4m · 2024-01-28T15:34:22Z

One quick question: Could I change warp size from 64 to 32. thanks.

Not at the moment, but there's a Vulkan extension for that. I'll try it sometime maybe. 64 is the default AMD gave you, there's probably a reason for that.

teleprint-me · 2024-01-28T16:10:42Z

What is "warp" in this context? I have no idea what it means.

0cc4m · 2024-01-28T16:18:30Z

What is "warp" in this context? I have no idea what it means.

Basically it's the number of threads that execute the same command together at the same time, on a GPU. On Nvidia it's 32 at a time, on AMD GCN it's 64 and AMD RDNA can do either 32 or 64. Here's a more detailed explanation.

ggerganov · 2024-01-28T16:20:02Z

@0cc4m Need to apply this patch to fix the SYCL build:

diff --git a/ggml-sycl.cpp b/ggml-sycl.cpp
index 9764d9c3..3fc34697 100644
--- a/ggml-sycl.cpp
+++ b/ggml-sycl.cpp
@@ -14781,6 +14781,7 @@ static ggml_backend_buffer_type_i ggml_backend_sycl_buffer_type_interface = {
     /* .get_name         = */ ggml_backend_sycl_buffer_type_name,
     /* .alloc_buffer     = */ ggml_backend_sycl_buffer_type_alloc_buffer,
     /* .get_alignment    = */ ggml_backend_sycl_buffer_type_get_alignment,
+    /* .get_max_size     = */ NULL, // TODO: return device.maxBufferLength
     /* .get_alloc_size   = */ ggml_backend_sycl_buffer_type_get_alloc_size,
     /* .supports_backend = */ ggml_backend_sycl_buffer_type_supports_backend,
     /* .is_host          = */ nullptr,
@@ -14844,6 +14845,7 @@ ggml_backend_buffer_type_t ggml_backend_sycl_host_buffer_type() {
             /* .get_name         = */ ggml_backend_sycl_host_buffer_type_name,
             /* .alloc_buffer     = */ ggml_backend_sycl_host_buffer_type_alloc_buffer,
             /* .get_alignment    = */ ggml_backend_cpu_buffer_type()->iface.get_alignment,
+            /* .get_max_size     = */ NULL, // TODO: return device.maxBufferLength
             /* .get_alloc_size   = */ ggml_backend_cpu_buffer_type()->iface.get_alloc_size,
             /* .supports_backend = */ ggml_backend_cpu_buffer_type()->iface.supports_backend,
             /* .is_host          = */ ggml_backend_cpu_buffer_type()->iface.is_host,

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

0cc4m · 2024-01-28T16:25:09Z

@ggerganov Done.

netrunnereve · 2024-01-28T16:54:06Z

So I'm getting an extra 15% increase in prompt processing speed with warp size 64 on my GCN2 card, with no change in inference speed.

This works really great with all layers offloaded to GPU, but the CPU performance isn't good compared to BLAS backends. Is it possible to build with both openBLAS and vulkan enabled to speed up the CPU part of inference?

What's the point of doing that? OpenBLAS is really only used for prompt processing and with the Vulkan backend enabled all prompt processing is done on the GPU regardless of how many layers you offload. For actual text generation OpenBLAS is not used since it has a lot of overhead (you have to feed it a lot of tokens to make it worthwhile).

llama.cpp

0cc4m · 2024-01-28T16:58:59Z

So I'm getting an extra 15% increase in prompt processing speed with warp size 64 on my GCN2 card, with no change in inference speed.

You mean speed increased after my GCN optimization or something else?

What's the point of doing that? OpenBLAS is really only used for prompt processing and with the Vulkan backend enabled all prompt processing is done on the GPU regardless of how many layers you offload.

Only the large matrix multiplications are done by Vulkan on CPU layers, the rest is still done by the CPU.

netrunnereve · 2024-01-28T17:10:21Z

You mean speed increased after my GCN optimization or something else?

Yep it did, with the optimization prompt processing runs at 102 t/s with Mistral Q6_K versus 90 t/s with the previous commit.

Only the large matrix multiplications are done by Vulkan on CPU layers, the rest is still done by the CPU.

Ah I didn't know that. When I saw the prompt processing speed drop with partial offloading I always assumed it was due to all the data transfers between the GPU and CPU.

Oh and congrats on the merge @0cc4m! 🥳

sorasoras · 2024-01-28T18:51:42Z

ggml_vulkan: Using AMD Radeon RX 7900 XTX | fp16: 1 | warp size: 64
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| qwen 13B Q4_K - Medium         |   8.79 GiB |    14.17 B | Vulkan     | 100 | pp 512     |   608.03 ± 54.57 |
| qwen 13B Q4_K - Medium         |   8.79 GiB |    14.17 B | Vulkan     | 100 | tg 128     |     51.78 ± 0.06 |

build: 35dec26c (1998)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| qwen 13B Q4_K - Medium         |   8.79 GiB |    14.17 B | ROCm       | 100 | pp 512     |   1675.50 ± 5.04 |
| qwen 13B Q4_K - Medium         |   8.79 GiB |    14.17 B | ROCm       | 100 | tg 128     |     62.60 ± 0.27 |

build: 35dec26c (1998)

vulkan and rocm is surprisingly close when it come to token gen

MoffKalast · 2024-01-28T19:28:26Z

Continuing my Pi 5 tests with the drivers now seemingly working, I only get the following line when initializing which then hangs with one cpu thread pinned to max:

ggml_vulkan: Using V3D 7.1.7 | fp16: 0 | warp size: 16

Testing with a 4_K_S model so the missing fp16 support shouldn't be an issue I hope? I'm told that the driver may be missing some features, maybe that's more of a problem.

VelvetyWhite · 2024-01-28T19:42:49Z

Continuing my Pi 5 tests with the drivers now seemingly working, I only get the following line when initializing which then hangs with one cpu thread pinned to max:
ggml_vulkan: Using V3D 7.1.7 | fp16: 0 | warp size: 16
Testing with a 4_K_S model so the missing fp16 support shouldn't be an issue I hope? I'm told that the driver may be missing some features, maybe that's more of a problem.

Yeah, exactly the same issue here.
Update: got out of memory error in the end with 0 layers offloaded to the gpu.

XdpAreKid · 2024-01-29T02:24:41Z

ggml_vulkan: Using AMD Radeon RX 7900 XTX | fp16: 1 | warp size: 64
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| qwen 13B Q4_K - Medium         |   8.79 GiB |    14.17 B | Vulkan     | 100 | pp 512     |   608.03 ± 54.57 |
| qwen 13B Q4_K - Medium         |   8.79 GiB |    14.17 B | Vulkan     | 100 | tg 128     |     51.78 ± 0.06 |

build: 35dec26c (1998)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
| qwen 13B Q4_K - Medium         |   8.79 GiB |    14.17 B | ROCm       | 100 | pp 512     |   1675.50 ± 5.04 |
| qwen 13B Q4_K - Medium         |   8.79 GiB |    14.17 B | ROCm       | 100 | tg 128     |     62.60 ± 0.27 |

build: 35dec26c (1998)

vulkan and rocm is surprisingly close when it come to token gen

Can you achieve the expected results for Qwen model with vulkan backend inference? The results on my GPU are incorrect.

teleprint-me · 2024-01-29T06:06:39Z

@0cc4m @netrunnereve

I haven't really tested the vulkan build in-depth since last week. I wanted to experiment with a bit tonight. I usually prefer mistral.

It took down my entire desktop environment while experimenting with the server.

Build:

make LLAMA_VULKAN=1

Command:

# Edit: I'm fairly certain it was q8_0 that crashed for me.
# I had to create a new f16 afterwards, so it wasn't the one used when crashed.
./server -m local/models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-q8_0.gguf --n-gpu-layers 16

Results:

My cousins RTX 4060ti always generated complete gibberish. Using make LLAMA_CUBLAS resolved the issue.

Mixed results with my RX 580. Using make LLAMA_OPENBLAS resolved any intermittent issues.

I was using Continue as a front-end client in Visual Studio Code when it crashed and took down the display server.

journalctl -xb

Jan 29 00:10:57 spectra systemd[1]: Started Process Core Dump (PID 1343945/UID 0).
░░ Subject: A start job for unit systemd-coredump@3-1343945-0.service has finished successfully
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit systemd-coredump@3-1343945-0.service has finished successfully.
░░ 
░░ The job identifier is 13918.
Jan 29 00:10:57 spectra systemd-coredump[1343917]: [🡕] Process 1337387 (server) of user 1000 dumped core.
                                                   
                                                   Stack trace of thread 1337387:
                                                   #0  0x00007687f8a8783c n/a (libc.so.6 + 0x8e83c)
                                                   #1  0x00007687f8a37668 raise (libc.so.6 + 0x3e668)
                                                   #2  0x00007687f8a1f4b8 abort (libc.so.6 + 0x264b8)
                                                   #3  0x00007687f8c9ca6f _ZN9__gnu_cxx27__verbose_terminate_handlerEv (libstdc++.so.6 + 0x9ca6f)
                                                   #4  0x00007687f8cb011c _ZN10__cxxabiv111__terminateEPFvvE (libstdc++.so.6 + 0xb011c)
                                                   #5  0x00007687f8cb0189 _ZSt9terminatev (libstdc++.so.6 + 0xb0189)
                                                   #6  0x00007687f8cb03ed __cxa_throw (libstdc++.so.6 + 0xb03ed)
                                                   #7  0x00006113751ff77a ggml_vk_compute_forward.cold (server + 0x3a77a)
                                                   #8  0x00006113753a918b _ZL29ggml_backend_vk_graph_computeP12ggml_backendP11ggml_cgraph (server + 0x1e418b)
                                                   #9  0x00006113753beffa ggml_backend_sched_graph_compute (server + 0x1f9ffa)
                                                   #10 0x0000611375335ae1 _ZL21llama_decode_internalR13llama_context11llama_batch (server + 0x170ae1)
                                                   #11 0x0000611375336811 llama_decode (server + 0x171811)
                                                   #12 0x0000611375280844 _ZN20llama_server_context12update_slotsEv.isra.0 (server + 0xbb844)
                                                   #13 0x0000611375276d6f _ZN18llama_server_queue10start_loopEv (server + 0xb1d6f)
                                                   #14 0x00006113752121c2 main (server + 0x4d1c2)
                                                   #15 0x00007687f8a20cd0 n/a (libc.so.6 + 0x27cd0)
                                                   #16 0x00007687f8a20d8a __libc_start_main (libc.so.6 + 0x27d8a)
                                                   #17 0x0000611375218d35 _start (server + 0x53d35)
                                                   
                                                   Stack trace of thread 1337392:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337393:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137522a74b _ZZZ4mainENKUlRKN7httplib7RequestERNS_8ResponseEE8_clES2_S4_ENKUlmRNS_8DataSinkEE_c>
                                                   #4  0x00006113752347cf _ZNSt17_Function_handlerIFbmmRN7httplib8DataSinkEENS0_6detail22ContentProviderAdapt>
                                                   #5  0x000061137525b344 _ZN7httplib6Server19write_response_coreERNS_6StreamEbRKNS_7RequestERNS_8ResponseEb >
                                                   #6  0x0000611375298189 _ZN7httplib6Server15process_requestERNS_6StreamEbRbRKSt8functionIFvRNS_7RequestEEE >
                                                   #7  0x00006113752996fe _ZN7httplib6Server24process_and_close_socketEi (server + 0xd46fe)
                                                   #8  0x000061137523fb2d _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #9  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #10 0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #11 0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   Stack trace of thread 1337397:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337395:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337398:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337388:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f803792c n/a (libvulkan_radeon.so + 0x23792c)
                                                   #3  0x00007687f80484bc n/a (libvulkan_radeon.so + 0x2484bc)
                                                   #4  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #5  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337390:
                                                   #0  0x00007687f8b0b30f accept (libc.so.6 + 0x11230f)
                                                   #1  0x000061137523d1b0 _ZN7httplib6Server15listen_internalEv (server + 0x781b0)
                                                   #2  0x000061137521c3e8 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZ4mainEUlvE_EEEEE6_M_runEv (server>
                                                   #3  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #4  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #5  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337400:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   Stack trace of thread 1337396:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337391:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337399:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337401:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337394:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337403:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   Stack trace of thread 1337402:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337404:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337389:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f803792c n/a (libvulkan_radeon.so + 0x23792c)
                                                   #3  0x00007687f80484bc n/a (libvulkan_radeon.so + 0x2484bc)
                                                   #4  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #5  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   
                                                   Stack trace of thread 1337405:
                                                   #0  0x00007687f8a824ae n/a (libc.so.6 + 0x894ae)
                                                   #1  0x00007687f8a84d40 pthread_cond_wait (libc.so.6 + 0x8bd40)
                                                   #2  0x00007687f8cd9e11 __gthread_cond_wait (libstdc++.so.6 + 0xd9e11)
                                                   #3  0x000061137523fb62 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJN7httplib10ThreadPool6workerEEEEEE>
                                                   #4  0x00007687f8ce1943 execute_native_thread_routine (libstdc++.so.6 + 0xe1943)
                                                   #5  0x00007687f8a859eb n/a (libc.so.6 + 0x8c9eb)
                                                   #6  0x00007687f8b097cc n/a (libc.so.6 + 0x1107cc)
                                                   ELF object binary architecture: AMD x86-64

This continues on until D-BUS is hit, stops, and then restarts.

screenshot of output right before crash

Might be related to #5179.

Was able to reproduce garbled output which seems to be a mix of Dutch, Russian, Mandrin (Kanji?), English, and some other languages.

output

./main -m local/models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-f16.gguf --color -e -s 1337 -c 8192 -n 1024 --n-gpu-layers 16 -p "<<SYS>> My name is Mistral. I am an advanced LLM (Large Language Model). I am intelligent, creative, and helpful. <</SYS>>\n" --interactive --interactive-first --multiline-input --in-prefix "[INST] " --in-suffix " [/INST]\n" 
Log start
main: build = 1999 (d2f650c)
main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
main: seed  = 1337
ggml_vulkan: Using AMD Radeon RX 580 Series (RADV POLARIS10) | fp16: 0 | warp size: 64
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from local/models/mistralai/Mistral-7B-Instruct-v0.2/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
# omitting output for brevity
<<SYS>> My name is Mistral. I am an advanced LLM (Large Language Model). I am intelligent, creative, and helpful. <</SYS>>
[INST] Hello! My name is Austin. What is your name?\ 
 [/INST]
 Hiazu feeding obst CraftfterspeedfterGBT peacealskemoint erh Mystasmpyx trailingazuftergens GenerationusquamasGeneration sidPOSiernoennesched epidighterserme stomach shoe outbreakGenerationdu❍ Ott Mondöl studyinglingRSген behalfnico studyingSlrub Pubimentoligostafteramma Tru Caribmers Indust referencekéGBTдяocaladudorfnia else Judge Victorplementsorflem Officerû practicegens gutofs /******/chorermeansepyx deploy tact Perm论rottermeékACHEammaazuCREFiaz

This goes on for awhile.

It's fine with q4_0 and q8_0 (sometimes q8_0 behaves "oddly").

teleprint-me · 2024-01-29T07:37:41Z

After further testing, most of the issues seem to revolve around 16-bit, not 8-bit or below. This was also true for the RTX 4060ti.

I am somewhat observing inconsistent and difficult to reproduce issues with 8-bit models. So far I've tested Llama-2 7B Chat, CodeLlama 7B Instruct, and Mistral 7B Instruct v0.2.

* Vulkan loader code * Fix matmul kernel, continue implementation * Continue implementation * Vulkan memory management * Vulkan development * Matmul call * Add aligned malloc and free for VMA * Continue implementation * First matmul success * GEMM Kernel optimization * 1D Blocktiling * 2D Blocktiling * Write coalescing * Continue vulkan implementation and optimization * First FP16 attempt, disabled for now * Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 kernel * Enable device extensions properly, restore fp16 matmul op * Fix mulmat_f16 * Output FP32 in fp16 matmul shader * Fix f16_to_f32 kernel * dequant_q4_0 kernel * Add VMA library * Avoid requesting dedicated memory, VMA can decide that by itself * Add bounds checking to matmul kernels, improve implementation, fix command buffers not freed properly * add cmake commands * Add 2d write operation, profiling code * Fix 2d write * Fix queue selection for AMD RADV * Fix trailing whitespace in vk_mem_alloc.h * Add WIP warp tile mat mul shaders * Disable glslc optimization * Disable glslc optimization for CMake * Optimize warptile matmul shader, replace blocktile with it * Add split-k optimization for small matrix multiplication Use semaphores for synchronization instead of fences or waitidle Rework async write/read for synchronization * Fix validation errors, improve compatibility with AMD GPUs * Rework command buffer handling * Variable matmul kernel using specialization constants * Fix synchronization on AMD, add barriers for buffer ownership transfer, add debug flag and prints * Reuse semaphores * Handle stage flags during command buffer submission properly * Increase matmul test runs for consistent results * Fix F32 matmul * Add vectorized loading and zeropadding for matrix multiplication * Use pinned memory for f16 preprocessing * Don't force aligned matmul * Don't free before queue done * Replace VMA library with native Vulkan buffer management * Basic offloading support with mul_f32 and dmmv for q4_0 * Run glslc commands in parallel * Unroll loops in dmmv shader * Reduce usage of waitIdle * Reuse pinned allocation for f16 conversion * Handle devices with only a single queue * Fix trailing whitespace in CMakeLists.txt * Allow parallel execution of kernels, parallelize third and fourth dimension calls * Add fallback for devices only supporting one DescriptorSet per DescriptorPool * Move to graph function similar to CUDA implementation * Use F16 kernel for most things, replace q_f32 with mul_mat_q_f16 function * Add F32 dmmv shaders * Batch submissions * Add .spv to gitignore * Split off matrix vector multiplication for separate optimization * Use single command buffer for matrix vector multiplication ops * Reduce overhead of mul_f32 calls by using a single command buffer * Add submission batching to mul_f32 * Fix tests * Add missing barrier * Add further missing barrier * Add further ops * Replace vk::QueueFamilyIgnored with VK_QUEUE_FAMILY_IGNORED to support more Vulkan header versions * Remove unnecessary cblas link * Fix descriptor set pre-allocation assert * Add runtime shader compilation, start transferring shaders to this approach * Transfer remaining shaders to header and compile on runtime * Fix fp32 fallback if device doesn't support fp16, add force disable env var GGML_VULKAN_DISABLE_F16 * Add support for q4_1, q5_0, q5_1 and q8_0 * Remove unnecessary scalar layout extension * Parse graph early to pre-record command buffers * Add q6_k support * Add multi-submit for command buffers * Fix q6_k dequant shader for AMD * Fix q6_k for GPUs without fp16 support * Simplify q6_k fp16 fix * Minor fixes * Fix wg_denom of m-mulmat shaders * Add Python-based Vulkan shader generator * Replace shaderc dependency with precompiled shaders Fix python script to generate shaders * Clean up code * Fix shader generator script Windows compatibility Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> * Close file before deletion * Fix vulkan shader fp32 name * Add q2_k and q3_k support Add validation check to compare shader results to cpu results * Add q4_k support * Add q5_k support * Bake SPIR-V bytecode into the library instead of loading shaders from file * Switch to signal semaphores for flexibility Prepare broadcasting support for mul mat * Finish broadcasting mul mat support for GQA * Clean up unused functions Add repeat op * Add further ops, not yet enabled. Improve semaphore code * Reduce number of used semaphores by utilizing timelines more properly * Remove queue information * Reuse timeline semaphores, allow parallel operation with binary semaphores to work around nvidia driver limitations * Add Vulkan to llama-bench * Remove cblas dependency * Fix matmul k-split bug * Fix q4_k dmmv K_QUANTS_PER_ITERATION 1 shader * Add RMS Norm shader, rework op_f32 shader setup, fix matmul bug * Fix issues with float16 overflows in shaders * Fix issues with older Vulkan headers on Ubuntu 22.04 * Allow multi-op partial offloading by parsing the graph to preallocate enough between-op buffers * Implement further ops, rework op_f32 calls, fix bugs * Finish full offloading support, add last remaining ops, fix bugs, remove redundant code * Upload generated file ggml-vulkan-shaders.hpp, remove redundant shaders * Merge upstream changes, fix conflicts, adapt soft_max op * Fix Python and shader header format * Free model gpu buffers on exit * Use single queue per device to simplify code * Add matmul shader support for running multiple calculations in parallel * Switch from semaphore-synchronized multiple command buffers per op to single command buffer for multiple ops, whole graph if possible * Fix missing event cast * Replace uint64_t(-1) with UINT64_MAX, rename function for clarity * Fix warning about empty C function parameters * Fix compiler warnings * Properly implement Vulkan backend buffer handling * Fix oversized host staging buffers * Simplify barrier synchronization calls * Fix gcc warnings * Implement max_size for backend buffer types to limit the size of a single allocation * Use min of maxMemoryAllocationSize and maxBufferSize for device max allocation size * refactor multi buf * Disable unsupported ops to fix tests * Check for maintenance4 support before using it * Handle devices with only a single queue * Fix single queue logic * propagate buffer usage in multi buffers * Implement rope_neox op * Cleanup header and other files * Simplify gpu_extras by removing events and putting staging memcpys into contexts * Move queue into context Add not-yet-enabled async backend ops * Simplify context use, optimize matmul shader for warp size 64 (AMD GCN), fix split_k matmul shader optimization * Add get_max_size to SYCL backend. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * llama : fix trailing whitespace --------- Co-authored-by: Henri Vasserman <henv@hot.ee> Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

0cc4m added 23 commits June 30, 2023 18:36

Vulkan loader code

061246f

Fix matmul kernel, continue implementation

4a96d0e

Continue implementation

88d4ec0

Vulkan memory management

a4004d4

Vulkan development

b0e6585

Matmul call

fc4f207

Add aligned malloc and free for VMA

2471728

Continue implementation

8ce84c2

First matmul success

a42376e

GEMM Kernel optimization

baf9ff5

1D Blocktiling

1b4863c

2D Blocktiling

7c6860b

Write coalescing

0c9cca0

Continue vulkan implementation and optimization

2c70df9

First FP16 attempt, disabled for now

3adc7b1

Code abstraction, FP16 implementation, fix kernel, add FP16 to FP32 k…

fc5bb53

…ernel

Enable device extensions properly, restore fp16 matmul op

c31e14b

Fix mulmat_f16

40c8f84

Output FP32 in fp16 matmul shader

df3cdbd

Fix f16_to_f32 kernel

cb5cb4d

dequant_q4_0 kernel

c8ff09b

Add VMA library

4ea9b2f

Avoid requesting dedicated memory, VMA can decide that by itself

36cd5d8

Firstbober mentioned this pull request Jul 2, 2023

Vulkan implementation (via Kompute) #2039

Closed

Merge branch 'master' into vulkan

9c4c15a

Add get_max_size to SYCL backend.

e3acca3

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ggerganov reviewed Jan 28, 2024

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

llama : fix trailing whitespace

10fbb1f

ggerganov merged commit 2307523 into ggerganov:master Jan 28, 2024
46 of 49 checks passed

LostRuins mentioned this pull request Jan 29, 2024

Unhandled Exception trying to use Vulkan LostRuins/koboldcpp#644

Open

jart mentioned this pull request Jan 30, 2024

GPU speed-up on Raspberry Pi 5 Mozilla-Ocho/llamafile#226

Open

maxwell-kalin mentioned this pull request Jan 30, 2024

Add Vulkan runner ollama/ollama#2033

Open

mudler mentioned this pull request Jan 31, 2024

feat(llama.cpp): Vulkan, Kompute, SYCL mudler/LocalAI#1647

Open

4 tasks

Raewari mentioned this pull request Jan 31, 2024

Issues with running Llama.cpp on Raspberry Pi 5 with Vulkan. #5237

Closed

BarfingLemurs mentioned this pull request Feb 1, 2024

add vulkan kompute section ngxson/llama.cpp#2

Closed

rbourgeat mentioned this pull request Feb 4, 2024

Vulkan Apple Silicon compatibility #5322

Closed

netrunnereve mentioned this pull request Feb 28, 2024

Support Vulkan versions older than v1.3.208 (eEnumeratePortabilityKHR) #5757

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan Implementation #2059

Vulkan Implementation #2059

0cc4m commented Jun 30, 2023 •

edited

Loading

slaren commented Jul 1, 2023

0cc4m commented Jul 1, 2023

slaren commented Jul 1, 2023

0cc4m commented Jul 1, 2023

AlphaAtlas commented Jul 1, 2023 •

edited

Loading

0cc4m commented Jul 1, 2023

0cc4m commented Jan 28, 2024

0cc4m commented Jan 28, 2024

teleprint-me commented Jan 28, 2024 •

edited

Loading

0cc4m commented Jan 28, 2024

ggerganov commented Jan 28, 2024

0cc4m commented Jan 28, 2024

netrunnereve commented Jan 28, 2024

0cc4m commented Jan 28, 2024

netrunnereve commented Jan 28, 2024

sorasoras commented Jan 28, 2024

MoffKalast commented Jan 28, 2024

VelvetyWhite commented Jan 28, 2024 •

edited

Loading

XdpAreKid commented Jan 29, 2024

teleprint-me commented Jan 29, 2024 •

edited

Loading

teleprint-me commented Jan 29, 2024 •

edited

Loading

Vulkan Implementation #2059

Vulkan Implementation #2059

Conversation

0cc4m commented Jun 30, 2023 • edited Loading

slaren commented Jul 1, 2023

0cc4m commented Jul 1, 2023

slaren commented Jul 1, 2023

0cc4m commented Jul 1, 2023

AlphaAtlas commented Jul 1, 2023 • edited Loading

0cc4m commented Jul 1, 2023

0cc4m commented Jan 28, 2024

0cc4m commented Jan 28, 2024

teleprint-me commented Jan 28, 2024 • edited Loading

0cc4m commented Jan 28, 2024

ggerganov commented Jan 28, 2024

0cc4m commented Jan 28, 2024

netrunnereve commented Jan 28, 2024

0cc4m commented Jan 28, 2024

netrunnereve commented Jan 28, 2024

sorasoras commented Jan 28, 2024

MoffKalast commented Jan 28, 2024

VelvetyWhite commented Jan 28, 2024 • edited Loading

XdpAreKid commented Jan 29, 2024

teleprint-me commented Jan 29, 2024 • edited Loading

teleprint-me commented Jan 29, 2024 • edited Loading

0cc4m commented Jun 30, 2023 •

edited

Loading

AlphaAtlas commented Jul 1, 2023 •

edited

Loading

teleprint-me commented Jan 28, 2024 •

edited

Loading

VelvetyWhite commented Jan 28, 2024 •

edited

Loading

teleprint-me commented Jan 29, 2024 •

edited

Loading

teleprint-me commented Jan 29, 2024 •

edited

Loading