Vulkan implementation (via Kompute) #2039

niansa · 2023-06-28T14:14:57Z

Hey!

This is an attempt for a Vulkan implementation in GLSL via Kompute. The use of Kompute instead of using Vulkan directly is to avoid thousands of lines of hard to maintain boilterplate code.

I mostly base the code on the current Metal implementation and approach.

Already implemented

ADD (Untested)
MUL (Untested)
SCALE (Test passed) ✅
SILU (Test passed) ✅
RELU (Test passed) ✅
GELU (Test passed) ✅
SOFT_MAX (Test failed) ❌
DIAG_MASK_INF (likely broken)
MUL_MAT (still needs fallback for small sizes)
- F16 (Unused)
- Q4_0 (Test failed) ❌
- Q4_1 (Test passed) ✅
GET_ROW
- F16 (Unused)
- Q4_0 (Test passed) ✅
- Q4_1 (Test failed) ❌
NORM (Unused)
RMS_NORM (Test passed) ✅
ROPE (Test passed) ✅
CPY (Test passed) ✅

TODO things before merge

Remove Kompute submodule (expect it from system!)
Clean up
Fix Vulkan validation errors
Fix conflicts

AlphaAtlas · 2023-06-29T21:44:06Z

👀

Other than reaching more platforms, what advantages would this Vulkan implementation have over OpenCL? Better performance? Better support for IGPs? Support for the matmul acceleration in Arc/RDNA3?

SlyEcho · 2023-06-29T22:01:12Z

OpenCL and OpenGL are basically deprecated and Vulkan is the replacement. Better driver support in newer GPUs etc.

KerfuffleV2 · 2023-06-30T09:03:37Z

In case it helps anyone build this:

Make sure you set up the Kompute submodule:

git submodule init
git submodule update

There currently isn't support for building with the Makefile, you need to use a cmake build and add -DLLAMA_KOMPUTE (You can also add something like -DKOMPUTE_OPT_LOG_LEVEL=Warn to turn down the Kompute logging.)

I wasn't able to compile Kompute due to compile warnings, so it was necessary to edit kompute/CMakeLists.txt (after updating the kompute submodule) and remove -Werror from the build flags.

Right now it seems like it just skips any operations that aren't implemented for GPU so you can't run inference on a model yet. Fun fact: Turns out evaluating models is really fast if you skip all the matrix multiplication operations!

ggerganov · 2023-07-01T17:31:38Z

Other than reaching more platforms, what advantages would this Vulkan implementation have over OpenCL? Better performance? Better support for IGPs? Support for the matmul acceleration in Arc/RDNA3?

In addition to what @SlyEcho answered, my view is that since we now have a relatively good way to implement decoupled GPU backends (CUDA, Metal, OpenCL, etc.), there is no good reason not to keep doing it. As long as the new implementations do not touch the core ggml API and don't break the general workflow of other backends, we can keep adding support for new backends. Developers can join and help maintain the respective implementations. We can even have multiple implementations for a given single backend since we are still learning and finding new ways to make things more optimal and we don't know yet what is "the best way".

In the long run, we can eventually obsolete and deprecate certain implementations, but for now the main goal is to experiment and explore the most efficient ways for hardware acceleration. Supporting more backends also helps to find certain good patterns in the implementations and even though currently there is a lot of "copy-paste", at some point we can think about the best way to consolidate the code and reuse the best techniques that we have found across all architectures.

Firstbober · 2023-07-02T08:20:51Z

I can't get both this PR and #2059 to work. I wanted to run a perplexity test with wiki.test.raw as recommended in README, but all I get is GGML_ASSERT on ggml-vulkan.cpp:142: res != ctx->tensors.end(). Furthermore, I also tested just normal inference, but it just gets out of memory on my ram (16 GiB physical + 16 GiB swap).

In the case of second implementation, I can't even get to filling my ram as it's just crashes on me with ggml_vulkan: vk_transfer_queue_family_index invalid and i can't do anything.

I am using RX 580 (RADV driver, Polaris 10 architecture). On the side note, OpenCL works, but prompt ingestion is painfully slow.

0cc4m · 2023-07-02T20:15:24Z

@Firstbober AMD GPUs have fewer queues, it seems. The initialization code of my implementation didn't work with that yet. Can you try again? Don't expect any good performance yet, though.

Firstbober · 2023-07-02T21:30:29Z

@Firstbober AMD GPUs have fewer queues, it seems. The initialization code of my implementation didn't work with that yet. Can you try again? Don't expect any good performance yet, though.

It did work this time after your patch! I needed to remove the validation layer as the amount of errors it produced was too much. Two major ones were VUID-vkCmdDispatch-groupCountX-00386(ERROR / SPEC) and VUID-VkDeviceCreateInfo-queueFamilyIndex-02802(ERROR / SPEC).

After making the output readable, I can clearly see my GPU working and ingesting the prompt. I wasn't able to test the perplexity of wiki.test.raw as it just OOMed me (vk::Device::allocateCommandBuffers: ErrorOutOfHostMemory), but I only have 4 GiB of VRAM, so it's pretty understandable.

It seems to be very close to OpenCL. I ran the test with ./main -n 1 (for OpenCL, I also added -ngl 40):
Vulkan:

llama_print_timings:        load time =  2129,57 ms
llama_print_timings:      sample time =     1,13 ms /     1 runs   (    1,13 ms per token,   882,61 tokens per second)
llama_print_timings: prompt eval time = 45162,26 ms /   372 tokens (  121,40 ms per token,     8,24 tokens per second)
llama_print_timings:        eval time =     0,00 ms /     1 runs   (    0,00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 45182,14 ms

OpenCL:

llama_print_timings:        load time =  9512,41 ms
llama_print_timings:      sample time =     1,02 ms /     1 runs   (    1,02 ms per token,   985,22 tokens per second)
llama_print_timings: prompt eval time = 47387,69 ms /   372 tokens (  127,39 ms per token,     7,85 tokens per second)
llama_print_timings:        eval time =     0,00 ms /     1 runs   (    0,00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 47411,72 ms

Overall, Vulkan used my VRAM much better, while OpenCL allocated all of it and even didn't use my GPU as much as your Vulkan implementation.

JianbangZ · 2023-07-03T16:54:53Z

I couldn't get this to work, which throws some werid error "fatal error C1189: #error: You
r C implementation is not IEC 559 compliant, which is required for proper Vulkan interop"

I also came around another project which fully uses Vulkan, and I tested on Windows with Intel integrated GPU. Speed is quite plausible but the good thing is it fully used the integrated GPU so PC doesn't feel slow anymore, otherwise CPU will be at 100% https://mlc.ai/mlc-llm/

AlphaAtlas · 2023-07-03T18:20:33Z

@JianbangZ

Yeah, MLC uses Apache TVM's Vulkan backend. It features are pretty barebones, and it doesn't support any model splitting, CPU offloading, or anything like K Quants, but it is very fast and relatively portable once compiled.

If you are using MSVC or something, my guess is you need to compile in WSL?

JianbangZ · 2023-07-03T18:33:46Z

@AlphaAtlas I use https://github.com/skeeto/w64devkit

AlphaAtlas · 2023-07-03T18:39:08Z

I use https://github.com/skeeto/w64devkit

@JianbangZ 🤷

You could try reporting this issue to w64devkit, but it might be a fundamental compiler limitation.

JianbangZ · 2023-07-03T21:23:25Z

I use https://github.com/skeeto/w64devkit

@JianbangZ 🤷

You could try reporting this issue to w64devkit, but it might be a fundamental compiler limitation.

I think the C1189 is a MSVC error. Not sure whose fault thought at this moment

SlyEcho · 2023-07-04T13:17:37Z

I wasn't able to compile Kompute due to compile warnings, so it was necessary to edit kompute/CMakeLists.txt (after updating the kompute submodule) and remove -Werror from the build flags.

The error/warning seems to come from the fmt library build, using the system's fmt library with -DKOMPUTE_OPT_USE_BUILT_IN_FMT=OFF allowed me to build. Also useful: -DKOMPUTE_OPT_LOG_LEVEL=Off.

riverzhou · 2023-07-12T15:17:58Z

[Jul 12 2023 15:12:23] [warn] [/home/river/LLM/llama.cpp/kompute/src/Manager.cpp:231] Kompute Manager no valid layer names found from desired layer names
[Jul 12 2023 15:12:23] [info] [/home/river/LLM/llama.cpp/kompute/src/Manager.cpp:339] Using physical device index 0 found llvmpipe (LLVM 15.0.7, 256 bits)

What's problem about this?

niansa added 30 commits June 22, 2023 13:57

Initial working stuff

4f598dd

Updated gitignore

2f3fe0c

Cleanups

3b3d30e

More code cleanups

b0f11fa

Implemented dequantize_row_q4_1

9cdaea9

Added more functions from Metal

339bc36

Fixed compile error

9d64375

More fixes...

b8a4594

Began implementing ggml_graph_compute

d539247

More progress...

18d6f7f

Added vk_mul to ggml_vk_graph_compute

b626454

Minor fixes

5e94033

Share sequence to functions and add scale()

e830264

Specify program output size

5c0d8dd

Prevent compileSource race

2589cb0

Wait for all threads to finish

09b0b3a

Fix ggml_vk_h2d_tensor throwing on second call

98e588c

h2d tensors during loadup

46f577b

Add mutexes for gpu tensors

1a68195

Added ggml_vk_mem_used()

e6da9bd

Added more debugging

40621ea

Temporarily care for all layers

4b267e8

Improved memory safety

55815b6

Free vk context

e0814f8

More little fixes and stuff

5d5f66d

Reenabled unknown op message

acb7d90

Add buffer qualifiers

072007b

Fixed ggml_vk_abmath row argument

ed14f07

Allow vk add row

e2b721d

Implemented ggml_vk_soft_max

de7d182

Snake case all functions

749d617

niansa added 6 commits June 30, 2023 11:47

Added mul_mat (needs fixes)

964fe8c

Minor MUL_MAT fix and implemented DIAG_MASK_INF

f093bf2

Fixed mul mat dispatch size

0dc5f2f

Added missing break to mul_mat_f16 case

8fa6013

Implemented GGML_OP_NORM

d1f84db

Implemented RMS_NORM

f0e1429

0cc4m mentioned this pull request Jun 30, 2023

Vulkan Implementation #2059

Merged

niansa added 5 commits July 5, 2023 10:59

Simple mul_mat_f16 for speed and removal of unused mul_mat_f32

2fc8249

Ported mat mul from Metal

6be93e6

Optimized ggml_vk_mul_mat_f16 argument count

856b758

Fixed case order in ggml_vk_graph_compute

77ebe46

Only warn if __STDC_IEC_559__ isn't defined

44d214c

jhen0409 mentioned this pull request Aug 11, 2023

Support Android mybigday/llama.rn#5

Closed

tuxifan closed this by deleting the head repository Aug 15, 2023

Green-Sky mentioned this pull request Sep 28, 2023

[Question] What is the status of Vulkan backend? ggerganov/ggml#542

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vulkan implementation (via Kompute) #2039

Vulkan implementation (via Kompute) #2039

niansa commented Jun 28, 2023 •

edited

AlphaAtlas commented Jun 29, 2023 •

edited

SlyEcho commented Jun 29, 2023

KerfuffleV2 commented Jun 30, 2023

ggerganov commented Jul 1, 2023

Firstbober commented Jul 2, 2023

0cc4m commented Jul 2, 2023

Firstbober commented Jul 2, 2023

JianbangZ commented Jul 3, 2023

AlphaAtlas commented Jul 3, 2023 •

edited

JianbangZ commented Jul 3, 2023

AlphaAtlas commented Jul 3, 2023

JianbangZ commented Jul 3, 2023

SlyEcho commented Jul 4, 2023 •

edited

riverzhou commented Jul 12, 2023

Vulkan implementation (via Kompute) #2039

Vulkan implementation (via Kompute) #2039

Conversation

niansa commented Jun 28, 2023 • edited

Already implemented

TODO things before merge

AlphaAtlas commented Jun 29, 2023 • edited

SlyEcho commented Jun 29, 2023

KerfuffleV2 commented Jun 30, 2023

ggerganov commented Jul 1, 2023

Firstbober commented Jul 2, 2023

0cc4m commented Jul 2, 2023

Firstbober commented Jul 2, 2023

JianbangZ commented Jul 3, 2023

AlphaAtlas commented Jul 3, 2023 • edited

JianbangZ commented Jul 3, 2023

AlphaAtlas commented Jul 3, 2023

JianbangZ commented Jul 3, 2023

SlyEcho commented Jul 4, 2023 • edited

riverzhou commented Jul 12, 2023

niansa commented Jun 28, 2023 •

edited

AlphaAtlas commented Jun 29, 2023 •

edited

AlphaAtlas commented Jul 3, 2023 •

edited

SlyEcho commented Jul 4, 2023 •

edited