Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vulkan implementation (via Kompute) #2039

Closed
wants to merge 43 commits into from
Closed

Vulkan implementation (via Kompute) #2039

wants to merge 43 commits into from

Conversation

niansa
Copy link
Contributor

@niansa niansa commented Jun 28, 2023

Hey!

This is an attempt for a Vulkan implementation in GLSL via Kompute. The use of Kompute instead of using Vulkan directly is to avoid thousands of lines of hard to maintain boilterplate code.

I mostly base the code on the current Metal implementation and approach.

Already implemented

  • ADD (Untested)
  • MUL (Untested)
  • SCALE (Test passed) ✅
  • SILU (Test passed) ✅
  • RELU (Test passed) ✅
  • GELU (Test passed) ✅
  • SOFT_MAX (Test failed) ❌
  • DIAG_MASK_INF (likely broken)
  • MUL_MAT (still needs fallback for small sizes)
    • F16 (Unused)
    • Q4_0 (Test failed) ❌
    • Q4_1 (Test passed) ✅
  • GET_ROW
    • F16 (Unused)
    • Q4_0 (Test passed) ✅
    • Q4_1 (Test failed) ❌
  • NORM (Unused)
  • RMS_NORM (Test passed) ✅
  • ROPE (Test passed) ✅
  • CPY (Test passed) ✅

TODO things before merge

  • Remove Kompute submodule (expect it from system!)
  • Clean up
  • Fix Vulkan validation errors
  • Fix conflicts

@AlphaAtlas
Copy link

AlphaAtlas commented Jun 29, 2023

👀

Other than reaching more platforms, what advantages would this Vulkan implementation have over OpenCL? Better performance? Better support for IGPs? Support for the matmul acceleration in Arc/RDNA3?

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Jun 29, 2023

OpenCL and OpenGL are basically deprecated and Vulkan is the replacement. Better driver support in newer GPUs etc.

@KerfuffleV2
Copy link
Collaborator

In case it helps anyone build this:

Make sure you set up the Kompute submodule:

  1. git submodule init
  2. git submodule update

There currently isn't support for building with the Makefile, you need to use a cmake build and add -DLLAMA_KOMPUTE (You can also add something like -DKOMPUTE_OPT_LOG_LEVEL=Warn to turn down the Kompute logging.)

I wasn't able to compile Kompute due to compile warnings, so it was necessary to edit kompute/CMakeLists.txt (after updating the kompute submodule) and remove -Werror from the build flags.

Right now it seems like it just skips any operations that aren't implemented for GPU so you can't run inference on a model yet. Fun fact: Turns out evaluating models is really fast if you skip all the matrix multiplication operations!

@0cc4m 0cc4m mentioned this pull request Jun 30, 2023
@ggerganov
Copy link
Owner

Other than reaching more platforms, what advantages would this Vulkan implementation have over OpenCL? Better performance? Better support for IGPs? Support for the matmul acceleration in Arc/RDNA3?

In addition to what @SlyEcho answered, my view is that since we now have a relatively good way to implement decoupled GPU backends (CUDA, Metal, OpenCL, etc.), there is no good reason not to keep doing it. As long as the new implementations do not touch the core ggml API and don't break the general workflow of other backends, we can keep adding support for new backends. Developers can join and help maintain the respective implementations. We can even have multiple implementations for a given single backend since we are still learning and finding new ways to make things more optimal and we don't know yet what is "the best way".

In the long run, we can eventually obsolete and deprecate certain implementations, but for now the main goal is to experiment and explore the most efficient ways for hardware acceleration. Supporting more backends also helps to find certain good patterns in the implementations and even though currently there is a lot of "copy-paste", at some point we can think about the best way to consolidate the code and reuse the best techniques that we have found across all architectures.

@Firstbober
Copy link

I can't get both this PR and #2059 to work. I wanted to run a perplexity test with wiki.test.raw as recommended in README, but all I get is GGML_ASSERT on ggml-vulkan.cpp:142: res != ctx->tensors.end(). Furthermore, I also tested just normal inference, but it just gets out of memory on my ram (16 GiB physical + 16 GiB swap).

In the case of second implementation, I can't even get to filling my ram as it's just crashes on me with ggml_vulkan: vk_transfer_queue_family_index invalid and i can't do anything.

I am using RX 580 (RADV driver, Polaris 10 architecture). On the side note, OpenCL works, but prompt ingestion is painfully slow.

@0cc4m
Copy link
Collaborator

0cc4m commented Jul 2, 2023

@Firstbober AMD GPUs have fewer queues, it seems. The initialization code of my implementation didn't work with that yet. Can you try again? Don't expect any good performance yet, though.

@Firstbober
Copy link

@Firstbober AMD GPUs have fewer queues, it seems. The initialization code of my implementation didn't work with that yet. Can you try again? Don't expect any good performance yet, though.

It did work this time after your patch! I needed to remove the validation layer as the amount of errors it produced was too much. Two major ones were VUID-vkCmdDispatch-groupCountX-00386(ERROR / SPEC) and VUID-VkDeviceCreateInfo-queueFamilyIndex-02802(ERROR / SPEC).

After making the output readable, I can clearly see my GPU working and ingesting the prompt. I wasn't able to test the perplexity of wiki.test.raw as it just OOMed me (vk::Device::allocateCommandBuffers: ErrorOutOfHostMemory), but I only have 4 GiB of VRAM, so it's pretty understandable.

It seems to be very close to OpenCL. I ran the test with ./main -n 1 (for OpenCL, I also added -ngl 40):
Vulkan:

llama_print_timings:        load time =  2129,57 ms
llama_print_timings:      sample time =     1,13 ms /     1 runs   (    1,13 ms per token,   882,61 tokens per second)
llama_print_timings: prompt eval time = 45162,26 ms /   372 tokens (  121,40 ms per token,     8,24 tokens per second)
llama_print_timings:        eval time =     0,00 ms /     1 runs   (    0,00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 45182,14 ms

OpenCL:

llama_print_timings:        load time =  9512,41 ms
llama_print_timings:      sample time =     1,02 ms /     1 runs   (    1,02 ms per token,   985,22 tokens per second)
llama_print_timings: prompt eval time = 47387,69 ms /   372 tokens (  127,39 ms per token,     7,85 tokens per second)
llama_print_timings:        eval time =     0,00 ms /     1 runs   (    0,00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 47411,72 ms

Overall, Vulkan used my VRAM much better, while OpenCL allocated all of it and even didn't use my GPU as much as your Vulkan implementation.

@JianbangZ
Copy link

I couldn't get this to work, which throws some werid error "fatal error C1189: #error: You
r C implementation is not IEC 559 compliant, which is required for proper Vulkan interop"

I also came around another project which fully uses Vulkan, and I tested on Windows with Intel integrated GPU. Speed is quite plausible but the good thing is it fully used the integrated GPU so PC doesn't feel slow anymore, otherwise CPU will be at 100% https://mlc.ai/mlc-llm/

@AlphaAtlas
Copy link

AlphaAtlas commented Jul 3, 2023

@JianbangZ

Yeah, MLC uses Apache TVM's Vulkan backend. It features are pretty barebones, and it doesn't support any model splitting, CPU offloading, or anything like K Quants, but it is very fast and relatively portable once compiled.

If you are using MSVC or something, my guess is you need to compile in WSL?

@JianbangZ
Copy link

@AlphaAtlas
Copy link

I use https://github.com/skeeto/w64devkit

@JianbangZ 🤷

You could try reporting this issue to w64devkit, but it might be a fundamental compiler limitation.

@JianbangZ
Copy link

I use https://github.com/skeeto/w64devkit

@JianbangZ 🤷

You could try reporting this issue to w64devkit, but it might be a fundamental compiler limitation.

I think the C1189 is a MSVC error. Not sure whose fault thought at this moment

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Jul 4, 2023

I wasn't able to compile Kompute due to compile warnings, so it was necessary to edit kompute/CMakeLists.txt (after updating the kompute submodule) and remove -Werror from the build flags.

The error/warning seems to come from the fmt library build, using the system's fmt library with -DKOMPUTE_OPT_USE_BUILT_IN_FMT=OFF allowed me to build. Also useful: -DKOMPUTE_OPT_LOG_LEVEL=Off.

@riverzhou
Copy link

[Jul 12 2023 15:12:23] [warn] [/home/river/LLM/llama.cpp/kompute/src/Manager.cpp:231] Kompute Manager no valid layer names found from desired layer names
[Jul 12 2023 15:12:23] [info] [/home/river/LLM/llama.cpp/kompute/src/Manager.cpp:339] Using physical device index 0 found llvmpipe (LLVM 15.0.7, 256 bits)

What's problem about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants