Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Vulkan backend to allow multiple contexts #7961

Merged
merged 3 commits into from
Jun 23, 2024

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Jun 16, 2024

I reworked how Vulkan backends are handled. This should fix #7575, and hopefully some issues with RAM overuse.

@MaggotHATE Can you check whether this has fixed your issue?

I'll leave this on draft while I run further tests to make sure I didn't break anything.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Jun 16, 2024

@0cc4m Looks like it has, thank you! I will run more tests tomorrow, but so far it works on non-MoE models (not just the same one, testing 11b llama3-based in Q5_K_M). Speeds are quite good too.

@github-actions github-actions bot added the Vulkan Issues specific to the Vulkan backend label Jun 16, 2024
@MaggotHATE
Copy link
Contributor

Maybe I'm spoiled by Clblast, but so far Vulkan backend has 2 annoyances:

  1. Shaders runtime compilation (?) takes too much time (GPU utilization is less than 30%), and it happens quite often, because...
  2. ...running models requires precise number of layers offloaded - one more or one less, and it will either crash or halt, doing nothing.

I don't see other major problems for now, but waiting times are quite long, and it (or Windows?) clears or forgets the cache after some time, so even if I haven't changed anything in settings, I'll have to wait again.

@lin72h
Copy link

lin72h commented Jun 18, 2024

  1. Shaders runtime compilation (?) takes too much time (GPU utilization is less than 30%), and it happens quite often, because...

Thank for the feedback, just wondering is there a way to cache the shader in spvir binary, and make the loading fast?

@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Jun 18, 2024
@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 18, 2024

Maybe I'm spoiled by Clblast, but so far Vulkan backend has 2 annoyances:

1. Shaders runtime compilation (?) takes too much time (GPU utilization is less than 30%), and it happens quite often, because...

2. ...running models requires precise number of layers offloaded - one more or one less, and it will either crash or halt, doing nothing.

I don't see other major problems for now, but waiting times are quite long, and it (or Windows?) clears or forgets the cache after some time, so even if I haven't changed anything in settings, I'll have to wait again.

Both of those issues are specific to you. I have never had those on any system I tested.

  1. Shader compilation and caching is the responsibility of your driver, and both Linux and Windows drivers do that well in my experience.
  2. It should never halt. It will crash when it runs out of VRAM.

Can you give me more details of your system? It doesn't seem to behave as expected.

@MaggotHATE
Copy link
Contributor

Can you give me more details of your system? It doesn't seem to behave as expected

Windows 10, 16GB DDR4, i7 8700, GTX 1060 3GB (536.23). I remember reporting about it first here, so maybe it's Windows 10 specifically. I use both iGPU and 1060 now, so it's not related.

It may be relevant that I use Lunar's Vulkan SDK 1.3.283.

It should never halt. It will crash when it runs out of VRAM.

It does crash when it's out of VRAM, but it halts when not enough layers are offloaded.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 18, 2024

Can you give me more details of your system? It doesn't seem to behave as expected

Windows 10, 16GB DDR4, i7 8700, GTX 1060 3GB (536.23). I remember reporting about it first here, so maybe it's Windows 10 specifically. I use both iGPU and 1060 now, so it's not related.

That seems perfectly normal, yeah. I don't have a Windows Nvidia setup myself, but many people run that. Odd.

It should never halt. It will crash when it runs out of VRAM.

It does crash when it's out of VRAM, but it halts when not enough layers are offloaded.

That doesn't make sense, you can run with 0 layers offloaded and it should work just fine. In that case it'll only offload the big matrix multiplications to the GPU. Any number of layers up to your VRAM limit should work.

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Jun 19, 2024

The shaders compilation is almost random - it didn't happen today in a fresh session on the first time running a model. Not eve once so far.

That doesn't make sense, you can run with 0 layers offloaded and it should work just fine. In that case it'll only offload the big matrix multiplications to the GPU. Any number of layers up to your VRAM limit should work.

Here's the report: vk_report_hermes_halt.txt. The model is Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q6_K.gguf, 8 layers, ctx-size is 8192, n_batch is 4096. It runs perfectly with 9 layers and crashes with 10 layers.

UPD: This is specifically 8 layers that halt - 7 and 6 work.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 19, 2024

That doesn't make sense, you can run with 0 layers offloaded and it should work just fine. In that case it'll only offload the big matrix multiplications to the GPU. Any number of layers up to your VRAM limit should work.

Here's the report: vk_report_hermes_halt.txt. The model is Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q6_K.gguf, 8 layers, ctx-size is 8192, n_batch is 4096. It runs perfectly with 9 layers and crashes with 10 layers.

UPD: This is specifically 8 layers that halt - 7 and 6 work.

Can you build and run with validation and vulkan debug enabled and upload a log where it got stuck?

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Jun 19, 2024

Can you build and run with validation and vulkan debug enabled and upload a log where it got stuck?

Here's the log: vk_report_validation_hermes_halt.txt

GPU just stays on 21% power, but it doesn't seem to do anything.

UPD: now that I remember, this was an issue with Vulkan backend my friend had on Win 10 back before the new backend system. It was fixed just before moving onto the new backend. In fact, my friend still uses that pre-backend sometimes.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 19, 2024

Can you enable the debug output? LLAMA_VULKAN_DEBUG=1

@MaggotHATE
Copy link
Contributor

MaggotHATE commented Jun 19, 2024

Can you enable the debug output? LLAMA_VULKAN_DEBUG=1

I'm getting compilation error with GGML_VULKAN_DEBUG

base/ggml-vulkan.cpp:2040:58: error: 'dev_num' was not declared in this scope; did you mean 'dev_t'? 2040 | VK_LOG_DEBUG("ggml_vk_init(" << ctx->name << ", " << dev_num << ")");

UPD: looks like it should be idx instead of dev_num?

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 19, 2024

Can you enable the debug output? LLAMA_VULKAN_DEBUG=1

I'm getting compilation error with GGML_VULKAN_DEBUG

base/ggml-vulkan.cpp:2040:58: error: 'dev_num' was not declared in this scope; did you mean 'dev_t'? 2040 | VK_LOG_DEBUG("ggml_vk_init(" << ctx->name << ", " << dev_num << ")");

That's my bad, sorry. I forgot to update it after refactoring the function.

If it's just that line, you could delete it. Or you wait until I fix it later today.

@MaggotHATE
Copy link
Contributor

If it's just that line, you could delete it. Or you wait until I fix it later today.

I think I fixed it by going with idx. Here's the log: vk_report_debug_memdebug_validation_hermes_halt.txt

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 19, 2024

If it's just that line, you could delete it. Or you wait until I fix it later today.

I think I fixed it by going with idx. Here's the log: vk_report_debug_memdebug_validation_hermes_halt.txt

Thank you, looks like it gets stuck in a small copy to GPU operation. Weird. I'll take a closer look later, maybe I can figure out what's going on.

@0cc4m
Copy link
Collaborator Author

0cc4m commented Jun 19, 2024

I have no idea why it's stopping there, I think that's some issue with your driver. Your log looks normal up to that point, and I can't reproduce that issue with any of my GPUs.

@0cc4m 0cc4m marked this pull request as ready for review June 19, 2024 17:28
@0cc4m 0cc4m merged commit 45c0e2e into master Jun 23, 2024
57 checks passed
@0cc4m 0cc4m deleted the 0cc4m/vulkan-backend-context-fix branch June 23, 2024 08:21
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jun 30, 2024
* Refactor Vulkan backend to allow multiple contexts

* Fix too many shader groups called validation error in llama3 on AMD and Intel GPUs

* Fix Vulkan debug build error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : High Generally require indepth knowledge of LLMs or GPUs Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: having more than one context doesn't work as expected with the Vulkan backend
5 participants