-
Notifications
You must be signed in to change notification settings - Fork 13.6k
vulkan: remove the need for the dryrun #16826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I do have the optimal setup to test this, with a slow server CPU + 3090:
The test good, but it's gonna take me a little while to go through the code. |
|
Can you fix the conflict? |
|
Rebased. |
Allocate pipelines and descriptor sets when requested. Reallocate the prealloc buffers when needed, and flush any pending work before reallocating. For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I get what this is doing and results are positive, in my tests. What's your plan now? Clean up the unused code before merge?
|
Sure, I can do that tomorrow. |
|
The test spawns multiple threads, each creates its own I'm only observing crashes after this PR, but from looking at the code it seems like this should always have been somewhat broken. Almost feels like I must misunderstand something..? One could argue that it doesn't make much sense to concurrently run multiple |
|
Thanks for the tip. I had seen a test-thread-safety failure in an unrelated CI job and was surprised. I'll try to reproduce it locally. |
* origin/master: (21 commits) vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (ggml-org#16919) examples(gguf): GGUF example outputs (ggml-org#17025) mtmd: allow QwenVL to process larger image by default (ggml-org#17020) server : do not default to multiple slots with speculative decoding (ggml-org#17017) mtmd: improve struct initialization (ggml-org#16981) docs: Clarify the endpoint that webui uses (ggml-org#17001) model : add openPangu-Embedded (ggml-org#16941) ggml webgpu: minor set rows optimization (ggml-org#16810) sync : ggml ggml : fix conv2d_dw SVE path (ggml/1380) CUDA: update ops.md (ggml-org#17005) opencl: update doc (ggml-org#17011) refactor: replace sprintf with snprintf for safer string handling in dump functions (ggml-org#16913) vulkan: remove the need for the dryrun (ggml-org#16826) server : do context shift only while generating (ggml-org#17000) readme : update hot topics (ggml-org#17002) ggml-cpu : bicubic interpolation (ggml-org#16891) ci : apply model label to models (ggml-org#16994) chore : fix models indent after refactor (ggml-org#16992) Fix garbled output with REPACK at high thread counts (ggml-org#16956) ...
Allocate pipelines and descriptor sets when requested.
Reallocate the prealloc buffers when needed, and flush any pending work before reallocating.
For rms_partials and total_mul_mat_bytes, use the sizes computed the last time the graph was executed.
The dryrun is a small but consistent overhead where the GPU is idle. I get an average of maybe 1-2% improvement with it removed, though my numbers have been noisy lately.
I didn't totally rip out all the logic yet, I wanted to keep the diffs smaller to make it more clear how the new code works.