ggml-webgpu: add Q1_0 support#22374
Merged
reeselevine merged 2 commits intoggml-org:masterfrom Apr 27, 2026
Merged
Conversation
a355539 to
0c4b40e
Compare
CISC
approved these changes
Apr 26, 2026
Aflah012
approved these changes
Apr 26, 2026
reeselevine
reviewed
Apr 27, 2026
Contributor
reeselevine
left a comment
There was a problem hiding this comment.
only minor change is we shouldn't need to initialize shared memory, otherwise looks good!
|
|
||
| if (global_m >= params.m) { | ||
| for (var bit = 0u; bit < NQ; bit++) { | ||
| shmem[i + bit] = f16(0.0); |
Contributor
There was a problem hiding this comment.
we actually don't need shared memory initialization to 0 because WebGPU guarantees it will be.
Contributor
Author
There was a problem hiding this comment.
Thanks, added a fix.
reeselevine
reviewed
Apr 27, 2026
| for (var bit = 0u; bit < NQ; bit++) { | ||
| shmem[i + bit] = f16(0.0); | ||
| } | ||
| continue; |
Contributor
There was a problem hiding this comment.
break instead of continue?
Contributor
Author
There was a problem hiding this comment.
Thanks, added a fix. The continue was treating each iteration as if it might come back in-bounds, but global_m only goes up, so once it passes params.m the rest are OOB too. break works better here and exits early.
Contributor
Author
|
Thanks Reese for the feedback! I have made the required changes. |
reeselevine
approved these changes
Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds WebGPU support for the Q1_0 quantization type, including a fast mat-vec kernel (
MUL_ACC_Q1_0inmul_mat_vec.wgsl), a fast mat-mat block (INIT_SRC0_SHMEM_Q1_0inmul_mat_decls.tmpl) that enables both the register-tile and subgroup-matrix paths, and aGET_ROWSdequant (Q1_0block inget_rows.wgsl), along with the dispatcher andsupports_opupdates forMUL_MATandMUL_MAT_ID.Additional information
Q1_0 was previously not supported on the WebGPU backend, so both mat-vec and mat-mat dispatched to the CPU fallback. With this PR the kernels run on WebGPU.
Numbers below are from
llama-bench -m Bonsai-1.7B-Q1_0.gguf -p 512 -n 128 -r 3 -ngl 99on Intel Arc B580 (Mesa 25.2.8, Dawn 4654ba883e), using the model from prism-ml/Bonsai-1.7B-gguf.Requirements