Skip to content

ggml-webgpu: add Q1_0 support#22374

Merged
reeselevine merged 2 commits intoggml-org:masterfrom
SharmaRithik:webgpu-q1_0-support
Apr 27, 2026
Merged

ggml-webgpu: add Q1_0 support#22374
reeselevine merged 2 commits intoggml-org:masterfrom
SharmaRithik:webgpu-q1_0-support

Conversation

@SharmaRithik
Copy link
Copy Markdown
Contributor

@SharmaRithik SharmaRithik commented Apr 25, 2026

Overview

Adds WebGPU support for the Q1_0 quantization type, including a fast mat-vec kernel (MUL_ACC_Q1_0 in mul_mat_vec.wgsl), a fast mat-mat block (INIT_SRC0_SHMEM_Q1_0 in mul_mat_decls.tmpl) that enables both the register-tile and subgroup-matrix paths, and a GET_ROWS dequant (Q1_0 block in get_rows.wgsl), along with the dispatcher and supports_op updates for MUL_MAT and MUL_MAT_ID.

Additional information

Q1_0 was previously not supported on the WebGPU backend, so both mat-vec and mat-mat dispatched to the CPU fallback. With this PR the kernels run on WebGPU.

Numbers below are from llama-bench -m Bonsai-1.7B-Q1_0.gguf -p 512 -n 128 -r 3 -ngl 99 on Intel Arc B580 (Mesa 25.2.8, Dawn 4654ba883e), using the model from prism-ml/Bonsai-1.7B-gguf.

test master (tok/s) this branch (tok/s)
pp512 (prefill) 137.44 ± 0.25 2775.24 ± 11.56
tg128 (decode) 12.59 ± 0.14 59.96 ± 0.50

Requirements

@SharmaRithik SharmaRithik requested a review from a team as a code owner April 25, 2026 23:02
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels Apr 25, 2026
Copy link
Copy Markdown
Contributor

@reeselevine reeselevine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only minor change is we shouldn't need to initialize shared memory, otherwise looks good!


if (global_m >= params.m) {
for (var bit = 0u; bit < NQ; bit++) {
shmem[i + bit] = f16(0.0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we actually don't need shared memory initialization to 0 because WebGPU guarantees it will be.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added a fix.

for (var bit = 0u; bit < NQ; bit++) {
shmem[i + bit] = f16(0.0);
}
continue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break instead of continue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added a fix. The continue was treating each iteration as if it might come back in-bounds, but global_m only goes up, so once it passes params.m the rest are OOB too. break works better here and exits early.

@SharmaRithik
Copy link
Copy Markdown
Contributor Author

Thanks Reese for the feedback! I have made the required changes.

@reeselevine reeselevine merged commit 434b2a1 into ggml-org:master Apr 27, 2026
45 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants