CUDA: fuse gate + up for mmvq, and mmvf #16630

am17an · 2025-10-17T06:48:41Z

This PR adds support for fusing mmvq and mmvf with an optional gate and GLU. Currently it only supports SWIGLU and no bias, which is by far the most common pattern. Perf gains in TG of 4-9% for quantized models, lesser for fp models.

After #16130 and this PR ggml_can_fuse is too primitive to support fusion. What we want is a self-contained DAG with one exit point, where views are not used elsewhere in the graph. I will create a future PR for that

Performance on a 4090

Model	Test	t/s master	t/s cuda_fuse_gate	Speedup
deepseek2 16B Q4_0	tg32	205.77	218.02	1.06
deepseek2 16B Q4_0	tg64	202.49	215.07	1.06
deepseek2 16B Q4_0	tg128	200.51	212.70	1.06
qwen3 0.6B F16	tg32	311.52	325.05	1.04
qwen3 0.6B F16	tg64	305.72	318.91	1.04
qwen3 0.6B F16	tg128	303.35	315.34	1.04
qwen3 0.6B Q4_0	tg32	416.74	453.17	1.09
qwen3 0.6B Q4_0	tg64	407.95	444.68	1.09
qwen3 0.6B Q4_0	tg128	403.34	440.55	1.09
qwen3 8B F16	tg32	48.15	48.42	1.01
qwen3 8B F16	tg64	47.93	48.17	1.01
qwen3 8B F16	tg128	47.83	48.07	1.00
qwen3moe 30B.A3B Q4_0	tg32	154.49	160.81	1.04
qwen3moe 30B.A3B Q4_0	tg64	151.12	157.17	1.04
qwen3moe 30B.A3B Q4_0	tg128	149.57	155.33	1.04
gemma3 1B F16	tg32	234.06	241.52	1.03
gemma3 1B F16	tg64	231.35	238.68	1.03
gemma3 1B F16	tg128	230.10	237.11	1.03

JohannesGaessler · 2025-10-17T07:57:30Z

Without having looked at this PR, consider also fusing the Q, K, and V matrix multiplications into a single, batched operation. It's not going to reduce I/O but it's going to reduce kernel launch overhead and tail effects.

ggml/src/ggml-cuda/unary.cuh

ggml/src/ggml-cuda/mmvf.cu

ggml/src/ggml-cuda/ggml-cuda.cu

ggml/src/ggml-cuda/mmvf.cu

JohannesGaessler · 2025-10-17T08:46:49Z

tests/test-backend-ops.cpp

    }
 };

+struct test_fused_ffn_gate : public test_case {


Would it make sense to extend the test case for matrix multiplication instead?

I think it is fine, because it has a switch for mul_mat_id as well

ggml/src/ggml-cuda/mmvq.cu

am17an · 2025-10-17T15:50:02Z

Note that with quantized types and GEGLU operator, I was getting NMSE as high as 1e-2

EDIT: this is fixed, it was a CUDA graphs bug. See comment below

jammm · 2025-10-19T14:23:03Z

This improved perf on ROCm backend as well. Gap to Vulkan is closing with this one!

PS D:\jam\llama.cpp\build\bin> .\llama-bench.exe -ngl 99 -mmp 0 -m .\llama-2-7b.Q4_0.gguf -fa 0,1 --device rocm0,vulkan0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 9070 XT, gfx1201 (0x1201), VMM: no, Wave Size: 32
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 9070 XT (AMD proprietary driver) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat

Before PR:

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	ROCm0	pp512	4287.33 ± 11.72
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	ROCm0	tg128	120.20 ± 0.38
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	ROCm0	pp512	4320.84 ± 19.01
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	ROCm0	tg128	121.10 ± 0.39
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	Vulkan0	pp512	4007.62 ± 36.73
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	Vulkan0	tg128	133.28 ± 1.25
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	Vulkan0	pp512	3552.97 ± 4.94
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	Vulkan0	tg128	134.93 ± 0.94

After PR:

model	size	params	backend	ngl	fa	dev	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	ROCm0	pp512	4333.61 ± 19.93
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	ROCm0	tg128	122.93 ± 0.35
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	ROCm0	pp512	4376.61 ± 19.10
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	ROCm0	tg128	124.25 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	Vulkan0	pp512	4002.04 ± 41.63
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	0	Vulkan0	tg128	135.46 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	Vulkan0	pp512	3552.78 ± 4.35
llama 7B Q4_0	3.56 GiB	6.74 B	ROCm,Vulkan	99	1	Vulkan0	tg128	135.87 ± 0.30

build: e6c48820 (6797)

BTW what else can we do to reduce this CPU overhead? I currently have a branch with a few other optimizations applied from other PR's https://github.com/ggml-org/llama.cpp/compare/master...jammm:llama.cpp:jam/gfx1201?expand=1 though this PR gives the most benefit.

am17an · 2025-10-19T14:42:19Z

@jammm depends on your model, kernel fusion helps more at the moment for MoE models. BTW sooner or later Vulkan will also adopt a similar thing so it's a shifting goalpost. For AMD specifically @JohannesGaessler recently made some changes in FA, be sure to keep include those as well. Another PR you can try related to CUDA graphs: #16548

jammm · 2025-10-19T14:47:43Z

@jammm depends on your model, kernel fusion helps more at the moment for MoE models. BTW sooner or later Vulkan will also adopt a similar thing so it's a shifting goalpost. For AMD specifically @JohannesGaessler recently made some changes in FA, be sure to keep include those as well. Another PR you can try related to CUDA graphs: #16548

Thanks! Will take a look at those. You're right re. the shifting goalposts.

am17an · 2025-10-20T05:51:02Z

There is something strange with running CUDA graphs with this, I get
Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGraphExecUpdate.

and the fusion doesn't go through, and/or there are failures with high NMSE. With GGML_DISABLE_CUDA_GRAPHS=1 everything seems to work fine. @slaren do you have ideas why this might be happening?

EDIT: The problem was CUDA graphs not re-capturing GGML_OP_GLU when the op_params have changed. This fixes it and also the high NMSE errors I was getting earlier

diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 1c283bc1..a0a94f62 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -2834,7 +2834,7 @@ static bool ggml_graph_node_has_matching_properties(ggml_tensor * node, ggml_gra
         }
     }

-    if (node->op == GGML_OP_SCALE &&
+    if ((node->op == GGML_OP_SCALE || node->op == GGML_OP_GLU) &&
         memcmp(graph_node_properties->op_params, node->op_params, GGML_MAX_OP_PARAMS) != 0) {
         return false;
     }

ggerganov · 2025-10-20T07:41:57Z

The problem was CUDA graphs not re-capturing GGML_OP_GLU when the op_params have changed.

When do the op_params of GGML_OP_GLU change?

am17an · 2025-10-20T08:23:09Z

When do the op_params of GGML_OP_GLU change?

I added both SWIGLU and GEGLU, the op param for GGML_OP_GLU changes in the test

am17an requested review from CISC, JohannesGaessler and slaren as code owners October 17, 2025 06:48

am17an force-pushed the cuda_fuse_gate branch from bc0dcea to 39da6fe Compare October 17, 2025 07:16

am17an requested a review from ggerganov as a code owner October 17, 2025 07:16

am17an force-pushed the cuda_fuse_gate branch from 39da6fe to 0b2acfc Compare October 17, 2025 07:20

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 17, 2025

CISC reviewed Oct 17, 2025

View reviewed changes

ggml/src/ggml-cuda/unary.cuh Show resolved Hide resolved

JohannesGaessler reviewed Oct 17, 2025

View reviewed changes

am17an force-pushed the cuda_fuse_gate branch from 7f3827d to a0c67c7 Compare October 18, 2025 05:22

This comment was marked as outdated.

Sign in to view

am17an requested a review from JohannesGaessler October 19, 2025 03:38

am17an added 11 commits October 20, 2025 10:38

CUDA: fuse ffn gate for mmvf

c672fd6

cont: add mmvq

636f075

cont: add test-backend-ops test case for all_types + mul_mat_id

07d33ef

cont: add expand_graph to build_ffn

87a464a

optimize mmvf

098f53a

cont: fix gate ordering

2284cac

move unary operator to the end

ab18302

cont: add geglu

c84f9b6

format

39a13d9

cont: avoid using make_nvbfloat16

db94ab6

initialize buf_iw_gate

3741634

ggml-cuda: recompute graph if GGML_OP_GLU op_params change

9a9cd33

am17an force-pushed the cuda_fuse_gate branch from 715bd0b to 9a9cd33 Compare October 20, 2025 06:08

CUDA: fuse gate + up for mmvq, and mmvf #16630

Are you sure you want to change the base?

CUDA: fuse gate + up for mmvq, and mmvf #16630

Conversation

am17an commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Oct 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

am17an Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

am17an commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

jammm commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Oct 19, 2025

Uh oh!

jammm commented Oct 19, 2025

Uh oh!

am17an commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 20, 2025

Uh oh!

am17an commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

am17an commented Oct 17, 2025 •

edited

Loading

am17an commented Oct 17, 2025 •

edited

Loading

jammm commented Oct 19, 2025 •

edited

Loading

am17an commented Oct 20, 2025 •

edited

Loading