CUDA: use fastdiv + ggml_cuda_mad for mmvf #16557

am17an · 2025-10-13T06:10:27Z

I see speedups in my 3090, but not so much on a 4090. I suspect it due to better integer division hardware on newer cards, but I did not find any documentation to confirm.

on 3090:

Model	Test	t/s master	t/s cuda_mmvf_fastdiv	Speedup
lfm2moe 8B.A1B BF16	tg32	117.54	128.09	1.09
lfm2moe 8B.A1B BF16	tg64	120.75	127.56	1.06
lfm2moe 8B.A1B BF16	tg128	121.43	128.15	1.06

on 4090:

Model	Test	t/s master	t/s cuda_mmvf_fastdiv	Speedup
lfm2moe 8B.A1B F16	tg32	139.51	140.07	1.00
lfm2moe 8B.A1B F16	tg64	139.41	139.74	1.00
lfm2moe 8B.A1B F16	tg128	139.35	139.57	1.00

JohannesGaessler · 2025-10-13T18:14:01Z

I can confirm a speedup, though a smaller one. Presumably it will depend on the model.

GPU	Model	Test	t/s `477a66b`	t/s `8898040`	Speedup
MI50	llama 1B BF16	tg128	150.16	150.15	1.00
MI50	llama 1B F16	tg128	149.20	150.19	1.01
MI50	llama 1B all F32	tg128	93.51	93.58	1.00
P40	llama 1B BF16	tg128	109.53	110.07	1.00
P40	llama 1B F16	tg128	109.01	109.70	1.01
P40	llama 1B all F32	tg128	59.18	59.29	1.00
RTX 3090	llama 1B BF16	tg128	269.84	271.69	1.01
RTX 3090	llama 1B F16	tg128	270.05	272.03	1.01
RTX 3090	llama 1B all F32	tg128	152.44	153.25	1.01
RTX 4090	llama 1B BF16	tg128	316.85	317.81	1.00
RTX 4090	llama 1B F16	tg128	316.88	317.98	1.00
RTX 4090	llama 1B all F32	tg128	174.09	174.31	1.00
RX 6800	llama 1B BF16	tg128	94.65	96.46	1.02
RX 6800	llama 1B F16	tg128	94.41	96.61	1.02
RX 6800	llama 1B all F32	tg128	80.23	80.64	1.00
RX 9060 XT	llama 1B BF16	tg128	98.26	99.33	1.01
RX 9060 XT	llama 1B F16	tg128	99.47	99.96	1.00
RX 9060 XT	llama 1B all F32	tg128	57.23	57.74	1.01

JohannesGaessler

I think mul_mat_vec_f should always pass float2, half2, or nv_bfloat162 to ggml_cuda_mad and then let that function decide how to do the calculation. For example, on I think Hopper and Blackwell there are mixed-precision instructions that can be used (possibly in a future PR) and there definitely are such instructions on AMD GPUs (which are already supported).

ggml/src/ggml-cuda/mmvf.cu

am17an · 2025-10-14T05:42:14Z

Sorry I am not able to fix the HIP builds

JohannesGaessler · 2025-10-14T06:26:22Z

For now keep the problematic code in mmvf.cu as-is for HIP with a comment briefly explaining the problem.

IMbackK · 2025-10-14T10:49:04Z

Sorry I am not able to fix the HIP builds

ill take a look

JohannesGaessler · 2025-10-14T11:07:08Z

ill take a look

Would be appreciated, otherwise I would have tried to fix this myself. My preferred approach would be to merge this PR as-is and to fix the HIP issues in a follow-up PR. Is that fine with both of you?

IMbackK · 2025-10-14T11:09:50Z

sure, yes

CUDA: use fastdiv + ggml_cuda_mad for mmvf

8898040

am17an requested a review from JohannesGaessler as a code owner October 13, 2025 06:10

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 13, 2025

am17an changed the title ~~CUDA: use fast + ggml_cuda_mad for mmvf~~ CUDA: use fastdiv + ggml_cuda_mad for mmvf Oct 13, 2025

JohannesGaessler reviewed Oct 13, 2025

View reviewed changes

ggml/src/ggml-cuda/mmvf.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/mmvf.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/mmvf.cu Outdated Show resolved Hide resolved

am17an force-pushed the cuda_mmvf_fastdiv branch from 23f2ccc to 9d74b8f Compare October 14, 2025 04:55

am17an requested a review from slaren as a code owner October 14, 2025 04:55

am17an force-pushed the cuda_mmvf_fastdiv branch 2 times, most recently from 7560a47 to ec9a51c Compare October 14, 2025 05:20

am17an requested a review from IMbackK October 14, 2025 05:22

use bf16 directly + fix formatting

e1afe75

am17an force-pushed the cuda_mmvf_fastdiv branch from ec9a51c to e1afe75 Compare October 14, 2025 05:32

Add exception for HIP code

d6c71e9

am17an force-pushed the cuda_mmvf_fastdiv branch from 6dce339 to d6c71e9 Compare October 14, 2025 09:27

JohannesGaessler approved these changes Oct 14, 2025

View reviewed changes

JohannesGaessler merged commit 1ee9d0b into ggml-org:master Oct 14, 2025
65 of 70 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: use fastdiv + ggml_cuda_mad for mmvf #16557

CUDA: use fastdiv + ggml_cuda_mad for mmvf #16557

am17an commented Oct 13, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Oct 13, 2025

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Oct 14, 2025

Uh oh!

JohannesGaessler commented Oct 14, 2025

Uh oh!

IMbackK commented Oct 14, 2025

Uh oh!

JohannesGaessler commented Oct 14, 2025

Uh oh!

IMbackK commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CUDA: use fastdiv + ggml_cuda_mad for mmvf #16557

CUDA: use fastdiv + ggml_cuda_mad for mmvf #16557

Conversation

am17an commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Oct 13, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Oct 14, 2025

Uh oh!

JohannesGaessler commented Oct 14, 2025

Uh oh!

IMbackK commented Oct 14, 2025

Uh oh!

JohannesGaessler commented Oct 14, 2025

Uh oh!

IMbackK commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

am17an commented Oct 13, 2025 •

edited

Loading