Vectorize load instructions in dmmv f16 CUDA kernel #9816

agray3 · 2024-10-10T09:03:25Z

Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

agray3 · 2024-10-10T09:27:58Z

See #9817

JohannesGaessler

The dequantize_mul_mat_vec kernels on master are quite poorly written. At the same time they are also not needed anymore except for FP16. For this reason one of my medium-term goals is to remove this kernel and replace it with a dedicated FP16 matrix vector multiplication kernel (I think cuBLAS cannot really be used because it requires the same datatype for all matrices). The FP16 compilation option is also poorly designed and should be replaced with FAST_FP16_AVAILABLE. In this context I would then also be adding BF16 support.

So if you have the time and motivation to work on FP16 performance my recommendation would be to completely scrap the dequantize_mul_mat_vec kernels and start from scratch. (I would still be willing to review this PR otherwise.)

JohannesGaessler · 2024-10-10T11:09:33Z

ggml/src/ggml-cuda/dmmv.cu

+    v.x = x_reg.x;
+    v.y = x_reg.y;


This needs to use instructions like __low2float in order to work correctly with HIP. Also did you check the PTX code regarding whether or not these two lines are equivalent to assigning half2 directly?

I've now replaced with __low2float and __high2float. NVCC gives an error if I try an assign x_reg to v directly, but these ops are in-register so won't be on the critical path anyway

agray3 · 2024-10-10T13:14:04Z

The dequantize_mul_mat_vec kernels on master are quite poorly written. At the same time they are also not needed anymore except for FP16. For this reason one of my medium-term goals is to remove this kernel and replace it with a dedicated FP16 matrix vector multiplication kernel (I think cuBLAS cannot really be used because it requires the same datatype for all matrices). The FP16 compilation option is also poorly designed and should be replaced with FAST_FP16_AVAILABLE. In this context I would then also be adding BF16 support.

So if you have the time and motivation to work on FP16 performance my recommendation would be to completely scrap the dequantize_mul_mat_vec kernels and start from scratch. (I would still be willing to review this PR otherwise.)

Thanks Johannes. FWIW as I mention in #9817 this kernel is actually performing well (from my experiments at least) on GDDR GPUs. HBM GPUs often require more careful tuning of memory ops to achieve a high percentage of the high available bandwidth. I don't plan to start from scratch with this kernel, so appreciate the review.

ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

JohannesGaessler

Model	GPU	Test	t/s master	t/s `d150c7e`	Speedup
llama 8B F16	RX 6800	tg128	17.46	17.54	1.00
llama 8B F16	RTX 3090	tg128	50.29	51.06	1.02
llama 8B F16	RTX 4090	tg128	57.85	57.77	1.00
llama 8B F16	P40	tg128	17.23	18.57	1.08

JohannesGaessler · 2024-10-14T06:56:14Z

Thanks, I wanted to merge this sooner but I forgot.

* Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Oct 10, 2024

agray3 mentioned this pull request Oct 10, 2024

Optimization of matrix-vector kernel memory accesses for NVIDIA CUDA High Bandwidth GPUs #9817

Closed

4 tasks

gakugaku mentioned this pull request Oct 10, 2024

Bug: TypeError when YAML license field in README.md is a list during GGUF conversion #9819

Closed

JohannesGaessler reviewed Oct 10, 2024

View reviewed changes

addressed comment

d07dc44

JohannesGaessler reviewed Oct 10, 2024

View reviewed changes

ggml/src/ggml-cuda/dmmv.cu Outdated Show resolved Hide resolved

Update ggml/src/ggml-cuda/dmmv.cu

d150c7e

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

JohannesGaessler approved these changes Oct 10, 2024

View reviewed changes

slaren merged commit 13dca2a into ggml-org:master Oct 14, 2024
53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize load instructions in dmmv f16 CUDA kernel #9816

Vectorize load instructions in dmmv f16 CUDA kernel #9816

Uh oh!

agray3 commented Oct 10, 2024

Uh oh!

agray3 commented Oct 10, 2024

Uh oh!

JohannesGaessler left a comment

Uh oh!

JohannesGaessler Oct 10, 2024

Uh oh!

agray3 Oct 10, 2024

Uh oh!

agray3 commented Oct 10, 2024

Uh oh!

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

JohannesGaessler commented Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Vectorize load instructions in dmmv f16 CUDA kernel #9816

Vectorize load instructions in dmmv f16 CUDA kernel #9816

Uh oh!

Conversation

agray3 commented Oct 10, 2024

Uh oh!

agray3 commented Oct 10, 2024

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

agray3 Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

agray3 commented Oct 10, 2024

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JohannesGaessler commented Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants