-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Vectorize load instructions in dmmv f16 CUDA kernel #9816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize load instructions in dmmv f16 CUDA kernel #9816
Conversation
Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.
See #9817 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dequantize_mul_mat_vec
kernels on master are quite poorly written. At the same time they are also not needed anymore except for FP16. For this reason one of my medium-term goals is to remove this kernel and replace it with a dedicated FP16 matrix vector multiplication kernel (I think cuBLAS cannot really be used because it requires the same datatype for all matrices). The FP16 compilation option is also poorly designed and should be replaced with FAST_FP16_AVAILABLE
. In this context I would then also be adding BF16 support.
So if you have the time and motivation to work on FP16 performance my recommendation would be to completely scrap the dequantize_mul_mat_vec
kernels and start from scratch. (I would still be willing to review this PR otherwise.)
ggml/src/ggml-cuda/dmmv.cu
Outdated
v.x = x_reg.x; | ||
v.y = x_reg.y; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to use instructions like __low2float
in order to work correctly with HIP. Also did you check the PTX code regarding whether or not these two lines are equivalent to assigning half2
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now replaced with __low2float
and __high2float
. NVCC gives an error if I try an assign x_reg to v directly, but these ops are in-register so won't be on the critical path anyway
Thanks Johannes. FWIW as I mention in #9817 this kernel is actually performing well (from my experiments at least) on GDDR GPUs. HBM GPUs often require more careful tuning of memory ops to achieve a high percentage of the high available bandwidth. I don't plan to start from scratch with this kernel, so appreciate the review. |
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Model | GPU | Test | t/s master | t/s d150c7e | Speedup |
---|---|---|---|---|---|
llama 8B F16 | RX 6800 | tg128 | 17.46 | 17.54 | 1.00 |
llama 8B F16 | RTX 3090 | tg128 | 50.29 | 51.06 | 1.02 |
llama 8B F16 | RTX 4090 | tg128 | 57.85 | 57.77 | 1.00 |
llama 8B F16 | P40 | tg128 | 17.23 | 18.57 | 1.08 |
Thanks, I wanted to merge this sooner but I forgot. |
* Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.