Skip to content

Conversation

agray3
Copy link
Contributor

@agray3 agray3 commented Oct 10, 2024

Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.
@agray3
Copy link
Contributor Author

agray3 commented Oct 10, 2024

See #9817

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dequantize_mul_mat_vec kernels on master are quite poorly written. At the same time they are also not needed anymore except for FP16. For this reason one of my medium-term goals is to remove this kernel and replace it with a dedicated FP16 matrix vector multiplication kernel (I think cuBLAS cannot really be used because it requires the same datatype for all matrices). The FP16 compilation option is also poorly designed and should be replaced with FAST_FP16_AVAILABLE. In this context I would then also be adding BF16 support.

So if you have the time and motivation to work on FP16 performance my recommendation would be to completely scrap the dequantize_mul_mat_vec kernels and start from scratch. (I would still be willing to review this PR otherwise.)

Comment on lines 422 to 423
v.x = x_reg.x;
v.y = x_reg.y;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to use instructions like __low2float in order to work correctly with HIP. Also did you check the PTX code regarding whether or not these two lines are equivalent to assigning half2 directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now replaced with __low2float and __high2float. NVCC gives an error if I try an assign x_reg to v directly, but these ops are in-register so won't be on the critical path anyway

@agray3
Copy link
Contributor Author

agray3 commented Oct 10, 2024

The dequantize_mul_mat_vec kernels on master are quite poorly written. At the same time they are also not needed anymore except for FP16. For this reason one of my medium-term goals is to remove this kernel and replace it with a dedicated FP16 matrix vector multiplication kernel (I think cuBLAS cannot really be used because it requires the same datatype for all matrices). The FP16 compilation option is also poorly designed and should be replaced with FAST_FP16_AVAILABLE. In this context I would then also be adding BF16 support.

So if you have the time and motivation to work on FP16 performance my recommendation would be to completely scrap the dequantize_mul_mat_vec kernels and start from scratch. (I would still be willing to review this PR otherwise.)

Thanks Johannes. FWIW as I mention in #9817 this kernel is actually performing well (from my experiments at least) on GDDR GPUs. HBM GPUs often require more careful tuning of memory ops to achieve a high percentage of the high available bandwidth. I don't plan to start from scratch with this kernel, so appreciate the review.

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model GPU Test t/s master t/s d150c7e Speedup
llama 8B F16 RX 6800 tg128 17.46 17.54 1.00
llama 8B F16 RTX 3090 tg128 50.29 51.06 1.02
llama 8B F16 RTX 4090 tg128 57.85 57.77 1.00
llama 8B F16 P40 tg128 17.23 18.57 1.08

@slaren slaren merged commit 13dca2a into ggml-org:master Oct 14, 2024
53 checks passed
@JohannesGaessler
Copy link
Collaborator

Thanks, I wanted to merge this sooner but I forgot.

drollings pushed a commit to drollings/llama.cpp that referenced this pull request Oct 18, 2024
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Dec 23, 2024
* Vectorize load instructions in dmmv f16 CUDA kernel

Replaces scalar with vector load instructions, which substantially
improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall
speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on
H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup.

* addressed comment

* Update ggml/src/ggml-cuda/dmmv.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants