Skip to content

cuBLAS: refactor and optimize f16 mat mul performance#1259

Merged
slaren merged 4 commits into
ggml-org:masterfrom
slaren:cuda-mat-mul
May 1, 2023
Merged

cuBLAS: refactor and optimize f16 mat mul performance#1259
slaren merged 4 commits into
ggml-org:masterfrom
slaren:cuda-mat-mul

Conversation

@slaren
Copy link
Copy Markdown
Member

@slaren slaren commented Apr 30, 2023

Moves all the cuBLAS specific code from ggml.c to ggml-cuda.cu. This also makes ggml-cuda.h much simpler, since fewer definitions have to exposed now.

Additionally, improves mat mul performance by using multiple stream where possible (when multiplying 3 or 4-dimensional tensors), and by choosing between doing f16 x f32 mat muls either as f16 x f16 or as f32 x f32, depending on what requires less data to be transferred to the GPU.

Overall, improves perplexity times with cuBLAS by ~15%.

🤖 Generated by Copilot at 4e54943

Summary

🚀🧹🛠️

This pull request improves the performance, compatibility, and readability of the GGML library and the llama model loader. It refactors the CUDA and BLAS code, simplifies the error checking and memory management, and exposes some useful functions and macros. The main files affected are ggml-cuda.h, ggml.c, ggml.h, llama-util.h, and llama.cpp.

ggml refactored
CUDA and BLAS streamlined
Winter of llama

Walkthrough

  • Refactored the code for using cuBLAS for matrix multiplication in GGML, by moving the CUDA-related functions and macros to ggml-cuda.h and calling them from ggml.c with conditional compilation (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
  • Exposed the functions for converting between half-precision and single-precision floating-point numbers as part of the GGML API, by adding their declarations to ggml.h and removing them from ggml.c (link, link)
  • Moved the macro for asserting conditions from ggml.c to ggml.h, to make it available for other source files that use the GGML library (link, link)
  • Improved the code style and quality in ggml.c, by removing unused variables, empty lines, and redundant conditional compilation (link, link, link, link, link, link)

From #1233:

  • Enhanced the llama_buffer and llama_ctx_buffer structs in llama-util.h, by adding default constructors and disabling copy and move constructors and assignment operators, to prevent memory leaks or errors (link, link, link)
  • Optimized the initialization of temporary buffers in the llama_model_loader struct in llama.cpp, by using the constructor of the std::vector instead of the resize method (link)

@slaren slaren marked this pull request as draft April 30, 2023 22:13
@slaren
Copy link
Copy Markdown
Member Author

slaren commented Apr 30, 2023

Exposed the functions for converting between half-precision and single-precision floating-point numbers as part of the GGML API, by adding their declarations to ggml.h and removing them from ggml.c (link, link)

Specifically, this adds vector versions of ggml_fp16_to_fp32 and ggml_fp32_to_fp16. ggml_fp32_to_fp16_row is vectorized with F16C. This was necessary as GGML_FP32_TO_FP16 isn't visible from ggml-cuda.cu, and ggml_fp32_to_fp16 is too slow without inlining.

Moved the macro for asserting conditions from ggml.c to ggml.h, to make it available for other source files that use the GGML library (link, link)

GGML_ASSERT is now exposed in ggml.h, I did this to be able to use it from ggml-cuda.cu, but if this is not desirable I can remove it.

@slaren slaren marked this pull request as ready for review May 1, 2023 11:38
Comment thread ggml.c
Comment thread ggml-cuda.cu Outdated
__half m; // min
half d; // delta
half m; // min
uint32_t qh; // 5-th bit of quants
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, should sync the CUDA block_q5_1 with the CPU one:

https://github.com/ggerganov/llama.cpp/blob/c0335b51f959ddf8c7b58bf497d10e5dc4730267/ggml.c#L736

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure why this isn't the case already, did you have any problems with alignment or anything else?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it in the same way as q5_0 and didn't notice any issues.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Q5_1 it works both ways.
For Q5_0, the uint32_t way does not work due to alignment issues, so we changed Q5_1 to uint8_t[4] for consistency

@slaren slaren merged commit 58b367c into ggml-org:master May 1, 2023
@slaren slaren deleted the cuda-mat-mul branch May 1, 2023 16:11
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* cuBLAS: refactor, convert fp16 to fp32 on device

* cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16

* fix build

* cuBLAS: update block_q5_1
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* cuBLAS: refactor, convert fp16 to fp32 on device

* cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16

* fix build

* cuBLAS: update block_q5_1
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
* cuBLAS: refactor, convert fp16 to fp32 on device

* cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16

* fix build

* cuBLAS: update block_q5_1
AlexiAlp pushed a commit to minghaop/llama.cpp that referenced this pull request Jun 2, 2026
* cuBLAS: refactor, convert fp16 to fp32 on device

* cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16

* fix build

* cuBLAS: update block_q5_1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants