cuBLAS: refactor and optimize f16 mat mul performance by slaren · Pull Request #1259 · ggml-org/llama.cpp

slaren · 2023-04-30T22:06:50Z

Moves all the cuBLAS specific code from ggml.c to ggml-cuda.cu. This also makes ggml-cuda.h much simpler, since fewer definitions have to exposed now.

Additionally, improves mat mul performance by using multiple stream where possible (when multiplying 3 or 4-dimensional tensors), and by choosing between doing f16 x f32 mat muls either as f16 x f16 or as f32 x f32, depending on what requires less data to be transferred to the GPU.

Overall, improves perplexity times with cuBLAS by ~15%.

`🤖 Generated by Copilot at 4e54943`

Summary

🚀🧹🛠️

This pull request improves the performance, compatibility, and readability of the GGML library and the llama model loader. It refactors the CUDA and BLAS code, simplifies the error checking and memory management, and exposes some useful functions and macros. The main files affected are ggml-cuda.h, ggml.c, ggml.h, llama-util.h, and llama.cpp.

ggml refactored
CUDA and BLAS streamlined
Winter of llama

Walkthrough

Refactored the code for using cuBLAS for matrix multiplication in GGML, by moving the CUDA-related functions and macros to ggml-cuda.h and calling them from ggml.c with conditional compilation (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
Exposed the functions for converting between half-precision and single-precision floating-point numbers as part of the GGML API, by adding their declarations to ggml.h and removing them from ggml.c (link, link)
Moved the macro for asserting conditions from ggml.c to ggml.h, to make it available for other source files that use the GGML library (link, link)
Improved the code style and quality in ggml.c, by removing unused variables, empty lines, and redundant conditional compilation (link, link, link, link, link, link)

From #1233:

Enhanced the llama_buffer and llama_ctx_buffer structs in llama-util.h, by adding default constructors and disabling copy and move constructors and assignment operators, to prevent memory leaks or errors (link, link, link)
Optimized the initialization of temporary buffers in the llama_model_loader struct in llama.cpp, by using the constructor of the std::vector instead of the resize method (link)

slaren · 2023-04-30T22:34:07Z

Exposed the functions for converting between half-precision and single-precision floating-point numbers as part of the GGML API, by adding their declarations to ggml.h and removing them from ggml.c (link, link)

Specifically, this adds vector versions of ggml_fp16_to_fp32 and ggml_fp32_to_fp16. ggml_fp32_to_fp16_row is vectorized with F16C. This was necessary as GGML_FP32_TO_FP16 isn't visible from ggml-cuda.cu, and ggml_fp32_to_fp16 is too slow without inlining.

Moved the macro for asserting conditions from ggml.c to ggml.h, to make it available for other source files that use the GGML library (link, link)

GGML_ASSERT is now exposed in ggml.h, I did this to be able to use it from ggml-cuda.cu, but if this is not desirable I can remove it.

…l_mat_f16

ggerganov · 2023-05-01T12:00:49Z

-    __half m;               // min
+    half d;                 // delta
+    half m;                 // min
    uint32_t qh;            // 5-th bit of quants


At some point, should sync the CUDA block_q5_1 with the CPU one:

https://github.com/ggerganov/llama.cpp/blob/c0335b51f959ddf8c7b58bf497d10e5dc4730267/ggml.c#L736

I am not entirely sure why this isn't the case already, did you have any problems with alignment or anything else?

I updated it in the same way as q5_0 and didn't notice any issues.

For Q5_1 it works both ways.
For Q5_0, the uint32_t way does not work due to alignment issues, so we changed Q5_1 to uint8_t[4] for consistency

* cuBLAS: refactor, convert fp16 to fp32 on device * cuBLAS: use multiple streams, choose smartly between mul_mat_q and mul_mat_f16 * fix build * cuBLAS: update block_q5_1

slaren marked this pull request as draft April 30, 2023 22:13

slaren added 3 commits May 1, 2023 13:33

cuBLAS: refactor, convert fp16 to fp32 on device

cf93fdc

cuBLAS: use multiple streams, choose smartly between mul_mat_q and mu…

a9ad140

…l_mat_f16

fix build

4cd0a48

slaren force-pushed the cuda-mat-mul branch from 4fd9d4a to 4cd0a48 Compare May 1, 2023 11:35

slaren marked this pull request as ready for review May 1, 2023 11:38

ggerganov reviewed May 1, 2023

View reviewed changes

Comment thread ggml.c

slaren mentioned this pull request May 1, 2023

Generalize quantize_fns for simpler FP16 handling #1237

Merged

ggerganov approved these changes May 1, 2023

View reviewed changes

cuBLAS: update block_q5_1

a79756b

slaren merged commit 58b367c into ggml-org:master May 1, 2023

slaren deleted the cuda-mat-mul branch May 1, 2023 16:11

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuBLAS: refactor and optimize f16 mat mul performance#1259

cuBLAS: refactor and optimize f16 mat mul performance#1259
slaren merged 4 commits into
ggml-org:masterfrom
slaren:cuda-mat-mul

slaren commented Apr 30, 2023 •

edited

Loading

Uh oh!

slaren commented Apr 30, 2023 •

edited

Loading

Uh oh!

Uh oh!

ggerganov May 1, 2023

Uh oh!

slaren May 1, 2023

Uh oh!

slaren May 1, 2023

Uh oh!

ggerganov May 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

slaren commented Apr 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 Generated by Copilot at 4e54943

Summary

Walkthrough

Uh oh!

slaren commented Apr 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ggerganov May 1, 2023

Choose a reason for hiding this comment

Uh oh!

slaren May 1, 2023

Choose a reason for hiding this comment

Uh oh!

slaren May 1, 2023

Choose a reason for hiding this comment

Uh oh!

ggerganov May 1, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

slaren commented Apr 30, 2023 •

edited

Loading

`🤖 Generated by Copilot at 4e54943`

slaren commented Apr 30, 2023 •

edited

Loading