Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any chance of adding Clblast support? #1059

Closed
Folko-Ven opened this issue Apr 19, 2023 · 12 comments · Fixed by #1164
Closed

Any chance of adding Clblast support? #1059

Folko-Ven opened this issue Apr 19, 2023 · 12 comments · Fixed by #1164

Comments

@Folko-Ven
Copy link
Contributor

Folko-Ven commented Apr 19, 2023

Since the latest release added support for cuBLAS, is there any chance of adding Clblast?
Koboldcpp (which, as I understand, also uses llama.cpp) already has it, so it shouldn't be that hard.

@slaren
Copy link
Collaborator

slaren commented Apr 19, 2023

If the authors of Koboldcpp want to contribute that change I see no reason why it wouldn't be accepted, but the licenses are incompatible so we cannot just take their code and merge it here.

@ggerganov
Copy link
Owner

Also, if there is any performance benefit or code simplification thanks to adding CLBlast.
If not - there is not point in adding it

@Azeirah
Copy link
Contributor

Azeirah commented Apr 19, 2023

Also, if there is any performance benefit or code simplification thanks to adding CLBlast.
If not - there is not point in adding it

Quoting from clblast github readme (emphasis mine)

CLBlast is a modern, lightweight, performant and tunable OpenCL BLAS library written in C++11. It is designed to leverage the full performance potential of a wide variety of OpenCL devices from different vendors, including desktop and laptop GPUs, embedded GPUs, and other accelerators.

Nvidia CUBLAS support is amazing, but it doesn't add anything for people with embedded devices (like phones or raspberry PIs) or AMD GPUs.

I only agree it would not be worth adding if it has no benefit over using openBLAS, but I doubt that. Don't have any numbers to back that up though.

@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 20, 2023

Doing a quick and dirty comparison between llama.cpp with OpenBLAS and koboldcpp with CLBlast.
OpenBLAS processing time for dan.txt

llama_print_timings: prompt eval time = 32412.01 ms /   399 tokens (   81.23 ms per token)

CLBlast processing time for dan.txt

Time Taken - Processing:26.7s (67ms/T)

Keep in mind this is on a RX 570 and not a high end card. There are more numbers here but I can't verify the hardware it was performed on. cuBLAS gains at that point seem to be comparable to CLBlast, however CLBlast in that instance already did dequantization on the GPU, while cuBLAS didn't at the time those numbers were done. So the gap is probably larger now between the two.

@0cc4m
Copy link
Collaborator

0cc4m commented Apr 21, 2023

I wrote the ClBlast code for koboldcpp. If there's interest here, it should be easy to port. I could open a PR.

@apcameron
Copy link
Contributor

I wrote the ClBlast code for koboldcpp. If there's interest here, it should be easy to port. I could open a PR.

I have been playing with it on my RISC V VisionFive 2 SBC and would like to see it incorporated and improved here.
In the few tests I did it does seem to make it faster on my board.

@ghost
Copy link

ghost commented Apr 22, 2023

@0cc4m please do. I think it would be a good idea to have that functionality in upstream llama.cpp rather than as a Kobold-exclusive feature.

On the performance side another user is reporting 50% gains with a Nvidia 3060 on the clblast Kobold code. Granted clblast is twice as slow as OpenBLAS on my hardware but I'm using an integrated Intel HD530. While AMD and Nvidia users are likely better off with HIPBLAS or CUBLAS those with older AMD or Intel GPUs are stuck with clblast if they want to hardware offload.

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 24, 2023

The current cuBLAS integration is very basic (awesome work to get it in, just far from being the boost it could be), we might want to choose one path at one point if the implementation is not very compatible.

I don't know much about clBlast but it's open source while cuBLAS is fully closed sourced. If the dot product performance is compareable it's probably the better choice.
The website of clBlast is fairly outdated on benchmarks, would be interesting to see how it performs vs cuBLAS on a good 30 or 40 series.

@0cc4m
Copy link
Collaborator

0cc4m commented Apr 24, 2023

I basically have a working version, the mul_mat speed is only slightly slower than CuBLAS. Here are some tests of it on my RTX 3060. The numbers at the end of the lines are m, n and k.
CuBLAS:

CBLAS = 6.322000 ms, 512 x 4096 x 4096
CBLAS = 6.053000 ms, 512 x 4096 x 4096
CBLAS = 6.006000 ms, 512 x 4096 x 4096
CBLAS = 6.430000 ms, 512 x 4096 x 4096
CBLAS = 15.937000 ms, 512 x 11008 x 4096
CBLAS = 14.502000 ms, 512 x 11008 x 4096
CBLAS = 14.397000 ms, 512 x 4096 x 11008

CLBlast:

CBLAS = 7.697000 ms, 512 x 4096 x 4096
CBLAS = 7.527000 ms, 512 x 4096 x 4096
CBLAS = 7.458000 ms, 512 x 4096 x 4096
CBLAS = 7.656000 ms, 512 x 4096 x 4096
CBLAS = 18.108000 ms, 512 x 11008 x 4096
CBLAS = 17.960000 ms, 512 x 11008 x 4096
CBLAS = 17.942000 ms, 512 x 4096 x 11008

However, I am observing some very strange behavior. In total CLBlast is still much slower than CuBLAS:
CLBlast:
llama_print_timings: prompt eval time = 16477.85 ms / 622 tokens ( 26.49 ms per token)

CuBLAS:
llama_print_timings: prompt eval time = 5508.83 ms / 622 tokens ( 8.86 ms per token)

But when I dig deeper, I find that building with CuBLAS enabled seems to speed up entirely unrelated operations massively. I have so far not found any reason for this. Here is some data, CuBLAS (no mulmat) means I disabled the BLAS acceleration:
OP - CuBLAS - CuBLAS (no mulmat) - CLBlast - OpenBLAS

perf_total_per_op_us[            NONE] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[             DUP] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[             ADD] =  18.456 ms    19.994 ms     69.657 ms    71.692 ms
perf_total_per_op_us[             SUB] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[             MUL] = 171.661 ms   180.016 ms    211.013 ms   353.589 ms
perf_total_per_op_us[             DIV] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[             SQR] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[            SQRT] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[             SUM] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[            MEAN] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[          REPEAT] =  29.396 ms    30.701 ms     33.975 ms    34.173 ms
perf_total_per_op_us[             ABS] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[             SGN] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[             NEG] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[            STEP] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[            RELU] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[            GELU] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[            SILU] =  23.951 ms    27.628 ms    393.741 ms   441.933 ms
perf_total_per_op_us[            NORM] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[        RMS_NORM] =  17.480 ms    18.173 ms    178.540 ms   223.740 ms
perf_total_per_op_us[         MUL_MAT] = 3153.423 ms  15204.046 m   9531.328 ms  16918.383 ms
perf_total_per_op_us[           SCALE] =   3.023 ms     3.040 ms     57.477 ms    58.138 ms
perf_total_per_op_us[             CPY] =  70.451 ms    67.351 ms    309.938 ms   352.598 ms
perf_total_per_op_us[            CONT] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[         RESHAPE] =   0.104 ms     0.082 ms      0.063 ms     0.132 ms
perf_total_per_op_us[            VIEW] =   0.064 ms     0.085 ms      0.063 ms     0.109 ms
perf_total_per_op_us[         PERMUTE] =   0.060 ms     0.061 ms      0.051 ms     0.085 ms
perf_total_per_op_us[       TRANSPOSE] =   0.013 ms     0.019 ms      0.013 ms     0.029 ms
perf_total_per_op_us[        GET_ROWS] =   0.973 ms     0.984 ms      1.050 ms     1.384 ms
perf_total_per_op_us[   DIAG_MASK_INF] = 131.878 ms   129.145 ms    115.912 ms   116.610 ms
perf_total_per_op_us[        SOFT_MAX] =  61.720 ms    65.932 ms    855.297 ms   837.989 ms
perf_total_per_op_us[            ROPE] =  34.551 ms    38.595 ms    490.074 ms   909.771 ms
perf_total_per_op_us[      CONV_1D_1S] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[      CONV_1D_2S] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[      FLASH_ATTN] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[        FLASH_FF] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[       MAP_UNARY] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms
perf_total_per_op_us[      MAP_BINARY] =   0.000 ms     0.000 ms      0.000 ms     0.000 ms

@slaren Do you have any idea what causes the CuBLAS version to accelerate for example the RMS Norm by a factor of 10, even though this function is entirely run on the CPU? Even on my CLBlast version it is slightly faster than OpenBLAS, but that is a small enough difference to just be run-to-run variance. Is there something wrong with the perf measurements of the library?

@slaren
Copy link
Collaborator

slaren commented Apr 24, 2023

I don't see any reason for that, I would be inclined to believe that there is a measurement error somewhere.

@LostRuins
Copy link
Collaborator

LostRuins commented Apr 24, 2023

@slaren @0cc4m we've solved the issue - apparently there was code in the llama.cpp file that made the graph switch to single threaded mode during BLAS calculations - understandable for OpenBLAS but unnecessary for GPU accelerated approaches. When CuBLAS was added, ggml_cpu_has_cublas() was also added as an exception to the thread count limiter, but we missed it with our earlier attempts, thus running the rest of the graph single threaded and greatly slowing overall speed down.

Now that it is fixed, CLBlast performs nearly on par with CuBLAS (still slightly slower).

Edit: If anyone wants to try out our implementation, check out KoboldCpp

@Folko-Ven
Copy link
Contributor Author

This is great news! Not everyone has nvidia cards, and OpenCL supports even fossilized dinosaur bones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants