Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL: Fixes for older devices. #1435

Merged
merged 19 commits into from
May 20, 2023
Merged

OpenCL: Fixes for older devices. #1435

merged 19 commits into from
May 20, 2023

Conversation

SlyEcho
Copy link
Collaborator

@SlyEcho SlyEcho commented May 13, 2023

Remove constant in array definitions, we can't use defines because CPP does not process the kernel code normally with the way it is included in the code.

The platform and device selection should also be improved, it is now possible to use a string to match platforms:

  • GGML_OPENCL_PLATFORM
  • GGML_OPENCL_DEVICE
# default:
./main ...

# AMD:
GGML_OPENCL_PLATFORM=AMD ./main ...

etc... But I changed the name of the env variable, because it is not really related to CLBlast itself, only OpenCL in general.

Issue: #1429

Two mallocs and frees removed as well :)

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 13, 2023

I managed to crash my GPU when I tried using Clover. Maybe we should have a platform blocklist? Or could it be made to work…

@SlyEcho SlyEcho added the bug Something isn't working label May 13, 2023
ggml-opencl.c Outdated Show resolved Hide resolved
@TheBloke
Copy link
Contributor

Thanks very much for working on this @SlyEcho

It is now picking the right device first time. But now I get a new error message: <program source>:3:635: error: use of undeclared identifier 'qk'

Full log:

tomj@Eddie ~/src $ git clone -b clfixes https://github.com/SlyEcho/llama.cpp llama-sly
Cloning into 'llama-sly'...
remote: Enumerating objects: 2795, done.
remote: Counting objects: 100% (1052/1052), done.
remote: Compressing objects: 100% (153/153), done.
remote: Total 2795 (delta 972), reused 919 (delta 899), pack-reused 1743
Receiving objects: 100% (2795/2795), 2.48 MiB | 11.12 MiB/s, done.
Resolving deltas: 100% (1795/1795), done.
tomj@Eddie ~/src $ cd llama-sly
tomj@Eddie ~/src/llama-sly (clfixes)$ make clean && LLAMA_CLBLAST=1 make
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  i386
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  i386
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_ACCELERATE -DGGML_USE_CLBLAST
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:   -framework Accelerate -lclblast -framework OpenCL
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_ACCELERATE -DGGML_USE_CLBLAST   -c ggml.c -o ggml.o
ggml.c:2002:5: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
    GGML_F16_VEC_REDUCE(sumf, sum);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:1646:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
#define GGML_F16_VEC_REDUCE         GGML_F32Cx8_REDUCE
                                    ^
ggml.c:1636:33: note: expanded from macro 'GGML_F32Cx8_REDUCE'
#define GGML_F32Cx8_REDUCE      GGML_F32x8_REDUCE
                                ^
ggml.c:1583:11: note: expanded from macro 'GGML_F32x8_REDUCE'
    res = _mm_cvtss_f32(_mm_hadd_ps(t1, t1));                     \
        ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2982:9: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
        GGML_F16_VEC_REDUCE(sumf[k], sum[k]);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:1646:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
#define GGML_F16_VEC_REDUCE         GGML_F32Cx8_REDUCE
                                    ^
ggml.c:1636:33: note: expanded from macro 'GGML_F32Cx8_REDUCE'
#define GGML_F32Cx8_REDUCE      GGML_F32x8_REDUCE
                                ^
ggml.c:1583:11: note: expanded from macro 'GGML_F32x8_REDUCE'
    res = _mm_cvtss_f32(_mm_hadd_ps(t1, t1));                     \
        ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:9613:23: warning: unused variable 'wdata' [-Wunused-variable]
        float * const wdata = params->wdata;
                      ^
ggml.c:9614:34: warning: unused variable 'dequantize_row_q' [-Wunused-variable]
        dequantize_row_q_t const dequantize_row_q = quantize_fns[type].dequantize_row_q;
                                 ^
ggml.c:477:23: warning: unused function 'mul_sum_i8_pairs' [-Wunused-function]
static inline __m128i mul_sum_i8_pairs(const __m128i x, const __m128i y) {
                      ^
ggml.c:508:19: warning: unused function 'hsum_i32_4' [-Wunused-function]
static inline int hsum_i32_4(const __m128i a) {
                  ^
ggml.c:563:23: warning: unused function 'packNibbles' [-Wunused-function]
static inline __m128i packNibbles( __m256i bytes )
                      ^
7 warnings generated.
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c examples/common.cpp -o common.o
examples/common.cpp:759:24: warning: comparison of integers of different signs: 'char32_t' and '__darwin_wint_t' (aka 'int') [-Wsign-compare]
        if (input_char == WEOF || input_char == 0x04 /* Ctrl+D*/) {
            ~~~~~~~~~~ ^  ~~~~
examples/common.cpp:774:45: warning: comparison of integers of different signs: 'char32_t' and '__darwin_wint_t' (aka 'int') [-Wsign-compare]
                while ((code = getchar32()) != WEOF) {
                        ~~~~~~~~~~~~~~~~~~  ^  ~~~~
2 warnings generated.
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_ACCELERATE -DGGML_USE_CLBLAST -c ggml-opencl.c -o ggml-opencl.o
In file included from ggml-opencl.c:4:
/usr/local/include/clblast_c.h:1686:47: warning: a function declaration without a prototype is deprecated in all versions of C [-Wstrict-prototypes]
CLBlastStatusCode PUBLIC_API CLBlastClearCache();
                                              ^
                                               void
1 warning generated.
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o ggml-opencl.o -o main  -framework Accelerate -lclblast -framework OpenCL

====  Run ./main -h for help.  ====

c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o ggml-opencl.o -o quantize  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-opencl.o -o quantize-stats  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-opencl.o -o perplexity  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-opencl.o -o embedding  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native pocs/vdot/vdot.cpp ggml.o ggml-opencl.o -o vdot  -framework Accelerate -lclblast -framework OpenCL
tomj@Eddie ~/src/llama-sly (clfixes)$  ./main -t 16 -m ~/src/huggingface/Wizard-Vicuna-13B-Uncensored-GGML/Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin -n 512 -p "### Instruction: write a story about llamas\n### Response:"
main: build = 549 (0453ce3)
main: seed  = 1684058891
llama.cpp: loading model from /Users/tomj/src/huggingface/Wizard-Vicuna-13B-Uncensored-GGML/Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
ggml_opencl: using platform: 'Apple'
ggml_opencl: using device: 'AMD Radeon RX 6900 XT Compute Engine'
<program source>:3:635: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ^
<program source>:3:660: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ^
<program source>:3:669: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ^
<program source>:3:962: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ^
<program source>:3:991: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^
<program source>:3:1000: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ^
<program source>:3:1466: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ^
<program source>:3:1491: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ^
<program source>:3:1500: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ^
<program source>:3:1997: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ^
<program source>:3:2026: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ^
<program source>:3:2035: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ^
<program source>:3:2239: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^

tomj@Eddie ~/src/llama-sly (clfixes)$

@SlyEcho SlyEcho marked this pull request as draft May 14, 2023 10:19
@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 14, 2023

Sorry, I have some difficulties running OpenCL right now because my home desktop is completely locked up from the broken Mesa Clover driver.

I put it to draft right now and will change back when I have run tests on this code.

@TheBloke
Copy link
Contributor

TheBloke commented May 14, 2023

That commit got it running! Thank you!

It doesn't seem to be using the GPU though.
image

And performance figures are identical when comparing to llama.cpp compiled without LLAMA_CLBLAST

Is there any command line argument I need to use? I tried -ngl 32 but that did nothing - is that option only for CUBLAS?

@swittk
Copy link
Contributor

swittk commented May 14, 2023

For me on my hackintosh with an RX560, without specifying the device I want to use, I get the error : ggml_opencl: clGetContextInfo error -30 at ggml-opencl.c:269. (It appears that by default device 0 is the CPU itself; not even the Iris graphics. idk why either.)

But if I export GGML_OPENCL_DEVICE=2 (for the Radeon RX560 GPU) before and run the main executable, it works!
Screen Shot 2566-05-14 at 17 38 53 copy

@TheBloke
Copy link
Contributor

Weird, it just doesn't seem to use the GPU at all for me. It selects it OK, but then there's no difference in performance and no GPU usage % that I can see

image

@TheBloke
Copy link
Contributor

Ohh OK sorry I understand now. It's because I had a short prompt - I forgot that CLBLAST is only for prompt evaluation at the moment?

I just wrote a really long prompt like you had, and then I saw some GPU usage
image

Thank you again @SlyEcho !

@swittk
Copy link
Contributor

swittk commented May 14, 2023

Yes, it's currently only for prompt evaluation, and only activates when it reaches a specific length I think. 😄

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 14, 2023

It doesn't do the generating on OpenCL right now, like it does for CUDA.

Maybe if I get my setup running again I can take a shot at it. Probably in a different PR.

@TheBloke
Copy link
Contributor

Understood. It's cool just seeing it using the GPU at all!

But of course it'd be amazing if it could one day do the same as CUBLAS as well.

Thanks for all your work on this.

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 14, 2023

Testing

I got my Steam Deck to run containers again and was able to run the code there.

Model files:

666a4bb533b303bdaf89e1b6a3b6f93535d868de31d903afdc20983dc526c847  llama-7b-f16.bin
417111a40c36bff7ae6c6b3f773ac6efdb1c46584ef1077a1f3404d668e3944f  llama-7b-q4_0.bin
0fc3f4925923cafe4681370e863319e8ff8f2d760e6b3f5435b415a407aa8d56  llama-7b-q4_1.bin
edfcb8202feb4c1ec827cd645f4e6627f7a85e95538c4858f0408993b5f0d994  llama-7b-q5_0.bin
72040d380ab1067dc08c28d5f16269453bf1d4d7172c24424d4300d8474b42b6  llama-7b-q5_1.bin
d3e36532ac11c4a63798ac6ec1471c1dc5a89305c9dec0319dfcb7efc146d001  llama-7b-q8_0.bin

Testing command:

for q in q4_0 q4_1 q5_0 q5_1 q8_0 f16; do
    ./bin/perplexity -m ../models/llama-7b-$q.bin --no-mmap -f ../models/wiki.test.mini;
done

Results:

7B Q4_0
main: build = 0 (unknown)
main: seed  = 1684072095
llama.cpp: loading model from ../models/llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113752.75 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 19.26 seconds per pass - ETA 3 minutes
[1]4.4546,[2]4.9404,[3]5.8280,[4]6.4844,[5]6.5855,[6]6.5088,[7]6.6928,[8]6.8061,[9]7.1428,[10]7.3867,[11]7.5941,[12]7.6133,

llama_print_timings:        load time = 24988.32 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 201847.31 ms /  6144 tokens (   32.85 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 208709.02 ms
7B Q4_1
main: build = 0 (unknown)
main: seed  = 1684072304
llama.cpp: loading model from ../models/llama-7b-q4_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936280.75 KB
llama_model_load_internal: mem required  = 6612.59 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 19.12 seconds per pass - ETA 3 minutes
[1]4.4308,[2]4.8581,[3]5.7560,[4]6.3742,[5]6.4791,[6]6.4509,[7]6.6431,[8]6.7450,[9]7.0653,[10]7.3208,[11]7.5307,[12]7.5741,

llama_print_timings:        load time = 25209.34 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 198820.19 ms /  6144 tokens (   32.36 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 206035.26 ms
7B Q5_0
main: build = 0 (unknown)
main: seed  = 1684072511
llama.cpp: loading model from ../models/llama-7b-q5_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4525016.75 KB
llama_model_load_internal: mem required  = 6210.96 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 19.41 seconds per pass - ETA 3 minutes
[1]4.2473,[2]4.7592,[3]5.6487,[4]6.2761,[5]6.3829,[6]6.3595,[7]6.5483,[8]6.6444,[9]6.9754,[10]7.2143,[11]7.4127,[12]7.4464,

llama_print_timings:        load time = 25059.42 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 204714.69 ms /  6144 tokens (   33.32 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 211418.33 ms
7B Q5_1
main: build = 0 (unknown)
main: seed  = 1684072723
llama.cpp: loading model from ../models/llama-7b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936280.75 KB
llama_model_load_internal: mem required  = 6612.59 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 19.57 seconds per pass - ETA 3 minutes
[1]4.2727,[2]4.7374,[3]5.6310,[4]6.2173,[5]6.3480,[6]6.3057,[7]6.4954,[8]6.5895,[9]6.9196,[10]7.1579,[11]7.3579,[12]7.3818,

llama_print_timings:        load time = 25792.35 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 204019.29 ms /  6144 tokens (   33.21 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 211384.88 ms
7B Q8_0
main: build = 551 (394dabb)
main: seed  = 1684080494
llama.cpp: loading model from ../models/llama-7b-q8_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 7403864.75 KB
llama_model_load_internal: mem required  = 9022.34 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 19.52 seconds per pass - ETA 3 minutes
[1]4.2285,[2]4.7008,[3]5.5712,[4]6.1758,[5]6.2967,[6]6.2676,[7]6.4630,[8]6.5547,[9]6.8740,[10]7.1201,[11]7.3160,[12]7.3370,

llama_print_timings:        load time = 29735.79 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 198337.91 ms /  6144 tokens (   32.28 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 210448.65 ms
7B F16
main: build = 0 (unknown)
main: seed  = 1684073144
llama.cpp: loading model from ../models/llama-7b-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 13161560.75 KB
llama_model_load_internal: mem required  = 14645.09 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 211.29 seconds per pass - ETA 42 minutes
[1]4.2366,[2]5.0812,[3]5.8747,[4]6.4266,[5]6.5086,[6]6.4413,[7]6.6143,[8]7.2502,[9]7.5188,[10]7.7195,[11]7.8910,[12]7.8650,

llama_print_timings:        load time = 244849.08 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 2375526.16 ms /  6144 tokens (  386.64 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 2412459.23 ms

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 14, 2023

I think Q8_0 is now fixed.

@SlyEcho SlyEcho marked this pull request as ready for review May 14, 2023 16:13
@0cc4m
Copy link
Collaborator

0cc4m commented May 14, 2023

It doesn't do the generating on OpenCL right now, like it does for CUDA.

Maybe if I get my setup running again I can take a shot at it. Probably in a different PR.

I got relatively far with that already. I'll make a PR soon.

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 15, 2023

OK, I tested it on an old Mac as well and it "works". It needs a small batch size (32) or CLBlast will not get enough memory. It is also 4 times slower than the CPU.

@0cc4m
Copy link
Collaborator

0cc4m commented May 15, 2023

Nice! I tested it and it works as intended, mostly. It still selects my CPU by default, even though your code looks like it should pick the GPU. I'll take a closer look later today unless you figure it out first.

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 15, 2023

@0cc4m, feel free to take my really complex platform and device init code and experiment. I think I tried to follow the "default" logic where I pass in NULL. I think it is kind of similar like what clinfo shows in the end. Maybe it's not the right thing to do?

We could iterate all platforms and devices and select the first GPU, although I would also try to skip the Clover platform, because for me it is even worse than just not working: it causes the GPU to lock up requiring the GPU to be reset in the best case and power cycled in the worst case.

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 17, 2023

Since we don't access the value directly, it might as well be a ushort type, but then, is vload_half() even supported?

It really does depend on the platform itself so much, since the code has to be compiled at run time. Older drivers have worse compiler support, while newer ones have something like modern LLVM.

Would be nice to have SPIR-V or something...

@AutonomicPerfectionist
Copy link
Contributor

AutonomicPerfectionist commented May 17, 2023

Which platform and device are you using?

Platform: AMD Accelerated Parallel Processor
Device: Oland
ROCm SDK version: 20.40

It's an old Radeon HD 8570 with 1Gb of VRAM. Mainly was using it just for testing purposes, I found that its performance on prompt evaluation was worse than CPU, so supporting it if that's the only one exhibiting the issue is not a need for me. Just thought I would make y'all aware of it in case it affects other users. This card was a nightmare to get setup, only one particular version of AMD's software works on one particular version of Ubuntu (20.04), and I still needed to fiddle with library paths and names

@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 17, 2023

ROCm SDK version: 20.40

Latest version of ROCm is 5.5, did you mean Ubuntu 20.04?

@AutonomicPerfectionist
Copy link
Contributor

AMD's documentation is confusing, sorry. Meant (I think) Radeon software / amdgpu:

https://www.amd.com/en/support/kb/release-notes/rn-amdgpu-unified-linux-20-40

ggml-opencl.c Outdated Show resolved Hide resolved
@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 19, 2023

@0cc4m, Now everything (and I mean everything) is wrapped in CL_CHECK() macros.

I figured out a way to even do the calls that return error by reference and I do that by (ab)using the comma operator.

Maybe with C++, there's a better way with lambdas or whatever.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the CL_CHECK improvements. Only some oddities with the device selection code left. It's getting a little too complicated for my taste, but I see the purpose.

ggml-opencl.c Show resolved Hide resolved
ggml-opencl.c Outdated Show resolved Hide resolved
ggml-opencl.c Outdated Show resolved Hide resolved
ggml-opencl.c Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ggml-opencl.c Outdated Show resolved Hide resolved
ggml-opencl.c Outdated Show resolved Hide resolved
@SlyEcho
Copy link
Collaborator Author

SlyEcho commented May 20, 2023

Now rewritten the selection logic, it first scans all devices in all platforms (well, 16 platforms and 16 devices...)

Then the user selection is applied. I also match by the platform vendor string.

It is now tring to choose GPU devices by default.

GGML_OPENCL_PLATFORM=AMD     ./main # select AMD and look for GPUs there
GGML_OPENCL_PLATFORM=pocl    ./main # choose pocl but not a GPU, so show warning
GGML_OPENCL_PLATFORM=rusticl ./main # abort because there are not devices
GGML_OPENCL_PLATFORM=0       ./main # pocl because it's the first
GGML_OPENCL_DEVICE=Intel     ./main # find a device Intel, which pocl has
GGML_OPENCL_DEVICE=gfx900    ./main # use the Vega from the AMD OpenCL platform
GGML_OPENCL_DEVICE=Vega      ./main # use the Vega from Mesa Clover 💩

# you can apply both filters too but the device numbers are not per platform any more.
GGML_OPENCL_PLATFORM=pocl GGML_OPENCL_DEVICE=Intel ./main

What strings to use? Well, I didn't include the listing of the devices, but you can see it from the clinfo -l command etc.

Device numbers are now absolute, so you can't select like the second device from the 3rd platform. If you filter by name, it will select the first matching. But it is possible to give the absolute number of the device you specifically want. llama.cpp can only use one device right now anyway.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ggml-opencl.c Outdated
Comment on lines 232 to 233
if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0; }
else { CL_CHECK(clGetDeviceIDsError); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: statement should be inside braces [readability-braces-around-statements]

Suggested change
if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0; }
else { CL_CHECK(clGetDeviceIDsError); }
if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0;
} else CL_CHECK(clGetDeviceIDsError);

ggml-opencl.c Outdated
cl_device_id device_ids[NDEV];
cl_int clGetDeviceIDsError = clGetDeviceIDs(p->id, CL_DEVICE_TYPE_ALL, NDEV, device_ids, &p->n_devices);
if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0; }
else { CL_CHECK(clGetDeviceIDsError); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: statement should be inside braces [readability-braces-around-statements]

Suggested change
else { CL_CHECK(clGetDeviceIDsError); }
else { CL_CHECK(clGetDeviceIDsError);
}

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, LGTM

@SlyEcho SlyEcho merged commit 9ecb30f into ggerganov:master May 20, 2023
@SlyEcho SlyEcho deleted the clfixes branch May 20, 2023 14:57
@Hyenadae
Copy link

Hyenadae commented May 21, 2023

So I can confirm it's doing 'something' on my Ubuntu 20.04 system, 32GB DDR4, i9-9500T (3.2ghz @ 6c with mitigations off+pwr lim increased). No specific errors but odd/horrible performance which I kinda expected.

Model is VicUnlocked 30B q4_1 and latest llama.cpp as of writing

Happily detects my iGPU which I installed the NEO drivers on a while ago:
Intel(R) OpenCL HD Graphics / GEN 9 Graphics NEO
2.2.11 ICD loader ver, profile OpenCL 2.1
24CU @ 1100MHz Max
24.91GB Global memory size (so, no obvious OOM yet... and I did export the GPU alloc = 100 stuff)

Normal CPU GGML (6 threads) is about 350ms/token with 28GB/s IMC read during compute, 0% GPU usage
Enabling/compiling CLBlast 1.5.2 from the PPA greatly reduces the speed regardless of the number of layers on the GPU (I tried 6, 30 and 60)

Using the GPU, the token time went to 1400ms/token and eval time of 720m/s token with 100% CPU usage but 70% GPU 3D usage via Intel_GPU_TOP.
Memory stats were 3.5GB/s read and write to the IMC (odd bottleneck) until it jumped back to what it was doing before on the CPU at 28GB/s read, 150mb/s write.

Sample and eval time was the same for both CPU only and compiled-enabled OpenCL. Tried a 13B q4_1 model too, and same huge speed reduction, wonder if it's an architectural thing or the iGPU's way too weak

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants