OpenCL: Fixes for older devices. #1435

SlyEcho · 2023-05-13T19:13:23Z

Remove constant in array definitions, we can't use defines because CPP does not process the kernel code normally with the way it is included in the code.

The platform and device selection should also be improved, it is now possible to use a string to match platforms:

GGML_OPENCL_PLATFORM
GGML_OPENCL_DEVICE

# default:
./main ...

# AMD:
GGML_OPENCL_PLATFORM=AMD ./main ...

etc... But I changed the name of the env variable, because it is not really related to CLBlast itself, only OpenCL in general.

Issue: #1429

Two mallocs and frees removed as well :)

SlyEcho · 2023-05-13T19:49:30Z

I managed to crash my GPU when I tried using Clover. Maybe we should have a platform blocklist? Or could it be made to work…

ggml-opencl.c

TheBloke · 2023-05-14T10:09:58Z

Thanks very much for working on this @SlyEcho

It is now picking the right device first time. But now I get a new error message: <program source>:3:635: error: use of undeclared identifier 'qk'

Full log:

tomj@Eddie ~/src $ git clone -b clfixes https://github.com/SlyEcho/llama.cpp llama-sly
Cloning into 'llama-sly'...
remote: Enumerating objects: 2795, done.
remote: Counting objects: 100% (1052/1052), done.
remote: Compressing objects: 100% (153/153), done.
remote: Total 2795 (delta 972), reused 919 (delta 899), pack-reused 1743
Receiving objects: 100% (2795/2795), 2.48 MiB | 11.12 MiB/s, done.
Resolving deltas: 100% (1795/1795), done.
tomj@Eddie ~/src $ cd llama-sly
tomj@Eddie ~/src/llama-sly (clfixes)$ make clean && LLAMA_CLBLAST=1 make
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  i386
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-matmult save-load-state build-info.h
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  i386
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_ACCELERATE -DGGML_USE_CLBLAST
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:   -framework Accelerate -lclblast -framework OpenCL
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_ACCELERATE -DGGML_USE_CLBLAST   -c ggml.c -o ggml.o
ggml.c:2002:5: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
    GGML_F16_VEC_REDUCE(sumf, sum);
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:1646:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
#define GGML_F16_VEC_REDUCE         GGML_F32Cx8_REDUCE
                                    ^
ggml.c:1636:33: note: expanded from macro 'GGML_F32Cx8_REDUCE'
#define GGML_F32Cx8_REDUCE      GGML_F32x8_REDUCE
                                ^
ggml.c:1583:11: note: expanded from macro 'GGML_F32x8_REDUCE'
    res = _mm_cvtss_f32(_mm_hadd_ps(t1, t1));                     \
        ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:2982:9: warning: implicit conversion increases floating-point precision: 'float' to 'ggml_float' (aka 'double') [-Wdouble-promotion]
        GGML_F16_VEC_REDUCE(sumf[k], sum[k]);
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:1646:37: note: expanded from macro 'GGML_F16_VEC_REDUCE'
#define GGML_F16_VEC_REDUCE         GGML_F32Cx8_REDUCE
                                    ^
ggml.c:1636:33: note: expanded from macro 'GGML_F32Cx8_REDUCE'
#define GGML_F32Cx8_REDUCE      GGML_F32x8_REDUCE
                                ^
ggml.c:1583:11: note: expanded from macro 'GGML_F32x8_REDUCE'
    res = _mm_cvtss_f32(_mm_hadd_ps(t1, t1));                     \
        ~ ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ggml.c:9613:23: warning: unused variable 'wdata' [-Wunused-variable]
        float * const wdata = params->wdata;
                      ^
ggml.c:9614:34: warning: unused variable 'dequantize_row_q' [-Wunused-variable]
        dequantize_row_q_t const dequantize_row_q = quantize_fns[type].dequantize_row_q;
                                 ^
ggml.c:477:23: warning: unused function 'mul_sum_i8_pairs' [-Wunused-function]
static inline __m128i mul_sum_i8_pairs(const __m128i x, const __m128i y) {
                      ^
ggml.c:508:19: warning: unused function 'hsum_i32_4' [-Wunused-function]
static inline int hsum_i32_4(const __m128i a) {
                  ^
ggml.c:563:23: warning: unused function 'packNibbles' [-Wunused-function]
static inline __m128i packNibbles( __m256i bytes )
                      ^
7 warnings generated.
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c examples/common.cpp -o common.o
examples/common.cpp:759:24: warning: comparison of integers of different signs: 'char32_t' and '__darwin_wint_t' (aka 'int') [-Wsign-compare]
        if (input_char == WEOF || input_char == 0x04 /* Ctrl+D*/) {
            ~~~~~~~~~~ ^  ~~~~
examples/common.cpp:774:45: warning: comparison of integers of different signs: 'char32_t' and '__darwin_wint_t' (aka 'int') [-Wsign-compare]
                while ((code = getchar32()) != WEOF) {
                        ~~~~~~~~~~~~~~~~~~  ^  ~~~~
2 warnings generated.
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_ACCELERATE -DGGML_USE_CLBLAST -c ggml-opencl.c -o ggml-opencl.o
In file included from ggml-opencl.c:4:
/usr/local/include/clblast_c.h:1686:47: warning: a function declaration without a prototype is deprecated in all versions of C [-Wstrict-prototypes]
CLBlastStatusCode PUBLIC_API CLBlastClearCache();
                                              ^
                                               void
1 warning generated.
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o ggml-opencl.o -o main  -framework Accelerate -lclblast -framework OpenCL

====  Run ./main -h for help.  ====

c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o ggml-opencl.o -o quantize  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-opencl.o -o quantize-stats  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-opencl.o -o perplexity  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-opencl.o -o embedding  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native pocs/vdot/vdot.cpp ggml.o ggml-opencl.o -o vdot  -framework Accelerate -lclblast -framework OpenCL
tomj@Eddie ~/src/llama-sly (clfixes)$  ./main -t 16 -m ~/src/huggingface/Wizard-Vicuna-13B-Uncensored-GGML/Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin -n 512 -p "### Instruction: write a story about llamas\n### Response:"
main: build = 549 (0453ce3)
main: seed  = 1684058891
llama.cpp: loading model from /Users/tomj/src/huggingface/Wizard-Vicuna-13B-Uncensored-GGML/Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  90.75 KB
llama_model_load_internal: mem required  = 11359.05 MB (+ 1608.00 MB per state)
ggml_opencl: using platform: 'Apple'
ggml_opencl: using device: 'AMD Radeon RX 6900 XT Compute Engine'
<program source>:3:635: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ^
<program source>:3:660: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   ^
<program source>:3:669: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ^
<program source>:3:962: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ^
<program source>:3:991: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^
<program source>:3:1000: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       ^
<program source>:3:1466: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ^
<program source>:3:1491: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ^
<program source>:3:1500: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ^
<program source>:3:1997: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ^
<program source>:3:2026: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ^
<program source>:3:2035: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ^
<program source>:3:2239: error: use of undeclared identifier 'qk'
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { const uint i = get_global_id(0) / 32; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^

tomj@Eddie ~/src/llama-sly (clfixes)$

SlyEcho · 2023-05-14T10:21:34Z

Sorry, I have some difficulties running OpenCL right now because my home desktop is completely locked up from the broken Mesa Clover driver.

I put it to draft right now and will change back when I have run tests on this code.

TheBloke · 2023-05-14T10:35:18Z

That commit got it running! Thank you!

It doesn't seem to be using the GPU though.

And performance figures are identical when comparing to llama.cpp compiled without LLAMA_CLBLAST

Is there any command line argument I need to use? I tried -ngl 32 but that did nothing - is that option only for CUBLAS?

swittk · 2023-05-14T10:40:22Z

For me on my hackintosh with an RX560, without specifying the device I want to use, I get the error : ggml_opencl: clGetContextInfo error -30 at ggml-opencl.c:269. (It appears that by default device 0 is the CPU itself; not even the Iris graphics. idk why either.)

But if I export GGML_OPENCL_DEVICE=2 (for the Radeon RX560 GPU) before and run the main executable, it works!

TheBloke · 2023-05-14T10:42:22Z

Weird, it just doesn't seem to use the GPU at all for me. It selects it OK, but then there's no difference in performance and no GPU usage % that I can see

TheBloke · 2023-05-14T10:47:25Z

Ohh OK sorry I understand now. It's because I had a short prompt - I forgot that CLBLAST is only for prompt evaluation at the moment?

I just wrote a really long prompt like you had, and then I saw some GPU usage

Thank you again @SlyEcho !

swittk · 2023-05-14T10:48:41Z

Yes, it's currently only for prompt evaluation, and only activates when it reaches a specific length I think. 😄

SlyEcho · 2023-05-14T10:57:54Z

It doesn't do the generating on OpenCL right now, like it does for CUDA.

Maybe if I get my setup running again I can take a shot at it. Probably in a different PR.

TheBloke · 2023-05-14T11:03:43Z

Understood. It's cool just seeing it using the GPU at all!

But of course it'd be amazing if it could one day do the same as CUBLAS as well.

Thanks for all your work on this.

SlyEcho · 2023-05-14T15:49:06Z

Testing

I got my Steam Deck to run containers again and was able to run the code there.

Model files:

666a4bb533b303bdaf89e1b6a3b6f93535d868de31d903afdc20983dc526c847  llama-7b-f16.bin
417111a40c36bff7ae6c6b3f773ac6efdb1c46584ef1077a1f3404d668e3944f  llama-7b-q4_0.bin
0fc3f4925923cafe4681370e863319e8ff8f2d760e6b3f5435b415a407aa8d56  llama-7b-q4_1.bin
edfcb8202feb4c1ec827cd645f4e6627f7a85e95538c4858f0408993b5f0d994  llama-7b-q5_0.bin
72040d380ab1067dc08c28d5f16269453bf1d4d7172c24424d4300d8474b42b6  llama-7b-q5_1.bin
d3e36532ac11c4a63798ac6ec1471c1dc5a89305c9dec0319dfcb7efc146d001  llama-7b-q8_0.bin

Testing command:

for q in q4_0 q4_1 q5_0 q5_1 q8_0 f16; do
    ./bin/perplexity -m ../models/llama-7b-$q.bin --no-mmap -f ../models/wiki.test.mini;
done

Results:

7B Q4_0

main: build = 0 (unknown)
main: seed  = 1684072095
llama.cpp: loading model from ../models/llama-7b-q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4113752.75 KB
llama_model_load_internal: mem required  = 5809.34 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 19.26 seconds per pass - ETA 3 minutes
[1]4.4546,[2]4.9404,[3]5.8280,[4]6.4844,[5]6.5855,[6]6.5088,[7]6.6928,[8]6.8061,[9]7.1428,[10]7.3867,[11]7.5941,[12]7.6133,

llama_print_timings:        load time = 24988.32 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 201847.31 ms /  6144 tokens (   32.85 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 208709.02 ms

7B Q4_1

main: build = 0 (unknown)
main: seed  = 1684072304
llama.cpp: loading model from ../models/llama-7b-q4_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936280.75 KB
llama_model_load_internal: mem required  = 6612.59 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 19.12 seconds per pass - ETA 3 minutes
[1]4.4308,[2]4.8581,[3]5.7560,[4]6.3742,[5]6.4791,[6]6.4509,[7]6.6431,[8]6.7450,[9]7.0653,[10]7.3208,[11]7.5307,[12]7.5741,

llama_print_timings:        load time = 25209.34 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 198820.19 ms /  6144 tokens (   32.36 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 206035.26 ms

7B Q5_0

main: build = 0 (unknown)
main: seed  = 1684072511
llama.cpp: loading model from ../models/llama-7b-q5_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4525016.75 KB
llama_model_load_internal: mem required  = 6210.96 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 19.41 seconds per pass - ETA 3 minutes
[1]4.2473,[2]4.7592,[3]5.6487,[4]6.2761,[5]6.3829,[6]6.3595,[7]6.5483,[8]6.6444,[9]6.9754,[10]7.2143,[11]7.4127,[12]7.4464,

llama_print_timings:        load time = 25059.42 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 204714.69 ms /  6144 tokens (   33.32 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 211418.33 ms

7B Q5_1

main: build = 0 (unknown)
main: seed  = 1684072723
llama.cpp: loading model from ../models/llama-7b-q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 4936280.75 KB
llama_model_load_internal: mem required  = 6612.59 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 19.57 seconds per pass - ETA 3 minutes
[1]4.2727,[2]4.7374,[3]5.6310,[4]6.2173,[5]6.3480,[6]6.3057,[7]6.4954,[8]6.5895,[9]6.9196,[10]7.1579,[11]7.3579,[12]7.3818,

llama_print_timings:        load time = 25792.35 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 204019.29 ms /  6144 tokens (   33.21 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 211384.88 ms

7B Q8_0

main: build = 551 (394dabb)
main: seed  = 1684080494
llama.cpp: loading model from ../models/llama-7b-q8_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 7403864.75 KB
llama_model_load_internal: mem required  = 9022.34 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 19.52 seconds per pass - ETA 3 minutes
[1]4.2285,[2]4.7008,[3]5.5712,[4]6.1758,[5]6.2967,[6]6.2676,[7]6.4630,[8]6.5547,[9]6.8740,[10]7.1201,[11]7.3160,[12]7.3370,

llama_print_timings:        load time = 29735.79 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 198337.91 ms /  6144 tokens (   32.28 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 210448.65 ms

7B F16

main: build = 0 (unknown)
main: seed  = 1684073144
llama.cpp: loading model from ../models/llama-7b-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 13161560.75 KB
llama_model_load_internal: mem required  = 14645.09 MB (+ 1026.00 MB per state)
ggml_opencl: using platform: 'AMD Accelerated Parallel Processing'
ggml_opencl: using device: 'gfx1033'
....................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
perplexity: calculating perplexity over 12 chunks, batch_size=512
perplexity: 211.29 seconds per pass - ETA 42 minutes
[1]4.2366,[2]5.0812,[3]5.8747,[4]6.4266,[5]6.5086,[6]6.4413,[7]6.6143,[8]7.2502,[9]7.5188,[10]7.7195,[11]7.8910,[12]7.8650,

llama_print_timings:        load time = 244849.08 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings: prompt eval time = 2375526.16 ms /  6144 tokens (  386.64 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token)
llama_print_timings:       total time = 2412459.23 ms

SlyEcho · 2023-05-14T16:13:49Z

I think Q8_0 is now fixed.

0cc4m · 2023-05-14T19:16:18Z

It doesn't do the generating on OpenCL right now, like it does for CUDA.

Maybe if I get my setup running again I can take a shot at it. Probably in a different PR.

I got relatively far with that already. I'll make a PR soon.

SlyEcho · 2023-05-15T00:24:05Z

OK, I tested it on an old Mac as well and it "works". It needs a small batch size (32) or CLBlast will not get enough memory. It is also 4 times slower than the CPU.

0cc4m · 2023-05-15T05:43:11Z

Nice! I tested it and it works as intended, mostly. It still selects my CPU by default, even though your code looks like it should pick the GPU. I'll take a closer look later today unless you figure it out first.

SlyEcho · 2023-05-15T08:33:28Z

@0cc4m, feel free to take my really complex platform and device init code and experiment. I think I tried to follow the "default" logic where I pass in NULL. I think it is kind of similar like what clinfo shows in the end. Maybe it's not the right thing to do?

We could iterate all platforms and devices and select the first GPU, although I would also try to skip the Clover platform, because for me it is even worse than just not working: it causes the GPU to lock up requiring the GPU to be reset in the best case and power cycled in the worst case.

SlyEcho · 2023-05-17T09:29:47Z

Since we don't access the value directly, it might as well be a ushort type, but then, is vload_half() even supported?

It really does depend on the platform itself so much, since the code has to be compiled at run time. Older drivers have worse compiler support, while newer ones have something like modern LLVM.

Would be nice to have SPIR-V or something...

AutonomicPerfectionist · 2023-05-17T17:18:58Z

Which platform and device are you using?

Platform: AMD Accelerated Parallel Processor
Device: Oland
ROCm SDK version: 20.40

It's an old Radeon HD 8570 with 1Gb of VRAM. Mainly was using it just for testing purposes, I found that its performance on prompt evaluation was worse than CPU, so supporting it if that's the only one exhibiting the issue is not a need for me. Just thought I would make y'all aware of it in case it affects other users. This card was a nightmare to get setup, only one particular version of AMD's software works on one particular version of Ubuntu (20.04), and I still needed to fiddle with library paths and names

SlyEcho · 2023-05-17T18:46:31Z

ROCm SDK version: 20.40

Latest version of ROCm is 5.5, did you mean Ubuntu 20.04?

AutonomicPerfectionist · 2023-05-17T19:03:02Z

AMD's documentation is confusing, sorry. Meant (I think) Radeon software / amdgpu:

https://www.amd.com/en/support/kb/release-notes/rn-amdgpu-unified-linux-20-40

ggml-opencl.c

SlyEcho · 2023-05-19T10:12:07Z

@0cc4m, Now everything (and I mean everything) is wrapped in CL_CHECK() macros.

I figured out a way to even do the calls that return error by reference and I do that by (ab)using the comma operator.

Maybe with C++, there's a better way with lambdas or whatever.

0cc4m

I like the CL_CHECK improvements. Only some oddities with the device selection code left. It's getting a little too complicated for my taste, but I see the purpose.

ggml-opencl.c

github-actions

clang-tidy made some suggestions

ggml-opencl.c

SlyEcho · 2023-05-20T12:33:46Z

Now rewritten the selection logic, it first scans all devices in all platforms (well, 16 platforms and 16 devices...)

Then the user selection is applied. I also match by the platform vendor string.

It is now tring to choose GPU devices by default.

GGML_OPENCL_PLATFORM=AMD     ./main # select AMD and look for GPUs there
GGML_OPENCL_PLATFORM=pocl    ./main # choose pocl but not a GPU, so show warning
GGML_OPENCL_PLATFORM=rusticl ./main # abort because there are not devices
GGML_OPENCL_PLATFORM=0       ./main # pocl because it's the first
GGML_OPENCL_DEVICE=Intel     ./main # find a device Intel, which pocl has
GGML_OPENCL_DEVICE=gfx900    ./main # use the Vega from the AMD OpenCL platform
GGML_OPENCL_DEVICE=Vega      ./main # use the Vega from Mesa Clover 💩

# you can apply both filters too but the device numbers are not per platform any more.
GGML_OPENCL_PLATFORM=pocl GGML_OPENCL_DEVICE=Intel ./main

What strings to use? Well, I didn't include the listing of the devices, but you can see it from the clinfo -l command etc.

Device numbers are now absolute, so you can't select like the second device from the 3rd platform. If you filter by name, it will select the first matching. But it is possible to give the absolute number of the device you specifically want. llama.cpp can only use one device right now anyway.

github-actions

clang-tidy made some suggestions

github-actions · 2023-05-20T12:39:33Z

ggml-opencl.c

+        if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0; }
+        else { CL_CHECK(clGetDeviceIDsError); }


warning: statement should be inside braces [readability-braces-around-statements]

Suggested change

if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0; }

else { CL_CHECK(clGetDeviceIDsError); }

if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0;

} else CL_CHECK(clGetDeviceIDsError);

github-actions · 2023-05-20T12:39:33Z

ggml-opencl.c

+        cl_device_id device_ids[NDEV];
+        cl_int clGetDeviceIDsError = clGetDeviceIDs(p->id, CL_DEVICE_TYPE_ALL, NDEV, device_ids, &p->n_devices);
+        if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0; }
+        else { CL_CHECK(clGetDeviceIDsError); }


warning: statement should be inside braces [readability-braces-around-statements]

Suggested change

else { CL_CHECK(clGetDeviceIDsError); }

else { CL_CHECK(clGetDeviceIDsError);

}

0cc4m

Nice, LGTM

Hyenadae · 2023-05-21T05:51:36Z

So I can confirm it's doing 'something' on my Ubuntu 20.04 system, 32GB DDR4, i9-9500T (3.2ghz @ 6c with mitigations off+pwr lim increased). No specific errors but odd/horrible performance which I kinda expected.

Model is VicUnlocked 30B q4_1 and latest llama.cpp as of writing

Happily detects my iGPU which I installed the NEO drivers on a while ago:
Intel(R) OpenCL HD Graphics / GEN 9 Graphics NEO
2.2.11 ICD loader ver, profile OpenCL 2.1
24CU @ 1100MHz Max
24.91GB Global memory size (so, no obvious OOM yet... and I did export the GPU alloc = 100 stuff)

Normal CPU GGML (6 threads) is about 350ms/token with 28GB/s IMC read during compute, 0% GPU usage
Enabling/compiling CLBlast 1.5.2 from the PPA greatly reduces the speed regardless of the number of layers on the GPU (I tried 6, 30 and 60)

Using the GPU, the token time went to 1400ms/token and eval time of 720m/s token with 100% CPU usage but 70% GPU 3D usage via Intel_GPU_TOP.
Memory stats were 3.5GB/s read and write to the IMC (odd bottleneck) until it jumped back to what it was doing before on the CPU at 28GB/s read, 150mb/s write.

Sample and eval time was the same for both CPU only and compiled-enabled OpenCL. Tried a 13B q4_1 model too, and same huge speed reduction, wonder if it's an architectural thing or the iGPU's way too weak

SlyEcho added 2 commits May 13, 2023 22:04

remove constants

bb5c3e2

rewrite platform and device selection

b8fb5cd

SlyEcho force-pushed the clfixes branch from bcb5d63 to b8fb5cd Compare May 13, 2023 19:26

SlyEcho added the bug Something isn't working label May 13, 2023

nicdesousa reviewed May 14, 2023

View reviewed changes

ggml-opencl.c Outdated Show resolved Hide resolved

Remove all constants

0453ce3

SlyEcho mentioned this pull request May 14, 2023

Anyone got CLBLAST working on Intel macOS with AMD GPU? Is it meant to work? #1429

Closed

minor nitpicks

9074e35

SlyEcho marked this pull request as draft May 14, 2023 10:19

remove qk as well

394dabb

ggerganov mentioned this pull request May 14, 2023

OpenCL support via CLBlast does not build on MacOS 13.3 Intel ggerganov/whisper.cpp#888

Open

Fix Q8_0

9939b87

SlyEcho marked this pull request as ready for review May 14, 2023 16:13

Merge remote-tracking branch 'origin/master'

225305d

SlyEcho added 3 commits May 19, 2023 00:35

Added another check to find a GPU.

962e2a9

Merge 'origin/master' into clfixes

558c672

add packed just in case

772e3fb

0cc4m reviewed May 19, 2023

View reviewed changes

ggml-opencl.c Outdated Show resolved Hide resolved

wrap all CL calls in checks.

35dbc8d

0cc4m reviewed May 19, 2023

View reviewed changes

ggml-opencl.c Show resolved Hide resolved

ggml-opencl.c Outdated Show resolved Hide resolved

ggml-opencl.c Outdated Show resolved Hide resolved

0cc4m reviewed May 19, 2023

View reviewed changes

ggml-opencl.c Outdated Show resolved Hide resolved

SlyEcho mentioned this pull request May 19, 2023

[BUG] CLBLast generate garbage text on q8.0 models #1525

Closed

SlyEcho added 4 commits May 19, 2023 23:46

Merge 'origin/master' into clfixes

057c9b7

update Q formats

6df8e93

rewrite platform selection code.

e71bba9

Merge 'origin/master' into clfixes

e4640ee

github-actions bot reviewed May 20, 2023

View reviewed changes

ggml-opencl.c Outdated Show resolved Hide resolved

ggml-opencl.c Outdated Show resolved Hide resolved

SlyEcho added 3 commits May 20, 2023 15:21

fix indexing issue

4f97f73

editorconfig fixes

ad9ab0e

make clang-tidy happy

71ac58a

github-actions bot reviewed May 20, 2023

View reviewed changes

clang-tidi

c29378e

0cc4m approved these changes May 20, 2023

View reviewed changes

SlyEcho merged commit 9ecb30f into ggerganov:master May 20, 2023

SlyEcho deleted the clfixes branch May 20, 2023 14:57

warkcod mentioned this pull request Jun 8, 2023

OpenCL clCreateCommandQueue error -30 on MacOS 13.4 intel ggerganov/whisper.cpp#996

Open

cracksauce mentioned this pull request Jan 7, 2024

Support AMD GPUs on Intel Macs ollama/ollama#1016

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCL: Fixes for older devices. #1435

OpenCL: Fixes for older devices. #1435

SlyEcho commented May 13, 2023

SlyEcho commented May 13, 2023

TheBloke commented May 14, 2023

SlyEcho commented May 14, 2023

TheBloke commented May 14, 2023 •

edited

Loading

swittk commented May 14, 2023 •

edited

Loading

TheBloke commented May 14, 2023

TheBloke commented May 14, 2023

swittk commented May 14, 2023

SlyEcho commented May 14, 2023

TheBloke commented May 14, 2023

SlyEcho commented May 14, 2023 •

edited

Loading

SlyEcho commented May 14, 2023

0cc4m commented May 14, 2023

SlyEcho commented May 15, 2023

0cc4m commented May 15, 2023

SlyEcho commented May 15, 2023

SlyEcho commented May 17, 2023

AutonomicPerfectionist commented May 17, 2023 •

edited

Loading

SlyEcho commented May 17, 2023

AutonomicPerfectionist commented May 17, 2023

SlyEcho commented May 19, 2023

0cc4m left a comment

github-actions bot left a comment

SlyEcho commented May 20, 2023

github-actions bot left a comment

github-actions bot May 20, 2023

github-actions bot May 20, 2023

0cc4m left a comment

Hyenadae commented May 21, 2023 •

edited

Loading

		if (clGetDeviceIDsError == CL_DEVICE_NOT_FOUND) { p->n_devices = 0; }
		else { CL_CHECK(clGetDeviceIDsError); }

	else { CL_CHECK(clGetDeviceIDsError); }
	else { CL_CHECK(clGetDeviceIDsError);
	}

OpenCL: Fixes for older devices. #1435

OpenCL: Fixes for older devices. #1435

Conversation

SlyEcho commented May 13, 2023

SlyEcho commented May 13, 2023

TheBloke commented May 14, 2023

SlyEcho commented May 14, 2023

TheBloke commented May 14, 2023 • edited Loading

swittk commented May 14, 2023 • edited Loading

TheBloke commented May 14, 2023

TheBloke commented May 14, 2023

swittk commented May 14, 2023

SlyEcho commented May 14, 2023

TheBloke commented May 14, 2023

SlyEcho commented May 14, 2023 • edited Loading

Testing

SlyEcho commented May 14, 2023

0cc4m commented May 14, 2023

SlyEcho commented May 15, 2023

0cc4m commented May 15, 2023

SlyEcho commented May 15, 2023

SlyEcho commented May 17, 2023

AutonomicPerfectionist commented May 17, 2023 • edited Loading

SlyEcho commented May 17, 2023

AutonomicPerfectionist commented May 17, 2023

SlyEcho commented May 19, 2023

0cc4m left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

SlyEcho commented May 20, 2023

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot May 20, 2023

Choose a reason for hiding this comment

github-actions bot May 20, 2023

Choose a reason for hiding this comment

0cc4m left a comment

Choose a reason for hiding this comment

Hyenadae commented May 21, 2023 • edited Loading

TheBloke commented May 14, 2023 •

edited

Loading

swittk commented May 14, 2023 •

edited

Loading

SlyEcho commented May 14, 2023 •

edited

Loading

AutonomicPerfectionist commented May 17, 2023 •

edited

Loading

Hyenadae commented May 21, 2023 •

edited

Loading