OpenCL dequant_mul_mat #1459

0cc4m · 2023-05-14T20:35:08Z

I refactored the OpenCL implementation to become more like the CUDA implementation, including dequant_mul_mat kernels and gpu layers. I see a pretty good speed increase with it.

Not done testing yet, so I put this PR on draft until I'm fully done. But since I've seen others with the same idea, I wanna put it up already to hopefully avoid doing the same work multiple times.

Here are some initial results on Ryzen 9 5950X and AMD RX 6800 XT, q5_1 Llama 7B:
master branch:

llama_print_timings:      sample time =    29.00 ms /    50 runs   (    0.58 ms per token)
llama_print_timings: prompt eval time =  5271.28 ms /   622 tokens (    8.47 ms per token)
llama_print_timings:        eval time =  6815.71 ms /    49 runs   (  139.10 ms per token)

This PR (no layers offloaded):

llama_print_timings:      sample time =    29.96 ms /    50 runs   (    0.60 ms per token)
llama_print_timings: prompt eval time =  5879.90 ms /   622 tokens (    9.45 ms per token)
llama_print_timings:        eval time =  6171.29 ms /    49 runs   (  125.94 ms per token)

This PR (33 layers offloaded, 4.7GB VRAM used, dequant_mul_mat kernels used):

llama_print_timings:      sample time =    23.40 ms /    50 runs   (    0.47 ms per token)
llama_print_timings: prompt eval time =  4918.17 ms /   622 tokens (    7.91 ms per token)
llama_print_timings:        eval time =  2779.95 ms /    49 runs   (   56.73 ms per token)

The slight regression in prompt eval time is the reason I didn't follow the CUDA implementation sooner, it appeared when I implemented the non-contiguous tensor loading. There might be some bottleneck in the code related to that, let me know if you spot something.

Still missing:

~~CMakeLists.txt update~~
~~Nvidia tests~~
~~Intel tests~~
~~iGPU tests~~

github-actions

clang-tidy made some suggestions

ggml-opencl.cpp

JohannesGaessler

There seems to be a lot of code duplication. Does OpenCL not support templates?

SlyEcho · 2023-05-14T21:24:17Z

Does OpenCL not support templates?

No, it's a C-based language. But you could generate the code as strings as needed and compile at runtime.

JohannesGaessler · 2023-05-14T21:48:50Z

That is unfortunate. Just today I did a PR that deduplicated the CUDA code by creating another template for just dequantization: #1453 . That way you only have to maintain 1 method per quantization + 2 template methods.

JohannesGaessler · 2023-05-14T22:21:42Z

I did some quick testing:

GPU	Model	Backend	Speed [t/s]
GTX 1070	7b q4_0	CUDA	14.47
GTX 1070	7b q4_0	OpenCL	13.69

SlyEcho · 2023-05-14T22:45:33Z

There is no need for templates, actually, @JohannesGaessler. I converted your new CUDA code to macros and it still works fine.

horrible macros

#define DEFINE_Q_FUNCS(NAM, block_size, qk, qr, dequantize_kernel) \
static __global__ void dequantize_block_##NAM(const void * vx, float * y, const int k) { \
    const int i = blockDim.x*blockIdx.x + 2*threadIdx.x; \
\
    if (i >= k) { \
        return; \
    } \
\
    const int ib = i/qk; /* block index */ \
    const int iqs = (i%qk)/qr; /* quant index */ \
    const int iybs = i - i%qk; /* y block start index */ \
    const int y_offset = qr == 1 ? 1 : qk/2; \
\
    /* dequantize */ \
    float & v0 = y[iybs + iqs + 0]; \
    float & v1 = y[iybs + iqs + y_offset]; \
    dequantize_kernel(vx, ib, iqs, v0, v1); \
} \
\
static void dequantize_row_##NAM##_cuda(const void * vx, float * y, const int k, cudaStream_t stream) { \
    const int num_blocks = (k + CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / CUDA_DEQUANTIZE_BLOCK_SIZE; \
    dequantize_block_##NAM<<<num_blocks, CUDA_DEQUANTIZE_BLOCK_SIZE, 0, stream>>>(vx, y, k); \
} \
\
static __global__ void dequantize_mul_mat_vec_##NAM(const void * vx, const float * y, float * dst, const int ncols) { \
    const int row = blockIdx.x; \
    const int tid = threadIdx.x; \
\
    const int y_offset = qr == 1 ? 1 : qk/2; \
\
    __shared__ float tmp[block_size]; /* separate sum for each thread */ \
    tmp[tid] = 0; \
\
    for (int i = 0; i < ncols/block_size; i += 2) { \
        const int col = i*block_size + 2*tid; \
        const int ib = (row*ncols + col)/qk; /* block index */ \
        const int iqs = (col%qk)/qr; /* quant index */ \
        const int iybs = col - col%qk; /* y block start index */ \
\
        /* dequantize */ \
        float v0, v1; \
        dequantize_kernel(vx, ib, iqs, v0, v1); \
\
        /* matrix multiplication */ \
        tmp[tid] += v0 * y[iybs + iqs + 0]; \
        tmp[tid] += v1 * y[iybs + iqs + y_offset]; \
    }\
\
    /* sum up partial sums and write back result */ \
    __syncthreads(); \
    for (int s=block_size/2; s>0; s>>=1) { \
        if (tid < s) { \
            tmp[tid] += tmp[tid + s]; \
        } \
        __syncthreads(); \
    } \
    if (tid == 0) { \
        dst[row] = tmp[0]; \
    } \
} \
\
static void dequantize_mul_mat_vec_##NAM##_cuda(const void * vx, const float * y, float * dst, const int ncols, const int nrows, cudaStream_t stream) { \
    GGML_ASSERT(ncols % CUDA_DMMV_BLOCK_SIZE == 0); \
    dequantize_mul_mat_vec_##NAM<<<nrows, CUDA_DMMV_BLOCK_SIZE, 0, stream>>>(vx, y, dst, ncols); \
}

DEFINE_Q_FUNCS(q4_0, CUDA_DMMV_BLOCK_SIZE, QK4_0, QR4_0, dequantize_q4_0)
DEFINE_Q_FUNCS(q4_1, CUDA_DMMV_BLOCK_SIZE, QK4_1, QR4_1, dequantize_q4_1)
DEFINE_Q_FUNCS(q5_0, CUDA_DMMV_BLOCK_SIZE, QK5_0, QR5_0, dequantize_q5_0)
DEFINE_Q_FUNCS(q5_1, CUDA_DMMV_BLOCK_SIZE, QK5_1, QR5_1, dequantize_q5_1)
DEFINE_Q_FUNCS(q8_0, CUDA_DMMV_BLOCK_SIZE, QK8_0, QR8_0, dequantize_q8_0)
DEFINE_Q_FUNCS(f16, CUDA_DMMV_BLOCK_SIZE, 32, 1, convert_f16)

Currently this wouldn't work in ggml-opencl.c because of the way the string is processed into the file. There can't be any defines in there.

0cc4m · 2023-05-15T04:18:41Z

The runtime compilation of OpenCL gives us the ability to build the kernels on the fly. I plan to add a simple templater that does this to reduce duplication.

LiliumSancta · 2023-05-15T13:25:55Z

Some tests with (GPU RX 6600 8GB CPU 5600x RAM 32GB 3600mhz)
In ubuntu I used CUDA with the rocm patch of @SlyEcho, in windows with OPENCL i couldn't put the same number of layers as the model in the VRAM due to the overload of windows itself. All were run with the same parameters and seed.

Edit: model Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin

There is an error in CMakeLists.txt in set(GGML_OPENCL_SOURCES ggml-opencl.c ggml-opencl.h) due to change to ggml-opencl.cpp

OPENCL - Ubuntu

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx1032 FP16: 1
llama_model_load_internal: [opencl] offloading 31 layers to GPU
llama_model_load_internal: [opencl] total VRAM used: 7033 MB
llama_init_from_file: kv self size = 1600,00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0

llama_print_timings: load time = 5548,81 ms
llama_print_timings: sample time = 180,76 ms / 391 runs ( 0,46 ms per token)
llama_print_timings: prompt eval time = 3289,77 ms / 14 tokens ( 234,98 ms per token)
llama_print_timings: eval time = 59369,83 ms / 390 runs ( 152,23 ms per token)
llama_print_timings: total time = 65191,07 ms

ROCM - CUDA

llama_model_load_internal: [cublas] offloading 31 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 7033 MB
llama_init_from_file: kv self size = 1600,00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0

llama_print_timings: load time = 4835,65 ms
llama_print_timings: sample time = 120,23 ms / 278 runs ( 0,43 ms per token)
llama_print_timings: prompt eval time = 2329,01 ms / 14 tokens ( 166,36 ms per token)
llama_print_timings: eval time = 39577,58 ms / 277 runs ( 142,88 ms per token)
llama_print_timings: total time = 44547,13 ms

OPENCL - WINDOWS

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx1032 FP16: 1
llama_model_load_internal: [opencl] offloading 23 layers to GPU
llama_model_load_internal: [opencl] total VRAM used: 5218 MB
llama_init_from_file: kv self size = 1600.00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0

llama_print_timings: load time = 9369.12 ms
llama_print_timings: sample time = 74.93 ms / 434 runs ( 0.17 ms per token)
llama_print_timings: prompt eval time = 5275.04 ms / 14 tokens ( 376.79 ms per token)
llama_print_timings: eval time = 89806.45 ms / 433 runs ( 207.41 ms per token)
llama_print_timings: total time = 99290.66 ms

SlyEcho · 2023-05-15T13:31:03Z

@LiliumSancta, which LLaMa is it?

BTW, add --ignore-eos to always generate the amount of tokens you ask for.

LiliumSancta · 2023-05-15T13:35:21Z

@LiliumSancta, which LLaMa is it?

BTW, add --ignore-eos to always generate the amount of tokens you ask for.

Sorry forgot to mention it is Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin

0cc4m · 2023-05-15T13:49:14Z

@LiliumSancta Thanks for the test. Your results also show that something is wrong with token processing. I will try to investigate that.

LiliumSancta · 2023-05-15T13:58:15Z

@LiliumSancta Thanks for the test. Your results also show that something is wrong with token processing. I will try to investigate that.

I would like to thank all of you for your efforts in the development of this project. Later when I get off work I'll redo everything with a 7B model that can be loaded entirely into VRAM and using --ignore-eos to be more consistent.

daniandtheweb · 2023-05-15T14:05:14Z

@0cc4m On WizardLM 7B offloading all the 32 layers I'm getting 60 ms per token on a 5700XT. On the wizard-vicuna-13B model instead I'm just able to offload 27 layers before saturating the VRAM and I get 157 ms per token. Both the models are 5_1 quantized.

wizardlm-7B-q5_1:

llama_print_timings:        load time = 10132.12 ms
llama_print_timings:      sample time =   161.22 ms /   256 runs   (    0.63 ms per token)
llama_print_timings: prompt eval time =  8981.20 ms /   658 tokens (   13.65 ms per token)
llama_print_timings:        eval time = 15348.64 ms /   254 runs   (   60.43 ms per token)
llama_print_timings:       total time = 28500.82 ms

wizard-vicuna-uncensored-13B-q5_1:

llama_print_timings:        load time = 13575.15 ms
llama_print_timings:      sample time =   160.96 ms /   256 runs   (    0.63 ms per token)
llama_print_timings: prompt eval time = 18880.03 ms /   658 tokens (   28.69 ms per token)
llama_print_timings:        eval time = 40117.31 ms /   254 runs   (  157.94 ms per token)
llama_print_timings:       total time = 61474.53 ms

SlyEcho · 2023-05-15T14:09:30Z

something is wrong with token processing.

I don't think so necessarily, with only 14 tokens, the timing may be very inaccurate.

But it does show one thing: OpenCL can be as fast as CUDA/ROCm, and also that CLBlast is seriously underwhelming compared to cuBLAS/rocBLAS.

LiliumSancta · 2023-05-15T14:25:19Z

I don't think so necessarily, with only 14 tokens, the timing may be very inaccurate.

But it does show one thing: OpenCL can be as fast as CUDA/ROCm, and also that CLBlast is seriously underwhelming compared to cuBLAS/rocBLAS.

Yes, maybe i sent too many layers to the GPU and i believe that could be affecting processing times, so i intend to redo it with a smaller model. I noticed is that CLBLAST allows me to send more layers than fit on the GPU without throwing any errors, but it slows everything down a lot.

skidd-level-100 · 2023-05-15T14:35:29Z

the changes compiled fine, the program runs fine but '--n-gpu-layers N' seems to not make a difference i check nvidia-smi and there is no vram usage difference when I change the number of layers and it seems no faster

I am using 'NVIDIA GeForce RTX 3050 Ti Laptop GPU'

prompt times

without --gpu-layers
llama_print_timings: sample time = 84.73 ms / 165 runs ( 0.51 ms per token)
llama_print_timings: prompt eval time = 162.75 ms / 2 tokens ( 81.38 ms per token)
llama_print_timings: eval time = 22549.72 ms / 165 runs ( 136.66 ms per token)

With:
llama_print_timings: sample time = 117.32 ms / 245 runs ( 0.48 ms per token)
llama_print_timings: prompt eval time = 174.76 ms / 2 tokens ( 87.38 ms per token)
llama_print_timings: eval time = 33491.55 ms / 245 runs ( 136.70 ms per token)

Mabey I did something wrong?

daniandtheweb · 2023-05-15T17:33:15Z

Results of OpenAssistant-SFT-7-Llama-30B-q5_1 with Radeon 5700XT (8GB) - i7 9700K 32GB ram (3200Mhz)

0 layers:

llama_print_timings:        load time = 25062.67 ms
llama_print_timings:      sample time =   161.38 ms /   256 runs   (    0.63 ms per token)
llama_print_timings: prompt eval time = 40327.40 ms /   658 tokens (   61.29 ms per token)
llama_print_timings:        eval time = 178379.81 ms /   254 runs   (  702.28 ms per token)
llama_print_timings:       total time = 220165.20 ms

15 layers:

llama_print_timings:        load time = 26448.30 ms
llama_print_timings:      sample time =   164.82 ms /   256 runs   (    0.64 ms per token)
llama_print_timings: prompt eval time = 41564.33 ms /   658 tokens (   63.17 ms per token)
llama_print_timings:        eval time = 160101.05 ms /   254 runs   (  630.32 ms per token)
llama_print_timings:       total time = 204397.64 ms

More than 15 layers saturate the VRAM and slows everything down by a lot.

LiliumSancta · 2023-05-15T20:45:45Z

Inside windows 11 it's fast, but under linux it's crazy fast.
All tests were with the same parameters and model 7B Q4_0. I'll compile again on windows and check if everything is correct, but i believe it's just the "windows standard or amd drivers I don't know", unfortunately wsl2 still doesn't support OPENCL.

model llama 7B Q4_0 -n 512 --ctx-size 2048 --gpu_layers 99 --seed 1684109798 -t 6

Windows - OPENCL

short prompt

llama_print_timings: load time = 6061.47 ms
llama_print_timings: sample time = 78.93 ms / 512 runs ( 0.15 ms per token)
llama_print_timings: prompt eval time = 3134.41 ms / 14 tokens ( 223.89 ms per token)
llama_print_timings: eval time = 53299.86 ms / 511 runs ( 104.31 ms per token)
llama_print_timings: total time = 59480.58 ms

long prompt

llama_print_timings: load time = 11210.65 ms
llama_print_timings: sample time = 78.60 ms / 512 runs ( 0.15 ms per token)
llama_print_timings: prompt eval time = 8198.12 ms / 401 tokens ( 20.44 ms per token)
llama_print_timings: eval time = 58348.64 ms / 511 runs ( 114.19 ms per token)
llama_print_timings: total time = 69680.18 ms

Ubuntu - OPENCL

short prompt

llama_print_timings: load time = 4629,94 ms
llama_print_timings: sample time = 218,57 ms / 512 runs ( 0,43 ms per token)
llama_print_timings: prompt eval time = 1681,86 ms / 14 tokens ( 120,13 ms per token)
llama_print_timings: eval time = 34416,77 ms / 511 runs ( 67,35 ms per token)
llama_print_timings: total time = 39367,81 ms

long prompt

llama_print_timings: load time = 7222,00 ms
llama_print_timings: sample time = 218,61 ms / 512 runs ( 0,43 ms per token)
llama_print_timings: prompt eval time = 5893,85 ms / 401 tokens ( 14,70 ms per token)
llama_print_timings: eval time = 39798,12 ms / 511 runs ( 77,88 ms per token)
llama_print_timings: total time = 47383,59 ms

Ubuntu - CUDA - ROCM

short prompt

llama_print_timings: load time = 2580,17 ms
llama_print_timings: sample time = 190,11 ms / 512 runs ( 0,37 ms per token)
llama_print_timings: prompt eval time = 1183,21 ms / 14 tokens ( 84,52 ms per token)
llama_print_timings: eval time = 32662,96 ms / 511 runs ( 63,92 ms per token)
llama_print_timings: total time = 35453,89 ms

long prompt

llama_print_timings: load time = 5641,55 ms
llama_print_timings: sample time = 189,71 ms / 512 runs ( 0,37 ms per token)
llama_print_timings: prompt eval time = 4255,51 ms / 401 tokens ( 10,61 ms per token)
llama_print_timings: eval time = 37548,29 ms / 511 runs ( 73,48 ms per token)
llama_print_timings: total time = 43401,08 ms

github-actions

clang-tidy made some suggestions

ggml-opencl.cpp

LostRuins · 2023-05-16T10:20:04Z

Tried this out and I have some findings to share:

After switching to the C++ version of CLBlast, I seem to be getting an undefined reference to clblast::StatusCode clblast::Gemm linker issue at compile time: see Undefined reference to clblast::StatusCode clblast::Gemm on Windows with GCC with the C++ API CNugteren/CLBlast#467 . I don't know if it's a bug in their prebuilt windows binaries, or it's my own user error. I solved it by using a wrapper function for the SGEMM functions to using the C-version of CLBlastSgemm which worked well.
I am getting a silent crash that happens on any batch > 32 tokens. For < 32 tokens it runs fine, but as soon as the batch exceeds it, it will not work regardless of how many layers are offloaded. For a 37 token prompt:

ggml_cl_mul_mat_q_f32: Type:2, Size_Type=20  - ne00:4096 ne01:4096 ne10:4096 ne11:37
ggml_cl_mul_mat_q_f32: Type:2, Size_Type=20  - ne00:4096 ne01:4096 ne10:4096 ne11:37
ggml_cl_mul_mat_q_f32: Type:2, Size_Type=20  - ne00:4096 ne01:4096 ne10:4096 ne11:37
ggml_cl_mul_mat_q_f32: Type:1, Size_Type=2  - ne00:128 ne01:37 ne10:128 ne11:37

Adding some debug printing to the ggml_cl_mul_mat_q_f32 function shows that it executes 4 times before silently crashing. The program exits with no errors and no asserts (even with debug on)

To solve that, I backported the cl_sgemm_wrapper from the previous implementation, and I modified ggml_cl_can_mul_mat to trigger ONLY for tensors on device, regardless of batch size. That allowed me to run everything with no issues.

These are my benchmarks:

Prompt Processing: 
CUDA with 0 layers = 11ms/T
CUDA with 30 layers = 9ms/T
CLBlast (original SGEMM) = 19ms/T

Generation speed:
CUDA with 0 layers = 200ms/T
CUDA with 30 layers = 86ms/T
CLBlast with 0 layers = 192ms/T
CLBlast with 30 layers = 94ms/T

I am really enjoying the practically 100% speed improvement for this, excluding the prompt processing regressions which can be solved by using it only for batchsize=1 tensors on device in the worst case.

LostRuins · 2023-05-16T14:40:09Z

Can confirm the latest fix solves the crash.

Is anyone else able to repro the performance regression in the prompt processing? My speed seems equally fast.

ggml-opencl.cpp

JohannesGaessler · 2023-05-16T16:06:08Z

I implemented more CUDA kernels that I will need to fix the memory management issues: #1483 . They do not make a difference for performance. My plan is to directly load the LLaMa parameters from disk to VRAM and if all parameters (including the norms) are loaded this way it will be simpler.

0cc4m · 2023-05-16T16:53:42Z

I think it is ready for review now. I tested on Nvidia and AMD and it worked in all cases for me.

I have noticed that there are (weaker) devices that are slightly slower when prompt processing with the non-contiguous tensor loading (like my Steam Deck), but I haven't found the reason for that yet.

0cc4m · 2023-05-16T17:54:08Z

But #1435 should definitely be merged first. I'll adapt this PR once that happens.

SlyEcho · 2023-05-16T19:51:42Z

But #1435 should definitely be merged first.

I'll get on it.

LostRuins · 2023-05-17T13:13:19Z

More benchmarks:

AMD Accelerated Parallel Processing Device gfx1030 with FP16 off:

7b_q5_1 clblast- all layers on gpu -  540 token prompt, generating 512 tokens
Time Taken - Processing:4.7s (9ms/T), Generation:41.8s (82ms/T), Total:46.5s

7b_q5_1 clblast 0 gpu layers - 540 token prompt, generating 512 tokens
Time Taken - Processing:7.3s (14ms/T), Generation:127.9s (250ms/T), Total:135.3s

GPT4-X- Alpaca 65b Q4_0 GGML - 30 layers Offloaded on GPU, 540 token prompt, generating 178 tokens
Time Taken - Processing:117.4s (217ms/T), Generation:221.8s (1246ms/T), Total:339.2s

GPT4-X- Alpaca 65b Q4_0 GGML - 0 layers Offloaded on GPU, 540 token prompt, generating 273 tokens
Time Taken - Processing:55.2s (102ms/T), Generation:471.7s (1728ms/T), Total:526.9s

With FP 16 ON, GPT4-X- Alpaca 65b Q4_0 GGML - 30 layers Offloaded on GPU
Processing 535 tokens ( 306 ms/T), Generation 258 tokens ( 1365 ms/T)

LostRuins · 2023-05-17T14:52:19Z

Also is anyone else having issues with the q8_0 dequantize kernel? I am getting gibberish, but only for that kernel, and only on clblast. Openblas works fine, as does clblast on other kernels. The implementation looks different from the CUDA one but unfortunately I don't understand it enough to review it.

0cc4m · 2023-05-21T08:01:32Z

@SlyEcho I rebased this branch onto your changes, I hope I got everything right. Let me know if you see any mistake.

SlyEcho · 2023-05-21T10:17:59Z

Awesome!

I only have one nitpick with the printf but otherwise it looks very good! 🚀

SlyEcho · 2023-05-21T10:26:20Z

OK, I found something:

Using a F16 model with --gpu-layers > 0 will result in a segfault.

0cc4m · 2023-05-21T20:15:38Z

That's an FP16 issue again.. will take a look tomorrow.

0cc4m · 2023-05-22T16:48:16Z

@SlyEcho I found the bug. Please check again if it's fixed for you as well.

What nitpick did you have with the printf?

github-actions

clang-tidy made some suggestions

ggml-opencl.cpp

SlyEcho · 2023-05-22T18:29:00Z

Otherwise it seems to work.

0cc4m · 2023-05-22T19:54:42Z

@SlyEcho We also noticed that selecting platform and device by id wasn't working as it did before. platform did nothing anymore, as only the device number picked it out of the whole list of devices. I restored the previous behavior for this case, so that both numbers are relevant. I hope that's okay with you. Koboldcpp was already relying on it.

If you disagree with this, let me know. Also if you find a mistake.

github-actions

clang-tidy made some suggestions

ggml-opencl.cpp

SlyEcho · 2023-05-22T20:15:46Z

I hope you don't mind me just pushing changes. I feel it will go faster this way.

0cc4m · 2023-05-22T20:15:47Z

Thanks, didn't even see that.. Apparently it was too late for the logic side of my brain.

SlyEcho · 2023-05-22T20:18:39Z

Device selection logic is good, should be more logical this way. I just didn't figure it out myself.

0cc4m · 2023-05-22T20:39:42Z

I think this PR is ready.

SlyEcho · 2023-05-22T21:34:37Z

Let's see what happens, I think it's a big improvement anyway.

github-actions bot reviewed May 14, 2023

View reviewed changes

JohannesGaessler reviewed May 14, 2023

View reviewed changes

github-actions bot reviewed May 16, 2023

View reviewed changes

ggml-opencl.cpp Show resolved Hide resolved

ggml-opencl.cpp Show resolved Hide resolved

ggml-opencl.cpp Show resolved Hide resolved

SlyEcho mentioned this pull request May 16, 2023

[Feature request] add openCL acceleration support? #1447

Closed

JohannesGaessler reviewed May 16, 2023

View reviewed changes

ggml-opencl.cpp Outdated Show resolved Hide resolved

0cc4m marked this pull request as ready for review May 16, 2023 16:53

Explicitely set GEMM type

18e9dd8

SlyEcho mentioned this pull request May 21, 2023

[BUG] CLBLast generate garbage text on q8.0 models #1525

Closed

JohannesGaessler mentioned this pull request May 21, 2023

[ISSUE] --gpu-layers in useless. #1555

Closed

edp1096 mentioned this pull request May 22, 2023

CUDA/OpenCL error, out of memory when reload. #1456

Closed

Only copy f16/f32 buffer if not already on GPU

4a55951

github-actions bot reviewed May 22, 2023

View reviewed changes

ggml-opencl.cpp Show resolved Hide resolved

ggml-opencl.cpp Show resolved Hide resolved

ggml-opencl.cpp Show resolved Hide resolved

SlyEcho requested changes May 22, 2023

View reviewed changes

ggml-opencl.cpp Outdated Show resolved Hide resolved

ggml-opencl.cpp Show resolved Hide resolved

change to fprintf

e1ee281

SlyEcho approved these changes May 22, 2023

View reviewed changes

Restore default platform + device selection by id behavior

4dfd4fe

github-actions bot reviewed May 22, 2023

View reviewed changes

ggml-opencl.cpp Show resolved Hide resolved

ggml-opencl.cpp Show resolved Hide resolved

ggml-opencl.cpp Show resolved Hide resolved

ggml-opencl.cpp Show resolved Hide resolved

ggml-opencl.cpp Show resolved Hide resolved

SlyEcho requested changes May 22, 2023

View reviewed changes

ggml-opencl.cpp Outdated Show resolved Hide resolved

ggml-opencl.cpp Outdated Show resolved Hide resolved

Small compiler warning fixes

cb28080

SlyEcho approved these changes May 22, 2023

View reviewed changes

SlyEcho merged commit 2e6cd4b into ggerganov:master May 22, 2023
22 checks passed

0cc4m deleted the opencl-dev branch May 23, 2023 03:37

su77ungr mentioned this pull request May 23, 2023

Add OpenCL (AMD GPU) support su77ungr/CASALIOY#109

Closed

TheTerrasque mentioned this pull request May 27, 2023

Segmentation fault while generating. abetlen/llama-cpp-python#218

Closed

4 tasks

OpenCL dequant_mul_mat #1459

OpenCL dequant_mul_mat #1459

Conversation

0cc4m commented May 14, 2023 • edited

github-actions bot left a comment

Choose a reason for hiding this comment

JohannesGaessler left a comment

Choose a reason for hiding this comment

SlyEcho commented May 14, 2023

JohannesGaessler commented May 14, 2023

JohannesGaessler commented May 14, 2023 • edited

SlyEcho commented May 14, 2023 • edited

0cc4m commented May 15, 2023

LiliumSancta commented May 15, 2023 • edited

SlyEcho commented May 15, 2023

LiliumSancta commented May 15, 2023

0cc4m commented May 15, 2023

LiliumSancta commented May 15, 2023

daniandtheweb commented May 15, 2023

SlyEcho commented May 15, 2023

LiliumSancta commented May 15, 2023

skidd-level-100 commented May 15, 2023

daniandtheweb commented May 15, 2023

LiliumSancta commented May 15, 2023

github-actions bot left a comment

Choose a reason for hiding this comment

LostRuins commented May 16, 2023 • edited

LostRuins commented May 16, 2023 • edited

JohannesGaessler commented May 16, 2023

0cc4m commented May 16, 2023

0cc4m commented May 16, 2023

SlyEcho commented May 16, 2023

LostRuins commented May 17, 2023 • edited

LostRuins commented May 17, 2023

0cc4m commented May 21, 2023

SlyEcho commented May 21, 2023

SlyEcho commented May 21, 2023

0cc4m commented May 21, 2023

0cc4m commented May 22, 2023

github-actions bot left a comment

Choose a reason for hiding this comment

SlyEcho commented May 22, 2023

0cc4m commented May 22, 2023

github-actions bot left a comment

Choose a reason for hiding this comment

SlyEcho commented May 22, 2023

0cc4m commented May 22, 2023

SlyEcho commented May 22, 2023

0cc4m commented May 22, 2023

SlyEcho commented May 22, 2023

0cc4m commented May 14, 2023 •

edited

JohannesGaessler commented May 14, 2023 •

edited

SlyEcho commented May 14, 2023 •

edited

LiliumSancta commented May 15, 2023 •

edited

LostRuins commented May 16, 2023 •

edited

LostRuins commented May 16, 2023 •

edited

LostRuins commented May 17, 2023 •

edited