Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL dequant_mul_mat #1459

Merged
merged 18 commits into from May 22, 2023
Merged

OpenCL dequant_mul_mat #1459

merged 18 commits into from May 22, 2023

Conversation

0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented May 14, 2023

I refactored the OpenCL implementation to become more like the CUDA implementation, including dequant_mul_mat kernels and gpu layers. I see a pretty good speed increase with it.

Not done testing yet, so I put this PR on draft until I'm fully done. But since I've seen others with the same idea, I wanna put it up already to hopefully avoid doing the same work multiple times.

Here are some initial results on Ryzen 9 5950X and AMD RX 6800 XT, q5_1 Llama 7B:
master branch:

llama_print_timings:      sample time =    29.00 ms /    50 runs   (    0.58 ms per token)
llama_print_timings: prompt eval time =  5271.28 ms /   622 tokens (    8.47 ms per token)
llama_print_timings:        eval time =  6815.71 ms /    49 runs   (  139.10 ms per token)

This PR (no layers offloaded):

llama_print_timings:      sample time =    29.96 ms /    50 runs   (    0.60 ms per token)
llama_print_timings: prompt eval time =  5879.90 ms /   622 tokens (    9.45 ms per token)
llama_print_timings:        eval time =  6171.29 ms /    49 runs   (  125.94 ms per token)

This PR (33 layers offloaded, 4.7GB VRAM used, dequant_mul_mat kernels used):

llama_print_timings:      sample time =    23.40 ms /    50 runs   (    0.47 ms per token)
llama_print_timings: prompt eval time =  4918.17 ms /   622 tokens (    7.91 ms per token)
llama_print_timings:        eval time =  2779.95 ms /    49 runs   (   56.73 ms per token)

The slight regression in prompt eval time is the reason I didn't follow the CUDA implementation sooner, it appeared when I implemented the non-contiguous tensor loading. There might be some bottleneck in the code related to that, let me know if you spot something.

Still missing:

  • CMakeLists.txt update
  • Nvidia tests
  • Intel tests
  • iGPU tests

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Outdated Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Outdated Show resolved Hide resolved
ggml-opencl.cpp Outdated Show resolved Hide resolved
ggml-opencl.cpp Outdated Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a lot of code duplication. Does OpenCL not support templates?

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 14, 2023

Does OpenCL not support templates?

No, it's a C-based language. But you could generate the code as strings as needed and compile at runtime.

@JohannesGaessler
Copy link
Collaborator

That is unfortunate. Just today I did a PR that deduplicated the CUDA code by creating another template for just dequantization: #1453 . That way you only have to maintain 1 method per quantization + 2 template methods.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented May 14, 2023

I did some quick testing:

GPU Model Backend Speed [t/s]
GTX 1070 7b q4_0 CUDA 14.47
GTX 1070 7b q4_0 OpenCL 13.69

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 14, 2023

There is no need for templates, actually, @JohannesGaessler. I converted your new CUDA code to macros and it still works fine.

horrible macros
#define DEFINE_Q_FUNCS(NAM, block_size, qk, qr, dequantize_kernel) \
static __global__ void dequantize_block_##NAM(const void * vx, float * y, const int k) { \
    const int i = blockDim.x*blockIdx.x + 2*threadIdx.x; \
\
    if (i >= k) { \
        return; \
    } \
\
    const int ib = i/qk; /* block index */ \
    const int iqs = (i%qk)/qr; /* quant index */ \
    const int iybs = i - i%qk; /* y block start index */ \
    const int y_offset = qr == 1 ? 1 : qk/2; \
\
    /* dequantize */ \
    float & v0 = y[iybs + iqs + 0]; \
    float & v1 = y[iybs + iqs + y_offset]; \
    dequantize_kernel(vx, ib, iqs, v0, v1); \
} \
\
static void dequantize_row_##NAM##_cuda(const void * vx, float * y, const int k, cudaStream_t stream) { \
    const int num_blocks = (k + CUDA_DEQUANTIZE_BLOCK_SIZE - 1) / CUDA_DEQUANTIZE_BLOCK_SIZE; \
    dequantize_block_##NAM<<<num_blocks, CUDA_DEQUANTIZE_BLOCK_SIZE, 0, stream>>>(vx, y, k); \
} \
\
static __global__ void dequantize_mul_mat_vec_##NAM(const void * vx, const float * y, float * dst, const int ncols) { \
    const int row = blockIdx.x; \
    const int tid = threadIdx.x; \
\
    const int y_offset = qr == 1 ? 1 : qk/2; \
\
    __shared__ float tmp[block_size]; /* separate sum for each thread */ \
    tmp[tid] = 0; \
\
    for (int i = 0; i < ncols/block_size; i += 2) { \
        const int col = i*block_size + 2*tid; \
        const int ib = (row*ncols + col)/qk; /* block index */ \
        const int iqs = (col%qk)/qr; /* quant index */ \
        const int iybs = col - col%qk; /* y block start index */ \
\
        /* dequantize */ \
        float v0, v1; \
        dequantize_kernel(vx, ib, iqs, v0, v1); \
\
        /* matrix multiplication */ \
        tmp[tid] += v0 * y[iybs + iqs + 0]; \
        tmp[tid] += v1 * y[iybs + iqs + y_offset]; \
    }\
\
    /* sum up partial sums and write back result */ \
    __syncthreads(); \
    for (int s=block_size/2; s>0; s>>=1) { \
        if (tid < s) { \
            tmp[tid] += tmp[tid + s]; \
        } \
        __syncthreads(); \
    } \
    if (tid == 0) { \
        dst[row] = tmp[0]; \
    } \
} \
\
static void dequantize_mul_mat_vec_##NAM##_cuda(const void * vx, const float * y, float * dst, const int ncols, const int nrows, cudaStream_t stream) { \
    GGML_ASSERT(ncols % CUDA_DMMV_BLOCK_SIZE == 0); \
    dequantize_mul_mat_vec_##NAM<<<nrows, CUDA_DMMV_BLOCK_SIZE, 0, stream>>>(vx, y, dst, ncols); \
}

DEFINE_Q_FUNCS(q4_0, CUDA_DMMV_BLOCK_SIZE, QK4_0, QR4_0, dequantize_q4_0)
DEFINE_Q_FUNCS(q4_1, CUDA_DMMV_BLOCK_SIZE, QK4_1, QR4_1, dequantize_q4_1)
DEFINE_Q_FUNCS(q5_0, CUDA_DMMV_BLOCK_SIZE, QK5_0, QR5_0, dequantize_q5_0)
DEFINE_Q_FUNCS(q5_1, CUDA_DMMV_BLOCK_SIZE, QK5_1, QR5_1, dequantize_q5_1)
DEFINE_Q_FUNCS(q8_0, CUDA_DMMV_BLOCK_SIZE, QK8_0, QR8_0, dequantize_q8_0)
DEFINE_Q_FUNCS(f16, CUDA_DMMV_BLOCK_SIZE, 32, 1, convert_f16)

Currently this wouldn't work in ggml-opencl.c because of the way the string is processed into the file. There can't be any defines in there.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 15, 2023

The runtime compilation of OpenCL gives us the ability to build the kernels on the fly. I plan to add a simple templater that does this to reduce duplication.

@LiliumSancta
Copy link

LiliumSancta commented May 15, 2023

Some tests with (GPU RX 6600 8GB CPU 5600x RAM 32GB 3600mhz)
In ubuntu I used CUDA with the rocm patch of @SlyEcho, in windows with OPENCL i couldn't put the same number of layers as the model in the VRAM due to the overload of windows itself. All were run with the same parameters and seed.

Edit: model Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin

There is an error in CMakeLists.txt in set(GGML_OPENCL_SOURCES ggml-opencl.c ggml-opencl.h) due to change to ggml-opencl.cpp

OPENCL - Ubuntu

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx1032 FP16: 1
llama_model_load_internal: [opencl] offloading 31 layers to GPU
llama_model_load_internal: [opencl] total VRAM used: 7033 MB
llama_init_from_file: kv self size = 1600,00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0

llama_print_timings: load time = 5548,81 ms
llama_print_timings: sample time = 180,76 ms / 391 runs ( 0,46 ms per token)
llama_print_timings: prompt eval time = 3289,77 ms / 14 tokens ( 234,98 ms per token)
llama_print_timings: eval time = 59369,83 ms / 390 runs ( 152,23 ms per token)
llama_print_timings: total time = 65191,07 ms

ROCM - CUDA

llama_model_load_internal: [cublas] offloading 31 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 7033 MB
llama_init_from_file: kv self size = 1600,00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0

llama_print_timings: load time = 4835,65 ms
llama_print_timings: sample time = 120,23 ms / 278 runs ( 0,43 ms per token)
llama_print_timings: prompt eval time = 2329,01 ms / 14 tokens ( 166,36 ms per token)
llama_print_timings: eval time = 39577,58 ms / 277 runs ( 142,88 ms per token)
llama_print_timings: total time = 44547,13 ms

OPENCL - WINDOWS

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: AMD Accelerated Parallel Processing Device: gfx1032 FP16: 1
llama_model_load_internal: [opencl] offloading 23 layers to GPU
llama_model_load_internal: [opencl] total VRAM used: 5218 MB
llama_init_from_file: kv self size = 1600.00 MB

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0

llama_print_timings: load time = 9369.12 ms
llama_print_timings: sample time = 74.93 ms / 434 runs ( 0.17 ms per token)
llama_print_timings: prompt eval time = 5275.04 ms / 14 tokens ( 376.79 ms per token)
llama_print_timings: eval time = 89806.45 ms / 433 runs ( 207.41 ms per token)
llama_print_timings: total time = 99290.66 ms

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 15, 2023

@LiliumSancta, which LLaMa is it?

BTW, add --ignore-eos to always generate the amount of tokens you ask for.

@LiliumSancta
Copy link

@LiliumSancta, which LLaMa is it?

BTW, add --ignore-eos to always generate the amount of tokens you ask for.

Sorry forgot to mention it is Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 15, 2023

@LiliumSancta Thanks for the test. Your results also show that something is wrong with token processing. I will try to investigate that.

@LiliumSancta
Copy link

@LiliumSancta Thanks for the test. Your results also show that something is wrong with token processing. I will try to investigate that.

I would like to thank all of you for your efforts in the development of this project. Later when I get off work I'll redo everything with a 7B model that can be loaded entirely into VRAM and using --ignore-eos to be more consistent.

@daniandtheweb
Copy link
Contributor

@0cc4m On WizardLM 7B offloading all the 32 layers I'm getting 60 ms per token on a 5700XT. On the wizard-vicuna-13B model instead I'm just able to offload 27 layers before saturating the VRAM and I get 157 ms per token. Both the models are 5_1 quantized.

wizardlm-7B-q5_1:

llama_print_timings:        load time = 10132.12 ms
llama_print_timings:      sample time =   161.22 ms /   256 runs   (    0.63 ms per token)
llama_print_timings: prompt eval time =  8981.20 ms /   658 tokens (   13.65 ms per token)
llama_print_timings:        eval time = 15348.64 ms /   254 runs   (   60.43 ms per token)
llama_print_timings:       total time = 28500.82 ms

wizard-vicuna-uncensored-13B-q5_1:

llama_print_timings:        load time = 13575.15 ms
llama_print_timings:      sample time =   160.96 ms /   256 runs   (    0.63 ms per token)
llama_print_timings: prompt eval time = 18880.03 ms /   658 tokens (   28.69 ms per token)
llama_print_timings:        eval time = 40117.31 ms /   254 runs   (  157.94 ms per token)
llama_print_timings:       total time = 61474.53 ms

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 15, 2023

something is wrong with token processing.

I don't think so necessarily, with only 14 tokens, the timing may be very inaccurate.

But it does show one thing: OpenCL can be as fast as CUDA/ROCm, and also that CLBlast is seriously underwhelming compared to cuBLAS/rocBLAS.

@LiliumSancta
Copy link

I don't think so necessarily, with only 14 tokens, the timing may be very inaccurate.

But it does show one thing: OpenCL can be as fast as CUDA/ROCm, and also that CLBlast is seriously underwhelming compared to cuBLAS/rocBLAS.

Yes, maybe i sent too many layers to the GPU and i believe that could be affecting processing times, so i intend to redo it with a smaller model. I noticed is that CLBLAST allows me to send more layers than fit on the GPU without throwing any errors, but it slows everything down a lot.

@skidd-level-100
Copy link

the changes compiled fine, the program runs fine but '--n-gpu-layers N' seems to not make a difference i check nvidia-smi and there is no vram usage difference when I change the number of layers and it seems no faster

I am using 'NVIDIA GeForce RTX 3050 Ti Laptop GPU'

prompt times

without --gpu-layers
llama_print_timings: sample time = 84.73 ms / 165 runs ( 0.51 ms per token)
llama_print_timings: prompt eval time = 162.75 ms / 2 tokens ( 81.38 ms per token)
llama_print_timings: eval time = 22549.72 ms / 165 runs ( 136.66 ms per token)

With:
llama_print_timings: sample time = 117.32 ms / 245 runs ( 0.48 ms per token)
llama_print_timings: prompt eval time = 174.76 ms / 2 tokens ( 87.38 ms per token)
llama_print_timings: eval time = 33491.55 ms / 245 runs ( 136.70 ms per token)

Mabey I did something wrong?

@daniandtheweb
Copy link
Contributor

Results of OpenAssistant-SFT-7-Llama-30B-q5_1 with Radeon 5700XT (8GB) - i7 9700K 32GB ram (3200Mhz)

0 layers:

llama_print_timings:        load time = 25062.67 ms
llama_print_timings:      sample time =   161.38 ms /   256 runs   (    0.63 ms per token)
llama_print_timings: prompt eval time = 40327.40 ms /   658 tokens (   61.29 ms per token)
llama_print_timings:        eval time = 178379.81 ms /   254 runs   (  702.28 ms per token)
llama_print_timings:       total time = 220165.20 ms

15 layers:

llama_print_timings:        load time = 26448.30 ms
llama_print_timings:      sample time =   164.82 ms /   256 runs   (    0.64 ms per token)
llama_print_timings: prompt eval time = 41564.33 ms /   658 tokens (   63.17 ms per token)
llama_print_timings:        eval time = 160101.05 ms /   254 runs   (  630.32 ms per token)
llama_print_timings:       total time = 204397.64 ms

More than 15 layers saturate the VRAM and slows everything down by a lot.

@LiliumSancta
Copy link

Inside windows 11 it's fast, but under linux it's crazy fast.
All tests were with the same parameters and model 7B Q4_0. I'll compile again on windows and check if everything is correct, but i believe it's just the "windows standard or amd drivers I don't know", unfortunately wsl2 still doesn't support OPENCL.

model llama 7B Q4_0 -n 512 --ctx-size 2048 --gpu_layers 99 --seed 1684109798 -t 6

Windows - OPENCL

short prompt

llama_print_timings: load time = 6061.47 ms
llama_print_timings: sample time = 78.93 ms / 512 runs ( 0.15 ms per token)
llama_print_timings: prompt eval time = 3134.41 ms / 14 tokens ( 223.89 ms per token)
llama_print_timings: eval time = 53299.86 ms / 511 runs ( 104.31 ms per token)
llama_print_timings: total time = 59480.58 ms

long prompt

llama_print_timings: load time = 11210.65 ms
llama_print_timings: sample time = 78.60 ms / 512 runs ( 0.15 ms per token)
llama_print_timings: prompt eval time = 8198.12 ms / 401 tokens ( 20.44 ms per token)
llama_print_timings: eval time = 58348.64 ms / 511 runs ( 114.19 ms per token)
llama_print_timings: total time = 69680.18 ms

Ubuntu - OPENCL

short prompt

llama_print_timings: load time = 4629,94 ms
llama_print_timings: sample time = 218,57 ms / 512 runs ( 0,43 ms per token)
llama_print_timings: prompt eval time = 1681,86 ms / 14 tokens ( 120,13 ms per token)
llama_print_timings: eval time = 34416,77 ms / 511 runs ( 67,35 ms per token)
llama_print_timings: total time = 39367,81 ms

long prompt

llama_print_timings: load time = 7222,00 ms
llama_print_timings: sample time = 218,61 ms / 512 runs ( 0,43 ms per token)
llama_print_timings: prompt eval time = 5893,85 ms / 401 tokens ( 14,70 ms per token)
llama_print_timings: eval time = 39798,12 ms / 511 runs ( 77,88 ms per token)
llama_print_timings: total time = 47383,59 ms

Ubuntu - CUDA - ROCM

short prompt

llama_print_timings: load time = 2580,17 ms
llama_print_timings: sample time = 190,11 ms / 512 runs ( 0,37 ms per token)
llama_print_timings: prompt eval time = 1183,21 ms / 14 tokens ( 84,52 ms per token)
llama_print_timings: eval time = 32662,96 ms / 511 runs ( 63,92 ms per token)
llama_print_timings: total time = 35453,89 ms

long prompt

llama_print_timings: load time = 5641,55 ms
llama_print_timings: sample time = 189,71 ms / 512 runs ( 0,37 ms per token)
llama_print_timings: prompt eval time = 4255,51 ms / 401 tokens ( 10,61 ms per token)
llama_print_timings: eval time = 37548,29 ms / 511 runs ( 73,48 ms per token)
llama_print_timings: total time = 43401,08 ms

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
@LostRuins
Copy link
Collaborator

LostRuins commented May 16, 2023

Tried this out and I have some findings to share:

  1. After switching to the C++ version of CLBlast, I seem to be getting an undefined reference to clblast::StatusCode clblast::Gemm linker issue at compile time: see Undefined reference to clblast::StatusCode clblast::Gemm on Windows with GCC with the C++ API CNugteren/CLBlast#467 . I don't know if it's a bug in their prebuilt windows binaries, or it's my own user error. I solved it by using a wrapper function for the SGEMM functions to using the C-version of CLBlastSgemm which worked well.

  2. I am getting a silent crash that happens on any batch > 32 tokens. For < 32 tokens it runs fine, but as soon as the batch exceeds it, it will not work regardless of how many layers are offloaded. For a 37 token prompt:

ggml_cl_mul_mat_q_f32: Type:2, Size_Type=20  - ne00:4096 ne01:4096 ne10:4096 ne11:37
ggml_cl_mul_mat_q_f32: Type:2, Size_Type=20  - ne00:4096 ne01:4096 ne10:4096 ne11:37
ggml_cl_mul_mat_q_f32: Type:2, Size_Type=20  - ne00:4096 ne01:4096 ne10:4096 ne11:37
ggml_cl_mul_mat_q_f32: Type:1, Size_Type=2  - ne00:128 ne01:37 ne10:128 ne11:37

Adding some debug printing to the ggml_cl_mul_mat_q_f32 function shows that it executes 4 times before silently crashing. The program exits with no errors and no asserts (even with debug on)

To solve that, I backported the cl_sgemm_wrapper from the previous implementation, and I modified ggml_cl_can_mul_mat to trigger ONLY for tensors on device, regardless of batch size. That allowed me to run everything with no issues.

These are my benchmarks:

Prompt Processing: 
CUDA with 0 layers = 11ms/T
CUDA with 30 layers = 9ms/T
CLBlast (original SGEMM) = 19ms/T

Generation speed:
CUDA with 0 layers = 200ms/T
CUDA with 30 layers = 86ms/T
CLBlast with 0 layers = 192ms/T
CLBlast with 30 layers = 94ms/T

I am really enjoying the practically 100% speed improvement for this, excluding the prompt processing regressions which can be solved by using it only for batchsize=1 tensors on device in the worst case.

@LostRuins
Copy link
Collaborator

LostRuins commented May 16, 2023

Can confirm the latest fix solves the crash.

Is anyone else able to repro the performance regression in the prompt processing? My speed seems equally fast.

ggml-opencl.cpp Outdated Show resolved Hide resolved
@JohannesGaessler
Copy link
Collaborator

I implemented more CUDA kernels that I will need to fix the memory management issues: #1483 . They do not make a difference for performance. My plan is to directly load the LLaMa parameters from disk to VRAM and if all parameters (including the norms) are loaded this way it will be simpler.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 16, 2023

I think it is ready for review now. I tested on Nvidia and AMD and it worked in all cases for me.

I have noticed that there are (weaker) devices that are slightly slower when prompt processing with the non-contiguous tensor loading (like my Steam Deck), but I haven't found the reason for that yet.

@0cc4m 0cc4m marked this pull request as ready for review May 16, 2023 16:53
@0cc4m
Copy link
Collaborator Author

0cc4m commented May 16, 2023

But #1435 should definitely be merged first. I'll adapt this PR once that happens.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 16, 2023

But #1435 should definitely be merged first.

I'll get on it.

@LostRuins
Copy link
Collaborator

LostRuins commented May 17, 2023

More benchmarks:

AMD Accelerated Parallel Processing Device gfx1030 with FP16 off:

7b_q5_1 clblast- all layers on gpu -  540 token prompt, generating 512 tokens
Time Taken - Processing:4.7s (9ms/T), Generation:41.8s (82ms/T), Total:46.5s

7b_q5_1 clblast 0 gpu layers - 540 token prompt, generating 512 tokens
Time Taken - Processing:7.3s (14ms/T), Generation:127.9s (250ms/T), Total:135.3s

GPT4-X- Alpaca 65b Q4_0 GGML - 30 layers Offloaded on GPU, 540 token prompt, generating 178 tokens
Time Taken - Processing:117.4s (217ms/T), Generation:221.8s (1246ms/T), Total:339.2s

GPT4-X- Alpaca 65b Q4_0 GGML - 0 layers Offloaded on GPU, 540 token prompt, generating 273 tokens
Time Taken - Processing:55.2s (102ms/T), Generation:471.7s (1728ms/T), Total:526.9s

With FP 16 ON, GPT4-X- Alpaca 65b Q4_0 GGML - 30 layers Offloaded on GPU
Processing 535 tokens ( 306 ms/T), Generation 258 tokens ( 1365 ms/T)

@LostRuins
Copy link
Collaborator

Also is anyone else having issues with the q8_0 dequantize kernel? I am getting gibberish, but only for that kernel, and only on clblast. Openblas works fine, as does clblast on other kernels. The implementation looks different from the CUDA one but unfortunately I don't understand it enough to review it.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 21, 2023

@SlyEcho I rebased this branch onto your changes, I hope I got everything right. Let me know if you see any mistake.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 21, 2023

Awesome!

I only have one nitpick with the printf but otherwise it looks very good! 🚀

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 21, 2023

OK, I found something:

Using a F16 model with --gpu-layers > 0 will result in a segfault.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 21, 2023

That's an FP16 issue again.. will take a look tomorrow.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 22, 2023

@SlyEcho I found the bug. Please check again if it's fixed for you as well.

What nitpick did you have with the printf?

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Outdated Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 22, 2023

Otherwise it seems to work.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 22, 2023

@SlyEcho We also noticed that selecting platform and device by id wasn't working as it did before. platform did nothing anymore, as only the device number picked it out of the whole list of devices. I restored the previous behavior for this case, so that both numbers are relevant. I hope that's okay with you. Koboldcpp was already relying on it.

If you disagree with this, let me know. Also if you find a mistake.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Show resolved Hide resolved
ggml-opencl.cpp Outdated Show resolved Hide resolved
ggml-opencl.cpp Outdated Show resolved Hide resolved
@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 22, 2023

I hope you don't mind me just pushing changes. I feel it will go faster this way.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 22, 2023

Thanks, didn't even see that.. Apparently it was too late for the logic side of my brain.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 22, 2023

Device selection logic is good, should be more logical this way. I just didn't figure it out myself.

@0cc4m
Copy link
Collaborator Author

0cc4m commented May 22, 2023

I think this PR is ready.

@SlyEcho SlyEcho merged commit 2e6cd4b into ggerganov:master May 22, 2023
22 checks passed
@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 22, 2023

Let's see what happens, I think it's a big improvement anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants