Assertion failure in ggml_mul_mat_q4_0_q8_1_cuda (g_compute_capabilities[id] >= MIN_CC_DP4A) #4229

cebtenzzre · 2023-11-27T01:37:33Z

Current Behavior

I got this crash on https://github.com/cebtenzzre/llama.cpp/tree/18fe116e9a5aa45a83bd1d6f043f98dc395f218e:

2023-11-26 20:06:04 INFO:Loaded the model in 9.14 seconds.

GGML_ASSERT: /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:5484: false

Failure Information (for bugs)

Backtrace:

#3  0x00007f5999fd54b8 in __GI_abort () at abort.c:79
#4  0x00007f585ac6b357 in ggml_mul_mat_q4_0_q8_1_cuda (stream=<optimized out>, nrows_dst=<optimized out>, nrows_y=<optimized out>, ncols_y=<optimized out>, 
    nrows_x=<optimized out>, ncols_x=<optimized out>, dst=<optimized out>, vy=<optimized out>, vx=<optimized out>)
    at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:5076
#5  ggml_cuda_op_mul_mat_q (src0=src0@entry=0x204c00320, src1=src1@entry=0x269123d80, dst=dst@entry=0x269123f00, src0_dd_i=src0_dd_i@entry=0x90be00000 "", 
    src1_ddf_i=src1_ddf_i@entry=0x9b0400000, src1_ddq_i=src1_ddq_i@entry=0x9afe00000 "", dst_dd_i=0x90b420400, row_low=32000, row_high=32032, src1_ncols=512, 
    src1_padded_row_size=5120, stream=@0x7f5878be7fa8: 0x7f5861b127a0) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:6098
#6  0x00007f585ac641f2 in ggml_cuda_op_mul_mat (src0=0x204c00320, src1=<optimized out>, dst=<optimized out>, 
    op=0x7f585ac6b270 <ggml_cuda_op_mul_mat_q(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st* const&)>, convert_src1_to_q8_1=true) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:6959
#7  0x00007f585ac66023 in ggml_cuda_compute_forward (params=params@entry=0x7f5878be8560, tensor=tensor@entry=0x269123f00)
    at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:7844
#8  0x00007f585ac4606e in ggml_compute_forward (tensor=0x269123f00, params=0x7f5878be8560) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml.c:14503
#9  ggml_graph_compute_thread (data=data@entry=0x7f5878be85e0) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml.c:16245
#10 0x00007f585ac4862e in ggml_graph_compute (cgraph=0x269000020, cplan=<optimized out>) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml.c:16831
#11 0x00007f585ac794b3 in ggml_graph_compute_helper (buf=std::vector of length 0, capacity 0, graph=graph@entry=0x269000020, n_threads=n_threads@entry=1)
    at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/llama.cpp:592
#12 0x00007f585ac7c365 in llama_decode_internal (lctx=..., batch=...) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/llama.cpp:5194
#13 0x00007f585ac7cac8 in llama_eval (ctx=0x7f586234bff0, tokens=0x7f5862346200, n_tokens=512, n_past=0)
    at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/llama.cpp:8842
#14 0x00007f5998def4f6 in ffi_call_unix64 () at ../src/x86/unix64.S:104

Relevant code: https://github.com/cebtenzzre/llama.cpp/blob/18fe116e9a5aa45a83bd1d6f043f98dc395f218e/ggml-cuda.cu#L5054-L5077

It asserts that g_compute_capabilities[id] >= MIN_CC_DP4A (610) where id is the current device. But it is 520, which matches my GTX 970:

>>> print id
$10 = 1
>>> print g_compute_capabilities[0]
$11 = 610
>>> print g_compute_capabilities[1]
$12 = 520

Steps to Reproduce

I'm not exactly sure how I ran into this issue, because I've been using the same build for weeks without seeing it. It could be an issue with my fork - I should investigate whether the latest llama.cpp is still significantly slower on my GPUs. I still have the coredump handy if any further information would help.

cc @slaren

The text was updated successfully, but these errors were encountered:

slaren · 2023-11-27T10:02:13Z

The choice to use mmq or not is made in ggml_cuda_mul_mat. It checks min_compute_capability >= MIN_CC_DP4A before using mmq, so unless there is a bug in the computation of min_compute_capability, I am not sure how this could happen.

cebtenzzre · 2023-11-27T17:55:24Z

Somehow we got there even though g_tensor_split[1] is 1.0 and g_main_device is 0. So min_compute_capability is computed as 610.

I can reproduce this on the latest master with the following commands:

$ cmake -B build -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_HOST_COMPILER=gcc-12 -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12 -DLLAMA_CUDA_FORCE_MMQ=ON
$ make -C build main
$ build/bin/main -m ~/dirs/text-ai-models/chronos-hermes-13b-v2.Q6_K.gguf -ngl 99 -ts 1,0 -p 'Hello'

I can reproduce this all the way back to d0cee0d, it's probably an issue with the original implementation of #2506.

One thing to note is that this is specific to the model size - llama-2-13b.Q4_K_S.gguf will not trigger this, even though each of the Q4_K_S and Q6_K will fully fit into the P40's VRAM.

cebtenzzre · 2023-12-20T22:35:58Z

I'm still hitting this on latest master (799fc22):

GGML_ASSERT: /home/jared/src/forks/llama.cpp/ggml-cuda.cu:6303: false

cc @JohannesGaessler

JohannesGaessler · 2023-12-21T18:46:32Z

I don't have a GTX 9XX GPU but I edited the code in such a way that one of my GPUs should be treated as such. Still, I am not able to reproduce this bug. I don't know what would be wrong with the multi GPU logic either.

cebtenzzre · 2023-12-21T18:51:42Z

Still, I am not able to reproduce this bug.

Try a few different Q6_K 13B models, with various (short) prompts. For some reason I haven't been able to trigger it with all models and all prompts, but with the right prompt and model it seems 100% reproducible.

JohannesGaessler · 2023-12-21T19:10:46Z

I tried a bunch of quantization formats and models and I still can't reproduce it. Is the model you were using one of those where the output tensor has 32001 instead of 32000 rows?

cebtenzzre · 2023-12-22T03:07:16Z

I tried a bunch of quantization formats and models and I still can't reproduce it. Is the model you were using one of those where the output tensor has 32001 instead of 32000 rows?

I can reproduce it on a 13B with n_vocab=32032 (chronos-hermes-13b-v2.Q6_K.gguf) and a 20B with n_vocab=32001, but not on a 13B or 20B with n_vocab=32000. So you're right, that's the difference.

cebtenzzre added the bug-unconfirmed label Nov 27, 2023

cebtenzzre added bug Something isn't working and removed bug-unconfirmed labels Nov 27, 2023

This comment was marked as off-topic.

Sign in to view

JohannesGaessler mentioned this issue Dec 22, 2023

CUDA: fixed row rounding for 0 tensor splits #4594

Merged

JohannesGaessler closed this as completed in #4594 Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion failure in ggml_mul_mat_q4_0_q8_1_cuda (g_compute_capabilities[id] >= MIN_CC_DP4A) #4229

Assertion failure in ggml_mul_mat_q4_0_q8_1_cuda (g_compute_capabilities[id] >= MIN_CC_DP4A) #4229

cebtenzzre commented Nov 27, 2023

slaren commented Nov 27, 2023

cebtenzzre commented Nov 27, 2023 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

cebtenzzre commented Dec 20, 2023 •

edited

Loading

JohannesGaessler commented Dec 21, 2023

cebtenzzre commented Dec 21, 2023

JohannesGaessler commented Dec 21, 2023

cebtenzzre commented Dec 22, 2023

Assertion failure in ggml_mul_mat_q4_0_q8_1_cuda (g_compute_capabilities[id] >= MIN_CC_DP4A) #4229

Assertion failure in ggml_mul_mat_q4_0_q8_1_cuda (g_compute_capabilities[id] >= MIN_CC_DP4A) #4229

Comments

cebtenzzre commented Nov 27, 2023

Current Behavior

Failure Information (for bugs)

Steps to Reproduce

slaren commented Nov 27, 2023

cebtenzzre commented Nov 27, 2023 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

cebtenzzre commented Dec 20, 2023 • edited Loading

JohannesGaessler commented Dec 21, 2023

cebtenzzre commented Dec 21, 2023

JohannesGaessler commented Dec 21, 2023

cebtenzzre commented Dec 22, 2023

cebtenzzre commented Nov 27, 2023 •

edited

Loading

cebtenzzre commented Dec 20, 2023 •

edited

Loading