Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure in ggml_mul_mat_q4_0_q8_1_cuda (g_compute_capabilities[id] >= MIN_CC_DP4A) #4229

Closed
cebtenzzre opened this issue Nov 27, 2023 · 11 comments · Fixed by #4594
Closed
Labels
bug Something isn't working

Comments

@cebtenzzre
Copy link
Collaborator

Current Behavior

I got this crash on https://github.com/cebtenzzre/llama.cpp/tree/18fe116e9a5aa45a83bd1d6f043f98dc395f218e:

2023-11-26 20:06:04 INFO:Loaded the model in 9.14 seconds.

GGML_ASSERT: /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:5484: false

Failure Information (for bugs)

Backtrace:

#3  0x00007f5999fd54b8 in __GI_abort () at abort.c:79
#4  0x00007f585ac6b357 in ggml_mul_mat_q4_0_q8_1_cuda (stream=<optimized out>, nrows_dst=<optimized out>, nrows_y=<optimized out>, ncols_y=<optimized out>, 
    nrows_x=<optimized out>, ncols_x=<optimized out>, dst=<optimized out>, vy=<optimized out>, vx=<optimized out>)
    at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:5076
#5  ggml_cuda_op_mul_mat_q (src0=src0@entry=0x204c00320, src1=src1@entry=0x269123d80, dst=dst@entry=0x269123f00, src0_dd_i=src0_dd_i@entry=0x90be00000 "", 
    src1_ddf_i=src1_ddf_i@entry=0x9b0400000, src1_ddq_i=src1_ddq_i@entry=0x9afe00000 "", dst_dd_i=0x90b420400, row_low=32000, row_high=32032, src1_ncols=512, 
    src1_padded_row_size=5120, stream=@0x7f5878be7fa8: 0x7f5861b127a0) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:6098
#6  0x00007f585ac641f2 in ggml_cuda_op_mul_mat (src0=0x204c00320, src1=<optimized out>, dst=<optimized out>, 
    op=0x7f585ac6b270 <ggml_cuda_op_mul_mat_q(ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st* const&)>, convert_src1_to_q8_1=true) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:6959
#7  0x00007f585ac66023 in ggml_cuda_compute_forward (params=params@entry=0x7f5878be8560, tensor=tensor@entry=0x269123f00)
    at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml-cuda.cu:7844
#8  0x00007f585ac4606e in ggml_compute_forward (tensor=0x269123f00, params=0x7f5878be8560) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml.c:14503
#9  ggml_graph_compute_thread (data=data@entry=0x7f5878be85e0) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml.c:16245
#10 0x00007f585ac4862e in ggml_graph_compute (cgraph=0x269000020, cplan=<optimized out>) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/ggml.c:16831
#11 0x00007f585ac794b3 in ggml_graph_compute_helper (buf=std::vector of length 0, capacity 0, graph=graph@entry=0x269000020, n_threads=n_threads@entry=1)
    at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/llama.cpp:592
#12 0x00007f585ac7c365 in llama_decode_internal (lctx=..., batch=...) at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/llama.cpp:5194
#13 0x00007f585ac7cac8 in llama_eval (ctx=0x7f586234bff0, tokens=0x7f5862346200, n_tokens=512, n_past=0)
    at /home/jared/src/forks/llama-cpp-python/vendor/llama.cpp/llama.cpp:8842
#14 0x00007f5998def4f6 in ffi_call_unix64 () at ../src/x86/unix64.S:104

Relevant code: https://github.com/cebtenzzre/llama.cpp/blob/18fe116e9a5aa45a83bd1d6f043f98dc395f218e/ggml-cuda.cu#L5054-L5077

It asserts that g_compute_capabilities[id] >= MIN_CC_DP4A (610) where id is the current device. But it is 520, which matches my GTX 970:

>>> print id
$10 = 1
>>> print g_compute_capabilities[0]
$11 = 610
>>> print g_compute_capabilities[1]
$12 = 520

Steps to Reproduce

I'm not exactly sure how I ran into this issue, because I've been using the same build for weeks without seeing it. It could be an issue with my fork - I should investigate whether the latest llama.cpp is still significantly slower on my GPUs. I still have the coredump handy if any further information would help.

cc @slaren

@slaren
Copy link
Collaborator

slaren commented Nov 27, 2023

The choice to use mmq or not is made in ggml_cuda_mul_mat. It checks min_compute_capability >= MIN_CC_DP4A before using mmq, so unless there is a bug in the computation of min_compute_capability, I am not sure how this could happen.

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Nov 27, 2023

Somehow we got there even though g_tensor_split[1] is 1.0 and g_main_device is 0. So min_compute_capability is computed as 610.

I can reproduce this on the latest master with the following commands:

$ cmake -B build -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_HOST_COMPILER=gcc-12 -DCMAKE_C_COMPILER=gcc-12 -DCMAKE_CXX_COMPILER=g++-12 -DLLAMA_CUDA_FORCE_MMQ=ON
$ make -C build main
$ build/bin/main -m ~/dirs/text-ai-models/chronos-hermes-13b-v2.Q6_K.gguf -ngl 99 -ts 1,0 -p 'Hello'

I can reproduce this all the way back to d0cee0d, it's probably an issue with the original implementation of #2506.

One thing to note is that this is specific to the model size - llama-2-13b.Q4_K_S.gguf will not trigger this, even though each of the Q4_K_S and Q6_K will fully fit into the P40's VRAM.

@cebtenzzre cebtenzzre added bug Something isn't working and removed bug-unconfirmed labels Nov 27, 2023
@GOAE

This comment was marked as off-topic.

@cebtenzzre

This comment was marked as off-topic.

@GOAE

This comment was marked as off-topic.

@cebtenzzre

This comment was marked as off-topic.

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Dec 20, 2023

I'm still hitting this on latest master (799fc22):

GGML_ASSERT: /home/jared/src/forks/llama.cpp/ggml-cuda.cu:6303: false

cc @JohannesGaessler

@JohannesGaessler
Copy link
Collaborator

I don't have a GTX 9XX GPU but I edited the code in such a way that one of my GPUs should be treated as such. Still, I am not able to reproduce this bug. I don't know what would be wrong with the multi GPU logic either.

@cebtenzzre
Copy link
Collaborator Author

Still, I am not able to reproduce this bug.

Try a few different Q6_K 13B models, with various (short) prompts. For some reason I haven't been able to trigger it with all models and all prompts, but with the right prompt and model it seems 100% reproducible.

@JohannesGaessler
Copy link
Collaborator

I tried a bunch of quantization formats and models and I still can't reproduce it. Is the model you were using one of those where the output tensor has 32001 instead of 32000 rows?

@cebtenzzre
Copy link
Collaborator Author

I tried a bunch of quantization formats and models and I still can't reproduce it. Is the model you were using one of those where the output tensor has 32001 instead of 32000 rows?

I can reproduce it on a 13B with n_vocab=32032 (chronos-hermes-13b-v2.Q6_K.gguf) and a 20B with n_vocab=32001, but not on a 13B or 20B with n_vocab=32000. So you're right, that's the difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants