-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Description
Name and Version
llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6, VMM: yes
version: 6727 (cdb6da46)
built with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
RTX 3080 Ti Laptop GPU
Models
Two granite models from unsloth:
~/src/llama.cpp/cuda.build/bin/llama-server --port 10000 -m ./granite-4.0-h-micro-UD-Q6_K_XL.gguf
~/src/llama.cpp/cuda.build/bin/llama-server --port 10000 -m ./granite-4.0-micro-UD-Q6_K_XL.gguf
Problem description & steps to reproduce
If I run server as ~/src/llama.cpp/cuda.build/bin/llama-server --port 10000 -m ./granite-4.0-h-tiny-Q4_K_M.gguf`
This sample (based on request from llama.vim) crashes:
curl -X POST http://localhost:10000/infill \
-H "Content-Type: application/json" \
-d '{
"input_prefix": "\n",
"input_suffix": "\n\n",
"response_fields": ["content"],
"prompt": "def ",
"top_p": 0.9,
"input_extra": [],
"samplers": ["top_p", "infill"],
"n_predict": 128,
"model": ""
}'llama-server crashes(see log below)
It doesn't always crashes in the middle of the real python source code or for example with removing top_p sampler
curl -X POST http://localhost:10000/infill \
-H "Content-Type: application/json" \
-d '{
"input_prefix": "\n",
"input_suffix": "\n\n",
"response_fields": ["content"],
"prompt": "def ",
"top_p": 0.9,
"input_extra": [],
"samplers": ["infill"],
"n_predict": 128,
"model": ""
}'
(returns {"content":""})
First Bad Commit
Dunno about commits, but the magical number 18446744073709551615 appears in several issues, non-closed one is #15711, I haven't tested they are related it or not
Relevant log output
<|start_of_role|>assistant<|end_of_role|>'
main: server is listening on http://127.0.0.1:10000 - starting the main loop
srv update_slots: all slots are idle
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 7
slot update_slots: id 0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 7, n_tokens = 7, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 7, n_tokens = 7
[New LWP 1159636]
[New LWP 1159635]
[New LWP 1159634]
[New LWP 1159633]
[New LWP 1159632]
[New LWP 1159631]
[New LWP 1159630]
[New LWP 1159629]
[New LWP 1159628]
[New LWP 1159627]
[New LWP 1159626]
[New LWP 1159625]
[New LWP 1159624]
[New LWP 1159623]
[New LWP 1159622]
[New LWP 1159621]
[New LWP 1159620]
[New LWP 1159619]
[New LWP 1159618]
[New LWP 1159617]
[New LWP 1159616]
[New LWP 1159615]
[New LWP 1159614]
[New LWP 1159609]
This GDB supports auto-downloading debuginfo from the following URLs:
<https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f7208c9f042 in ?? () from /usr/lib/libc.so.6
#0 0x00007f7208c9f042 in ?? () from /usr/lib/libc.so.6
#1 0x00007f7208c931ac in ?? () from /usr/lib/libc.so.6
#2 0x00007f7208c931f4 in ?? () from /usr/lib/libc.so.6
#3 0x00007f7208d03dcf in wait4 () from /usr/lib/libc.so.6
#4 0x00007f7210a21b2b in ggml_print_backtrace () from /home/fella/src/llama.cpp/cuda.build/bin/libggml-base.so
#5 0x00007f7210a33ac9 in ggml_uncaught_exception() () from /home/fella/src/llama.cpp/cuda.build/bin/libggml-base.so
#6 0x00007f72090b1eba in __cxxabiv1::__terminate (handler=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
warning: 48 /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc: No such file or directory
#7 0x00007f72090975d9 in std::terminate () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
58 in /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc
#8 0x00007f72090b2176 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7f7209284d78 <typeinfo for std::out_of_range>, dest=0x7f72090c9d00 <std::out_of_range::~out_of_range()>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
warning: 98 /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory
#9 0x00007f720909bf2a in std::__throw_out_of_range_fmt (__fmt=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/functexcept.cc:101
warning: 101 /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/functexcept.cc: No such file or directory
#10 0x00007f7210c754e6 in llama_vocab::impl::token_get_attr(int) const [clone .cold] () from /home/fella/src/llama.cpp/cuda.build/bin/libllama.so
#11 0x00007f7210d9a89c in llama_vocab::impl::token_to_piece(int, char*, int, int, bool) const () from /home/fella/src/llama.cpp/cuda.build/bin/libllama.so
#12 0x0000556e6552198d in common_token_to_piece[abi:cxx11](llama_vocab const*, int, bool) ()
#13 0x0000556e65521a4f in common_token_to_piece[abi:cxx11](llama_context const*, int, bool) ()
#14 0x0000556e65408646 in server_context::update_slots() ()
#15 0x0000556e653cb7d8 in server_queue::start_loop() ()
#16 0x0000556e65384b73 in main ()
[Inferior 1 (process 1159608) detached]
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 100352)
fish: Job 1, '~/src/llama.cpp/cuda.build/bin/…' terminated by signal SIGABRT (Abort)