Skip to content

Infill bug: llama-server can crash depending on top_p and input_prefix #16498

@Maykeye

Description

@Maykeye

Name and Version

llama-server --version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6, VMM: yes
version: 6727 (cdb6da46)
built with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 3080 Ti Laptop GPU

Models

Two granite models from unsloth:

~/src/llama.cpp/cuda.build/bin/llama-server --port 10000 -m ./granite-4.0-h-micro-UD-Q6_K_XL.gguf
~/src/llama.cpp/cuda.build/bin/llama-server --port 10000 -m ./granite-4.0-micro-UD-Q6_K_XL.gguf

Problem description & steps to reproduce

If I run server as ~/src/llama.cpp/cuda.build/bin/llama-server --port 10000 -m ./granite-4.0-h-tiny-Q4_K_M.gguf`

This sample (based on request from llama.vim) crashes:

curl -X POST http://localhost:10000/infill \
  -H "Content-Type: application/json" \
  -d '{
    "input_prefix": "\n",
    "input_suffix": "\n\n",
    "response_fields": ["content"],
    "prompt": "def ",
    "top_p": 0.9,
    "input_extra": [],
    "samplers": ["top_p", "infill"],
    "n_predict": 128,
    "model": ""
  }'

llama-server crashes(see log below)

It doesn't always crashes in the middle of the real python source code or for example with removing top_p sampler

 curl -X POST http://localhost:10000/infill \
               -H "Content-Type: application/json" \
               -d '{
             "input_prefix": "\n",
             "input_suffix": "\n\n",
             "response_fields": ["content"],
             "prompt": "def ",
             "top_p": 0.9,
             "input_extra": [],
             "samplers": ["infill"],
             "n_predict": 128,
             "model": ""
           }'

(returns {"content":""})

First Bad Commit

Dunno about commits, but the magical number 18446744073709551615 appears in several issues, non-closed one is #15711, I haven't tested they are related it or not

Relevant log output

<|start_of_role|>assistant<|end_of_role|>'
main: server is listening on http://127.0.0.1:10000 - starting the main loop
srv  update_slots: all slots are idle
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 7
slot update_slots: id  0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 7, n_tokens = 7, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 7, n_tokens = 7
[New LWP 1159636]
[New LWP 1159635]
[New LWP 1159634]
[New LWP 1159633]
[New LWP 1159632]
[New LWP 1159631]
[New LWP 1159630]
[New LWP 1159629]
[New LWP 1159628]
[New LWP 1159627]
[New LWP 1159626]
[New LWP 1159625]
[New LWP 1159624]
[New LWP 1159623]
[New LWP 1159622]
[New LWP 1159621]
[New LWP 1159620]
[New LWP 1159619]
[New LWP 1159618]
[New LWP 1159617]
[New LWP 1159616]
[New LWP 1159615]
[New LWP 1159614]
[New LWP 1159609]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f7208c9f042 in ?? () from /usr/lib/libc.so.6
#0  0x00007f7208c9f042 in ?? () from /usr/lib/libc.so.6
#1  0x00007f7208c931ac in ?? () from /usr/lib/libc.so.6
#2  0x00007f7208c931f4 in ?? () from /usr/lib/libc.so.6
#3  0x00007f7208d03dcf in wait4 () from /usr/lib/libc.so.6
#4  0x00007f7210a21b2b in ggml_print_backtrace () from /home/fella/src/llama.cpp/cuda.build/bin/libggml-base.so
#5  0x00007f7210a33ac9 in ggml_uncaught_exception() () from /home/fella/src/llama.cpp/cuda.build/bin/libggml-base.so
#6  0x00007f72090b1eba in __cxxabiv1::__terminate (handler=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
warning: 48     /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc: No such file or directory
#7  0x00007f72090975d9 in std::terminate () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
58      in /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc
#8  0x00007f72090b2176 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7f7209284d78 <typeinfo for std::out_of_range>, dest=0x7f72090c9d00 <std::out_of_range::~out_of_range()>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
warning: 98     /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory
#9  0x00007f720909bf2a in std::__throw_out_of_range_fmt (__fmt=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/functexcept.cc:101
warning: 101    /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/functexcept.cc: No such file or directory
#10 0x00007f7210c754e6 in llama_vocab::impl::token_get_attr(int) const [clone .cold] () from /home/fella/src/llama.cpp/cuda.build/bin/libllama.so
#11 0x00007f7210d9a89c in llama_vocab::impl::token_to_piece(int, char*, int, int, bool) const () from /home/fella/src/llama.cpp/cuda.build/bin/libllama.so
#12 0x0000556e6552198d in common_token_to_piece[abi:cxx11](llama_vocab const*, int, bool) ()
#13 0x0000556e65521a4f in common_token_to_piece[abi:cxx11](llama_context const*, int, bool) ()
#14 0x0000556e65408646 in server_context::update_slots() ()
#15 0x0000556e653cb7d8 in server_queue::start_loop() ()
#16 0x0000556e65384b73 in main ()
[Inferior 1 (process 1159608) detached]
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 100352)
fish: Job 1, '~/src/llama.cpp/cuda.build/bin/…' terminated by signal SIGABRT (Abort)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions