Infill bug: llama-server can crash depending on top_p and input_prefix

### Name and Version

llama-server --version

```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3080 Ti Laptop GPU, compute capability 8.6, VMM: yes
version: 6727 (cdb6da46)
built with cc (GCC) 15.2.1 20250813 for x86_64-pc-linux-gnu
```

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

RTX 3080 Ti Laptop GPU

### Models

Two granite models from unsloth:

`~/src/llama.cpp/cuda.build/bin/llama-server  --port 10000  -m ./granite-4.0-h-micro-UD-Q6_K_XL.gguf`
`~/src/llama.cpp/cuda.build/bin/llama-server  --port 10000  -m ./granite-4.0-micro-UD-Q6_K_XL.gguf`

### Problem description & steps to reproduce

If I run server as `~/src/llama.cpp/cuda.build/bin/llama-server ` --port 10000  -m ./granite-4.0-h-tiny-Q4_K_M.gguf`

This sample (based on request from llama.vim) crashes:

```console
curl -X POST http://localhost:10000/infill \
  -H "Content-Type: application/json" \
  -d '{
    "input_prefix": "\n",
    "input_suffix": "\n\n",
    "response_fields": ["content"],
    "prompt": "def ",
    "top_p": 0.9,
    "input_extra": [],
    "samplers": ["top_p", "infill"],
    "n_predict": 128,
    "model": ""
  }'
```

llama-server crashes(see log below)


It doesn't always crashes in the middle of the real python source code or for example with removing `top_p` sampler

```
 curl -X POST http://localhost:10000/infill \
               -H "Content-Type: application/json" \
               -d '{
             "input_prefix": "\n",
             "input_suffix": "\n\n",
             "response_fields": ["content"],
             "prompt": "def ",
             "top_p": 0.9,
             "input_extra": [],
             "samplers": ["infill"],
             "n_predict": 128,
             "model": ""
           }'
```

(returns `{"content":""}`)



### First Bad Commit

Dunno about commits, but the magical number 18446744073709551615 appears in several issues,  non-closed one is https://github.com/ggml-org/llama.cpp/issues/15711, I haven't tested they are related it or not

### Relevant log output

```shell
<|start_of_role|>assistant<|end_of_role|>'
main: server is listening on http://127.0.0.1:10000 - starting the main loop
srv  update_slots: all slots are idle
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 7
slot update_slots: id  0 | task 0 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 7, n_tokens = 7, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 7, n_tokens = 7
[New LWP 1159636]
[New LWP 1159635]
[New LWP 1159634]
[New LWP 1159633]
[New LWP 1159632]
[New LWP 1159631]
[New LWP 1159630]
[New LWP 1159629]
[New LWP 1159628]
[New LWP 1159627]
[New LWP 1159626]
[New LWP 1159625]
[New LWP 1159624]
[New LWP 1159623]
[New LWP 1159622]
[New LWP 1159621]
[New LWP 1159620]
[New LWP 1159619]
[New LWP 1159618]
[New LWP 1159617]
[New LWP 1159616]
[New LWP 1159615]
[New LWP 1159614]
[New LWP 1159609]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f7208c9f042 in ?? () from /usr/lib/libc.so.6
#0  0x00007f7208c9f042 in ?? () from /usr/lib/libc.so.6
#1  0x00007f7208c931ac in ?? () from /usr/lib/libc.so.6
#2  0x00007f7208c931f4 in ?? () from /usr/lib/libc.so.6
#3  0x00007f7208d03dcf in wait4 () from /usr/lib/libc.so.6
#4  0x00007f7210a21b2b in ggml_print_backtrace () from /home/fella/src/llama.cpp/cuda.build/bin/libggml-base.so
#5  0x00007f7210a33ac9 in ggml_uncaught_exception() () from /home/fella/src/llama.cpp/cuda.build/bin/libggml-base.so
#6  0x00007f72090b1eba in __cxxabiv1::__terminate (handler=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:48
warning: 48     /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc: No such file or directory
#7  0x00007f72090975d9 in std::terminate () at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:58
58      in /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_terminate.cc
#8  0x00007f72090b2176 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7f7209284d78 <typeinfo for std::out_of_range>, dest=0x7f72090c9d00 <std::out_of_range::~out_of_range()>) at /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc:98
warning: 98     /usr/src/debug/gcc/gcc/libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory
#9  0x00007f720909bf2a in std::__throw_out_of_range_fmt (__fmt=<optimized out>) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/functexcept.cc:101
warning: 101    /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/functexcept.cc: No such file or directory
#10 0x00007f7210c754e6 in llama_vocab::impl::token_get_attr(int) const [clone .cold] () from /home/fella/src/llama.cpp/cuda.build/bin/libllama.so
#11 0x00007f7210d9a89c in llama_vocab::impl::token_to_piece(int, char*, int, int, bool) const () from /home/fella/src/llama.cpp/cuda.build/bin/libllama.so
#12 0x0000556e6552198d in common_token_to_piece[abi:cxx11](llama_vocab const*, int, bool) ()
#13 0x0000556e65521a4f in common_token_to_piece[abi:cxx11](llama_context const*, int, bool) ()
#14 0x0000556e65408646 in server_context::update_slots() ()
#15 0x0000556e653cb7d8 in server_queue::start_loop() ()
#16 0x0000556e65384b73 in main ()
[Inferior 1 (process 1159608) detached]
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 100352)
fish: Job 1, '~/src/llama.cpp/cuda.build/bin/…' terminated by signal SIGABRT (Abort)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Infill bug: llama-server can crash depending on top_p and input_prefix #16498

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Infill bug: llama-server can crash depending on top_p and input_prefix #16498

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions