Skip to content

Eval bug: Recent updates lead to /infill requests on the Qwen2.5-Coder model failing and ultimately crashing. #17260

@thomasbergersen

Description

@thomasbergersen

Name and Version

llama-cli --version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce GTX 1660, compute capability 7.5, VMM: yes
version: 7046 (879dec3)
built with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

5080

Models

Qwen2.5-Coder-3B-Q8_0.gguf

Problem description & steps to reproduce

When using the llama-vscode plugin for automatic filling, it occasionally causes the server to crash.

First Bad Commit

No response

Relevant log output

slot update_slots: id  0 | task 458 | reusing chunk with size 1, shifting KV cache [505, 506) -> [24, 25)

slot update_slots: id  0 | task 458 | reusing chunk with size 4969, shifting KV cache [820, 5789) -> [25, 4994)

srv          stop: cancel task, id_task = 450

srv          stop: cancel task, id_task = 455

srv  log_server_r: request: POST /infill 172.18.0.1 500

srv  log_server_r: request: POST /infill 172.18.0.1 500

srv  log_server_r: request: POST /infill 172.18.0.1 500

srv  log_server_r: request: POST /infill 172.18.0.1 500

slot update_slots: id  0 | task 458 | n_tokens = 4994, memory_seq_rm [4994, end)

slot update_slots: id  0 | task 458 | prompt processing progress, n_tokens = 5666, batch.n_tokens = 672, progress = 1.000000

slot update_slots: id  0 | task 458 | prompt done, n_tokens = 5666, batch.n_tokens = 672

init: the tokens of sequence 0 in the input batch have inconsistent sequence positions:

 - the last position stored in the memory module of the context (i.e. the KV cache) for sequence 0 is X = 4784

 - the tokens for sequence 0 in the input batch have a starting position of Y = 4994

 it is required that the sequence positions remain consecutive: Y = X + 1

decode: failed to initialize batch

llama_decode: failed to decode, ret = -1

srv  update_slots: Invalid input batch. i = 0, n_batch = 2048, ret = -1

srv    send_error: task id = 458, error: Invalid input batch.

slot      release: id  0 | task 458 | stop processing: n_tokens = 5666, truncated = 0

srv  update_slots: all slots are idle

srv          stop: cancel task, id_task = 458

srv  update_slots: all slots are idle

srv  log_server_r: request: POST /infill 172.18.0.1 500

slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = 14806193847

slot launch_slot_: id  1 | task -1 | sampler chain: logits -> logit-bias -> dist 

slot launch_slot_: id  1 | task 465 | processing task

slot update_slots: id  1 | task 465 | new prompt, n_ctx_slot = 8192, n_keep = 0, task.n_tokens = 5752

slot update_slots: id  1 | task 465 | n_past = 22, slot.prompt.tokens.size() = 350, seq_id = 1, pos_min = -1

/opt/llama.cpp/tools/server/server.cpp:3747: pos_min == -1, but n_past > 0 - should not happen: https://github.com/ggml-org/llama.cpp/pull/13833#discussion_r2116181237⁠

/usr/local/lib/libggml-base.so.0(+0x16298)[0x7f3bbb94d298]

/usr/local/lib/libggml-base.so.0(ggml_print_backtrace+0x1e4)[0x7f3bbb94d664]

/usr/local/lib/libggml-base.so.0(ggml_abort+0x11e)[0x7f3bbb94d7ee]

llama-server(+0xccf9f)[0x55b10b371f9f]

llama-server(+0x9c1e9)[0x55b10b3411e9]

llama-server(+0x6b7c9)[0x55b10b3107c9]

/usr/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f3bbb46024a]

/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f3bbb460305]

llama-server(+0x6d501)[0x55b10b312501]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions