-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Description
EDIT: Apologies for marking this as a bug, probably should have marked as an enhancement - feel free to re-tag.
Name and Version
❯ ./llama.cpp/build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
version: 7236 (a2b0fe8)
built with cc (Gentoo 13.4.1_p20250807 p8) 13.4.1 20250807 for x86_64-pc-linux-gnu
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server --port 8000 -fa on -c 8192 -ngl 99 -cram -1 -m /models/Ministral-3-14B-Reasoning-2512-UD-Q6_K_XL.ggufProblem description & steps to reproduce
The new Ministral 3 reasoning models use a (I believe non-standard) content[].type of thinking which is used to pass the reasoning history traces back to the model:
https://huggingface.co/mistralai/Ministral-3-14B-Reasoning-2512/blob/main/chat_template.jinja#L94
However, llama-server currently rejects this with an error:
srv operator(): got exception: {"error":{"code":500,"message":"unsupported content[].type","type":"server_error"}}
Is this something that should/could be fixed on the llama-server side? Or is there a more standardized way of passing reasoning traces back in that e.g. we should encourage people converting the Magistral 3 GGUFs to add into the chat template?
First Bad Commit
N/A, not a regression