-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Description
Name and Version
/home/sysops/llama.cpp/build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
version: 7062 (9b17d74)
built with gcc-14 (Ubuntu 14.2.0-19ubuntu2) 14.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Device 0 [Tesla P40] PCIe GEN 3@ 8x
Device 1 [Tesla P40] PCIe GEN 3@16x
Models
hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL
Problem description & steps to reproduce
When using the llama.cpp server (via the OpenAI-compatible API, /v1/chat/completions), if the total number of tokens in the prompt exceeds the --ctx-size limit, the server stops processing the request and returns an HTTP 400 Bad Request error with the message "the request exceeds the available context size, try increasing it."
I expected the server, especially when using the --context-shift flag, to handle the context limit by automatically truncating/shifting the oldest messages in the conversation history, similar to how other platforms (like Ollama, which uses a client-side approach) manage chat history for long contexts.
Steps to Reproduce
-
Setup the Server: Run
llama.cppusing the following command (with a model, e.g., a Qwen model):"$LLAMA_SERVER_PATH" \ --host 0.0.0.0 \ --port 10000 \ --threads -1 \ --ctx-size 131072 \ --context-shift \ --alias "Qwen3-Coder-30B-A3B" \ -m "$MODEL_PATH" \ --api-key 1234567890 \ --jinja \ --temp 0.7 \ --min-p 0.01 \ --top-p 0.80 \ --top-k 20 \ --repeat-penalty 1.05 \ --flash-attn on \ --batch-size 4096 \ --ubatch-size 2048 \ --threads 32 \ --metrics
-
Send Long Chat Requests: Continuously send chat completion requests to
http://0.0.0.0:10000/v1/chat/completionsusing a client (like a custom application running on VSCode/Cline) that sends the full conversation history with each request. -
Observe Failure: Once the cumulative token count of the conversation history (
task.n_tokens) exceeds the configured limit (--ctx-size 131072), the server fails the request.
Observed Log Output (Server Side)
The server log shows the failure when the request size exceeds the context:
slot update_slots: id 1 | task 21 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 133046
srv send_error: task id = 21, error: the request exceeds the available context size, try increasing it
slot release: id 1 | task 21 | stop processing: n_tokens = 0, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 10.2.0.153 400
Expected Behavior
The server should either:
- Automatically truncate the oldest messages (excluding the system prompt and the final user message) to fit the context window (
--ctx-size 131072) and process the request successfully. - Or, if
--context-shiftis intended for this use case, it should be enabled and working to manage the context window without returning a 400 error.
First Bad Commit
No response
Relevant log output
srv log_server_r: request: POST /v1/chat/completions 10.2.0.153 400
srv log_server_r: request: GET /v1/models 10.2.0.153 200
srv params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.978
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 12 | processing task
slot update_slots: id 3 | task 12 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 314
slot update_slots: id 3 | task 12 | need to evaluate at least 1 token for each active slot (n_past = 314, task.n_tokens() = 314)
slot update_slots: id 3 | task 12 | n_past was set to 313
slot update_slots: id 3 | task 12 | n_tokens = 313, memory_seq_rm [313, end)
slot update_slots: id 3 | task 12 | prompt processing progress, n_tokens = 314, batch.n_tokens = 1, progress = 1.000000
slot update_slots: id 3 | task 12 | prompt done, n_tokens = 314, batch.n_tokens = 1
slot print_timing: id 3 | task 12 |
prompt eval time = 21.27 ms / 1 tokens ( 21.27 ms per token, 47.01 tokens per second)
eval time = 134.18 ms / 8 tokens ( 16.77 ms per token, 59.62 tokens per second)
total time = 155.45 ms / 9 tokens
slot release: id 3 | task 12 | stop processing: n_tokens = 321, truncated = 0
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 10.2.0.153 200
srv params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id 1 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id 1 | task 21 | processing task
slot update_slots: id 1 | task 21 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 133046
srv send_error: task id = 21, error: the request exceeds the available context size, try increasing it
slot release: id 1 | task 21 | stop processing: n_tokens = 0, truncated = 0
srv update_slots: no tokens to decode
srv update_slots: all slots are idle
srv stop: cancel task, id_task = 21
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 10.2.0.153 400