Skip to content

Eval bug: Server Fails with HTTP 400 (Context Size Exceeded) Instead of Truncating Chat History #17284

@lyesrock

Description

@lyesrock

Name and Version

/home/sysops/llama.cpp/build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
version: 7062 (9b17d74)
built with gcc-14 (Ubuntu 14.2.0-19ubuntu2) 14.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Device 0 [Tesla P40] PCIe GEN 3@ 8x
Device 1 [Tesla P40] PCIe GEN 3@16x

Models

hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL

Problem description & steps to reproduce

When using the llama.cpp server (via the OpenAI-compatible API, /v1/chat/completions), if the total number of tokens in the prompt exceeds the --ctx-size limit, the server stops processing the request and returns an HTTP 400 Bad Request error with the message "the request exceeds the available context size, try increasing it."

I expected the server, especially when using the --context-shift flag, to handle the context limit by automatically truncating/shifting the oldest messages in the conversation history, similar to how other platforms (like Ollama, which uses a client-side approach) manage chat history for long contexts.

Steps to Reproduce

  1. Setup the Server: Run llama.cpp using the following command (with a model, e.g., a Qwen model):

    "$LLAMA_SERVER_PATH" \
      --host 0.0.0.0 \
      --port 10000 \
      --threads -1 \
      --ctx-size 131072 \
      --context-shift \
      --alias "Qwen3-Coder-30B-A3B" \
      -m "$MODEL_PATH" \
      --api-key 1234567890 \
      --jinja \
      --temp 0.7 \
      --min-p 0.01 \
      --top-p 0.80 \
      --top-k 20 \
      --repeat-penalty 1.05 \
      --flash-attn on \
      --batch-size 4096 \
      --ubatch-size 2048 \
      --threads 32 \
      --metrics
  2. Send Long Chat Requests: Continuously send chat completion requests to http://0.0.0.0:10000/v1/chat/completions using a client (like a custom application running on VSCode/Cline) that sends the full conversation history with each request.

  3. Observe Failure: Once the cumulative token count of the conversation history (task.n_tokens) exceeds the configured limit (--ctx-size 131072), the server fails the request.

Observed Log Output (Server Side)

The server log shows the failure when the request size exceeds the context:

slot update_slots: id  1 | task 21 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 133046
srv    send_error: task id = 21, error: the request exceeds the available context size, try increasing it
slot      release: id  1 | task 21 | stop processing: n_tokens = 0, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 10.2.0.153 400

Expected Behavior

The server should either:

  1. Automatically truncate the oldest messages (excluding the system prompt and the final user message) to fit the context window (--ctx-size 131072) and process the request successfully.
  2. Or, if --context-shift is intended for this use case, it should be enabled and working to manage the context window without returning a 400 error.

First Bad Commit

No response

Relevant log output

srv  log_server_r: request: POST /v1/chat/completions 10.2.0.153 400
srv  log_server_r: request: GET /v1/models 10.2.0.153 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.978
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 12 | processing task
slot update_slots: id  3 | task 12 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 314
slot update_slots: id  3 | task 12 | need to evaluate at least 1 token for each active slot (n_past = 314, task.n_tokens() = 314)
slot update_slots: id  3 | task 12 | n_past was set to 313
slot update_slots: id  3 | task 12 | n_tokens = 313, memory_seq_rm [313, end)
slot update_slots: id  3 | task 12 | prompt processing progress, n_tokens = 314, batch.n_tokens = 1, progress = 1.000000
slot update_slots: id  3 | task 12 | prompt done, n_tokens = 314, batch.n_tokens = 1
slot print_timing: id  3 | task 12 |
prompt eval time =      21.27 ms /     1 tokens (   21.27 ms per token,    47.01 tokens per second)
       eval time =     134.18 ms /     8 tokens (   16.77 ms per token,    59.62 tokens per second)
      total time =     155.45 ms /     9 tokens
slot      release: id  3 | task 12 | stop processing: n_tokens = 321, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 10.2.0.153 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 21 | processing task
slot update_slots: id  1 | task 21 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 133046
srv    send_error: task id = 21, error: the request exceeds the available context size, try increasing it
slot      release: id  1 | task 21 | stop processing: n_tokens = 0, truncated = 0
srv  update_slots: no tokens to decode
srv  update_slots: all slots are idle
srv          stop: cancel task, id_task = 21
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 10.2.0.153 400

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions