Eval bug: Server Fails with HTTP 400 (Context Size Exceeded) Instead of Truncating Chat History

### Name and Version

/home/sysops/llama.cpp/build/bin/llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
  Device 1: Tesla P40, compute capability 6.1, VMM: yes
version: 7062 (9b17d74ab)
built with gcc-14 (Ubuntu 14.2.0-19ubuntu2) 14.2.0 for x86_64-linux-gnu

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Device 0 [Tesla P40] PCIe GEN 3@ 8x 
Device 1 [Tesla P40] PCIe GEN 3@16x

### Models

hf.co/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:UD-Q4_K_XL

### Problem description & steps to reproduce

When using the `llama.cpp` server (via the OpenAI-compatible API, `/v1/chat/completions`), if the total number of tokens in the prompt exceeds the `--ctx-size` limit, the server **stops processing** the request and returns an **HTTP 400 Bad Request** error with the message "the request exceeds the available context size, try increasing it."

I expected the server, especially when using the `--context-shift` flag, to handle the context limit by **automatically truncating/shifting** the oldest messages in the conversation history, similar to how other platforms (like Ollama, which uses a client-side approach) manage chat history for long contexts.

### Steps to Reproduce

1.  Setup the Server: Run `llama.cpp` using the following command (with a model, e.g., a Qwen model):

    ```bash
    "$LLAMA_SERVER_PATH" \
      --host 0.0.0.0 \
      --port 10000 \
      --threads -1 \
      --ctx-size 131072 \
      --context-shift \
      --alias "Qwen3-Coder-30B-A3B" \
      -m "$MODEL_PATH" \
      --api-key 1234567890 \
      --jinja \
      --temp 0.7 \
      --min-p 0.01 \
      --top-p 0.80 \
      --top-k 20 \
      --repeat-penalty 1.05 \
      --flash-attn on \
      --batch-size 4096 \
      --ubatch-size 2048 \
      --threads 32 \
      --metrics
    ```

2.  Send Long Chat Requests: Continuously send chat completion requests to `http://0.0.0.0:10000/v1/chat/completions` using a client (like a custom application running on VSCode/Cline) that sends the full conversation history with each request.

3.  Observe Failure: Once the cumulative token count of the conversation history (`task.n_tokens`) exceeds the configured limit (`--ctx-size 131072`), the server fails the request.

Observed Log Output (Server Side)

The server log shows the failure when the request size exceeds the context:

```log
slot update_slots: id  1 | task 21 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 133046
srv    send_error: task id = 21, error: the request exceeds the available context size, try increasing it
slot      release: id  1 | task 21 | stop processing: n_tokens = 0, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 10.2.0.153 400
```

Expected Behavior

The server should either:

1.  Automatically truncate the oldest messages (excluding the system prompt and the final user message) to fit the context window (`--ctx-size 131072`) and process the request successfully.
2.  Or, if `--context-shift` is intended for this use case, it should be enabled and working to manage the context window without returning a 400 error.

### First Bad Commit

_No response_

### Relevant log output

```shell
srv  log_server_r: request: POST /v1/chat/completions 10.2.0.153 400
srv  log_server_r: request: GET /v1/models 10.2.0.153 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  3 | task -1 | selected slot by LCP similarity, sim_best = 1.000 (> 0.100 thold), f_keep = 0.978
slot launch_slot_: id  3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  3 | task 12 | processing task
slot update_slots: id  3 | task 12 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 314
slot update_slots: id  3 | task 12 | need to evaluate at least 1 token for each active slot (n_past = 314, task.n_tokens() = 314)
slot update_slots: id  3 | task 12 | n_past was set to 313
slot update_slots: id  3 | task 12 | n_tokens = 313, memory_seq_rm [313, end)
slot update_slots: id  3 | task 12 | prompt processing progress, n_tokens = 314, batch.n_tokens = 1, progress = 1.000000
slot update_slots: id  3 | task 12 | prompt done, n_tokens = 314, batch.n_tokens = 1
slot print_timing: id  3 | task 12 |
prompt eval time =      21.27 ms /     1 tokens (   21.27 ms per token,    47.01 tokens per second)
       eval time =     134.18 ms /     8 tokens (   16.77 ms per token,    59.62 tokens per second)
      total time =     155.45 ms /     9 tokens
slot      release: id  3 | task 12 | stop processing: n_tokens = 321, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 10.2.0.153 200
srv  params_from_: Chat format: Hermes 2 Pro
slot get_availabl: id  1 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  1 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
slot launch_slot_: id  1 | task 21 | processing task
slot update_slots: id  1 | task 21 | new prompt, n_ctx_slot = 131072, n_keep = 15000, task.n_tokens = 133046
srv    send_error: task id = 21, error: the request exceeds the available context size, try increasing it
slot      release: id  1 | task 21 | stop processing: n_tokens = 0, truncated = 0
srv  update_slots: no tokens to decode
srv  update_slots: all slots are idle
srv          stop: cancel task, id_task = 21
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 10.2.0.153 400
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Server Fails with HTTP 400 (Context Size Exceeded) Instead of Truncating Chat History #17284

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Steps to Reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Server Fails with HTTP 400 (Context Size Exceeded) Instead of Truncating Chat History #17284

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

Steps to Reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions