server : return HTTP 400 if prompt exceeds context length #16486

rgerganov · 2025-10-09T13:40:24Z

In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case.

This patch fixes this problem and makes the server return HTTP 400 in such cases.

ngxson · 2025-10-09T14:21:22Z

Hmm that's strange, we have a specific error type for this, ERROR_TYPE_EXCEED_CONTEXT_SIZE. The error code is 400:

llama.cpp/tools/server/server.cpp

Lines 1268 to 1271 in 56b4795

    
           case ERROR_TYPE_EXCEED_CONTEXT_SIZE: 
        
               type_str = "exceed_context_size_error"; 
        
               code = 400; 
        
               break;

We also have this test case:

llama.cpp/tools/server/tests/unit/test_chat_completion.py

Lines 393 to 408 in 56b4795

    
           def test_context_size_exceeded(): 
        
               global server 
        
               server.start() 
        
               res = server.make_request("POST", "/chat/completions", data={ 
        
                   "messages": [ 
        
                       {"role": "system", "content": "Book"}, 
        
                       {"role": "user", "content": "What is the best book"}, 
        
                   ] * 100, # make the prompt too long 
        
               }) 
        
               assert res.status_code == 400 
        
               assert "error" in res.body 
        
               assert res.body["error"]["type"] == "exceed_context_size_error" 
        
               assert res.body["error"]["n_prompt_tokens"] > 0 
        
               assert server.n_ctx is not None 
        
               assert server.n_slots is not None 
        
               assert res.body["error"]["n_ctx"] == server.n_ctx // server.n_slots

I'm wondering which input leads to the 200 code that you mentioned?

rgerganov · 2025-10-09T14:33:57Z

The issue occurs only in streaming mode. In non-streaming it correctly returns 400.

rgerganov · 2025-10-09T15:42:55Z

I have added a new test which covers exceeding the context in streaming mode.

tools/server/server.cpp

In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case. This patch fixes this problem and makes the server return HTTP 400 in such cases.

ngxson · 2025-10-10T11:05:19Z

tools/server/server.cpp

                inputs = tokenize_input_prompts(ctx_server.vocab, ctx_server.mctx, prompt, true, true);
            }
-
+            const size_t n_ctx_slot = ctx_server.n_ctx / ctx_server.params_base.n_parallel;


I'm thinking maybe this check can better be done inside launch_slot_with_task? There you will have access to slot.n_ctx

Unfortunately there is no way to return non-200 status code once you call res.set_chunked_content_provider(...). That's why I am doing the check before that.

Hmm ok then, I'll refactor this code in #16488 , for now this can be a temporary soltuion

ngxson · 2025-10-10T11:07:52Z

tools/server/server.cpp

+                    json error_data = format_error_response("the request exceeds the available context size, try increasing it", ERROR_TYPE_EXCEED_CONTEXT_SIZE);
+                    error_data["n_prompt_tokens"] = n_prompt_tokens;
+                    error_data["n_ctx"] = n_ctx_slot;
+                    res_error(res, error_data);


If this is handled inside launch_slot_with_task, you can call send_error(slot, ".....", ERROR_TYPE_EXCEED_CONTEXT_SIZE);, which should simplify things a bit

rgerganov requested review from ngxson and ggerganov as code owners October 9, 2025 13:40

github-actions bot added examples python python script changes server labels Oct 9, 2025

rgerganov force-pushed the srv-ctx-exceed branch 2 times, most recently from aac559d to 1d8b16c Compare October 9, 2025 15:41

ggerganov reviewed Oct 9, 2025

View reviewed changes

tools/server/server.cpp Outdated Show resolved Hide resolved

rgerganov force-pushed the srv-ctx-exceed branch from 1d8b16c to d08f91a Compare October 10, 2025 07:55

ggerganov approved these changes Oct 10, 2025

View reviewed changes

ngxson reviewed Oct 10, 2025

View reviewed changes

ngxson approved these changes Oct 10, 2025

View reviewed changes

ngxson merged commit 68ee98a into ggml-org:master Oct 10, 2025
71 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : return HTTP 400 if prompt exceeds context length #16486

server : return HTTP 400 if prompt exceeds context length #16486

rgerganov commented Oct 9, 2025

Uh oh!

ngxson commented Oct 9, 2025

Uh oh!

rgerganov commented Oct 9, 2025

Uh oh!

rgerganov commented Oct 9, 2025

Uh oh!

Uh oh!

ngxson Oct 10, 2025

Uh oh!

rgerganov Oct 10, 2025

Uh oh!

ngxson Oct 10, 2025

Uh oh!

ngxson Oct 10, 2025

Uh oh!

Uh oh!

Uh oh!

server : return HTTP 400 if prompt exceeds context length #16486

server : return HTTP 400 if prompt exceeds context length #16486

Conversation

rgerganov commented Oct 9, 2025

Uh oh!

ngxson commented Oct 9, 2025

Uh oh!

rgerganov commented Oct 9, 2025

Uh oh!

rgerganov commented Oct 9, 2025

Uh oh!

Uh oh!

ngxson Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

rgerganov Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!