Skip to content

Conversation

rgerganov
Copy link
Collaborator

In streaming mode when prompt exceeds context length, the server returns HTTP 200 status code with a JSON error in the body. This is very confusing and inconsistent with all other inference engines which return HTTP 4xx error in this case.

This patch fixes this problem and makes the server return HTTP 400 in such cases.

@github-actions github-actions bot added examples python python script changes server labels Oct 9, 2025
@ngxson
Copy link
Collaborator

ngxson commented Oct 9, 2025

Hmm that's strange, we have a specific error type for this, ERROR_TYPE_EXCEED_CONTEXT_SIZE. The error code is 400:

case ERROR_TYPE_EXCEED_CONTEXT_SIZE:
type_str = "exceed_context_size_error";
code = 400;
break;

We also have this test case:

def test_context_size_exceeded():
global server
server.start()
res = server.make_request("POST", "/chat/completions", data={
"messages": [
{"role": "system", "content": "Book"},
{"role": "user", "content": "What is the best book"},
] * 100, # make the prompt too long
})
assert res.status_code == 400
assert "error" in res.body
assert res.body["error"]["type"] == "exceed_context_size_error"
assert res.body["error"]["n_prompt_tokens"] > 0
assert server.n_ctx is not None
assert server.n_slots is not None
assert res.body["error"]["n_ctx"] == server.n_ctx // server.n_slots

I'm wondering which input leads to the 200 code that you mentioned?

@rgerganov
Copy link
Collaborator Author

The issue occurs only in streaming mode. In non-streaming it correctly returns 400.

@rgerganov rgerganov force-pushed the srv-ctx-exceed branch 2 times, most recently from aac559d to 1d8b16c Compare October 9, 2025 15:41
@rgerganov
Copy link
Collaborator Author

I have added a new test which covers exceeding the context in streaming mode.

In streaming mode when prompt exceeds context length, the server returns
HTTP 200 status code with a JSON error in the body.  This is very
confusing and inconsistent with all other inference engines which return
HTTP 4xx error in this case.

This patch fixes this problem and makes the server return HTTP 400 in
such cases.
inputs = tokenize_input_prompts(ctx_server.vocab, ctx_server.mctx, prompt, true, true);
}

const size_t n_ctx_slot = ctx_server.n_ctx / ctx_server.params_base.n_parallel;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking maybe this check can better be done inside launch_slot_with_task? There you will have access to slot.n_ctx

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately there is no way to return non-200 status code once you call res.set_chunked_content_provider(...). That's why I am doing the check before that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok then, I'll refactor this code in #16488 , for now this can be a temporary soltuion

Comment on lines +4963 to +4966
json error_data = format_error_response("the request exceeds the available context size, try increasing it", ERROR_TYPE_EXCEED_CONTEXT_SIZE);
error_data["n_prompt_tokens"] = n_prompt_tokens;
error_data["n_ctx"] = n_ctx_slot;
res_error(res, error_data);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is handled inside launch_slot_with_task, you can call send_error(slot, ".....", ERROR_TYPE_EXCEED_CONTEXT_SIZE);, which should simplify things a bit

@ngxson ngxson merged commit 68ee98a into ggml-org:master Oct 10, 2025
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants