Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 3, 2025

Fix #17726

The proposed approach: Delegate state tracking to HTTP thread

  1. When creating a new task, we also create a new task_result_state that holds the state of the current generation "session". This object will be owned by the HTTP thread
  2. Each time we receive a new result, we call result.update(state); this allow the result to populate partial returns, at the same time also update the state object


llama_token sampled;

common_chat_format chat_format = COMMON_CHAT_FORMAT_CONTENT_ONLY;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chat_format seems to be unused, so I removed this. no idea why it was added in the first place

@ngxson ngxson changed the title Xsn/server improve msg diff server: move msg diffs tracking to HTTP thread Dec 3, 2025
@ngxson ngxson marked this pull request as ready for review December 3, 2025 15:33
@ngxson ngxson requested a review from ggerganov as a code owner December 3, 2025 15:33
@ngxson
Copy link
Collaborator Author

ngxson commented Dec 3, 2025

Server tests should be OK now, running from the mirror PR: ngxson#49

states.emplace_back(task.params.oaicompat_chat_syntax);

tasks.push_back(std::move(task));
states.push_back(task.params.oaicompat_chat_syntax);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task is already moved at this point

// send the results
json res_json = result->to_json();
if (result->is_error()) {
res->next = [res_this = res.get(), res_type, &should_stop, states = std::move(states)](std::string & output) mutable -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making the lambda mutable, wouldn't it be cleaner to maintain the states in the server_res_generator instance (i.e. res)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's good idea. I moved it to server_response_reader instead. update() is now called automatically whenever the result is received.

Also added a safe-guard ba2af58 to avoid someone forget to call update()

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a87d9a4 is already merged in master, so probably need to rebase

@ngxson ngxson force-pushed the xsn/server_improve_msg_diff branch from aad20a6 to 1c30c28 Compare December 4, 2025 13:13
@ngxson ngxson merged commit c4c10bf into ggml-org:master Dec 4, 2025
69 checks passed
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Dec 4, 2025
* origin/master:
server: strip content-length header on proxy (ggml-org#17734)
server: move msg diffs tracking to HTTP thread (ggml-org#17740)
examples : add missing code block end marker [no ci] (ggml-org#17756)
common : skip model validation when --help is requested (ggml-org#17755)
ggml-cpu : remove asserts always evaluating to false (ggml-org#17728)
convert: use existing local chat_template if mistral-format model has one. (ggml-org#17749)
cmake : simplify build info detection using standard variables (ggml-org#17423)
ci : disable ggml-ci-x64-amd-* (ggml-org#17753)
common: use native MultiByteToWideChar (ggml-org#17738)
metal : use params per pipeline instance (ggml-org#17739)
llama : fix sanity checks during quantization (ggml-org#17721)
build : move _WIN32_WINNT definition to headers (ggml-org#17736)
build: enable parallel builds in msbuild using MTT (ggml-org#17708)
ggml-cpu: remove duplicate conditional check 'iid' (ggml-org#17650)
Add a couple of file types to the text section (ggml-org#17670)
convert : support latest mistral-common (fix conversion with --mistral-format) (ggml-org#17712)
Use OpenAI-compatible `/v1/models` endpoint by default (ggml-org#17689)
webui: Fix zero pasteLongTextToFileLen to disable conversion being overridden (ggml-org#17445)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor: (server) move server_context::update_chat_msg to HTTP thread

2 participants