server: improve Responses API compliance and Codex CLI compatibility#21174
server: improve Responses API compliance and Codex CLI compatibility#21174krystophny wants to merge 9 commits intoggml-org:masterfrom
Conversation
|
please add proper testings for it |
530e13c to
467266b
Compare
Done! |
|
According to the responses API reference for streaming events, the It seems like the current implementation just returns a minimal response object in those events. This causes issues with certain spec-compliant client libraries like async-openai. Would it be possible to add the missing streaming event fields here as well? |
- Add sequence_number to ALL streaming events (created, in_progress,
output_item.added, content_part.added, all delta events)
- Add output_index to all events referencing output items
- Add content_index to content-related events
- Populate full response object in response.created and
response.in_progress events (was only {id, object, status})
- Add id field to function_call output_item.added events
- Add status: completed to reasoning output_item.done events
- Counter state persisted across streaming chunks via task_result_state
Fixes: spec-compliant client libraries (async-openai) that require
these fields can now parse all streaming events without error.
Refs: ggml-org#21174 (fumlig review comment)
Code fixes: - build_oai_resp_metadata accepts status param; completed_at is null when status is in_progress (was always set to timestamp) - response.created/in_progress events use zeroed usage (was passing actual prompt tokens before response was logically started) - Function call item IDs are now generated once per tool call in update() and reused consistently across output_item.added, function_call_arguments.delta, and output_item.done events (was generating independent random IDs in each path) - Clean up commented-out status checks in server-common.cpp Test fixes: - Assert sequence_number on every event unconditionally (was using weak "if present" guard) - Check actual values not just key presence in streaming created event test (completed_at is None, usage tokens are 0, etc.) Refs: ggml-org#21174 (patrick review)
|
@fumlig Thanks for the feedback. The streaming events now include the full response object in Tested with the async OpenAI Python SDK (which validates event schemas similarly to async-openai on the Rust side). @ngxson Tests added: 14 pytest tests covering schema fields, streaming compliance, tool type skipping, developer role merging, key stripping, multi-turn input, and output_text consistency. Plus E2E tests with async OpenAI SDK against Qwen3.5-9B. If you'd prefer this split into smaller PRs for faster review, happy to do so. |
|
Additional E2E testing (async OpenAI SDK, Codex CLI, concurrent stress tests, multiple Qwen3.5 models) is documented in the companion meta-repo: https://github.com/krystophny/llama.cpp-dev The meta-repo includes a Nix flake for reproducible test environments and scripted test harnesses under |
Codex CLI compatibility: - Skip non-function tool types (web_search, code_interpreter) - Merge developer/system messages into position 0 for Qwen templates - Strip Responses-only request keys (store, include, prompt_cache_key) - Restore refusal content type handling Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted): - Add 24 missing Response object fields per OpenAI spec - Fix function_call id/call_id field mapping - Add sequence_number, output_index, content_index to ALL streaming events - Full response object in response.created/in_progress events - Accept input_text type and EasyInputMessage for multi-turn input - output_text convenience field, output_tokens_details 14 pytest tests, E2E tested with async OpenAI SDK and Codex CLI. Refs: ggml-org#19138, ggml-org#19720, ggml-org#21174
|
I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it |
@ngxson thanks! I marked it as draft and iterate a bit and test locally more with https://github.com/lazy-fortran/fortbench during the coming days and see if any problems pop up on the codex path compared to opencode. Then I'll let you know. |
adef64c to
7229135
Compare
- Add sequence_number to ALL streaming events (created, in_progress,
output_item.added, content_part.added, all delta events)
- Add output_index to all events referencing output items
- Add content_index to content-related events
- Populate full response object in response.created and
response.in_progress events (was only {id, object, status})
- Add id field to function_call output_item.added events
- Add status: completed to reasoning output_item.done events
- Counter state persisted across streaming chunks via task_result_state
Fixes: spec-compliant client libraries (async-openai) that require
these fields can now parse all streaming events without error.
Refs: ggml-org#21174 (fumlig review comment)
Code fixes: - build_oai_resp_metadata accepts status param; completed_at is null when status is in_progress (was always set to timestamp) - response.created/in_progress events use zeroed usage (was passing actual prompt tokens before response was logically started) - Function call item IDs are now generated once per tool call in update() and reused consistently across output_item.added, function_call_arguments.delta, and output_item.done events (was generating independent random IDs in each path) - Clean up commented-out status checks in server-common.cpp Test fixes: - Assert sequence_number on every event unconditionally (was using weak "if present" guard) - Check actual values not just key presence in streaming created event test (completed_at is None, usage tokens are 0, etc.) Refs: ggml-org#21174 (patrick review)
@ngxson I saw no more issues in practical usage with Codex CLI and the smaller Qwen 3.5 models with the responses API. Rebased once more and set ready to review. |
|
Thank you for this!!! I ran into this problem today when I was trying codex with llama.cpp and this change makes it work :) |
|
Nice work, thank you! I tried using it with the Codex IDE extension and almost everything works, but still had a few issues causing errors that would freeze up the entire thread, making it unusable to continue from where it was left off. I'll share some small suggested changes soon. |
tools/server/server-common.cpp
Outdated
| if (!exists_and_is_string(output_text, "refusal")) { | ||
| throw std::invalid_argument("'Refusal' requires 'refusal'"); |
There was a problem hiding this comment.
| if (!exists_and_is_string(output_text, "refusal")) { | |
| throw std::invalid_argument("'Refusal' requires 'refusal'"); |
The throw errors make the thread crash, and unable to resume from there. It would never reach the pushback below
There was a problem hiding this comment.
Same reasoning as the output_text guard -- this protects against a 500 from .at("refusal") when the field is missing. The throw is caught by ex_wrapper and returns a clean 400. Removing it would downgrade a proper 400 to a confusing 500. Kept as-is.
There was a problem hiding this comment.
Updated in fa2ca43 -- same treatment. Skips with warning instead of throwing.
| {"refusal", output_text.at("refusal")}, | ||
| {"type", "refusal"}, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Part of the same guard, see above.
tools/server/server-common.cpp
Outdated
There was a problem hiding this comment.
| chatcmpl_content.push_back({{"text", "[image input - no URL provided]"}, {"type", "text"}}); | |
| continue; |
The throw errors cause the thread to crash and not able to continue
There was a problem hiding this comment.
Same situation as the input_text case above -- the throw is caught by ex_wrapper and returns a proper 400, not a crash. An input_image without image_url is malformed input, so 400 is the right response. Kept as-is.
There was a problem hiding this comment.
Updated in fa2ca43 -- same reasoning as above. If an image attachment ends up missing the URL in the replayed history, the thread shouldn't die. Now skips with a warning.
tools/server/server-common.cpp
Outdated
There was a problem hiding this comment.
| chatcmpl_content.push_back({{"text", "[text content missing - input_text is not supported yet!"}, {"type", "text"}}); | |
| continue; |
There was a problem hiding this comment.
Thanks for the report. The throw here doesn't actually crash the thread -- it's caught by ex_wrapper in server.cpp (line 38-69) which converts it to a clean HTTP 400 response. The server keeps running.
This throw fires when input_text is missing its required text field, which is a genuine client bug. Returning 400 is correct here -- it tells the client to fix the malformed request. Kept as-is.
There was a problem hiding this comment.
Updated in fa2ca43 -- you were right about this one. With Codex IDE replaying the full conversation history, any 400 here permanently kills the thread. Now skips the malformed item with a server warning instead of rejecting.
tools/server/server-common.cpp
Outdated
There was a problem hiding this comment.
| // Show the attached file as text | |
| if (input_item.contains("filename")) { | |
| chatcmpl_content.push_back({{"text", "[file: " + input_item.at("filename").get<std::string>() + "]"}, {"type", "text"}}); | |
| } else { | |
| chatcmpl_content.push_back({{"text", "[file data]"}, {"type", "text"}}); | |
| } | |
| } else { | |
| // If not text then give json representation instead of the entire item | |
| chatcmpl_content.push_back({{"text", input_item.dump()}, {"type", "text"}}); |
This outputs the file to the LLM
There was a problem hiding this comment.
Good call on this one. I've reworked input_file handling in edb7a79: instead of rejecting, it now renders the file content as text. If file_data is present, it gets injected into the prompt (with a [file: name] label when filename is available). If only filename is present, it shows a placeholder. Also removed the dead commented-out code.
The approach differs slightly from your suggestion -- I didn't add a raw input_item.dump() catch-all since injecting arbitrary JSON into the prompt can confuse models. Instead, unknown types are just skipped with a server warning log.
tools/server/server-common.cpp
Outdated
There was a problem hiding this comment.
| chatcmpl_content.push_back({{"text", input_item.dump()}, {"type", "text"}}); |
There was a problem hiding this comment.
Agreed that rejecting the whole request for an unknown type is too harsh. Changed to skip with a warning log in edb7a79. I went with a silent skip rather than input_item.dump() to avoid injecting raw JSON into the prompt, which tends to confuse models.
tools/server/server-common.cpp
Outdated
| if (!exists_and_is_string(output_text, "text")) { | ||
| throw std::invalid_argument("'Output text' requires 'text'"); |
There was a problem hiding this comment.
| if (!exists_and_is_string(output_text, "text")) { | |
| throw std::invalid_argument("'Output text' requires 'text'"); |
There was a problem hiding this comment.
This guard needs to stay. If you remove the check but keep the .at("text") call below, it throws a json::exception when text is missing -- which ex_wrapper catches as a 500 (server error) instead of the current 400 (client error). So removing the guard actually makes things worse. The throw here is correctly caught and returned as a 400 response, not a crash.
There was a problem hiding this comment.
Updated in fa2ca43 -- reconsidered after your clarification about the Codex IDE thread loop. The guard now skips the item with a warning + continue instead of throwing. This way a malformed item in replayed history won't kill the thread.
| {"text", output_text.at("text")}, | ||
| {"type", "text"}, | ||
| }); | ||
| } |
There was a problem hiding this comment.
Same as above -- part of the same validation guard that protects against a 500 from the .at() call. Kept as-is.
tools/server/server-common.cpp
Outdated
| }); | ||
| } else { | ||
| throw std::invalid_argument("'type' must be one of 'output_text' or 'refusal'"); | ||
| throw std::invalid_argument("'type' must be 'output_text', 'input_text', or 'refusal'"); |
There was a problem hiding this comment.
| throw std::invalid_argument("'type' must be 'output_text', 'input_text', or 'refusal'"); | |
| chatcmpl_content.push_back({{"text", "[Incorrect output: type must be 'output_text','input_text', or 'refusal'"}, {"type", "text"}}); |
There was a problem hiding this comment.
Changed in edb7a79 -- unknown assistant content types are now skipped with a warning log instead of rejecting. I went with a clean skip rather than injecting placeholder text into the prompt, since the model doesn't benefit from seeing error messages in its conversation history.
|
@michaelw9999 thabks for the detailed review! I will commit some fixes soon. |
|
You are welcome. I have more optional nice-to-haves for later after this
gets merged (web_search and image handling if the model supports it,
automatic compaction). Right now sending an image as an attachment or
having it look at one, it just eats up tons of tokens and sends it as
base64, then the blind-model has no idea what to do with it. Thanks again
for getting this up!
|
|
@krystophny what I meant by "crashed the thread", I was using it with Codex VScode extension , not the cli directly. When a 400 or 500 server error happened, it wasn't crashing llama-server. What I meant was that if you have a chat thread on-going that you've been working on (same prompt), once that error happens, it's over. You can't send a new message after it happens, or recover it because it goes into a loop with the same 400/500 repeatedly, so you would be forced to restart a new chat. |
|
@michaelw9999 Thanks for the clarification -- that makes a lot more sense. You're right that when the Codex extension replays conversation history on every request, a 400 for an unsupported item type becomes a permanent loop: the problematic item is baked into the history, so every subsequent message in that thread hits the same error. The thread is effectively dead. That's exactly the scenario the latest commit (edb7a79) addresses:
These were the cases where legitimate client data was causing the loop. The remaining throws (missing If you're still hitting the loop with the latest commit, could you share which specific input item type or structure triggers it? That would help pin down whether there's another case we need to handle gracefully. |
edb7a79 to
fa2ca43
Compare
|
Yes, you really can't use any of those throws to error from anything, at
all or it becomes effectively inop. The errors really did happen and it
was not that hard to make them happen. I did build and test with the
other alternative I suggested, it solves those issues without losing all
your work in a thread. I really don't think you should go to any error
that leaves a 400 or 500 - why? If it happens then you lose the chat. If
not, then it just shows the message in the chat history (that you, the user
sees too) - you or the model moves on and can recover. All you have to do
to reproduce: attach a file to the message with the file attachment option,
or an image, or text. Tell it to review a website or lookup a certain URL,
if it's the wrong format or an image it would crash. It may fail if it
goes to use the web_search tool natively instead of another specified or it
can't find the tool that Codex IDE assumes the server will respond with on
their own backend. If it's in interactive shell mode, and then it tries to
read an invalid file it would also error.
… Message ID: ***@***.***>
|
fa2ca43 to
46f2808
Compare
Hi..... Claude? Would be nice to have a human reply :) |
sorry :) |
|
I let Claude do the fixes and summaries right now due to busy Easter monday and wanted to get it done. I can have a more thorough look tomorrow then if there is something open. Sorry again. |
|
I'm not an expert on this so maybe @ngxson can give input. But why just do |
In another project I did a retry mechanism that gives the model |
|
@michaelw9999 after giving it some thought, I'm patching this now to move usual mistakes to debug logging and reply to the model with some hint so it may recover. If you could play with it a bit in your setup this would be great to get some more feedback. In case only minor issues remain and the rest seems solid to you I would be happy if you could merge the PR sooner than later and file the remaining issues and I will deal with them then individually in smaller follow up PRs. What do you think? |
Codex CLI compatibility: - Skip non-function tool types (web_search, code_interpreter) - Merge developer/system messages into position 0 for Qwen templates - Strip Responses-only request keys (store, include, prompt_cache_key) - output_text convenience field in streaming and non-streaming responses Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted): - Add 24 missing Response object fields per OpenAI spec - Fix function_call id/call_id field mapping - Add sequence_number, output_index, content_index to streaming events - Accept input_text type and EasyInputMessage for multi-turn input Verified: codex -p local and codex -p fast work against local llama.cpp with Qwen3.5 models including native tool calling. Refs: ggml-org#19138, ggml-org#19720
Add 8 new tests covering the changes in this PR: - test_responses_schema_fields: verify all 24+ Response object fields - test_responses_stream_schema_fields: verify sequence_number, output_index, content_index on streaming events - test_responses_non_function_tool_skipped: web_search/code_interpreter tool types return 200 instead of 400 - test_responses_mixed_tool_types: non-function tools filtered, function tools retained (not rejected at parsing layer) - test_responses_extra_keys_stripped: store, include, prompt_cache_key, web_search, text, truncation, metadata don't cause errors - test_responses_developer_role: developer messages merged into system - test_responses_input_text_type: input_text accepted for EasyInputMessage - test_responses_function_call_id_fields: output items have correct ids All 10 tests pass (2 existing + 8 new).
- Add sequence_number to ALL streaming events (created, in_progress,
output_item.added, content_part.added, all delta events)
- Add output_index to all events referencing output items
- Add content_index to content-related events
- Populate full response object in response.created and
response.in_progress events (was only {id, object, status})
- Add id field to function_call output_item.added events
- Add status: completed to reasoning output_item.done events
- Counter state persisted across streaming chunks via task_result_state
Fixes: spec-compliant client libraries (async-openai) that require
these fields can now parse all streaming events without error.
Refs: ggml-org#21174 (fumlig review comment)
- test_responses_stream_created_event_has_full_response: verify response.created contains all 24+ fields with status in_progress - test_responses_stream_all_events_have_sequence_number: every event has sequence_number and they are strictly increasing across stream - test_responses_stream_delta_events_have_indices: output_index and content_index present on all delta/added events All 14 tests pass (2 original + 9 from previous commit + 3 new).
Code fixes: - build_oai_resp_metadata accepts status param; completed_at is null when status is in_progress (was always set to timestamp) - response.created/in_progress events use zeroed usage (was passing actual prompt tokens before response was logically started) - Function call item IDs are now generated once per tool call in update() and reused consistently across output_item.added, function_call_arguments.delta, and output_item.done events (was generating independent random IDs in each path) - Clean up commented-out status checks in server-common.cpp Test fixes: - Assert sequence_number on every event unconditionally (was using weak "if present" guard) - Check actual values not just key presence in streaming created event test (completed_at is None, usage tokens are 0, etc.) Refs: ggml-org#21174 (patrick review)
Accept all valid reasoning item content formats in multi-turn input:
- Array of objects: [{"type":"reasoning_text","text":"..."}] (spec format)
- Plain string: "thinking about it" (OpenCode format)
- Null: content:null with encrypted_content (Codex, openai/codex#11834)
- Omitted entirely: no content field present
Previously threw "item['content'] is not an array" for non-array formats,
breaking OpenCode multi-turn conversations. The encrypted_content field
is accepted but ignored for local models (no server-side decryption).
Add 4 tests covering each format variant.
Refs: openai/codex#11834, anomalyco/opencode#19081
Replace hard rejections (throw) with graceful handling for input types that are unsupported but represent legitimate client usage: - input_file: render file_data as text content with filename label, or show filename placeholder when file_data is absent - Unknown user content types: skip with warning log - Unknown assistant content types: skip with warning log - Unknown top-level item types: skip with warning log Validation throws are preserved for genuinely malformed input (e.g. input_text missing required text field still returns 400). Remove dead commented-out code in input_file handler. This prevents multi-turn conversations from failing entirely when clients like Codex CLI include unsupported content types in their conversation history.
36b8613 to
23e59e8
Compare
Summary
Improve Responses API (
/v1/responses) compliance and Codex CLI compatibility.Response object (non-streaming and streaming):
id/call_idfield mapping (idgets unique fc_ ID,call_idgets the model's tool_call.id)output_texttop-level convenience fieldoutput_tokens_detailswithreasoning_tokensto usagerefusalcontent type handling (was broken in upstream — unreachable code after throw)Streaming events:
sequence_numberto ALL streaming events (created, in_progress, added, delta, done, completed)output_indexto all events referencing output itemscontent_indexto content-related eventsresponse.createdandresponse.in_progress(was only{id, object, status})output_item.added,function_call_arguments.delta, andoutput_item.donetask_result_stateCodex CLI compatibility:
input_texttype alongsideoutput_textfor EasyInputMessage / AssistantMessageItemParamPrior art: #19720 by @riskywindow (stale, 500+ commits behind). This PR incorporates applicable ideas adapted to the current codebase.
Fixes #19138. Related: #20156, #20733, #20607.
Verification
pytest (14 tests, tinyllama2)
E2E with async OpenAI SDK + Qwen3.5-9B Q4_K_M
Codex CLI E2E
Test plan
Requirements
If reviewers prefer, this can be split into smaller PRs for faster review. Let me know.