Skip to content

server: improve Responses API compliance and Codex CLI compatibility#21174

Open
krystophny wants to merge 9 commits intoggml-org:masterfrom
krystophny:responses-api-codex-compat
Open

server: improve Responses API compliance and Codex CLI compatibility#21174
krystophny wants to merge 9 commits intoggml-org:masterfrom
krystophny:responses-api-codex-compat

Conversation

@krystophny
Copy link
Copy Markdown

@krystophny krystophny commented Mar 30, 2026

Summary

Improve Responses API (/v1/responses) compliance and Codex CLI compatibility.

Response object (non-streaming and streaming):

  • Add 24 missing fields per OpenAI Responses API spec (tools, truncation, temperature, top_p, metadata, store, service_tier, etc.)
  • Fix function_call id/call_id field mapping (id gets unique fc_ ID, call_id gets the model's tool_call.id)
  • Add output_text top-level convenience field
  • Add output_tokens_details with reasoning_tokens to usage
  • Restore refusal content type handling (was broken in upstream — unreachable code after throw)

Streaming events:

  • Add sequence_number to ALL streaming events (created, in_progress, added, delta, done, completed)
  • Add output_index to all events referencing output items
  • Add content_index to content-related events
  • Populate full response object in response.created and response.in_progress (was only {id, object, status})
  • Function call item IDs consistent across output_item.added, function_call_arguments.delta, and output_item.done
  • Counter state persisted across streaming chunks via task_result_state

Codex CLI compatibility:

  • Skip non-function tool types (web_search, code_interpreter) instead of rejecting with 400
  • Merge developer/system role messages into first system message for templates requiring system at position 0 (e.g. Qwen)
  • Strip Responses-only request keys (store, include, prompt_cache_key, web_search, text, truncation, metadata)
  • Accept input_text type alongside output_text for EasyInputMessage / AssistantMessageItemParam

Prior art: #19720 by @riskywindow (stale, 500+ commits behind). This PR incorporates applicable ideas adapted to the current codebase.

Fixes #19138. Related: #20156, #20733, #20607.

Verification

pytest (14 tests, tinyllama2)

$ LLAMA_SERVER_BIN_PATH=./build/bin/llama-server pytest unit/test_compat_oai_responses.py -v
test_responses_with_openai_library PASSED
test_responses_stream_with_openai_library PASSED
test_responses_schema_fields PASSED
test_responses_stream_schema_fields PASSED
test_responses_non_function_tool_skipped PASSED
test_responses_only_non_function_tools_same_as_no_tools PASSED
test_responses_extra_keys_stripped PASSED
test_responses_developer_role_merging PASSED
test_responses_input_text_type_multi_turn PASSED
test_responses_output_text_matches_content PASSED
test_responses_stream_output_text_consistency PASSED
test_responses_stream_created_event_has_full_response PASSED
test_responses_stream_all_events_have_sequence_number PASSED
test_responses_stream_delta_events_have_indices PASSED
14 passed

E2E with async OpenAI SDK + Qwen3.5-9B Q4_K_M

$ python3 e2e_test.py  # uses AsyncOpenAI against localhost:8080
Test 1: Non-streaming basic          OK: output_text='4', fields=36
Test 2: Streaming basic              OK: 205 events, gathered text matches
Test 3: Non-function tools skipped   OK: status=completed
Test 4: Developer role merging       OK
Test 5: Multi-turn with input_text   OK
Test 7: Streaming seq_nums           OK: 105 events, strictly increasing
Test 6: Concurrent stress (5 req)    OK: 5/5 completed, 0 failures
ALL E2E TESTS PASSED

Codex CLI E2E

$ codex exec -p local "Say hello in one word"
Hello
tokens used: 6,119

$ codex exec -p local "Run 'echo hello world' and show the output"
exec: /bin/zsh -lc 'echo hello world' succeeded
hello world
tokens used: 1,054

Test plan

  • 14 pytest tests covering all code paths (schema, streaming, tools, roles, keys)
  • Async OpenAI SDK: non-streaming, streaming, concurrent stress
  • Streaming events: sequence_number strictly increasing, output_index/content_index present
  • response.created has full response object with all fields
  • Non-function tools silently skipped (200 not 400)
  • Developer/system messages merged correctly
  • Codex CLI connects and works (text + tool calling)
  • Function call IDs consistent between added/done streaming events

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - AI was used to help draft code, tests, and PR text. The submitter reviewed and is responsible for the final changes.

If reviewers prefer, this can be split into smaller PRs for faster review. Let me know.

@krystophny krystophny requested a review from a team as a code owner March 30, 2026 07:54
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Mar 30, 2026

please add proper testings for it

@github-actions github-actions bot added the python python script changes label Mar 30, 2026
@krystophny krystophny force-pushed the responses-api-codex-compat branch 3 times, most recently from 530e13c to 467266b Compare March 30, 2026 13:04
@krystophny
Copy link
Copy Markdown
Author

please add proper testings for it

Done!

@fumlig
Copy link
Copy Markdown

fumlig commented Mar 30, 2026

According to the responses API reference for streaming events, the response.created and response.in_progress streaming events should also contain created_at, model etc.

It seems like the current implementation just returns a minimal response object in those events. This causes issues with certain spec-compliant client libraries like async-openai. Would it be possible to add the missing streaming event fields here as well?

krystophny added a commit to krystophny/llama.cpp that referenced this pull request Mar 30, 2026
- Add sequence_number to ALL streaming events (created, in_progress,
  output_item.added, content_part.added, all delta events)
- Add output_index to all events referencing output items
- Add content_index to content-related events
- Populate full response object in response.created and
  response.in_progress events (was only {id, object, status})
- Add id field to function_call output_item.added events
- Add status: completed to reasoning output_item.done events
- Counter state persisted across streaming chunks via task_result_state

Fixes: spec-compliant client libraries (async-openai) that require
these fields can now parse all streaming events without error.

Refs: ggml-org#21174 (fumlig review comment)
krystophny added a commit to krystophny/llama.cpp that referenced this pull request Mar 30, 2026
Code fixes:
- build_oai_resp_metadata accepts status param; completed_at is null
  when status is in_progress (was always set to timestamp)
- response.created/in_progress events use zeroed usage (was passing
  actual prompt tokens before response was logically started)
- Function call item IDs are now generated once per tool call in
  update() and reused consistently across output_item.added,
  function_call_arguments.delta, and output_item.done events
  (was generating independent random IDs in each path)
- Clean up commented-out status checks in server-common.cpp

Test fixes:
- Assert sequence_number on every event unconditionally (was using
  weak "if present" guard)
- Check actual values not just key presence in streaming created
  event test (completed_at is None, usage tokens are 0, etc.)

Refs: ggml-org#21174 (patrick review)
@krystophny
Copy link
Copy Markdown
Author

@fumlig Thanks for the feedback. The streaming events now include the full response object in response.created and response.in_progress events (with all 24+ required fields, status: "in_progress", completed_at: null, zeroed usage). All partial events (output_item.added, content_part.added, all deltas) now have sequence_number, output_index, and content_index per spec.

Tested with the async OpenAI Python SDK (which validates event schemas similarly to async-openai on the Rust side).

@ngxson Tests added: 14 pytest tests covering schema fields, streaming compliance, tool type skipping, developer role merging, key stripping, multi-turn input, and output_text consistency. Plus E2E tests with async OpenAI SDK against Qwen3.5-9B.

If you'd prefer this split into smaller PRs for faster review, happy to do so.

@krystophny
Copy link
Copy Markdown
Author

Additional E2E testing (async OpenAI SDK, Codex CLI, concurrent stress tests, multiple Qwen3.5 models) is documented in the companion meta-repo: https://github.com/krystophny/llama.cpp-dev

The meta-repo includes a Nix flake for reproducible test environments and scripted test harnesses under scripts/.

krystophny added a commit to krystophny/llama.cpp that referenced this pull request Mar 30, 2026
Codex CLI compatibility:
- Skip non-function tool types (web_search, code_interpreter)
- Merge developer/system messages into position 0 for Qwen templates
- Strip Responses-only request keys (store, include, prompt_cache_key)
- Restore refusal content type handling

Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted):
- Add 24 missing Response object fields per OpenAI spec
- Fix function_call id/call_id field mapping
- Add sequence_number, output_index, content_index to ALL streaming events
- Full response object in response.created/in_progress events
- Accept input_text type and EasyInputMessage for multi-turn input
- output_text convenience field, output_tokens_details

14 pytest tests, E2E tested with async OpenAI SDK and Codex CLI.

Refs: ggml-org#19138, ggml-org#19720, ggml-org#21174
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Mar 30, 2026

I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it

@krystophny krystophny marked this pull request as draft March 30, 2026 18:04
@krystophny
Copy link
Copy Markdown
Author

I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it

@ngxson thanks! I marked it as draft and iterate a bit and test locally more with https://github.com/lazy-fortran/fortbench during the coming days and see if any problems pop up on the codex path compared to opencode. Then I'll let you know.

@krystophny krystophny force-pushed the responses-api-codex-compat branch from adef64c to 7229135 Compare April 3, 2026 07:02
krystophny added a commit to krystophny/llama.cpp that referenced this pull request Apr 3, 2026
- Add sequence_number to ALL streaming events (created, in_progress,
  output_item.added, content_part.added, all delta events)
- Add output_index to all events referencing output items
- Add content_index to content-related events
- Populate full response object in response.created and
  response.in_progress events (was only {id, object, status})
- Add id field to function_call output_item.added events
- Add status: completed to reasoning output_item.done events
- Counter state persisted across streaming chunks via task_result_state

Fixes: spec-compliant client libraries (async-openai) that require
these fields can now parse all streaming events without error.

Refs: ggml-org#21174 (fumlig review comment)
krystophny added a commit to krystophny/llama.cpp that referenced this pull request Apr 3, 2026
Code fixes:
- build_oai_resp_metadata accepts status param; completed_at is null
  when status is in_progress (was always set to timestamp)
- response.created/in_progress events use zeroed usage (was passing
  actual prompt tokens before response was logically started)
- Function call item IDs are now generated once per tool call in
  update() and reused consistently across output_item.added,
  function_call_arguments.delta, and output_item.done events
  (was generating independent random IDs in each path)
- Clean up commented-out status checks in server-common.cpp

Test fixes:
- Assert sequence_number on every event unconditionally (was using
  weak "if present" guard)
- Check actual values not just key presence in streaming created
  event test (completed_at is None, usage tokens are 0, etc.)

Refs: ggml-org#21174 (patrick review)
@krystophny krystophny marked this pull request as ready for review April 3, 2026 07:02
@krystophny
Copy link
Copy Markdown
Author

I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it

@ngxson I saw no more issues in practical usage with Codex CLI and the smaller Qwen 3.5 models with the responses API. Rebased once more and set ready to review.

@Macmee
Copy link
Copy Markdown

Macmee commented Apr 4, 2026

Thank you for this!!! I ran into this problem today when I was trying codex with llama.cpp and this change makes it work :)

@michaelw9999
Copy link
Copy Markdown
Contributor

Nice work, thank you! I tried using it with the Codex IDE extension and almost everything works, but still had a few issues causing errors that would freeze up the entire thread, making it unusable to continue from where it was left off. I'll share some small suggested changes soon.

Comment on lines 1302 to 1303
if (!exists_and_is_string(output_text, "refusal")) {
throw std::invalid_argument("'Refusal' requires 'refusal'");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!exists_and_is_string(output_text, "refusal")) {
throw std::invalid_argument("'Refusal' requires 'refusal'");

The throw errors make the thread crash, and unable to resume from there. It would never reach the pushback below

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reasoning as the output_text guard -- this protects against a 500 from .at("refusal") when the field is missing. The throw is caught by ex_wrapper and returns a clean 400. Removing it would downgrade a proper 400 to a confusing 500. Kept as-is.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in fa2ca43 -- same treatment. Skips with warning instead of throwing.

{"refusal", output_text.at("refusal")},
{"type", "refusal"},
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the same guard, see above.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
chatcmpl_content.push_back({{"text", "[image input - no URL provided]"}, {"type", "text"}});
continue;

The throw errors cause the thread to crash and not able to continue

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same situation as the input_text case above -- the throw is caught by ex_wrapper and returns a proper 400, not a crash. An input_image without image_url is malformed input, so 400 is the right response. Kept as-is.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in fa2ca43 -- same reasoning as above. If an image attachment ends up missing the URL in the replayed history, the thread shouldn't die. Now skips with a warning.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
chatcmpl_content.push_back({{"text", "[text content missing - input_text is not supported yet!"}, {"type", "text"}});
continue;

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the report. The throw here doesn't actually crash the thread -- it's caught by ex_wrapper in server.cpp (line 38-69) which converts it to a clean HTTP 400 response. The server keeps running.

This throw fires when input_text is missing its required text field, which is a genuine client bug. Returning 400 is correct here -- it tells the client to fix the malformed request. Kept as-is.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in fa2ca43 -- you were right about this one. With Codex IDE replaying the full conversation history, any 400 here permanently kills the thread. Now skips the malformed item with a server warning instead of rejecting.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Show the attached file as text
if (input_item.contains("filename")) {
chatcmpl_content.push_back({{"text", "[file: " + input_item.at("filename").get<std::string>() + "]"}, {"type", "text"}});
} else {
chatcmpl_content.push_back({{"text", "[file data]"}, {"type", "text"}});
}
} else {
// If not text then give json representation instead of the entire item
chatcmpl_content.push_back({{"text", input_item.dump()}, {"type", "text"}});

This outputs the file to the LLM

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call on this one. I've reworked input_file handling in edb7a79: instead of rejecting, it now renders the file content as text. If file_data is present, it gets injected into the prompt (with a [file: name] label when filename is available). If only filename is present, it shows a placeholder. Also removed the dead commented-out code.

The approach differs slightly from your suggestion -- I didn't add a raw input_item.dump() catch-all since injecting arbitrary JSON into the prompt can confuse models. Instead, unknown types are just skipped with a server warning log.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
chatcmpl_content.push_back({{"text", input_item.dump()}, {"type", "text"}});

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that rejecting the whole request for an unknown type is too harsh. Changed to skip with a warning log in edb7a79. I went with a silent skip rather than input_item.dump() to avoid injecting raw JSON into the prompt, which tends to confuse models.

Comment on lines 1294 to 1295
if (!exists_and_is_string(output_text, "text")) {
throw std::invalid_argument("'Output text' requires 'text'");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (!exists_and_is_string(output_text, "text")) {
throw std::invalid_argument("'Output text' requires 'text'");

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard needs to stay. If you remove the check but keep the .at("text") call below, it throws a json::exception when text is missing -- which ex_wrapper catches as a 500 (server error) instead of the current 400 (client error). So removing the guard actually makes things worse. The throw here is correctly caught and returned as a 400 response, not a crash.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in fa2ca43 -- reconsidered after your clarification about the Codex IDE thread loop. The guard now skips the item with a warning + continue instead of throwing. This way a malformed item in replayed history won't kill the thread.

{"text", output_text.at("text")},
{"type", "text"},
});
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above -- part of the same validation guard that protects against a 500 from the .at() call. Kept as-is.

});
} else {
throw std::invalid_argument("'type' must be one of 'output_text' or 'refusal'");
throw std::invalid_argument("'type' must be 'output_text', 'input_text', or 'refusal'");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throw std::invalid_argument("'type' must be 'output_text', 'input_text', or 'refusal'");
chatcmpl_content.push_back({{"text", "[Incorrect output: type must be 'output_text','input_text', or 'refusal'"}, {"type", "text"}});

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in edb7a79 -- unknown assistant content types are now skipped with a warning log instead of rejecting. I went with a clean skip rather than injecting placeholder text into the prompt, since the model doesn't benefit from seeing error messages in its conversation history.

@krystophny
Copy link
Copy Markdown
Author

@michaelw9999 thabks for the detailed review! I will commit some fixes soon.

@michaelw9999
Copy link
Copy Markdown
Contributor

michaelw9999 commented Apr 6, 2026 via email

@michaelw9999
Copy link
Copy Markdown
Contributor

@krystophny what I meant by "crashed the thread", I was using it with Codex VScode extension , not the cli directly. When a 400 or 500 server error happened, it wasn't crashing llama-server. What I meant was that if you have a chat thread on-going that you've been working on (same prompt), once that error happens, it's over. You can't send a new message after it happens, or recover it because it goes into a loop with the same 400/500 repeatedly, so you would be forced to restart a new chat.

@krystophny
Copy link
Copy Markdown
Author

@michaelw9999 Thanks for the clarification -- that makes a lot more sense. You're right that when the Codex extension replays conversation history on every request, a 400 for an unsupported item type becomes a permanent loop: the problematic item is baked into the history, so every subsequent message in that thread hits the same error. The thread is effectively dead.

That's exactly the scenario the latest commit (edb7a79) addresses:

  • input_file items are now rendered as text instead of rejected
  • Unknown content types (user, assistant, top-level) are silently skipped instead of rejected

These were the cases where legitimate client data was causing the loop. The remaining throws (missing text field in input_text, missing image_url in input_image, etc.) only fire for structurally broken items that the extension shouldn't be producing in the first place.

If you're still hitting the loop with the latest commit, could you share which specific input item type or structure triggers it? That would help pin down whether there's another case we need to handle gracefully.

@krystophny krystophny force-pushed the responses-api-codex-compat branch from edb7a79 to fa2ca43 Compare April 6, 2026 05:57
@michaelw9999
Copy link
Copy Markdown
Contributor

michaelw9999 commented Apr 6, 2026 via email

@krystophny krystophny force-pushed the responses-api-codex-compat branch from fa2ca43 to 46f2808 Compare April 6, 2026 06:21
@michaelw9999
Copy link
Copy Markdown
Contributor

@michaelw9999 Thanks for the clarification -- that makes a lot more sense. You're right that when the Codex extension replays conversation history on every request, a 400 for an unsupported item type becomes a permanent loop: the problematic item is baked into the history, so every subsequent message in that thread hits the same error. The thread is effectively dead.

That's exactly the scenario the latest commit (edb7a79) addresses:

  • input_file items are now rendered as text instead of rejected
  • Unknown content types (user, assistant, top-level) are silently skipped instead of rejected

These were the cases where legitimate client data was causing the loop. The remaining throws (missing text field in input_text, missing image_url in input_image, etc.) only fire for structurally broken items that the extension shouldn't be producing in the first place.

If you're still hitting the loop with the latest commit, could you share which specific input item type or structure triggers it? That would help pin down whether there's another case we need to handle gracefully.

Hi..... Claude? Would be nice to have a human reply :)

@krystophny
Copy link
Copy Markdown
Author

@michaelw9999 Thanks for the clarification -- that makes a lot more sense. You're right that when the Codex extension replays conversation history on every request, a 400 for an unsupported item type becomes a permanent loop: the problematic item is baked into the history, so every subsequent message in that thread hits the same error. The thread is effectively dead.
That's exactly the scenario the latest commit (edb7a79) addresses:

  • input_file items are now rendered as text instead of rejected
  • Unknown content types (user, assistant, top-level) are silently skipped instead of rejected

These were the cases where legitimate client data was causing the loop. The remaining throws (missing text field in input_text, missing image_url in input_image, etc.) only fire for structurally broken items that the extension shouldn't be producing in the first place.
If you're still hitting the loop with the latest commit, could you share which specific input item type or structure triggers it? That would help pin down whether there's another case we need to handle gracefully.

Hi..... Claude? Would be nice to have a human reply :)

sorry :)

@krystophny
Copy link
Copy Markdown
Author

I let Claude do the fixes and summaries right now due to busy Easter monday and wanted to get it done. I can have a more thorough look tomorrow then if there is something open. Sorry again.

@michaelw9999
Copy link
Copy Markdown
Contributor

I'm not an expert on this so maybe @ngxson can give input. But why just do SRV_WRN and then continue without sending anything out? I think that would make for bad performance. The only thing that happens in that case is there is no error, the user doesn't see it, the model doesn't see it, and then nothing happens. The thing with a lot of the smaller models is that it doesn't always follow the rules and might not use the syntax properly or might make a typo (it really happens) If the model makes an assumption that whatever it was trying to do worked, and had no response, it would be wrong and then could resume with a wrong state. But the model will actually read as text and interpret the real responses from any of the push_backs. Then both the model and user can see it didn't work and decide what to do, if anything, if anything: "Ooops, I used the wrong function call for the tool. I should have used input_file instead of input_filename. Let me try again.", or "That tool isn't giving me any output to view the file. Let me use cat instead."

@krystophny
Copy link
Copy Markdown
Author

I'm not an expert on this so maybe @ngxson can give input. But why just do SRV_WRN and then continue without sending anything out? I think that would make for bad performance. The only thing that happens in that case is there is no error, the user doesn't see it, the model doesn't see it, and then nothing happens. The thing with a lot of the smaller models is that it doesn't always follow the rules and might not use the syntax properly or might make a typo (it really happens) If the model makes an assumption that whatever it was trying to do worked, and had no response, it would be wrong and then could resume with a wrong state. But the model will actually read as text and interpret the real responses from any of the push_backs. Then both the model and user can see it didn't work and decide what to do, if anything, if anything: "Ooops, I used the wrong function call for the tool. I should have used input_file instead of input_filename. Let me try again.", or "That tool isn't giving me any output to view the file. Let me use cat instead."

In another project I did a retry mechanism that gives the model
feedback so it has a second chance to correct its wrong tool call. But I don't know how standard this approach is or how much it would do more harm for smaller models.

@krystophny
Copy link
Copy Markdown
Author

@michaelw9999 after giving it some thought, I'm patching this now to move usual mistakes to debug logging and reply to the model with some hint so it may recover. If you could play with it a bit in your setup this would be great to get some more feedback. In case only minor issues remain and the rest seems solid to you I would be happy if you could merge the PR sooner than later and file the remaining issues and I will deal with them then individually in smaller follow up PRs. What do you think?

Codex CLI compatibility:
- Skip non-function tool types (web_search, code_interpreter)
- Merge developer/system messages into position 0 for Qwen templates
- Strip Responses-only request keys (store, include, prompt_cache_key)
- output_text convenience field in streaming and non-streaming responses

Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted):
- Add 24 missing Response object fields per OpenAI spec
- Fix function_call id/call_id field mapping
- Add sequence_number, output_index, content_index to streaming events
- Accept input_text type and EasyInputMessage for multi-turn input

Verified: codex -p local and codex -p fast work against local
llama.cpp with Qwen3.5 models including native tool calling.

Refs: ggml-org#19138, ggml-org#19720
Add 8 new tests covering the changes in this PR:

- test_responses_schema_fields: verify all 24+ Response object fields
- test_responses_stream_schema_fields: verify sequence_number,
  output_index, content_index on streaming events
- test_responses_non_function_tool_skipped: web_search/code_interpreter
  tool types return 200 instead of 400
- test_responses_mixed_tool_types: non-function tools filtered,
  function tools retained (not rejected at parsing layer)
- test_responses_extra_keys_stripped: store, include, prompt_cache_key,
  web_search, text, truncation, metadata don't cause errors
- test_responses_developer_role: developer messages merged into system
- test_responses_input_text_type: input_text accepted for EasyInputMessage
- test_responses_function_call_id_fields: output items have correct ids

All 10 tests pass (2 existing + 8 new).
- Add sequence_number to ALL streaming events (created, in_progress,
  output_item.added, content_part.added, all delta events)
- Add output_index to all events referencing output items
- Add content_index to content-related events
- Populate full response object in response.created and
  response.in_progress events (was only {id, object, status})
- Add id field to function_call output_item.added events
- Add status: completed to reasoning output_item.done events
- Counter state persisted across streaming chunks via task_result_state

Fixes: spec-compliant client libraries (async-openai) that require
these fields can now parse all streaming events without error.

Refs: ggml-org#21174 (fumlig review comment)
- test_responses_stream_created_event_has_full_response: verify
  response.created contains all 24+ fields with status in_progress
- test_responses_stream_all_events_have_sequence_number: every event
  has sequence_number and they are strictly increasing across stream
- test_responses_stream_delta_events_have_indices: output_index and
  content_index present on all delta/added events

All 14 tests pass (2 original + 9 from previous commit + 3 new).
Code fixes:
- build_oai_resp_metadata accepts status param; completed_at is null
  when status is in_progress (was always set to timestamp)
- response.created/in_progress events use zeroed usage (was passing
  actual prompt tokens before response was logically started)
- Function call item IDs are now generated once per tool call in
  update() and reused consistently across output_item.added,
  function_call_arguments.delta, and output_item.done events
  (was generating independent random IDs in each path)
- Clean up commented-out status checks in server-common.cpp

Test fixes:
- Assert sequence_number on every event unconditionally (was using
  weak "if present" guard)
- Check actual values not just key presence in streaming created
  event test (completed_at is None, usage tokens are 0, etc.)

Refs: ggml-org#21174 (patrick review)
Accept all valid reasoning item content formats in multi-turn input:
- Array of objects: [{"type":"reasoning_text","text":"..."}] (spec format)
- Plain string: "thinking about it" (OpenCode format)
- Null: content:null with encrypted_content (Codex, openai/codex#11834)
- Omitted entirely: no content field present

Previously threw "item['content'] is not an array" for non-array formats,
breaking OpenCode multi-turn conversations. The encrypted_content field
is accepted but ignored for local models (no server-side decryption).

Add 4 tests covering each format variant.

Refs: openai/codex#11834, anomalyco/opencode#19081
Replace hard rejections (throw) with graceful handling for input types
that are unsupported but represent legitimate client usage:

- input_file: render file_data as text content with filename label,
  or show filename placeholder when file_data is absent
- Unknown user content types: skip with warning log
- Unknown assistant content types: skip with warning log
- Unknown top-level item types: skip with warning log

Validation throws are preserved for genuinely malformed input (e.g.
input_text missing required text field still returns 400).

Remove dead commented-out code in input_file handler.

This prevents multi-turn conversations from failing entirely when
clients like Codex CLI include unsupported content types in their
conversation history.
@krystophny krystophny force-pushed the responses-api-codex-compat branch from 36b8613 to 23e59e8 Compare April 9, 2026 11:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Support OpenAI Responses API (/v1/responses) in llama.cpp server

5 participants