server: improve Responses API compliance and Codex CLI compatibility by krystophny · Pull Request #21174 · ggml-org/llama.cpp

krystophny · 2026-03-30T07:54:24Z

Summary

Improve Responses API (/v1/responses) compliance and Codex CLI compatibility.

Response object (non-streaming and streaming):

Add 24 missing fields per OpenAI Responses API spec (tools, truncation, temperature, top_p, metadata, store, service_tier, etc.)
Fix function_call id/call_id field mapping (id gets unique fc_ ID, call_id gets the model's tool_call.id)
Add output_text top-level convenience field
Add output_tokens_details with reasoning_tokens to usage
Restore refusal content type handling (was broken in upstream — unreachable code after throw)

Streaming events:

Add sequence_number to ALL streaming events (created, in_progress, added, delta, done, completed)
Add output_index to all events referencing output items
Add content_index to content-related events
Populate full response object in response.created and response.in_progress (was only {id, object, status})
Function call item IDs consistent across output_item.added, function_call_arguments.delta, and output_item.done
Counter state persisted across streaming chunks via task_result_state

Codex CLI compatibility:

Skip non-function tool types (web_search, code_interpreter) instead of rejecting with 400
Merge developer/system role messages into first system message for templates requiring system at position 0 (e.g. Qwen)
Strip Responses-only request keys (store, include, prompt_cache_key, web_search, text, truncation, metadata)
Accept input_text type alongside output_text for EasyInputMessage / AssistantMessageItemParam

Prior art: #19720 by @riskywindow (stale, 500+ commits behind). This PR incorporates applicable ideas adapted to the current codebase.

Fixes #19138. Related: #20156, #20733, #20607.

Verification

pytest (14 tests, tinyllama2)

$ LLAMA_SERVER_BIN_PATH=./build/bin/llama-server pytest unit/test_compat_oai_responses.py -v
test_responses_with_openai_library PASSED
test_responses_stream_with_openai_library PASSED
test_responses_schema_fields PASSED
test_responses_stream_schema_fields PASSED
test_responses_non_function_tool_skipped PASSED
test_responses_only_non_function_tools_same_as_no_tools PASSED
test_responses_extra_keys_stripped PASSED
test_responses_developer_role_merging PASSED
test_responses_input_text_type_multi_turn PASSED
test_responses_output_text_matches_content PASSED
test_responses_stream_output_text_consistency PASSED
test_responses_stream_created_event_has_full_response PASSED
test_responses_stream_all_events_have_sequence_number PASSED
test_responses_stream_delta_events_have_indices PASSED
14 passed

E2E with async OpenAI SDK + Qwen3.5-9B Q4_K_M

$ python3 e2e_test.py  # uses AsyncOpenAI against localhost:8080
Test 1: Non-streaming basic          OK: output_text='4', fields=36
Test 2: Streaming basic              OK: 205 events, gathered text matches
Test 3: Non-function tools skipped   OK: status=completed
Test 4: Developer role merging       OK
Test 5: Multi-turn with input_text   OK
Test 7: Streaming seq_nums           OK: 105 events, strictly increasing
Test 6: Concurrent stress (5 req)    OK: 5/5 completed, 0 failures
ALL E2E TESTS PASSED

Codex CLI E2E

$ codex exec -p local "Say hello in one word"
Hello
tokens used: 6,119

$ codex exec -p local "Run 'echo hello world' and show the output"
exec: /bin/zsh -lc 'echo hello world' succeeded
hello world
tokens used: 1,054

Test plan

14 pytest tests covering all code paths (schema, streaming, tools, roles, keys)
Async OpenAI SDK: non-streaming, streaming, concurrent stress
Streaming events: sequence_number strictly increasing, output_index/content_index present
response.created has full response object with all fields
Non-function tools silently skipped (200 not 400)
Developer/system messages merged correctly
Codex CLI connects and works (text + tool calling)
Function call IDs consistent between added/done streaming events

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - AI was used to help draft code, tests, and PR text. The submitter reviewed and is responsible for the final changes.

If reviewers prefer, this can be split into smaller PRs for faster review. Let me know.

ngxson · 2026-03-30T08:28:41Z

please add proper testings for it

krystophny · 2026-03-30T13:11:56Z

please add proper testings for it

Done!

fumlig · 2026-03-30T15:04:28Z

According to the responses API reference for streaming events, the response.created and response.in_progress streaming events should also contain created_at, model etc.

It seems like the current implementation just returns a minimal response object in those events. This causes issues with certain spec-compliant client libraries like async-openai. Would it be possible to add the missing streaming event fields here as well?

- Add sequence_number to ALL streaming events (created, in_progress, output_item.added, content_part.added, all delta events) - Add output_index to all events referencing output items - Add content_index to content-related events - Populate full response object in response.created and response.in_progress events (was only {id, object, status}) - Add id field to function_call output_item.added events - Add status: completed to reasoning output_item.done events - Counter state persisted across streaming chunks via task_result_state Fixes: spec-compliant client libraries (async-openai) that require these fields can now parse all streaming events without error. Refs: ggml-org#21174 (fumlig review comment)

Code fixes: - build_oai_resp_metadata accepts status param; completed_at is null when status is in_progress (was always set to timestamp) - response.created/in_progress events use zeroed usage (was passing actual prompt tokens before response was logically started) - Function call item IDs are now generated once per tool call in update() and reused consistently across output_item.added, function_call_arguments.delta, and output_item.done events (was generating independent random IDs in each path) - Clean up commented-out status checks in server-common.cpp Test fixes: - Assert sequence_number on every event unconditionally (was using weak "if present" guard) - Check actual values not just key presence in streaming created event test (completed_at is None, usage tokens are 0, etc.) Refs: ggml-org#21174 (patrick review)

krystophny · 2026-03-30T16:27:17Z

@fumlig Thanks for the feedback. The streaming events now include the full response object in response.created and response.in_progress events (with all 24+ required fields, status: "in_progress", completed_at: null, zeroed usage). All partial events (output_item.added, content_part.added, all deltas) now have sequence_number, output_index, and content_index per spec.

Tested with the async OpenAI Python SDK (which validates event schemas similarly to async-openai on the Rust side).

@ngxson Tests added: 14 pytest tests covering schema fields, streaming compliance, tool type skipping, developer role merging, key stripping, multi-turn input, and output_text consistency. Plus E2E tests with async OpenAI SDK against Qwen3.5-9B.

If you'd prefer this split into smaller PRs for faster review, happy to do so.

krystophny · 2026-03-30T16:32:42Z

Additional E2E testing (async OpenAI SDK, Codex CLI, concurrent stress tests, multiple Qwen3.5 models) is documented in the companion meta-repo: https://github.com/krystophny/llama.cpp-dev

The meta-repo includes a Nix flake for reproducible test environments and scripted test harnesses under scripts/.

Codex CLI compatibility: - Skip non-function tool types (web_search, code_interpreter) - Merge developer/system messages into position 0 for Qwen templates - Strip Responses-only request keys (store, include, prompt_cache_key) - Restore refusal content type handling Responses API compliance (ideas from ggml-org#19720 by riskywindow, adapted): - Add 24 missing Response object fields per OpenAI spec - Fix function_call id/call_id field mapping - Add sequence_number, output_index, content_index to ALL streaming events - Full response object in response.created/in_progress events - Accept input_text type and EasyInputMessage for multi-turn input - output_text convenience field, output_tokens_details 14 pytest tests, E2E tested with async OpenAI SDK and Codex CLI. Refs: ggml-org#19138, ggml-org#19720, ggml-org#21174

ngxson · 2026-03-30T17:18:49Z

I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it

krystophny · 2026-03-30T18:05:54Z

I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it

@ngxson thanks! I marked it as draft and iterate a bit and test locally more with https://github.com/lazy-fortran/fortbench during the coming days and see if any problems pop up on the codex path compared to opencode. Then I'll let you know.

- Add sequence_number to ALL streaming events (created, in_progress, output_item.added, content_part.added, all delta events) - Add output_index to all events referencing output items - Add content_index to content-related events - Populate full response object in response.created and response.in_progress events (was only {id, object, status}) - Add id field to function_call output_item.added events - Add status: completed to reasoning output_item.done events - Counter state persisted across streaming chunks via task_result_state Fixes: spec-compliant client libraries (async-openai) that require these fields can now parse all streaming events without error. Refs: ggml-org#21174 (fumlig review comment)

Code fixes: - build_oai_resp_metadata accepts status param; completed_at is null when status is in_progress (was always set to timestamp) - response.created/in_progress events use zeroed usage (was passing actual prompt tokens before response was logically started) - Function call item IDs are now generated once per tool call in update() and reused consistently across output_item.added, function_call_arguments.delta, and output_item.done events (was generating independent random IDs in each path) - Clean up commented-out status checks in server-common.cpp Test fixes: - Assert sequence_number on every event unconditionally (was using weak "if present" guard) - Check actual values not just key presence in streaming created event test (completed_at is None, usage tokens are 0, etc.) Refs: ggml-org#21174 (patrick review)

krystophny · 2026-04-03T07:03:38Z

I'm ok with the current PR, but could you let us know when you are finally ready for review? I have been re-running the CI each time you pushed a new PR, and without CI passed, I cannot merge it

@ngxson I saw no more issues in practical usage with Codex CLI and the smaller Qwen 3.5 models with the responses API. Rebased once more and set ready to review.

Macmee · 2026-04-04T00:10:04Z

Thank you for this!!! I ran into this problem today when I was trying codex with llama.cpp and this change makes it work :)

michaelw9999 · 2026-04-05T22:51:15Z

Nice work, thank you! I tried using it with the Codex IDE extension and almost everything works, but still had a few issues causing errors that would freeze up the entire thread, making it unusable to continue from where it was left off. I'll share some small suggested changes soon.

michaelw9999 · 2026-04-06T00:10:55Z