feat: cli OpenAI-compatible API `response_format` support by markstur · Pull Request #884 · generative-computing/mellea

markstur · 2026-04-17T23:37:58Z

Misc PR

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Link to Issue: Fixes m serve OpenAI API structured output #824

feat: cli OpenAI-compatible API `response_format` support

   - Added `JsonSchemaFormat` model to represent JSON schema definitions
   - Extended `ResponseFormat` to support `json_schema` type (in addition to existing `text` and `json_object`)
   - Used field alias to avoid conflict with Pydantic's `schema` method

   - Added `_json_schema_to_pydantic()` utility function to dynamically convert JSON schemas to Pydantic models
   - Updated `_build_model_options()` to exclude `response_format` from model options (handled separately)
   - Modified `make_chat_endpoint()` to:
     - Parse `response_format` from requests
     - Convert `json_schema` type to Pydantic models using the utility function
     - Detect if the serve function accepts a `format` parameter using `inspect.signature()`
     - Pass the generated Pydantic model as `format=` parameter to serve functions that support it
     - Handle backward compatibility with serve functions that don't accept `format`
   - Added proper error handling for invalid schemas

   - Test json_schema format is converted to Pydantic model and passed to serve
   - Test json_object format doesn't pass a schema
   - Test text format doesn't pass a schema
   - Test error handling for missing json_schema field
   - Test error handling for invalid JSON schemas
   - Test backward compatibility with serve functions without format parameter
   - Test optional fields in JSON schemas

When a client sends a request with `response_format.type = "json_schema"`, the server:
1. Extracts the JSON schema from `response_format.json_schema.schema`
2. Dynamically creates a Pydantic model from the schema
3. Passes it as the `format=` parameter to the serve function
4. The serve function can then use this for constrained decoding via Mellea's `instruct()` method

This maps OpenAI's `response_format` API to Mellea's native `format=` parameter for structured output.

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code as added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Attribution

AI coding assistants used

- Added `JsonSchemaFormat` model to represent JSON schema definitions - Extended `ResponseFormat` to support `json_schema` type (in addition to existing `text` and `json_object`) - Used field alias to avoid conflict with Pydantic's `schema` method - Added `_json_schema_to_pydantic()` utility function to dynamically convert JSON schemas to Pydantic models - Updated `_build_model_options()` to exclude `response_format` from model options (handled separately) - Modified `make_chat_endpoint()` to: - Parse `response_format` from requests - Convert `json_schema` type to Pydantic models using the utility function - Detect if the serve function accepts a `format` parameter using `inspect.signature()` - Pass the generated Pydantic model as `format=` parameter to serve functions that support it - Handle backward compatibility with serve functions that don't accept `format` - Added proper error handling for invalid schemas - Test json_schema format is converted to Pydantic model and passed to serve - Test json_object format doesn't pass a schema - Test text format doesn't pass a schema - Test error handling for missing json_schema field - Test error handling for invalid JSON schemas - Test backward compatibility with serve functions without format parameter - Test optional fields in JSON schemas When a client sends a request with `response_format.type = "json_schema"`, the server: 1. Extracts the JSON schema from `response_format.json_schema.schema` 2. Dynamically creates a Pydantic model from the schema 3. Passes it as the `format=` parameter to the serve function 4. The serve function can then use this for constrained decoding via Mellea's `instruct()` method This maps OpenAI's `response_format` API to Mellea's native `format=` parameter for structured output. Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

github-actions · 2026-04-17T23:38:14Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

jakelorocco

I have a broader question that is touched on in my comments below: If we trust our backend provider to properly handle our structured output requests, why do we do any validation on our side? (Because module.serve might do something funky?)

* elif/pass not needed. Comment * set finish reason from result Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

planetf1 · 2026-04-23T08:53:25Z

OpenAI will return a finish_reason of length if limits are hit.

If we're trying to be OpenAI compatible (which is the intent here) my default response would be to say we should emulate that behaviour -- though I am uncomfortable in some ways with returning syntactically invalid json.. I guess the rationale is 'we returned what we could'. In some ways it comes down to our 'contract' with callers

For streaming this is a key difference as we fail validation (after the data has been streamed)

We also have the inconsistency between streaming (here) and non-streaming. For the latter we're already returning the same as OpenAI since we missed doing any validation...

I'd say to remove the validation -- then we have the same behaviour and it's down to the format parameter and how the backend constrained decoding works. I think this is what @jakelorocco describes

I think this also means my early comment about ADDING streaming validatino is something I no longer feel is the right way forward. In fact the opposite -- remove

psschwei · 2026-04-24T12:59:20Z

+1 on emulating openai as faithfully as possible

* inspect server() to set accepts_format and is_async in make_chat_endpoint instead of on every request * improved support for json_schema that can be converted to pydantic * no output validation. The module is given a format when applicable, but instead of throwing an error if the formatting is incorrect, the results are sent to the client. * consolidated redundant blocks of kwargs by building a dict of kwargs needed. * improved setting of finish_reason * added tests * improved comments/docstrings where clarification needed Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

markstur · 2026-04-29T08:07:50Z

removed the validation so now it is more consistent with what you might expect from an OpenAI API server that will return what it has even if it is truncated or otherwise broken.

planetf1 · 2026-05-01T14:29:58Z

+                    value.upper() if value and value[0].isalpha() else f"VALUE_{index}"
+                ): value
+                for index, value in enumerate(enum_values)
+            }


String-enum members collide when two values upper-case to the same name. The dict comprehension at L126-131 keys each member on value.upper(), so ["open", "OPEN"] produces a 1-member enum — one value is silently dropped, and validating the dropped value fails with ValidationError even though the schema declared it.

Reproduced against the branch:

m = json_schema_to_pydantic( {"type": "object", "properties": {"s": {"type": "string", "enum": ["open", "OPEN"]}}, "required": ["s"]}, ) m.model_validate({"s": "open"}) # ValidationError

Case-variant enums are uncommon in hand-written OpenAI schemas but routine in schemas migrated from other sources (OpenAPI, Java/TS codebases, ["AM", "PM"], etc.). The failure is silent at conversion time and only surfaces as a mismatched validation later.

Two options, either is fine:

Dedupe member names — track seen names, append _1, _2, … on collision. Minimal change, preserves the named-Enum form.

Collapse to Literal — the non-string branch at L134 already uses Literal[tuple(enum_values)], which is case-exact, needs no name generation, and would let L122-132 disappear. No public contract depends on the Enum class being named.

Worth a regression test asserting both "open" and "OPEN" validate when both are declared.

planetf1 · 2026-05-01T14:31:01Z

+                raise ValueError(f"{_format_path(path)} enum must be an array")
+            return _enum_annotation(enum_values, path)
+
+        field_type = normalized_schema.get("type", "string")


L286: field_type = normalized_schema.get("type", "string") silently defaults any typeless schema node to str. JSON Schema with no type keyword is valid and means "any type"; defaulting to str narrows that silently.

Reproduced:

json_schema_to_pydantic({ "type": "object", "properties": {"data": {"description": "anything"}}, "required": ["data"], }).model_validate({"data": 42}) # ValidationError: expected str

{"description": "..."} without type is a real pattern in schemas pasted from OpenAPI specs where types are contextual. The converter already raises ValueError for unrecognised types ({"type": "integr"} → "unsupported JSON schema type"), so this is the only silently-wrong fallthrough left in the type dispatch.

Two options:

Raise on missing type — strictest and matches OpenAI's own strict-mode requirement that every node declares a type. Most aligned with the project's fail-fast policy.

Default to Any — spec-faithful; treats missing type as "any JSON value" and lets Pydantic validate nothing specific.

Either is a clear improvement over the current silent narrowing. Lower severity than the enum collision — OpenAI-style schemas usually do declare type on every node — but worth fixing while the converter is fresh.

planetf1 · 2026-05-01T14:31:58Z

+        if normalized_schema.get("type") != "object":
+            raise ValueError(f"{_format_path(path)} must be an object schema")
+
+        cache_key = f"{current_model_name}:{id(object_schema)}"


L354: cache_key = f"{current_model_name}:{id(object_schema)}" — the cache never protects against recursive $ref. _normalize_schema returns a fresh dict on every ref resolution, so each re-entry to _object_schema_to_model for the same $def gets a new id(), a new cache miss, and another recursion.

Reproduced:

json_schema_to_pydantic({ "type": "object", "$defs": {"Node": { "type": "object", "properties": {"val": {"type": "integer"}, "child": {"$ref": "#/$defs/Node"}}, "required": ["val"], }}, "properties": {"root": {"$ref": "#/$defs/Node"}}, "required": ["root"], }) # RecursionError

Recursive schemas are rare in OpenAI-style structured output (strict mode historically disallowed them) but a valid JSON Schema pattern for tree/linked-list outputs. The current behaviour is a server-side crash — caught by the endpoint's except Exception and surfaced as a generic 500.

Minimum fix: detect cycles by tracking in-flight ref names in a set and raise ValueError("recursive $ref is not supported") — turns a 500 into a clean 400 and adds one line to the "Still unsupported" list in the docstring. Supporting recursion properly would need Pydantic forward refs + model_rebuild(); probably not worth it for v1.

Severity: lower than findings 1 and 2 — unlikely to hit OpenAI users in practice — but trivial to fix and improves the error surface.

planetf1 · 2026-05-01T14:32:59Z

+            if choices and len(choices) > 0:
+                finish_reason = choices[0].get("finish_reason")
+                if finish_reason in valid_reasons:
+                    return finish_reason


extract_finish_reason handles Ollama (chat_response.done_reason) and OpenAI-direct (oai_chat_response["choices"][0]["finish_reason"]) but silently skips LiteLLM and returns the default "stop" for every LiteLLM-backed request.

LiteLLM stores its response at a different key and in a different shape — mellea/backends/litellm.py:443,693:

mot._meta["litellm_chat_response"] = chunk.choices[0].model_dump()

This is a flat Choice dict ({"finish_reason": "length", "index": 0, "message": {...}}), not wrapped in a choices array, so neither existing branch matches.

Not a regression — pre-PR the code hardcoded "stop" for every backend, so LiteLLM users are no worse off. But the PR's stated goal is "use backend finish_reason", and LiteLLM is a first-class backend in Mellea; leaving it defaulting to "stop" means clients that branch on finish_reason == "length" to paginate will loop forever on LiteLLM-fronted models.

Proposed addition (after L46):

# LiteLLM backend stores a flat Choice.model_dump() dict litellm_response = output._meta.get("litellm_chat_response") if isinstance(litellm_response, dict): finish_reason = litellm_response.get("finish_reason") if finish_reason in valid_reasons: return finish_reason

HuggingFace backend is also uncovered but has no OpenAI-equivalent finish signal to extract, so fall-through to "stop" is reasonable there. Worth calling out in the docstring.

planetf1 · 2026-05-01T14:34:10Z

+            assert call_args.kwargs["format"] is None
+
+        # Verify response is successful
+        assert isinstance(response, ChatCompletion)


L608-609 (and L635-636 in the text-format sibling test) use a conditional assert that never fires:

if "format" in call_args.kwargs: assert call_args.kwargs["format"] is None

The shared mock_module fixture uses Mock(), and inspect.signature(Mock().serve) returns (*args, **kwargs) with parameters ["args", "kwargs"] — "format" is never in sig.parameters, so accepts_format in make_chat_endpoint is always False for this fixture, and format is never added to serve_kwargs. The if branch is unreachable; the assert proves nothing.

Reproduced:

>>> inspect.signature(Mock().serve).parameters mappingproxy({"args": <Parameter "*args">, "kwargs": <Parameter "**kwargs">}) >>> "format" in inspect.signature(Mock().serve).parameters False

Net effect: these two tests only verify that endpoint() returns a ChatCompletion when response_format is json_object / text — the "no format model is passed" behaviour they claim to test is not tested at all.

Fix options:

Unconditional assert — assert "format" not in call_args.kwargs. Locks in the mock's actual behaviour (no format in params → no format in call).

Use a real function with format= param — e.g. def fake_serve(input, requirements=None, model_options=None, format=None): ... and assert format is None unconditionally. More faithful to the real code path.

Option 2 is closer to what the test docstring claims; option 1 is a one-line fix.

planetf1 · 2026-05-01T14:37:48Z

+                        return create_openai_error_response(
+                            status_code=400,
+                            message=f"Invalid JSON schema: {e!s}",
+                            error_type="invalid_request_error",


Scoped nit on the schema-conversion error handling (the broader except Exception handler at L252 is pre-existing, not in scope for this PR).

L181 catches only ValueError from json_schema_to_pydantic, but the converter can also surface non-ValueError exceptions for pathological schemas that should still be client errors:

create_model(...) can raise PydanticUserError / PydanticSchemaGenerationError on schemas that produce an internally-inconsistent model.

Enum(...) in _enum_annotation can raise TypeError for certain member-name edge cases (see the enum-collision comment above).

These currently fall through to the generic 500 handler and surface to the client as "Internal server error: ..." rather than a clean 400 "Invalid JSON schema".

Minimal fix — widen this block only:

except (ValueError, TypeError) as e: return create_openai_error_response( status_code=400, message=f"Invalid JSON schema: {e!s}", error_type="invalid_request_error", param="response_format.json_schema.schema", )

Or import the Pydantic error types explicitly and include them in the tuple. Either way keeps the scope tight to schema-conversion failures and leaves the outer handler alone.

planetf1 · 2026-05-01T14:39:29Z

+    """JSON Schema definition when type is 'json_schema'."""


 class StreamOptions(BaseModel):


Minor: ResponseFormat accepts {"type": "json_schema", "json_schema": null} at Pydantic-parse time and defers the "json_schema is required when type=json_schema" check to the endpoint at app.py:169-175. The endpoint does catch it and returns a clean 400, so this is a code-organisation concern rather than a functional bug.

A model_validator on ResponseFormat would co-locate the invariant with the type it constrains and shrink the endpoint body by a branch:

from pydantic import model_validator class ResponseFormat(BaseModel): type: Literal["text", "json_object", "json_schema"] json_schema: JsonSchemaFormat | None = None @model_validator(mode="after") def _require_json_schema(self) -> "ResponseFormat": if self.type == "json_schema" and self.json_schema is None: raise ValueError("json_schema is required when type is 'json_schema'") return self

The endpoint 's existing except RequestValidationError handler (via validation_exception_handler) will then turn the ValueError into a 400 with the same shape as other validation errors, and the explicit block at app.py:169-175 can go away.

Not blocking — happy to ship as-is if the author prefers endpoint-level validation.

planetf1

Suggesting some changes. I think the main one is around the enum clash -- the others add value over what was there before, so the comments are more about completeness rather than regression - so I think it would be good to fix up, but minded to approve when you are comfortable with the changes. nothing else is a blocker

markstur added 3 commits April 17, 2026 13:03

feat: add response_format support in cli when streaming

79f327f

Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

feat: cli response_format features adding doc examples

af7e905

Signed-off-by: Mark Sturdevant <mark.sturdevant@ibm.com>

markstur requested a review from a team as a code owner April 17, 2026 23:37

markstur requested review from jakelorocco and planetf1 April 17, 2026 23:37

markstur changed the title ~~Issue 824~~ feat: cli OpenAI-compatible API response_format support Apr 17, 2026

github-actions Bot added the enhancement New feature or request label Apr 17, 2026