In [1]:
%load_ext dotenv
%dotenv

# Structured Output

In this notebook, we explore how to get models to output structured output.

LLMs are often viewed as models that take in text and output text.  
Indeed, this is how they are often used in chatbot applications for example.

In [2]:
from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Hi!"}],
)
print(completion.choices[0].message.content)

Hello! How can I assist you today?


But there are often situations where we want to get structured output from LLMs.  
For example, you want a model to classify movie review in positive and negative.  
Simply asking a model for positive or negative may not get you the desired output:

In [3]:
movie_review = """
I hated every second of this movie.
"""
content = f"""
Here is a movie review:
{movie_review}

Classify it as positive or negative.
"""

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": content,}],
)
print(completion.choices[0].message.content)

The review is negative.


The model responded with a sentence, not just the word "negative".  
We can try to do prompt engineering to improve the situation.  

In [4]:
movie_review = """
I hated every second of this movie.
"""
content = f"""
Here is a movie review:
{movie_review}

Respond with a single word: `positive` or `negative`.
"""

completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": content,}],
)
print(completion.choices[0].message.content)

negative


Ok, that worked, but we can do better.  
Note the following:  

1. We had to do prompt engineering.  
2. The prompt engineering is not very robust.  
  - It may fail on the next invocation.  
  - It may fail if we switch models.  
3. The moment we switch the expected output, we have to prompt engineer again.

Clearly, this is not a good solution.  
LLM providers have realised this and are now providing structured output APIs.

In [5]:
from typing import Literal
from pydantic import BaseModel

class Sentiment(BaseModel):
    value: Literal["positive", "negative"]

movie_review = """
I hated every second of this movie.
"""

completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": movie_review},
    ],
    response_format=Sentiment,
)

event = completion.choices[0].message.parsed
event

Sentiment(value='negative')

Nice!  
We could define a structure using Pydantic.  
Furthermore, we could feed the `movie_review` directly, circumventing the need for prompt engineering.  
The model responded directly with an instance of the Pydantic Sentiment class.  
That means we get type checking and validation!  
This is clearly a much better developer experience.

Note that `client.chat.completions.parse` is a wrapper around `client.chat.completions.create`.  
It is equivalent to

In [6]:
# Explain the Semantic model schema to the model
sentiment_schema = {
    "type": "object",
    "properties": {
        "value": {
            "type": "string",
            "enum": ["positive", "negative"],
        }
    },
    "required": [
        "value"
    ],
    "additionalProperties": False
}
sentiment_format = {
    "name": "Sentiment",
    "description": "Sentiment of the movie",
    "schema": sentiment_schema,
    "strict": True
}
response_format = {
    "type": "json_schema",
    "json_schema": sentiment_format,    
}

# Invoke the model
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": movie_review},
    ],
    response_format=response_format,
)

# Parse the output into a Pydantic model instance
Sentiment.model_validate_json(completion.choices[0].message.content)

Sentiment(value='negative')

Nice!  
The `parse` is a nice abstraction over the `create` method that takes away some complexity for us developers.

In the other notebooks, we learned about "tool calling".  
How is "tool calling" different from "structured output"?  
Let's inspect the completion response that we parsed into the pydantic instance.

In [7]:
print(completion.choices[0].message.to_json(indent=2))

{
  "content": "{\"value\":\"negative\"}",
  "refusal": null,
  "role": "assistant",
  "annotations": []
}


We observe that "structured output" does not use any `tool_calls`.  
Rather, it puts the output directly in the `content` field.  
This is an indication that this works differently from `tool calling`.  
OpenAI doesn't release much information on how this works behind the scenes.  
But we can expect that there is a special chat template that explains to the model what the expected `response_format` is.  
The model is furthermore instructed to output a regular message in json format.  
Most likely there is a portion of the training data where the model is specifically trained to do structured output.

Note that we could have achieved the same thing with "tool calling" as follows:

In [8]:
tools = [{
    "type": "function",
    "function": {
        "name": "Sentiment",
        "description": "Sentiment of the movie",
        "parameters": sentiment_schema,
        "strict": True
    }
}]
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": movie_review}],
    tools=tools
)
print(completion.choices[0].message.to_json(indent=2))

{
  "content": null,
  "refusal": null,
  "role": "assistant",
  "annotations": [],
  "tool_calls": [
    {
      "id": "call_n42xQV1PrQDh4hvhAo8jVJoL",
      "function": {
        "arguments": "{\"value\":\"negative\"}",
        "name": "Sentiment"
      },
      "type": "function"
    }
  ]
}


Simple enough.  
So which one should we use?  
[OpenAI themselves recommend the following](https://platform.openai.com/docs/guides/structured-outputs?api-mode=chat#function-calling-vs-response-format):  

"""  
When to use Structured Outputs via function calling vs via response_format
Structured Outputs is available in two forms in the OpenAI API:

When using function calling
When using a json_schema response format
Function calling is useful when you are building an application that bridges the models and functionality of your application.

For example, you can give the model access to functions that query a database in order to build an AI assistant that can help users with their orders, or functions that can interact with the UI.

Conversely, Structured Outputs via response_format are more suitable when you want to indicate a structured schema for use when the model responds to the user, rather than when the model calls a tool.

For example, if you are building a math tutoring application, you might want the assistant to respond to your user using a specific JSON Schema so that you can generate a UI that displays different parts of the model's output in distinct ways.

Put simply:

If you are connecting the model to tools, functions, data, etc. in your system, then you should use function calling
If you want to structure the model's output when it responds to the user, then you should use a structured response_format
The remainder of this guide will focus on non-function calling use cases in the Chat Completions API. To learn more about how to use Structured Outputs with function calling, check out the Function Calling guide.  
"""

There is an important feature we have not touched upon yet.  
So far, we have been asking the model nicely to output json.  
As we have seen, this works well for the simple cases we have tested.  
However, with very complex schemas and also with less powerful models,  
we may run into issues where the model doesn't output a valid json schema.  
What can we do in that case?  

One good solution is to simplify the expected output schema.  
If the model is really struggling, it means that you you are asking it to do too complex of a task.  

However, you may run into situations where the business logic requires a complex schema and there is no way around it.  
In this case, there is still an option.  
The model generates the answer token by token.  
We can intercept the output and validate it at each step.  
In this way, we can discard any invalid tokens and take the next most likely token.  
This means the validation needs to happen inside the server where the model is being invoked,  
as we need access to all the logits that the model outputs.  
Luckily, OpenAI, vLLM and Ollama now have support for this feature as well as many other providers.  
In fact, we have secretly used it already!  
Note the parameter `"strict": True` in our `tools` and `response_format`.  
This parameter indicates to OpenAI that they should discard invalid tokens.  
Therefore, this ensures that the response complies with the schema we provided.  

With `"strict": False`, we are purely relying on the model ability to comply to the schema.  

Note also that `"strict": True` is like putting a constraint on the model.  
So there is some risk that this impacts the model performance.  
`"strict": True` also means that your schema needs to be supported by OpenAI or you will experience errors.
OpenAI themselves now recommend to use `"strict": True` always, so the impact is probably negligible.

Here's how you can do structured output with LangChain

In [9]:
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o-mini").with_structured_output(
    Sentiment,
    strict=True, # Or False
    method="json_schema", # or "function_calling" for the "tool calling" approach.
)
sentiment = model.invoke(movie_review)
sentiment

Sentiment(value='negative')

Similarly, we can enable "strict" "tool calling" through `model.bind_tools`.

In [10]:
model = ChatOpenAI(model="gpt-4o-mini").bind_tools(
    tools=[Sentiment],
    strict=True,
)
message = model.invoke(movie_review)
message.pretty_print()

Tool Calls:
  Sentiment (call_B8rnKrx7yy0ln3zpW7fmWFxJ)
 Call ID: call_B8rnKrx7yy0ln3zpW7fmWFxJ
  Args:
    value: negative


What about Ollama?
Ollama also has support for structured output.

In [11]:
from langchain_ollama import ChatOllama

model = ChatOllama(
    model="mistral-small",
    base_url="http://10.1.0.255:11434",
).with_structured_output(
    Sentiment,
    method="json_schema", # or "function_calling" for the "tool calling" approach.
)
sentiment = model.invoke(movie_review)
sentiment

Sentiment(value='negative')

Note that Ollama has no "strict" parameter.  
Instead, `method="json_schema"` will use `strict=True` and `method="function_calling"` will use `strict=False`.

The `function_calling` doesn't work very well, because the model may respond with a regular message without using the tool.

In [12]:
from langchain_ollama import ChatOllama

model = ChatOllama(
    model="mistral-small",
    base_url="http://10.1.0.255:11434",
).with_structured_output(
    Sentiment,
    method="function_calling",
    include_raw=True
)
raw = model.invoke(movie_review)
raw["raw"].pretty_print()


It sounds like you didn't enjoy the movie. Is there anything specific you disliked about it?


It is thus recommended to use `method="json_schema"` for structured output.

The above example with `method="json_schema"` seemed to work pretty well.  
But there is a crucial flaw!  

It is important to know that OpenAI does pass the response_format to the model.  
If you use vLLM or Ollama, the response_format is **not passed** to the model!  

Thus, with Ollama, the model never received the schema of the `Sentiment` pydantic class.  
Why did it work then? Because Ollama used `strict=True` to discard any invalid tokens from the model.  
This will not work for more complex cases where the model needs to understand the structure that it needs to output.  

Let's test this out.
We will create a confusing class that can only be parsed correclty if the model has access to the response_format.

In [13]:
from pydantic import BaseModel, Field

class ConfusingClass(BaseModel):
    integer: int = Field(description="The sum of the two integers")

In [14]:
from langchain_openai import ChatOpenAI
model = ChatOpenAI(model="gpt-4o-mini").with_structured_output(
    ConfusingClass,
    strict=True
)
confusing_class = model.invoke("1,2")
confusing_class

ConfusingClass(integer=3)

gpt-4o-mini got it right! It means that it had access to the `description`.  

Now what about mistral-small with Ollama?

In [15]:
from langchain_ollama import ChatOllama

model = ChatOllama(
    model="mistral-small",
    base_url="http://10.1.0.255:11434",
).with_structured_output(
    ConfusingClass,
    method="json_schema",
)
confusing_class = model.invoke("1,2")
confusing_class

ConfusingClass(integer=1)

It can't do it.  
The reason is that it did not receive the `description` from the response format.  
To fix this, we have to put the schema in the prompt.

In [16]:
import json
from langchain_ollama import ChatOllama

model = ChatOllama(
    model="mistral-small",
    base_url="http://10.1.0.255:11434",
).with_structured_output(
    ConfusingClass,
    method="json_schema",
)

user_message = "1,2"
prompt = f"""{user_message}

Respond with a json in the following schema:
{json.dumps(ConfusingClass.model_json_schema(), indent=2)}"""

confusing_class = model.invoke(prompt)
confusing_class

ConfusingClass(integer=3)

Now it got it correctly!  
Be careful about this point.  
You cannot exchange `ChatOpenAI` with `ChatOllama` freely if you are using structured output.  

To make it extra confusing, Ollama does pass the schema if you use `function_calling`!

In [17]:
from langchain_ollama import ChatOllama

model = ChatOllama(
    model="mistral-small",
    base_url="http://10.1.0.255:11434",
).with_structured_output(
    ConfusingClass,
    method="function_calling",
)
confusing_class = model.invoke("1,2")
confusing_class

ConfusingClass(integer=3)

Please be very careful when using structured output to ensure that the model receives the full set of instructions needed to complete the task.