---
title: "Is Pydantic making your LLM dumber?"
date: "11/10/2024"
date-modified: last-modified
description-meta: "Structured outputs make it easy to work with LLMs, but are they impacting their performance?"
toc: true
toc-depth: 3
lightbox: true
fig-cap-location: margin
categories:
  - llm
  - openai
  - pydantic
  - python
author:
  - name: Dylan Castillo
    url: https://dylancastillo.co
    affiliation: Iwana Labs
    affiliation-url: https://iwanalabs.com
citation: true
comments:
  utterances:
    repo: dylanjcastillo/blog_comments
    theme: dark-blue
    issue-term: pathname
draft: true
---

Since the [Let me speak freely?](https://arxiv.org/abs/2408.02442) was published, there's been a lot of talk^[By a lot, I mean just a couple of randos on my corner of the internet.] about the impact of structured outputs on the performance of LLMs. 

The [.txt](https://dottxt.co) team wrote a very [compelling rebuttal](https://blog.dottxt.co/say-what-you-mean.html). But they focused on open-weight models, which is not what I tend to use, and I don't know how well their results translate to proprietary models. Plus, I also missed seeing the performance of structured outputs on benchmarks that are [less likely to be part of the training data](https://arxiv.org/abs/2405.00332) of the models, such as [LiveBench](https://arxiv.org/abs/2406.19314).

So, I decided to run my own experiment to evaluate the impact of structured outputs on the performance of LLMs.

## Experiment design

I gathered the published reasoning, language, and math^[I only included the olympiad and math_comp tasks from the math category.] questions from [LiveBench](https://huggingface.co/livebench), and then I modified them to remove any formatting instructions.

Using these questions, I created three types of prompts:

1. Without structured outputs^[By structured outputs, I refer to the generation of JSON or XML objects.]: The LLM is given a system message that includes the format of the response and a user message with the question. But the instructions do not require a **too-rigid** structure.
2. With structured outputs using tool calls: The LLM is given a system message without a specific format, and later on it is asked to use the `Response` pydantic model to parse its response.
3. With structured outputs using JSON mode: The LLM is given a system message that includes the format of the response and a user message with the question. But the instructions ask the LLM to return a JSON object in the `Response` pydantic model.

## Set up

In [1]:
# | output: false
# | echo: false

import nest_asyncio

nest_asyncio.apply()

First, start by importing the necessary libraries:

In [None]:
# | output: false

import asyncio
import difflib
import json
import re
from asyncio import Semaphore
from enum import Enum
from itertools import permutations
from pathlib import Path
from typing import Any, List

import instructor
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from openai import AsyncOpenAI
from openai.types.chat import ChatCompletion
from pydantic import BaseModel, Field
from scipy import stats

np.random.seed(42)

load_dotenv()


This imports the required libraries. Two things to note:

1. I'm using [LangSmith](https://smith.langchain.com/) to trace the calls to the OpenAI API. This makes it easier to debug and see the prompts that are being used.
2. I use instructor to compare the performance of `structured outputs` using tool calls and JSON mode.

Next, load the data.

## Load data

In [3]:
data_dir = Path().absolute().parent / "data" / "live_bench"
reasoning_dir = data_dir / "reasoning"
math_dir = data_dir / "math"
language_dir = data_dir / "language"

df_reasoning = pd.read_json(
    reasoning_dir / "updated_questions.jsonl", lines=True
)
df_language = pd.read_json(language_dir / "updated_questions.jsonl", lines=True)
df_math = pd.read_json(math_dir / "updated_questions.jsonl", lines=True)

I'm using a slightly modified version of the reasoning, language, and math question (excluding some categories from the latter).

I had to modify them because the original ones included formatting requirements that would have made the comparison less fair.

You can see the changes I did [here](../scripts/is-pydantic-making-your-model-dumber/update_livebench_questions.py).

In [4]:
langsmith_client = wrap_openai(AsyncOpenAI())
tool_calls_client = instructor.from_openai(langsmith_client, mode=instructor.Mode.TOOLS)
json_mode_client = instructor.from_openai(langsmith_client, mode=instructor.Mode.JSON)
strict_tool_calls_client = instructor.from_openai(
    langsmith_client, mode=instructor.Mode.TOOLS_STRICT
)

In [5]:
class Response(BaseModel):
    reasoning: str = Field(description="Your reasoning explaining your answer.")
    answer: str = Field(description="Your answer, don't include any other text.")


class PromptType(Enum):
    WITHOUT_STRUCTURED_OUTPUT = "no_so"
    WITH_TOOL_CALLS = "so_tool_calls"
    WITH_JSON_MODE = "so_json_mode"
    WITH_STRICT_TOOL_CALLS = "so_strict_tool_calls"


class ClientConfig(BaseModel):
    client: Any
    system_message: str


CONFIG_MAPPING = {
    PromptType.WITHOUT_STRUCTURED_OUTPUT.value: ClientConfig(
        client=langsmith_client,
        system_message=(
            "You're a helpful assistant. You will help me answer a question."
            "\nYou must respond using the following format:"
            "\nREASONING: <your reasoning explaining your answer>"
            "\nANSWER: <your answer, don't include any other text>"
        ),
    ),
    PromptType.WITH_TOOL_CALLS.value: ClientConfig(
        client=tool_calls_client,
        system_message=(
            "You're a helpful assistant. You will help me answer a question."
        ),
    ),
    PromptType.WITH_JSON_MODE.value: ClientConfig(
        client=json_mode_client,
        system_message=(
            "You're a helpful assistant. You will help me answer a question."
            + "\nYou must respond using the following JSON schema:"
            + json.dumps(Response.model_json_schema())
        ),
    ),
    PromptType.WITH_STRICT_TOOL_CALLS.value: ClientConfig(
        client=strict_tool_calls_client,
        system_message=(
            "You're a helpful assistant. You will help me answer a question."
        ),
    ),
}

Then, I defined the three types of prompts I'm going to use:

1. Without structured outputs: The LLM is given a system message that includes the format of the response and a user message with the question. But the instructions do not require a **too-rigid** structure.
2. With structured outputs using tool calls: The LLM is given a system message without a specific format, and later on it is asked to use the `Response` pydantic model to parse its response.
3. With structured outputs using JSON mode: The LLM is given a system message that includes the format of the response and a user message with the question. But the instructions ask the LLM to return a JSON object in the `Response` pydantic model.

Finally, I defined a helper function to parse the responses from the LLM.


In [6]:
def parse_response(
    response: ChatCompletion | Response, response_type: PromptType
) -> str:
    if isinstance(response, Response):
        return response.answer
    elif (
        isinstance(response, ChatCompletion)
        and response_type == PromptType.WITHOUT_STRUCTURED_OUTPUT
    ):
        return response.choices[0].message.content.split("\nANSWER:")[1].strip()
    raise ValueError(f"Invalid response type: {type(response)}")


@traceable
async def call_model(
    client,
    prompt_type: PromptType,
    user_message: str,
    timeout: int = 120,
) -> Response:
    config = CONFIG_MAPPING[prompt_type.value]
    params = {
        "model": "gpt-4o-mini",
        "messages": [
            {"role": "system", "content": config.system_message},
            {"role": "user", "content": user_message},
        ],
        "timeout": timeout,
    }
    if prompt_type in (
        PromptType.WITH_TOOL_CALLS,
        PromptType.WITH_STRICT_TOOL_CALLS,
        PromptType.WITH_JSON_MODE,
    ):
        params.update(
            {
                "response_model": Response,
            }
        )
    response = await client.chat.completions.create(**params)
    return parse_response(response, prompt_type)


@traceable
async def process_row(
    row: pd.Series,
    prompt_type: PromptType,
    semaphore: Semaphore,
) -> str:
    client = CONFIG_MAPPING[prompt_type.value].client
    async with semaphore:
        for _ in range(3):
            try:
                answer = await call_model(
                    client=client,
                    prompt_type=prompt_type,
                    user_message=f"Question:\n{row.updated_question}",
                )
                return answer
            except Exception as e:
                print(f"Error processing row {row.name}: {e}")
                continue
        raise Exception(f"Failed to process row {row.name}, after 3 attempts")


@traceable
async def process_df(
    df: pd.DataFrame,
    prompt_type: PromptType,
    concurrency: int = 100,
) -> List[str]:
    semaphore = Semaphore(concurrency)
    tasks = [process_row(row, prompt_type, semaphore) for _, row in df.iterrows()]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return results

## Running the experiment


In [7]:
# Adapted from:
# https://github.com/LiveBench/LiveBench/blob/main/livebench/process_results/writing/plot_unscrambling/utils.py
def levenshtein_distance(A, B):
    N, M = len(A), len(B)
    # Create an array of size NxM
    dp = [[0 for i in range(M + 1)] for j in range(N + 1)]

    # Base Case: When N = 0
    for j in range(M + 1):
        dp[0][j] = j
    # Base Case: When M = 0
    for i in range(N + 1):
        dp[i][0] = i
    # Transitions
    for i in range(1, N + 1):
        for j in range(1, M + 1):
            if A[i - 1] == B[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(
                    dp[i - 1][j],  # Insertion
                    dp[i][j - 1],  # Deletion
                    dp[i - 1][j - 1],  # Replacement
                )

    return dp[N][M]


def plot_unscrambling_process_results(ground_truth: str, llm_answer: str) -> float:
    gt_sentences = [s.strip() for s in ground_truth.split(".")]
    ans_sentences = [s.strip() for s in llm_answer.split(".")]

    gt_sentences = [s for s in gt_sentences if s]
    ans_sentences = [s for s in ans_sentences if s]

    ans_ordering = []
    for x in gt_sentences:
        best_match = difflib.get_close_matches(x, ans_sentences, n=1, cutoff=0.0)
        if best_match:
            ans_ordering.append(ans_sentences.index(best_match[0]))

    n_sentences_gt = len(gt_sentences)
    raw_distance = levenshtein_distance(list(range(len(gt_sentences))), ans_ordering)
    score = 1 - (raw_distance / n_sentences_gt)

    return score


def evaluate_language_task(ground_truth: str, task_type: str, response: str):
    if task_type == "connections":
        objects = [
            re.sub(r"[^\w\s]", "", o.strip().lower()) for o in response.split(",")
        ]
        gt_objects = [
            re.sub(r"[^\w\s]", "", o.strip().lower()) for o in ground_truth.split(",")
        ]

        groups = [set(objects[i : i + 4]) for i in range(0, len(objects), 4)]
        gt_groups = [set(gt_objects[i : i + 4]) for i in range(0, len(gt_objects), 4)]

        max_correct = 0
        for perm in permutations(groups):
            correct_groups = sum(g1 == g2 for g1, g2 in zip(perm, gt_groups))
            max_correct = max(max_correct, correct_groups)
        return max_correct / len(gt_groups)
    elif task_type == "plot_unscrambling":
        return plot_unscrambling_process_results(ground_truth, response)
    elif task_type == "typos":
        return ground_truth in response
    else:
        raise ValueError(f"Invalid task type: {task_type}")


def evaluate_reasoning_task(ground_truth: str, task_type: str, response: str):
    if task_type == "web_of_lies_v2":
        response_objects = [
            re.sub(r"[^\w\s]", "", o.strip().lower()) for o in response.split(",")
        ]
        gt_objects = [
            re.sub(r"[^\w\s]", "", o.strip().lower()) for o in ground_truth.split(",")
        ]
        return response_objects == gt_objects
    elif task_type in ("spatial", "zebra_puzzle"):
        response = response.rstrip(".")
        return ground_truth.lower().strip() == response.lower().strip()
    else:
        raise ValueError(f"Invalid task type: {task_type}")


def evaluate_math_task(ground_truth: str, task_type: str, response: str):
    if task_type == "olympiad":
        response_objects = [
            re.sub(r"[^\w\s]", "", o.strip().lower()) for o in response.split(",")
        ]
        gt_objects = [
            re.sub(r"[^\w\s]", "", o.strip().lower()) for o in ground_truth.split(",")
        ]
        return response_objects == gt_objects
    elif task_type == "math_comp":
        return ground_truth == response
    else:
        raise ValueError(f"Invalid task type: {task_type}")

In [8]:
# | output: false
def generate_outputs(df):
    df_copy = df.copy()
    responses_without_so = asyncio.run(
        process_df(df_copy, PromptType.WITHOUT_STRUCTURED_OUTPUT)
    )
    responses_with_so_tool_calls = asyncio.run(
        process_df(df_copy, PromptType.WITH_TOOL_CALLS)
    )
    responses_with_so_json_mode = asyncio.run(
        process_df(df_copy, PromptType.WITH_JSON_MODE)
    )
    responses_with_so_strict_tool_calls = asyncio.run(
        process_df(df_copy, PromptType.WITH_STRICT_TOOL_CALLS)
    )
    df_copy["response_without_so"] = responses_without_so
    df_copy["response_with_so_tool_calls"] = responses_with_so_tool_calls
    df_copy["response_with_so_json_mode"] = responses_with_so_json_mode
    df_copy["response_with_so_strict_tool_calls"] = responses_with_so_strict_tool_calls
    return df_copy


def evaluate_outputs(df, evaluator):
    df_copy = df.copy()
    df_copy["score_without_so"] = df_copy.apply(
        lambda row: evaluator(
            row["ground_truth"], row["task"], row["response_without_so"]
        )
        * 1,
        axis=1,
    )
    df_copy["score_with_so_tool_calls"] = df_copy.apply(
        lambda row: evaluator(
            row["ground_truth"], row["task"], row["response_with_so_tool_calls"]
        )
        * 1,
        axis=1,
    )
    df_copy["score_with_so_json_mode"] = df_copy.apply(
        lambda row: evaluator(
            row["ground_truth"], row["task"], row["response_with_so_json_mode"]
        )
        * 1,
        axis=1,
    )
    df_copy["score_with_so_strict_tool_calls"] = df_copy.apply(
        lambda row: evaluator(
            row["ground_truth"], row["task"], row["response_with_so_strict_tool_calls"]
        )
        * 1,
        axis=1,
    )
    return df_copy

### Reasoning

In [9]:
df_reasoning_results = generate_outputs(df_reasoning)

In [10]:
df_reasoning_results = evaluate_outputs(df_reasoning_results, evaluate_reasoning_task)
df_reasoning_results.to_csv(data_dir / "reasoning" / "reasoning_results.csv")

In [None]:
# | output: false
# | echo: false

df_reasoning_results.groupby("task").agg(
    n_questions=("question_id", "count"),
    score_without_so=("score_without_so", "mean"),
    score_with_so_tool_calls=("score_with_so_tool_calls", "mean"),
    score_with_so_json_mode=("score_with_so_json_mode", "mean"),
    score_with_so_strict_tool_calls=("score_with_so_strict_tool_calls", "mean"),
)

### Language

In [None]:
df_language_results = generate_outputs(df_language)

In [13]:
df_language_results = evaluate_outputs(df_language_results, evaluate_language_task)
df_language_results.to_csv(data_dir / "language" / "language_results.csv")

In [None]:
# | output: false
# | echo: false

df_language_results.groupby("task").agg(
    n_questions=("question_id", "count"),
    score_without_so=("score_without_so", "mean"),
    score_with_so_tool_calls=("score_with_so_tool_calls", "mean"),
    score_with_so_json_mode=("score_with_so_json_mode", "mean"),
    score_with_so_strict_tool_calls=("score_with_so_strict_tool_calls", "mean"),
)

### Math

In [None]:
df_math_results = generate_outputs(df_math)

In [None]:
df_math_results = evaluate_outputs(df_math_results, evaluate_math_task)
df_math_results.to_csv(data_dir / "math" / "math_results.csv")

In [None]:
# | output: false
# | echo: false

df_math_results.groupby("task").agg(
    n_questions=("question_id", "count"),
    score_without_so=("score_without_so", "mean"),
    score_with_so_tool_calls=("score_with_so_tool_calls", "mean"),
    score_with_so_json_mode=("score_with_so_json_mode", "mean"),
    score_with_so_strict_tool_calls=("score_with_so_strict_tool_calls", "mean"),
)

## Evaluation

In [17]:
# | output: false
def calculate_confidence_intervals(df):
    for score_col in [
        "score_without_so",
        "score_with_so_tool_calls",
        "score_with_so_json_mode",
        "score_with_so_strict_tool_calls",
    ]:
        mean_score = df[score_col].mean()
        se_score = df[score_col].std() / np.sqrt(len(df))
        ci = [
            mean_score - 1.96 * se_score,
            mean_score + 1.96 * se_score,
        ]
        print(
            f"{score_col} - Mean: {mean_score * 100:.2f}% CI: {ci[0] * 100:.2f}% - {ci[1] * 100:.2f}%"
        )
    print()


def run_paired_t_test(df):
    scores = {}

    for score_col in [
        "score_without_so",
        "score_with_so_tool_calls",
        "score_with_so_json_mode",
        "score_with_so_strict_tool_calls",
    ]:
        scores[score_col] = df[score_col] * 1

    for score_col_1, score_col_2 in [
        ("score_without_so", "score_with_so_tool_calls"),
        ("score_without_so", "score_with_so_json_mode"),
        ("score_without_so", "score_with_so_strict_tool_calls"),
    ]:
        t_stat, p_value = stats.ttest_rel(scores[score_col_1], scores[score_col_2])
        print(f"{score_col_1} vs {score_col_2}")
        print(f"t-statistic: {t_stat}, p-value: {p_value}")

In [None]:
calculate_confidence_intervals(df_reasoning_results)
run_paired_t_test(df_reasoning_results)

In [None]:
calculate_confidence_intervals(df_language_results)
run_paired_t_test(df_language_results)

In [None]:
calculate_confidence_intervals(df_math_results)
run_paired_t_test(df_math_results)

These are the mean scores with confidence intervals:

| **Category** | **Response Format**              | **Mean (CI %)**         |
|--------------|----------------------------------|-------------------------|
| **Reasoning** | Without SO                      | 42.67% (34.73 - 50.61)  |
|               | With SO Tool Calls               | 39.33% (31.49 - 47.18)  |
|               | With SO JSON Mode                | 44.00% (36.03 - 51.97)  |
| **Language**  | Without SO                      | 48.91% (42.05 - 55.76)  |
|               | With SO Tool Calls               | 44.78% (37.96 - 51.59)  |
|               | With SO JSON Mode                | 46.66% (40.03 - 53.30)  |
| **Math**      | Without SO                      | 34.85% (26.69 - 43.01)  |
|               | With SO Tool Calls               | 37.12% (28.85 - 45.39)  |
|               | With SO JSON Mode                | 34.09% (25.97 - 42.21)  |

And these are the results of the paired t-tests:

### 2. T-Test Results

| **Category** | **Comparison**                         | **t-Statistic** | **p-Value** |
|--------------|----------------------------------------|-----------------|-------------|
| **Reasoning** | Without SO vs With SO Tool Calls       | 0.7442          | 0.4579      |
|               | Without SO vs With SO JSON Mode        | -0.2878         | 0.7739      |
|               | With SO Tool Calls vs With SO JSON Mode| -1.0000         | 0.3189      |
| **Language**  | Without SO vs With SO Tool Calls       | 1.2362          | 0.2185      |
|               | Without SO vs With SO JSON Mode        | 0.7979          | 0.4263      |
|               | With SO Tool Calls vs With SO JSON Mode| -0.5662         | 0.5721      |
| **Math**      | Without SO vs With SO Tool Calls       | -0.5985         | 0.5505      |
|               | Without SO vs With SO JSON Mode        | 0.1917          | 0.8482      |
|               | With SO Tool Calls vs With SO JSON Mode| 0.7833          | 0.4349      |

## Conclusion
