---
title: "Is Pydantic making your model dumber?"
date: "11/10/2024"
date-modified: last-modified
description-meta: ""
toc: true
toc-depth: 3
lightbox: true
fig-cap-location: margin
categories:
  - llm
  - openai
  - pydantic
  - python
author:
  - name: Dylan Castillo
    url: https://dylancastillo.co
    affiliation: Iwana Labs
    affiliation-url: https://iwanalabs.com
citation: true
comments:
  utterances:
    repo: dylanjcastillo/blog_comments
    theme: dark-blue
    issue-term: pathname
---

## Set up the environment

In [172]:
# | output: false
# | echo: false

import nest_asyncio

nest_asyncio.apply()

First, start by importing the necessary libraries:

In [190]:
import asyncio
import json
from asyncio import Semaphore
from enum import Enum
from pathlib import Path

import instructor
import numpy as np
import pandas as pd
from dotenv import load_dotenv
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from openai import AsyncOpenAI
from pydantic import BaseModel, Field
from scipy import stats

np.random.seed(42)

load_dotenv()

langsmith_client = wrap_openai(AsyncOpenAI())
instructor_client = instructor.from_openai(langsmith_client, mode=instructor.Mode.TOOLS)

## Reasoning

In [174]:
data_dir = Path().absolute().parent / "data" / "live_bench"
reasoning_dir = data_dir / "reasoning"
math_dir = data_dir / "math"
language_dir = data_dir / "language"

df_reasoning = (
    pd.read_json(reasoning_dir / "question.jsonl", lines=True)
    # .query("livebench_release_date == '2024-06-24'")
    .assign(turns_str=lambda x: x.turns.str[0])
    .reset_index()
    .rename(columns={"index": "data_point_id"})
    # .sample(15)
)

# assert there is a single turn per row
assert df_reasoning.turns.str.len().eq(1).all()

In [175]:
class FormatType(Enum):
    T1 = "Bold formatting for a single phrase. Example: Think step by step, and then put your answer in **bold** as a single phrase (for example, **sphere**)."
    T2 = "Bold formatting for a list of three words. Example: Think step by step, and then put your answer in **bold** as a list of three words, yes or no (for example, **yes, no, yes**)."
    T3 = "Bold formatting for a single integer. Example: Think step by step, and then put your answer in **bold** as a single integer (for example, **0**)."
    T4 = "Return a single digit number. Example: Return a single digit number, in the following format: **N**, where N is the position."
    T5 = "Return your answer as a single word. Example: Return your answer as a single word, in the following format: **X**, where X is the answer."
    OTHER = "Other formatting requirements. Example: Other formatting requirements (if none of the above apply)"


class FormatClassification(BaseModel):
    classification: FormatType = Field(
        description=f"The formatting requirements of the output of the provided question. Only allowed types: {[t.value for t in FormatType]}, should be used",
    )


system_prompt_classification_reasoning = "You're a helpful assistant. I will provide you with a question and you will classify the formatting requirements of the output of the provided question into the most appropriate category."

In [176]:
format_mapping = {
    FormatType.T1.name: "Put your answer as a single phrase (for example, sphere).",
    FormatType.T2.name: "Put your answer as a list of three words, yes or no (for example, yes, no, yes).",
    FormatType.T3.name: "Put your answer as a single integer (for example, 0).",
    FormatType.T4.name: "Return a single digit number, in the following format: N, where N is the position.",
    FormatType.T5.name: "Return your answer as a single word, in the following format: X, where X is the answer.",
}

In [177]:
@traceable
async def process_row(
    system_message: str,
    user_message: str,
    response_model: BaseModel,
    semaphore: Semaphore,
) -> dict:
    async with semaphore:
        return await instructor_client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message},
            ],
            response_model=response_model,
        )


@traceable
async def classify_reasoning_questions(df, concurrency: int = 30):
    semaphore = Semaphore(concurrency)
    tasks = [
        process_row(
            system_message=system_prompt_classification_reasoning,
            user_message=f"Question:\n{row.turns_str}",
            response_model=FormatClassification,
            semaphore=semaphore,
        )
        for _, row in df.iterrows()
    ]
    responses = await asyncio.gather(*tasks)
    return [r.classification for r in responses]

In [178]:
classified_reasoning = asyncio.run(classify_reasoning_questions(df_reasoning))
df_reasoning["classification"] = classified_reasoning

## Replace formatting

In [179]:
class UpdatedQuestion(BaseModel):
    updated_question: str


system_prompt_replace_reasoning_format = (
    "You're a helpful assistant. I will provide you with a question and the old formatting requirements. Your task is to replace the old formatting requirements with the new ones."
    "Please return the full text of the question with the new formatting requirements. Don't include any other text. Don't include 'Question:' or 'Old formatting:' or 'New formatting:'"
)

In [180]:
@traceable
async def replace_reasoning_questions_format(df, concurrency: int = 30):
    semaphore = Semaphore(concurrency)
    tasks = [
        process_row(
            system_message=system_prompt_replace_reasoning_format,
            user_message=f"Question:\n{row.turns_str}\nOld formatting:\n{row.classification.value}\nNew formatting:{format_mapping[row.classification.name]}\n",
            response_model=UpdatedQuestion,
            semaphore=semaphore,
        )
        for _, row in df.iterrows()
    ]
    responses = await asyncio.gather(*tasks)
    return [r.updated_question for r in responses]


replaced_reasoning = asyncio.run(replace_reasoning_questions_format(df_reasoning))

In [183]:
df_reasoning["replaced_question"] = replaced_reasoning

In [218]:
df_reasoning.to_csv(reasoning_dir / "updated_questions.csv", index=False)

## Language

In [242]:
df_language = (
    pd.read_json(language_dir / "question.jsonl", lines=True)
    .assign(
        turns_str=lambda x: x.turns.str[0],
        ground_truth=lambda x: x.ground_truth.str.strip(),
    )
    .reset_index()
    .rename(columns={"index": "data_point_id"})
)

assert df_language.turns.str.len().eq(1).all()

df_language["replaced_question"] = (
    df_language.turns_str.str.replace("Begin the plot summary with <PLOT_SUMMARY>", "")
    .str.replace("in **bold**", "")
    .str.replace("**", "")
    .str.strip()
)

Next, define the system prompt and the Pydantic models you'll use to format the responses:

In [243]:
system_prompt_without_structured_output = (
    "You're a helpful assistant. You will help me answer a question."
    "\nPlease respond as follows:"
    "\nREASONING: <your reasoning here>"
    "\nANSWER: <your answer here>"
)

system_prompt_with_structured_output = (
    "You're a helpful assistant. You will help me answer a question."
    "\nYou will use this JSON schema for your response:"
    "\n{response_format}"
)


class ResponseFormat(BaseModel):
    reasoning: str
    answer: str

In the system prompt you send to the LLM, you'll replace `{response_format}` with the JSON schema of the response format you want to use.

Then, let's define a few helper functions to run the experiment:

In [244]:
def validate_response(response_text: str, type: str):
    if type == "without_structured_output":
        answer = response_text.split("\nANSWER:")[1].strip()
        return answer
    else:
        return ResponseFormat.model_validate_json(response_text).answer


@traceable
async def process_row(row: pd.Series, type: str, semaphore: Semaphore) -> dict:
    if type == "without_structured_output":
        system_prompt = system_prompt_without_structured_output
    else:
        system_prompt = system_prompt_with_structured_output.format(
            response_format=ResponseFormat.model_json_schema()
        )
    async with semaphore:
        for _ in range(3):
            try:
                response = await langsmith_client.chat.completions.create(
                    model="gpt-4o",
                    messages=[
                        {"role": "system", "content": system_prompt},
                        {
                            "role": "user",
                            "content": f"Question:\n{row.replaced_question}",
                        },
                    ],
                    response_format={"type": "json_object"}
                    if type == "with_structured_output"
                    else None,
                    timeout=30,
                )
                return validate_response(response.choices[0].message.content, type)
            except Exception as e:
                print(f"Failed to generate a valid response for row {row.name}: {e}")
        raise Exception("Failed to generate a valid response and ran out of retries.")


@traceable
async def main(df, response_format, concurrency: int = 50):
    semaphore = Semaphore(concurrency)
    tasks = [process_row(row, response_format, semaphore) for _, row in df.iterrows()]
    responses = await asyncio.gather(*tasks)
    return responses

In this code, `validate_response` is used to check if the response is valid (i.e. it matches the JSON schema in the same order). If it is, it returns the response. Otherwise, it raises an exception.

`process_row` is used to process a single row of the DataFrame. It sends the system prompt to the LLM and validates the response. It includes a simple retry mechanism in case the validation fails. Each run is tracked in LangSmith.

Finally, `main` is used to run the experiment. It runs the `process_row` function concurrently for each row in the DataFrame. 

## Running the experiment

Now, you can run the experiment using the two response formats:

In [251]:
# | output: false
df_copy = df_language.copy()

responses_without_so = asyncio.run(main(df_copy, "without_structured_output"))
responses_with_so = asyncio.run(main(df_copy, "with_structured_output"))

In [326]:
import difflib
import re
from itertools import permutations


def levenshtein_distance(A, B):
    N, M = len(A), len(B)
    # Create an array of size NxM
    dp = [[0 for i in range(M + 1)] for j in range(N + 1)]

    # Base Case: When N = 0
    for j in range(M + 1):
        dp[0][j] = j
    # Base Case: When M = 0
    for i in range(N + 1):
        dp[i][0] = i
    # Transitions
    for i in range(1, N + 1):
        for j in range(1, M + 1):
            if A[i - 1] == B[j - 1]:
                dp[i][j] = dp[i - 1][j - 1]
            else:
                dp[i][j] = 1 + min(
                    dp[i - 1][j],  # Insertion
                    dp[i][j - 1],  # Deletion
                    dp[i - 1][j - 1],  # Replacement
                )

    return dp[N][M]


def plot_unscrambling_process_results(ground_truth: str, llm_answer: str) -> float:
    gt_sentences = [s.strip() for s in ground_truth.split(".")]
    ans_sentences = [s.strip() for s in llm_answer.split(".")]

    gt_sentences = [s for s in gt_sentences if s]
    ans_sentences = [s for s in ans_sentences if s]

    ans_ordering = []
    for x in gt_sentences:
        best_match = difflib.get_close_matches(x, ans_sentences, n=1, cutoff=0.0)
        if best_match:
            ans_ordering.append(ans_sentences.index(best_match[0]))

    n_sentences_gt = len(gt_sentences)
    raw_distance = levenshtein_distance(list(range(len(gt_sentences))), ans_ordering)
    score = 1 - (raw_distance / n_sentences_gt)

    return score


def evaluate_language_task(ground_truth: str, task_type: str, response: str):
    if task_type == "connections":
        objects = [
            re.sub(r"[^\w\s]", "", o.strip().lower()) for o in response.split(",")
        ]
        gt_objects = [
            re.sub(r"[^\w\s]", "", o.strip().lower()) for o in ground_truth.split(",")
        ]

        groups = [set(objects[i : i + 4]) for i in range(0, len(objects), 4)]
        gt_groups = [set(gt_objects[i : i + 4]) for i in range(0, len(gt_objects), 4)]

        max_correct = 0
        for perm in permutations(groups):
            correct_groups = sum(g1 == g2 for g1, g2 in zip(perm, gt_groups))
            max_correct = max(max_correct, correct_groups)
        return max_correct / len(gt_groups)
    elif task_type == "plot_unscrambling":
        return plot_unscrambling_process_results(ground_truth, response)
    elif task_type == "typos":
        return ground_truth in response
    else:
        raise ValueError(f"Invalid task type: {task_type}")

In [327]:
df_copy["response_without_so"] = responses_without_so
df_copy["is_correct_without_so"] = df_copy.apply(
    lambda row: evaluate_language_task(
        row["ground_truth"], row["task"], row["response_without_so"]
    ),
    axis=1,
)

df_copy["response_with_so"] = responses_with_so
df_copy["is_correct_with_so"] = df_copy.apply(
    lambda row: evaluate_language_task(
        row["ground_truth"], row["task"], row["response_with_so"]
    ),
    axis=1,
)

In [329]:
df_copy.groupby("task").agg(
    n_questions=("data_point_id", "count"),
    accuracy_without_so=("is_correct_without_so", "mean"),
    accuracy_with_so=("is_correct_with_so", "mean"),
)

Unnamed: 0_level_0,n_questions,accuracy_without_so,accuracy_with_so
task,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
connections,50,0.423333,0.55
plot_unscrambling,40,0.30545,0.369002
typos,50,0.6,0.52


In [320]:
# with pd.option_context("display.max_colwidth", None):   
#     display(
#         df_copy.query("task == 'connections' and is_correct_without_so == False")[
#             [
#         "ground_truth",
#         "replaced_question",
#         "response_without_so",
#         "response_with_so",
#             ]
#         ]
#     )


We run the experiment multiple times with the same inputs to account for the randomness in the LLM's responses. Ideally, we should run it more than three times, but I'm poor. So, we'll just do it 3 times.

In [324]:
# | output: false
n_bootstraps = 10_000
bootstrap_accuracies_without_so = []
bootstrap_accuracies_with_so = []

for _ in range(n_bootstraps):
    bootstrap_sample = df_copy.sample(n=len(df_copy), replace=True)

    mean_accuracy_without_so = bootstrap_sample["is_correct_without_so"].mean()
    mean_accuracy_with_so = bootstrap_sample["is_correct_with_so"].mean()

    bootstrap_accuracies_without_so.append(mean_accuracy_without_so)
    bootstrap_accuracies_with_so.append(mean_accuracy_with_so)

ci_without_so = np.percentile(bootstrap_accuracies_without_so, [2.5, 97.5])
ci_with_so = np.percentile(bootstrap_accuracies_with_so, [2.5, 97.5])

mean_accuracy_without_so = df_copy["is_correct_without_so"].mean()
mean_accuracy_with_so = df_copy["is_correct_with_so"].mean()

print(
    f"Response format without SO - Mean: {mean_accuracy_without_so * 100:.2f}% CI: {ci_without_so[0] * 100:.2f}% - {ci_without_so[1] * 100:.2f}%"
)
print(
    f"Response format with SO - Mean: {mean_accuracy_with_so * 100:.2f}% CI: {ci_with_so[0] * 100:.2f}% - {ci_with_so[1] * 100:.2f}%"
)

Response format without SO - Mean: 45.27% CI: 38.55% - 52.19%
Response format with SO - Mean: 48.04% CI: 41.20% - 54.90%


Then, you can build bootstrap confidence intervals for the accuracies of the two response formats. Given that I'm asking the LLM the same question multiple times, I went with an approach called [cluster bootstrapping](https://pmc.ncbi.nlm.nih.gov/articles/PMC5965657/), which accounts for the fact that the data points are not independent.

It should take a few minutes to run. Once it's done, you should see output like the following:

| Response Format | Mean (95% CI)           |
|-----------------|-------------------------|
| A        | 46.67% (34.67% – 58.67%) |
| B        | 30.67% (20.67% – 41.33%) |

These results suggest that the order of the fields in the JSON schema does matter.

But if you're still unsure, you can perform a t-test to see if the two response formats are statistically different:

In [325]:
# | output: false
accuracies_without_so = df_copy["is_correct_without_so"]
accuracies_with_so = df_copy["is_correct_with_so"]

t_stat, p_value = stats.ttest_rel(accuracies_without_so, accuracies_with_so)

print(f"t-statistic: {t_stat}, p-value: {p_value}")

t-statistic: -0.7477145847774228, p-value: 0.45589544260063786


I got a p-value <0.001, meaning I can reject the null hypothesis that the two response formats are the same. 

## Conclusion

Based on the results of the experiment, we can safely say that `ResponseFormatA` is better than `ResponseFormatB`.

But why?

In this case, it's simple. 

These response formats are meant to help the LLM reason step by step to arrive at the answer. This is known as [chain of thought reasoning](https://en.wikipedia.org/wiki/Chain_of_thought_reasoning). However, for it to work, we need the LLM to first provide us with the reasoning of how it arrived at the answer and then the answer.

In `ResponseFormatB`, we defined our Pydantic model with the answer first and the reasoning second. This means that the LLM will give us the answer first, and then adjust the reasoning to match that answer. `ResponseFormatA` does the opposite, which is why it performs better.

So, to summarize, when using structured outputs, don't put the cart before the horse.

That's all! Let me know if you have any questions in the comments.