---
title: "Say What You Mean... Sometimes"
date: "12/06/2024"
date-modified: last-modified
description-meta: "A look at the impact of structured outputs on the performance of LLMs."
toc: true
toc-depth: 3
lightbox: true
fig-cap-location: margin
categories:
  - llm
  - openai
  - pydantic
  - python
author:
  - name: Dylan Castillo
    url: https://dylancastillo.co
    affiliation: Iwana Labs
    affiliation-url: https://iwanalabs.com
citation: true
comments:
  utterances:
    repo: dylanjcastillo/blog_comments
    theme: dark-blue
    issue-term: pathname
---

When I read [Let Me Speak Freely?](https://arxiv.org/abs/2408.02442) by Tam et al. I thought they raised an interesting question: does constraining LLM outputs to structured formats impact the quality of their responses?

In both the original study and their recent update, Tam et al. concluded that structured generation constraints do indeed affect LLM performance. They found that "structured generation constraints significantly impact LLM performance across various tasks".

But the study had major flaws. The [.txt](https://dottxt.co/) team wrote a very compelling [rebuttal](https://dottxt.co/blog/let-me-speak-freely) to the paper. They demonstrate that Tam, et al. results were mostly due to poor prompting, unfair comparisons and the improper use of an "AI parser" rather than the use of structured outputs (at least for Llama 3 8B). 

They were right but it still left me wondering how well their results generalize. They only tested Llama 3 8B^[In this setup, but they've also shared results of other open-weight models in a different setup (https://blog.dottxt.co/performance-gsm8k.html)], which represents a small fraction of the LLMs powering applications in production today. Open-weight models offer a lot of flexibility on how to *structure* your output, such as using specific [regex expressions](https://dottxt-ai.github.io/outlines/latest/reference/generation/regex/) that the output must match instead of a specific JSON schema.

Proprietary models lack this flexibility. Right now, JSON is the only structured output format guaranteed to work across most providers. Given this constraint, would the .txt team’s results still hold?

Plus, both studies focused on tasks that might not be a good proxy for the full range of tasks people use LLMs for. For example, would the rebuttal results be different in settings outside of simple reasoning tasks?

So I decided to:

1. Replicate the results from .txt's rebuttal using Llama 3 8B.
2. Replicate the same tasks using a proprietary model (GPT-4o-mini).
3. Test results on a broader set of tasks such as [LiveBench](https://livebench.ai/).

This article presents the results of the first two steps.

## The results

## The original tasks 

The original reasoning tasks from Tam et al. were:

1. [GSM8K](https://huggingface.co/datasets/openai/gsm8k): A dataset from of grade school math word problems.
2. [Last Letter](https://huggingface.co/datasets/ChilleD/LastLetterConcat): A dataset of simple word puzzles that require concatening the last letters of a list of names.
3. [Shuffle Object](https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/tracking_shuffled_objects): A dataset that requires reasoning about the state of a system after a sequence of shuffling operations.

Tam et al. also included four classification tasks in their study, where they observed that structured outputs resulted in better performance. The .txt team's rebuttal excluded these classification tasks. I also believe that structured outputs are likely better for classification tasks. So, I excluded them from my analysis as well.

## Replicating .txt's rebuttal

.txt made it very easy to reproduce their results by sharing their [code on Github](https://github.com/dottxt-ai/demos/tree/main/say-what-you-mean). I just set up a machine at [Modal](https://modal.com/) and ran the code and reproduced their results.

But, while going through the code, I noticed some small issues with the prompts. So I decided to tweak them a bit.

Below are .txt's original results compared to mine, after the prompt adjustments:

| Task              | Unstructured (.txt) | Structured (.txt) | Unstructured (Me, 3-shot) | Structured (Me, 3-shot) |
|-------------------|---------------------|-------------------|-------------------|-----------------|
| **GSM8K**         | 77.18                 | 77.79               | 79.98            | 79.45          |
| **Last Letter**   | 73.33                 | 77.33               | 74.00            | 78.00          |
| **Shuffle Object**| 40.72                 | 44.35               | 42.68            | 43.90          |

Except for **Structured** in the **Shuffle Object** task, I was able to improve all the metrics. In **GSM8K's** case, even reversing .txt's result, with **Unstructured** outperforming **Structured** by a small margin.

But I don't think this matters much.

Their conclusion still holds: structured outputs are either as good as or better than unstructured outputs, in the tasks considered.

I'll explain the prompt changes I made below, so that you can judge for yourself if they make sense.

#### Formatting few-shot examples

In the **GSM8K** and **Last Letter** tasks, the few-shot prompt for both unstructured and structured used examples formatted as JSON objects and asked the LLM to produce the output in the same format, from which the answer was extracted.

That felt unfair. Even though you're not formally constraining the LLM to produce a JSON object, you're still asking it to format its response in somewhat unnatural way.

I adjusted the prompts to be as similar as possible while still trying to get the most out of each approach.

For example, in **GSM8K**, the unstructured prompt is:

> You are an expert in solving grade school math tasks. You will be presented with a grade-school math word problem and be asked to solve it.
> You will always respond in the following format:
>
> <str, reasoning about the answer>
>
> ANSWER: <int, final answer>
>
> First, provide your step by step reasoning. Then, in ANSWER, provide an integer that corresponds to the correct answer to the question. Don't include any other text in ANSWER.

And the structured prompt is:

> You are an expert in solving grade school math tasks. You will be presented with a grade-school math word problem and be asked to solve it.
> You will always respond in the following format:
>
> {"reasoning": <str, reasoning about the answer>, "answer": <int, final answer>}
>
> First, provide your step by step reasoning about the answer. Then, in the "answer" field, provide an integer that corresponds to the correct answer to the question.

For all the tasks, I used a 3-shot prompt.

#### Clarifying the task

I also clarified the prompts. The description of the task in the original **Last Letter** prompt was:

> You are an expert in solving simple word puzzles using reasoning steps. Your specific task is going to be to take a list of 4 names and reason about the last letter of each ., then you will concatenate those letters into a word. 

I changed it to:

> You are an expert in solving word puzzles. Your specific task is going to be to take a list of 4 names, get the last letter of each and concatenate these letters into a word. 

The original prompt was reasonable, but I thought the new version was clearer. Through trial and error, I've learned that when working with LLMs, it's best to be as clear and direct as possible.

## Evaluating GPT-4o-mini

Using the same setup as before, I ran the same tasks with `gpt-4o-mini-2024-07-18`.

Below are the results, including the original results from Tam et al. for comparison:

| **Task**        | **Method**   | **NL** | **FRI** | **JSON-Mode** | **JSON-Schema** |
|-----------------|--------------|--------|---------|---------------|------------------|
| **GSM8K**       | **Tam et al.** | 94.57  | 87.17   | 86.95         | 91.71            |
|                 | **Me (0-shot)** | 92.80  | 91.43   | 93.56         | 93.63            |
|                 | **Me (3-shot)** | 93.10  | 93.18   | 93.63         | 93.18            |
| **Last Letter** | **Tam et al.** | 83.11  | 84.73   | 76.00         | 86.07            |
|                 | **Me (0-shot)** | 83.33  | 84.00   | 90.67         | 87.33            |
|                 | **Me (3-shot)** | 90.00  | 91.33   | 90.67         | 88.67            |
| **Shuffle Obj** | **Tam et al.** | 82.85  | 81.46   | 76.43         | 81.77            |
|                 | **Me (0-shot)** | 94.31  | 82.11   | 81.71         | 87.40            |
|                 | **Me (3-shot)** | 87.40  | 70.73   | 63.41         | 67.48            |

*NL* stands for "Natural Language". This would correspond to the *Unstructured* method in the previous table.

*FRI* stands for "Format Restricting Instructions", which is a JSON generated through the OpenAI's [function calling](https://platform.openai.com/docs/guides/function-calling). *JSON-Mode* is a JSON generated through the OpenAI's [JSON mode](https://platform.openai.com/docs/guides/structured-outputs#json-mode). *JSON-Schema* is a JSON generated using [constrained decoding](https://openai.com/index/introducing-structured-outputs-in-the-api/). 

*JSON-Schema* would be the most comparable to **Structured** in the previous table but, for real-life applications, you don't really care about how the output is formatted. You just want to get the output in the format you want. So, for the sake of comparison, I will consider the three other methods as structured outputs as well.

#### Analyzing the results

Similar to what the .txt team found, you can see that after adjusting the prompts, the performance of structured outputs increases substantially compared to Tam et al.

Except for *NL* in **GSM8k** and *FRI* in **Last Letter**, I was able to improve all the metrics for both unstructured and structured outputs using a 0-shot prompt. For 3-shot prompts, I improved **GSM8k** and **Last Letter** across all methods, and *NL* in **Shuffle Object**.

These counter-intuitive results of 3-shot prompting hurting performance are in line with previous [analyses](https://python.useinstructor.com/blog/2024/09/26/bad-schemas-could-break-your-llm-structured-outputs/?h=bad+sc#modes-and-models). I will dive deeper into this in a future article.

**GSM8k** and **Last Letter** show very similar results between unstructured and structured outputs. I don't believe there's enough evidence to say that structured outputs are better than unstructured outputs in these cases.

On the other hand, **Shuffle Object** shows a clear advantage for unstructured outputs over structured outputs. I wasn't expecting this, and even after adjusting the prompts, I couldn't reduce the gap significantly.

 So, even though Tam et al. had issues in their study, their conclusion seems to hold. There are cases where unstructured outputs are better than structured outputs.

## Conclusion

If you're a developer, you'll be pleased to know that the answer to wether you should use structured or unstructured outputs is: [it depends](https://www.reddit.com/r/orlybooks/comments/50meb5/it_depends/).

I love using structured outputs in my daily work, because it makes it easier to work with the output of LLMs. I always encourage clients who aren't using them yet to give it a try.

But, until there's sufficient evidence in favor of one approach over the other, it's best to test yourself what works best for your specific use case. Don't rely on random posts from random people on the internet. Run your own evals and see for yourself.

There's a good chance that structured outputs will have comparable quality to unstructured outputs in most cases. But, especially in cases where you're not generating complex structured outputs, you might be losing out on easy performance gains by not testing out unstructured outputs.

The takeaway of this article shouldn't be that structured outputs are better than or worse than unstructured outputs in every possible task. Just that, in some cases, one or the other might be better. 

So, yes, say what you mean... sometimes.

In [1]:
# | output: false
# | echo: false

%load_ext autoreload
%autoreload 2

In [2]:
# | output: false
# | echo: false

import nest_asyncio

nest_asyncio.apply()

In [2]:
# | output: false
# | echo: false

import asyncio
import json
from asyncio import Semaphore
from enum import Enum
from functools import partial
from textwrap import dedent
from typing import Callable, List, Literal

import numpy as np
import pandas as pd
from datasets import load_dataset
from dotenv import load_dotenv
from langsmith import traceable
from langsmith.wrappers import wrap_openai
from openai import AsyncOpenAI
from openai.types.chat import ChatCompletion
from pydantic import BaseModel, ConfigDict, Field
from scipy import stats

np.random.seed(42)

load_dotenv()

MODEL_NAME = "gpt-4o-mini-2024-07-18"
USE_SAMPLE = False 
MAX_CONCURRENCY = 200

client = wrap_openai(AsyncOpenAI())

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# | output: false
# | echo: false

class PromptType(Enum):
    WITHOUT_STRUCTURED_OUTPUT = "without_so"
    WITH_TOOL_CALLS = "with_so_tool_calls"
    WITH_JSON_MODE = "with_so_json_mode"
    WITH_STRICT_TOOL_CALLS = "with_so_strict_tool_calls"


class ClientConfig(BaseModel):
    name: str
    col_name: str
    score_col_name: str


CONFIGS = [
    ClientConfig(
        name=PromptType.WITHOUT_STRUCTURED_OUTPUT.value,
        col_name=f"response_{PromptType.WITHOUT_STRUCTURED_OUTPUT.value}",
        score_col_name=f"score_{PromptType.WITHOUT_STRUCTURED_OUTPUT.value}",
    ),
    ClientConfig(
        name=PromptType.WITH_TOOL_CALLS.value,
        col_name=f"response_{PromptType.WITH_TOOL_CALLS.value}",
        score_col_name=f"score_{PromptType.WITH_TOOL_CALLS.value}",
    ),
    ClientConfig(
        name=PromptType.WITH_STRICT_TOOL_CALLS.value,
        col_name=f"response_{PromptType.WITH_STRICT_TOOL_CALLS.value}",
        score_col_name=f"score_{PromptType.WITH_STRICT_TOOL_CALLS.value}",
    ),
    ClientConfig(
        name=PromptType.WITH_JSON_MODE.value,
        col_name=f"response_{PromptType.WITH_JSON_MODE.value}",
        score_col_name=f"score_{PromptType.WITH_JSON_MODE.value}",
    ),
]

In [4]:
# | output: false
# | echo: false

class LLMEvaluator:
    def __init__(
        self,
        configs: List[ClientConfig],
        create_prompt_fn: Callable,
        parse_response_fn: Callable,
        response_model: BaseModel,
        concurrency: int = MAX_CONCURRENCY,
    ):
        self.configs = configs
        self.create_prompt_fn = create_prompt_fn
        self.parse_response_fn = parse_response_fn
        self.response_model = response_model
        self.concurrency = concurrency

    def _create_tool_call_schema(
        self,
        strict: bool = False,
    ) -> dict:
        model_schema = self.response_model.model_json_schema()
        if strict:
            return {
                "type": "json_schema",
                "json_schema": {
                    "name": model_schema["title"],
                    "schema": model_schema,
                    "strict": True,
                },
            }
        else:
            return {
                "type": "function",
                "function": {
                    "name": model_schema["title"],
                    "description": f"Correctly extracted `{model_schema['title']}` with all the required parameters with correct types",
                    "parameters": model_schema,
                },
            }

    @traceable(run_type="prompt")
    def create_prompt(
        self,
        question: str,
        prompt_type: str,
    ) -> List[dict]:
        return self.create_prompt_fn(
            question=question,
            prompt_type=prompt_type,
            response_model=self.response_model,
        )

    @traceable(run_type="parser")
    def parse_response(
        self,
        response: ChatCompletion,
        prompt_type: str,
    ) -> str | int:
        return self.parse_response_fn(response, prompt_type)

    @traceable(run_type="llm")
    async def call_llm(
        self,
        config: ClientConfig,
        question: str,
    ) -> ChatCompletion:
        params = {
            "messages": self.create_prompt(question=question, prompt_type=config.name),
            "model": MODEL_NAME,
            "timeout": 120,
        }

        prompt_type_configs = {
            PromptType.WITH_JSON_MODE.value: {
                "response_format": {"type": "json_object"}
            },
            PromptType.WITH_TOOL_CALLS.value: {
                "tools": [self._create_tool_call_schema(strict=False)]
            },
            PromptType.WITH_STRICT_TOOL_CALLS.value: {
                "response_format": self._create_tool_call_schema(strict=True)
            },
        }

        if config.name in prompt_type_configs:
            params.update(prompt_type_configs[config.name])

        completion = await client.chat.completions.create(**params)

        if config.name == PromptType.WITH_TOOL_CALLS.value:
            response_content = (
                completion.choices[0].message.tool_calls[0].function.arguments
            )
        else:
            response_content = completion.choices[0].message.content

        return response_content

    @traceable(run_type="chain")
    async def process_question(
        self,
        question: str,
        config: ClientConfig,
        semaphore: Semaphore,
        max_attempts: int = 3,
    ) -> str | int | None:
        async with semaphore:
            for _ in range(max_attempts):
                try:
                    answer = await self.call_llm(
                        config=config,
                        question=question,
                    )
                    parsed_answer = self.parse_response(answer, config.name)
                    return parsed_answer
                except Exception:
                    print(f"{config.name}, {question[:10]}: Retrying...")
                    await asyncio.sleep(1)
                    continue
            print(
                f"{config.name}, {question[:10]}: Failed to process question after 3 attempts. Set answer to null."
            )
        return None

    @traceable(run_type="chain")
    async def process_questions(
        self,
        run_name: str,
        questions: List[dict],
        config: ClientConfig,
    ) -> List[str | int | None]:
        semaphore = Semaphore(self.concurrency)
        tasks = [
            self.process_question(
                question=question["question"],
                config=config,
                semaphore=semaphore,
            )
            for question in questions
        ]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return results

    def generate_outputs(self, questions: List[dict]) -> pd.DataFrame:
        df = pd.DataFrame(
            {
                "id": [i for i in range(len(questions))],
                "question": [question["question"] for question in questions],
                "answer": [question["answer"] for question in questions],
            }
        )
        for config in self.configs:
            responses = asyncio.run(
                self.process_questions(
                    run_name=config.name,
                    questions=questions,
                    config=config,
                )
            )
            df[config.col_name] = responses
        return df

    def evaluate_outputs(self, df: pd.DataFrame) -> pd.DataFrame:
        df_copy = df.copy()
        for config in self.configs:
            df_copy[config.score_col_name] = (
                df_copy["answer"] == df_copy[config.col_name]
            ) * 1
        return df_copy

    def calculate_confidence_intervals(
        self, df: pd.DataFrame, conf_level: float = 0.95
    ) -> None:
        print(
            f"Calculating confidence intervals ({conf_level}) with {len(df)} observations:"
        )
        for config in self.configs:
            score_col = config.score_col_name
            scores = df[score_col]

            if len(scores) == 0:
                print(f"No scores available for {score_col}")
                continue

            mean_score = scores.mean()
            se_score = scores.std() / np.sqrt(len(scores))

            z_score = stats.norm.ppf((1 + conf_level) / 2)
            margin_error = z_score * se_score
            ci = [
                max(0.0, mean_score - margin_error),
                min(1.0, mean_score + margin_error),
            ]
            print(
                f"{score_col} - Mean: {mean_score * 100:.2f}% CI: {ci[0] * 100:.2f}% - {ci[1] * 100:.2f}%"
            )
        print()

    def run_paired_t_test(self, df: pd.DataFrame) -> None:
        scores = {}

        for config in self.configs:
            score_col = config.score_col_name
            scores[score_col] = df[score_col] * 1

        for score_col_1, score_col_2 in [
            ("score_without_so", "score_with_so_tool_calls"),
            ("score_without_so", "score_with_so_strict_tool_calls"),
            ("score_without_so", "score_with_so_json_mode"),
        ]:
            if score_col_1 in scores and score_col_2 in scores:
                t_stat, p_value = stats.ttest_rel(
                    scores[score_col_1], scores[score_col_2]
                )
                print(f"{score_col_1} vs {score_col_2}")
                print(f"t-statistic: {t_stat}, p-value: {p_value}")

In [5]:
# | output: false
# | echo: false

class ResponseGSM8K(BaseModel):
    model_config = ConfigDict(extra="forbid", title="Response")
    reasoning: str = Field(description="step by step reasoning about the answer")
    answer: int = Field(description="final answer")


def create_prompt_gsm8k(
    question: str, 
    prompt_type: str, 
    response_model: ResponseGSM8K | None = None, 
    zero_shot: bool = False,
):
    if prompt_type in [
        PromptType.WITH_JSON_MODE.value,
        PromptType.WITH_TOOL_CALLS.value,
        PromptType.WITH_STRICT_TOOL_CALLS.value,
    ]:
        system_prompt = (
            dedent("""
        You are an expert in solving grade school math tasks. You will be presented with a grade-school math word problem and be asked to solve it.

        You will always respond with JSON matching the following schema:
        """)
            + json.dumps(response_model.model_json_schema(), indent=2)
            + """\n\nFirst, provide your step by step reasoning in the "reasoning" field. Then, in the "answer" field, provide an integer that corresponds to the correct answer to the question. Don't include any other text in the "answer" field."""
        )
    else:
        system_prompt = dedent("""
        You are an expert in solving grade school math tasks. You will be presented with a grade-school math word problem and be asked to solve it.
        
        You will always respond in the following format:
        
        <str, reasoning about the answer>
        ANSWER: <int, final answer>
        
        First, provide your step by step reasoning. Then, in ANSWER, provide an integer that corresponds to the correct answer to the question. Don't include any other text in ANSWER.
        """)

    examples = [
        (
            "There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?",
            "There are 15 trees originally. Then there were 21 trees after some more were planted. So there must have been 21 - 15 = 6.",
            6,
        ),
        (
            "If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?",
            "There are originally 3 cars. 2 more cars arrive. 3 + 2 = 5.",
            5,
        ),
        (
            "Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?",
            "Originally, Leah had 32 chocolates. Her sister had 42. So in total they had 32 + 42 = 74. After eating 35, they had 74 - 35 = 39.",
            39,
        ),
    ]

    if not zero_shot:
        system_prompt += "\nExamples:" if examples else ""
        for i, (example_q, example_reason, example_ans) in enumerate(examples):
            system_prompt += f"\n\n**{i+1}**\nQuestion: {example_q}"
            if prompt_type in [
                PromptType.WITH_JSON_MODE.value,
                PromptType.WITH_TOOL_CALLS.value,
                PromptType.WITH_STRICT_TOOL_CALLS.value,
            ]:
                response = f'{{"reasoning": "{example_reason}", "answer": {example_ans}}}'
            else:
                response = f"{example_reason}\nANSWER: {example_ans}"
            system_prompt += f"\nAssistant Response:\n{response}"

    messages = [
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": f"Question: {question}",
        },
    ]

    return messages

create_prompt_gsm8k_zero_shot = partial(create_prompt_gsm8k, zero_shot=True)


def parse_response_gsm8k(response: str, prompt_type: str) -> int | None:
    if prompt_type in [
        PromptType.WITH_JSON_MODE.value,
        PromptType.WITH_TOOL_CALLS.value,
        PromptType.WITH_STRICT_TOOL_CALLS.value,
    ]:
        return ResponseGSM8K.model_validate_json(response).answer
    else:
        cleaned_response = (
            response.split("\nANSWER:")[1].replace(",", "").rstrip(".").strip()
        )
        return int(cleaned_response)

In [7]:
# | output: false
# | echo: false

dataset = load_dataset("gsm8k", "main")
evals = [
    {
        "question": d["question"],
        "answer": int(d["answer"].split("#### ")[1].replace(",", "").strip()),
    }
    for d in dataset["test"]
]

if USE_SAMPLE:
    evals = evals[:5]

In [8]:
# | output: false
# | echo: false

# Zero-shot
evaluator = LLMEvaluator(
    configs=CONFIGS,
    create_prompt_fn=create_prompt_gsm8k_zero_shot,
    parse_response_fn=parse_response_gsm8k,
    response_model=ResponseGSM8K,
)

df = evaluator.generate_outputs(evals)
df_results = evaluator.evaluate_outputs(df)
evaluator.calculate_confidence_intervals(df_results)
evaluator.run_paired_t_test(df_results)

without_so, Jamal's ph: Retrying...
without_so, Elliott is: Retrying...
without_so, Tim has a : Retrying...
without_so, A team of : Retrying...
with_so_tool_calls, Morisette : Retrying...
with_so_tool_calls, Rani has t: Retrying...
with_so_tool_calls, Carl’s fav: Retrying...
with_so_tool_calls, Mary is an: Retrying...
with_so_tool_calls, A local ga: Retrying...
with_so_tool_calls, Adrien's t: Retrying...
with_so_tool_calls, Artie has : Retrying...
with_so_tool_calls, John drive: Retrying...
with_so_tool_calls, Anakin and: Retrying...
with_so_tool_calls, Darren dec: Retrying...
with_so_tool_calls, It takes 2: Retrying...
with_so_tool_calls, Mark is a : Retrying...
with_so_tool_calls, 4 adults a: Retrying...
with_so_tool_calls, Mary buys : Retrying...
with_so_tool_calls, Morisette : Retrying...
with_so_tool_calls, Ten stalls: Retrying...
with_so_tool_calls, Carl’s fav: Retrying...
with_so_tool_calls, John drive: Retrying...
with_so_tool_calls, Mark makes: Retrying...
with_so_tool_calls, 

In [9]:
# | output: false
# | echo: false

# Few-shot
evaluator = LLMEvaluator(
    configs=CONFIGS,
    create_prompt_fn=create_prompt_gsm8k,
    parse_response_fn=parse_response_gsm8k,
    response_model=ResponseGSM8K,
)

df = evaluator.generate_outputs(evals)
df_results = evaluator.evaluate_outputs(df)
evaluator.calculate_confidence_intervals(df_results)
evaluator.run_paired_t_test(df_results)

without_so, Olivia upl: Retrying...
with_so_tool_calls, Toby is re: Retrying...
with_so_tool_calls, 20% of 50 : Retrying...
with_so_tool_calls, 90 single : Retrying...
with_so_tool_calls, Toby is re: Retrying...
with_so_tool_calls, Jo has bee: Retrying...
with_so_tool_calls, 90 single : Retrying...
with_so_tool_calls, Jo has bee: Retrying...
with_so_tool_calls, Randy has : Retrying...
Calculating confidence intervals (0.95) with 1319 observations:
score_without_so - Mean: 93.10% CI: 91.73% - 94.47%
score_with_so_tool_calls - Mean: 93.18% CI: 91.82% - 94.54%
score_with_so_strict_tool_calls - Mean: 93.18% CI: 91.82% - 94.54%
score_with_so_json_mode - Mean: 93.63% CI: 92.31% - 94.95%

score_without_so vs score_with_so_tool_calls
t-statistic: -0.1347897773418711, p-value: 0.8927986689942902
score_without_so vs score_with_so_strict_tool_calls
t-statistic: -0.12798913051366037, p-value: 0.898177100089478
score_without_so vs score_with_so_json_mode
t-statistic: -0.8818427649934396, p-value: 0

In [10]:
# | output: false
# | echo: false

class ResponseLastLetter(BaseModel):
    model_config = ConfigDict(extra="forbid", title="Response")
    reasoning: str = Field(description="step by step reasoning about the answer")
    answer: str = Field(description="final answer")


def create_prompt_last_letter(
    question: str, 
    prompt_type: str, 
    response_model: ResponseLastLetter | None = None, 
    zero_shot: bool = False
):
    if prompt_type in [
        PromptType.WITH_JSON_MODE.value,
        PromptType.WITH_TOOL_CALLS.value,
        PromptType.WITH_STRICT_TOOL_CALLS.value,
    ]:
        system_prompt = dedent("""
        You are an expert in solving simple word puzzles using reasoning steps. Your specific task is going to be to take a list of 4 names and reason about the last letter of each. Then, you will concatenate the last letters into a word. 
          
        You will always respond with JSON matching the following schema:
        """) + json.dumps(response_model.model_json_schema(), indent=2) + """\n\nFirst, provide your step by step reasoning in the "reasoning" field. Then, in the "answer" field, provide the final answer. Don't include any other text in the "answer" field."""
    else:
        system_prompt = dedent("""
        You are an expert in solving simple word puzzles using reasoning steps. Your specific task is going to be to take a list of 4 names and reason about the last letter of each. Then, you will concatenate the last letters into a word. 
        
        You will always respond in the following format:
        
        <str, reasoning about the answer>
        ANSWER: <str, final answer>
        
        First, provide your step by step reasoning. Then, in ANSWER, provide the final answer. Don't include any other text in ANSWER.
        """)

    fewshot_examples = [
        (
            "Ian Peter Bernard Stephen",
            "The last letter of 'Ian' is 'N'. The last letter of 'Peter' is 'R'. The last letter of 'Bernard' is 'D'. The last letter of 'Stephen' is 'N'. Concatenating them is 'NRDN'.",
            "NRDN",
        ),
        (
            "Javier Dylan Christopher Joseph",
            "The last letter of 'Javier' is 'R'. The last letter of 'Dylan' is 'N'. The last letter of 'Christopher' is 'R'. The last letter of 'Joseph' is 'H'. Concatenating them is 'RNRH'.",
            "RNRH",
        ),
        (
            "Anthony Elizabeth Carlos Jesus",
            "The last letter of 'Anthony' is 'Y'. The last letter of 'Elizabeth' is 'H'. The last letter of 'Carlos' is 'S'. The last letter of 'Jesus' is 'S'. Concatenating them is 'YHSS'.",
            "YHSS",
        ),
    ]

    if not zero_shot:
        system_prompt += "\nExamples:" if fewshot_examples else ""
        for i, (example_q, example_reason, example_ans) in enumerate(fewshot_examples):
            system_prompt += f"\n\n**{i+1}**\nQuestion: Take the last letters of the words in '{example_q}' and concatenate them."
            if prompt_type in [
                PromptType.WITH_JSON_MODE.value,
                PromptType.WITH_TOOL_CALLS.value,
                PromptType.WITH_STRICT_TOOL_CALLS.value,
            ]:
                response = f'{{"reasoning": "{example_reason}", "answer": "{example_ans}"}}'
            else:
                response = f"{example_reason}\nANSWER: {example_ans}"
            system_prompt += f"\nAssistant Response:\n{response}"

    messages = [
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": f"Question: {question}",
        },
    ]
    return messages

create_prompt_last_letter_zero_shot = partial(create_prompt_last_letter, zero_shot=True)

def parse_response_last_letter(response: str, prompt_type: str) -> str | None:
    if prompt_type in [
        PromptType.WITH_JSON_MODE.value,
        PromptType.WITH_TOOL_CALLS.value,
        PromptType.WITH_STRICT_TOOL_CALLS.value,
    ]:
        return ResponseLastLetter.model_validate_json(response).answer.lower()
    else:
        return response.split("\nANSWER:")[1].rstrip(".").strip().lower()

In [11]:
# | output: false
# | echo: false

dataset = load_dataset("ChilleD/LastLetterConcat")
evals = [
    {"question": d["question"], "answer": d["answer"].lower()} for d in dataset["test"]
]

if USE_SAMPLE:
    evals = evals[:5]


In [12]:
# | output: false
# | echo: false

# Zero-shot
evaluator = LLMEvaluator(
    configs=CONFIGS,
    create_prompt_fn=create_prompt_last_letter_zero_shot,
    parse_response_fn=parse_response_last_letter,
    response_model=ResponseLastLetter,
)

df = evaluator.generate_outputs(evals)
df_results = evaluator.evaluate_outputs(df)
evaluator.calculate_confidence_intervals(df_results)
evaluator.run_paired_t_test(df_results)

Calculating confidence intervals (0.95) with 150 observations:
score_without_so - Mean: 83.33% CI: 77.35% - 89.32%
score_with_so_tool_calls - Mean: 84.00% CI: 78.11% - 89.89%
score_with_so_strict_tool_calls - Mean: 87.33% CI: 81.99% - 92.67%
score_with_so_json_mode - Mean: 90.67% CI: 86.00% - 95.34%

score_without_so vs score_with_so_tool_calls
t-statistic: -0.1993588014488551, p-value: 0.8422538839198896
score_without_so vs score_with_so_strict_tool_calls
t-statistic: -1.2268049987877068, p-value: 0.22183090435484304
score_without_so vs score_with_so_json_mode
t-statistic: -2.5703013926014764, p-value: 0.01114153217659293


In [13]:
# | output: false
# | echo: false

# Few-shot
evaluator = LLMEvaluator(
    configs=CONFIGS,
    create_prompt_fn=create_prompt_last_letter,
    parse_response_fn=parse_response_last_letter,
    response_model=ResponseLastLetter,
)

df = evaluator.generate_outputs(evals)
df_results = evaluator.evaluate_outputs(df)
evaluator.calculate_confidence_intervals(df_results)
evaluator.run_paired_t_test(df_results)

Calculating confidence intervals (0.95) with 150 observations:
score_without_so - Mean: 90.00% CI: 85.18% - 94.82%
score_with_so_tool_calls - Mean: 91.33% CI: 86.82% - 95.85%
score_with_so_strict_tool_calls - Mean: 88.67% CI: 83.58% - 93.76%
score_with_so_json_mode - Mean: 90.67% CI: 86.00% - 95.34%

score_without_so vs score_with_so_tool_calls
t-statistic: -0.7059233380455161, p-value: 0.4813380870329814
score_without_so vs score_with_so_strict_tool_calls
t-statistic: 0.6311859646135468, p-value: 0.5288860007980016
score_without_so vs score_with_so_json_mode
t-statistic: -0.3323434771513362, p-value: 0.7400969989055401


In [14]:
# | output: false
# | echo: false

class ResponseShuffledObjects(BaseModel):
    model_config = ConfigDict(extra="forbid", title="Response")
    reasoning: str = Field(description="reasoning about the answer")
    answer: Literal["A", "B", "C", "D", "E"] = Field(description="final answer")


def create_prompt_shuffled_objects(
    question,
    prompt_type: str,
    response_model: ResponseShuffledObjects | None = None,
    zero_shot: bool = False
):
    if prompt_type in [
        PromptType.WITH_JSON_MODE.value,
        PromptType.WITH_TOOL_CALLS.value,
        PromptType.WITH_STRICT_TOOL_CALLS.value,
    ]:
        system_prompt = dedent("""
        You are an expert in performing common sense tasks involving the ordering of a sequence of events.
        Each question will present you with a sequence of events involving 5 people (switching objects, partners, positions, etc.). Your task is to determine the correct answer from the options provided.
          
        You will always respond with JSON matching the following schema:
        """) + json.dumps(response_model.model_json_schema(), indent=2) + """\n\nFirst, provide your reasoning in the "reasoning" field. Then, in the "answer" field, provide only the single letter representing the correct choice you are presented with. Don't include any other text in the "answer" field."""
    else:
        system_prompt = dedent("""
        You are an expert in performing common sense tasks involving the ordering of a sequence of events.
        Each question will present you with a sequence of events involving 5 people (switching objects, partners, positions, etc.). Your task is to determine the correct answer from the options provided.
        
        You will always respond in the following format:
        
        <str, reasoning about the answer>
        ANSWER: <str, final answer>
        
        First, provide your step by step reasoning. Then, in ANSWER, provide only the single letter representing the correct choice you are presented with. Don't include any other text in ANSWER.
        """)

    fewshot_examples = [
        (
            "Alice, Bob, Claire, Dave, and Eve are dancers at a square dance. At the start of a song, they each have a partner: Alice is dancing with Patrick, Bob is dancing with Sam, Claire is dancing with Jamie, Dave is dancing with Lola, and Eve is dancing with Melissa.\nThroughout the song, the dancers often trade partners. First, Dave and Eve switch partners. Then, Dave and Alice switch partners. Then, Eve and Alice switch partners. Then, Claire and Bob switch partners. Finally, Dave and Alice switch partners. At the end of the dance, Alice is dancing with\nOptions:\n(A) Patrick\n(B) Sam\n(C) Jamie\n(D) Lola\n(E) Melissa",
            "Dave and Eve switch partners, so Dave's partner is now Melissa and Eve's partner is now Patrick. Then Dave and Alice switch partners so Dave's partner is now Patrick and Alice's partner is now Melissa. Then Eve and Alice switch partners so Eve's partner is now Melissa and Alice's partner is now Lola. Then Claire and Bob switch patners so Claire's partner is now Sam, and Bob's partner is now Jamie. Finally, Dave and Alice switch partners so Dave's new partner is Lola, and Alice's new partner is Patrick. Alice is dance in with Patrick, choice A.",
            "A",
        ),
        (
            "Alice, Bob, Claire, Dave, and Eve are dancers at a square dance. At the start of a song, they each have a partner: Alice is dancing with Ophelia, Bob is dancing with Jamie, Claire is dancing with Melissa, Dave is dancing with Rodrigo, and Eve is dancing with Patrick.\nThroughout the song, the dancers often trade partners. First, Claire and Bob switch partners. Then, Claire and Eve switch partners. Then, Claire and Bob switch partners. Then, Eve and Dave switch partners. Finally, Claire and Alice switch partners. At the end of the dance, Alice is dancing with\nOptions:\n(A) Ophelia\n(B) Jamie\n(C) Melissa\n(D) Rodrigo\n(E) Patrick",
            "Claire and Bob switch partners, so Claire's partner is now Jamie and Bob's partner is now Melissa. Then, Claire and Eve switch partners, so Claire's partner becomes Patrick and Eve's partner becomes Jamie. Next, Claire and Bob switch partners again, making Claire's partner Melissa and Bob's partner Patrick. After that, Eve and Dave switch partners, resulting in Eve's partner being Rodrigo and Dave's partner being Jamie. Finally, Claire and Alice switch partners, so Claire's partner is now Ophelia and Alice's partner becomes Melissa. Alice is dancing with Melissa, which is choice C.",
            "C",
        ),
        (
            "Alice, Bob, Claire, Dave, and Eve are friends and avid readers who occasionally trade books. At the start of the semester, they each buy one new book: Alice gets Catch-22, Bob gets Hound of the Baskervilles, Claire gets Frankenstein, Dave gets The Pearl, and Eve gets The Fellowship of the Ring.\nAs the semester proceeds, they start trading around the new books. First, Eve and Alice swap books. Then, Alice and Claire swap books. Then, Alice and Bob swap books. Then, Dave and Alice swap books. Finally, Dave and Claire swap books. At the end of the semester, Dave has\nOptions:\n(A) Catch-22\n(B) Hound of the Baskervilles\n(C) Frankenstein\n(D) The Pearl\n(E) The Fellowship of the Ring",
            "First, Eve and Alice swap, so Alice gets The Fellowship of the Ring and Eve gets Catch-22. Next, Alice and Claire swap, giving Claire The Fellowship of the Ring and Alice Frankenstein. Then, Alice and Bob swap, resulting in Bob holding Frankenstein and Alice having Hound of the Baskervilles. Dave and Alice then swap, so Dave takes Hound of the Baskervilles and Alice receives The Pearl. Finally, Dave and Claire swap books, which means Dave takes The Fellowship of the Ring from Claire. Therefore, at the end of all the swaps, Dave possesses The Fellowship of the Ring, making option E the correct answer.",
            "E",
        ),
    ]

    if not zero_shot:
        system_prompt += "\nExamples:" if fewshot_examples else ""
        for i, (example_q, example_reason, example_ans) in enumerate(fewshot_examples):
            system_prompt += f"\n\n**{i+1}**\nQuestion: {example_q}"

            if prompt_type in [
                PromptType.WITH_JSON_MODE.value,
                PromptType.WITH_TOOL_CALLS.value,
                PromptType.WITH_STRICT_TOOL_CALLS.value,
            ]:
                response = f'{{"reasoning": "{example_reason}", "answer": "{example_ans}"}}'
            else:
                response = f"{example_reason}\nANSWER: {example_ans}"
            system_prompt += f"\nAssistant Response:\n{response}"

    messages = [
        {
            "role": "system",
            "content": system_prompt,
        },
        {
            "role": "user",
            "content": f"Question: {question}",
        },
    ]
    return messages

create_prompt_shuffled_objects_zero_shot = partial(create_prompt_shuffled_objects, zero_shot=True)


def parse_response_shuffled_objects(response: str, prompt_type: str) -> str:
    if prompt_type in [
        PromptType.WITH_JSON_MODE.value,
        PromptType.WITH_TOOL_CALLS.value,
        PromptType.WITH_STRICT_TOOL_CALLS.value,
    ]:
        return ResponseShuffledObjects.model_validate_json(response).answer
    else:
        return response.split("\nANSWER:")[1].rstrip(".").strip()

In [15]:
# | output: false
# | echo: false

dataset = load_dataset(
    "openeval/BIG-Bench-Hard", data_files="tracking_shuffled_objects_five_objects.json"
)
evals = [
    {
        "question": d["input"],
        "answer": d["target"].replace("(", "").replace(")", "").strip(),
    }
    for d in dataset["train"]["examples"][0][4:]  # first 3 are few-shot examples
]

if USE_SAMPLE:
    evals = evals[:5]

Repo card metadata block was not found. Setting CardData to empty.


In [16]:
# | output: false
# | echo: false

# Zero-shot
evaluator = LLMEvaluator(
    configs=CONFIGS,
    create_prompt_fn=create_prompt_shuffled_objects_zero_shot,
    parse_response_fn=parse_response_shuffled_objects,
    response_model=ResponseShuffledObjects,
)

df = evaluator.generate_outputs(evals)
df_results = evaluator.evaluate_outputs(df)
evaluator.calculate_confidence_intervals(df_results)
evaluator.run_paired_t_test(df_results)

with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: Retrying...
with_so_tool_calls, Alice, Bob: 

In [17]:
# | output: false
# | echo: false

evaluator = LLMEvaluator(
    configs=CONFIGS,
    create_prompt_fn=create_prompt_shuffled_objects,
    parse_response_fn=parse_response_shuffled_objects,
    response_model=ResponseShuffledObjects,
)

df = evaluator.generate_outputs(evals)
df_results = evaluator.evaluate_outputs(df)
evaluator.calculate_confidence_intervals(df_results)
evaluator.run_paired_t_test(df_results)

Calculating confidence intervals (0.95) with 246 observations:
score_without_so - Mean: 87.40% CI: 83.24% - 91.55%
score_with_so_tool_calls - Mean: 70.73% CI: 65.03% - 76.43%
score_with_so_strict_tool_calls - Mean: 67.48% CI: 61.61% - 73.35%
score_with_so_json_mode - Mean: 63.41% CI: 57.38% - 69.45%

score_without_so vs score_with_so_tool_calls
t-statistic: 4.816493986159092, p-value: 2.561002821571171e-06
score_without_so vs score_with_so_strict_tool_calls
t-statistic: 5.359767575478789, p-value: 1.9228803996842006e-07
score_without_so vs score_with_so_json_mode
t-statistic: 6.31805795375796, p-value: 1.239309733027979e-09
