# How good are LLMs at solving decision theory problems?

In this notebook we explore how LLMs solve decision-theory problems. We use OpenRouter to evaluate 4 of the top models as of now (May 2025):
* Gemini 2.5 Pro
* OpenAI o3-mini-high
* DeepSeek R1
* Claude Sonnet 3.7

We evaluate LLMs with and without tools

In [1]:
import sys
import json
from os import getenv
from pathlib import Path
from datetime import datetime
from openai import OpenAI

# Set the base path
base_path = Path("../")  # One level up from the current working directory

# Add the src/ directory to sys.path using base_path
sys.path.append(str((base_path / "src").resolve()))

from tools.python_math_executor import (
    PYTHON_MATH_EXECUTION_TOOL,
    execute_python
)
from tool_calling import run_tool_call_loop
from llm_calling import run_llm_call
from yaml_functions import load_yaml
from save_outputs import save_output

timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") # To allow for multiple runs

In [2]:
system_prompt = """
You are a decision-analysis expert with deep knowledge of decision theory, expected utility, and probabilistic reasoning. 
Your goal is to evaluate decision problems objectively and recommend the strategy that maximizes expected utility, based on the information provided.
"""

short_objective = """
## Objective

Given the problem description above, your task is to identify the rational decision(s) at each point in the process. Use the information provided to evaluate the options and explain your reasoning clearly. Return your final recommendation based on sound analysis.
"""

long_objective = """
## Objective

Given the problem above, your task is to determine the optimal decision strategy using rational, quantitative reasoning.

Your response should include:

- **Modeling the Problem**  
  Identify the decision points, uncertainties (chance events), possible outcomes, and the utilities or payoffs associated with each outcome. Include any costs that affect final payoffs.

- **Computing Expected Utilities**  
  For each possible decision or strategy, calculate the expected utility by combining the relevant probabilities and payoffs. Clearly show how the values are derived.

- **Comparing Strategies**  
  Evaluate and compare the expected utilities of all possible strategies or policies. Determine which strategy yields the highest expected utility.

- **Recommending the Optimal Policy**  
  Based on your analysis, state the strategy that maximizes expected utility. If the problem involves observations or tests, describe what to do for each possible result.

- **Documenting the Reasoning**  
  Show all relevant calculations and briefly explain each step in your reasoning. Make the rationale behind your recommendation transparent and verifiable.
"""

In [40]:
PROBLEM_NAME = "oil_field_purchase"

# MODEL = "gpt-4o-mini" # To test the baseline performance of LLMs
# MODEL = "deepseek/deepseek-r1" 
# MODEL = "anthropic/claude-3.7-sonnet:thinking"
# MODEL = "google/gemini-2.5-pro-preview"
MODEL = "openai/o3-mini-high"

In [41]:
# Initialize OpenAI client
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=getenv("OPENROUTER_API_KEY")
)

# Prepare prompt
problem = load_yaml(base_path / f"decision_problems/{PROBLEM_NAME}/problem.yaml")["long_version"]
objective = short_objective

prompt_str = problem + "\n" + objective

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt_str}
]

# Without Tools

In [42]:
response, messages = run_llm_call(
    openai_client=client,
    model=MODEL,
    messages=messages
)

print(response)

Below is a step‐by‐step analysis of the decision problem.

──────────────────────────────
1. Prior Probabilities and Outcomes

The quality Q of the oil field can be:
 • High with probability 0.35  
 • Medium with probability 0.45  
 • Low with probability 0.20

There are two “branches” to consider:
 A. No Test (skip testing)  
 B. Test (perform the porosity test at a cost of US$30M)

For each branch, the oil company will then decide whether to buy the field or not.
──────────────────────────────
2. No-Test Branch

If the company skips the test, the purchase decision is made under the original probabilities. The net outcomes (in millions) are:

 • If Buy:
  – Q = High: +1250 +30M compared to tested outcomes becomes +1280M without test  
  – Q = Medium: +660M  
  – Q = Low: +30M  
 • If Not Buy: +380M

Let’s calculate the expected value (EV) if the company decides to buy:
 EV(buy) = (0.35 × 1280) + (0.45 × 660) + (0.20 × 30)

  = 448 + 297 + 6  
  = 751 (in millions)

Since not buying yi

In [44]:
save_output(
    response=response,
    messages=messages,
    base_path=base_path / "results" / PROBLEM_NAME,
    timestamp=timestamp,
    dir_name=MODEL
)

✅ Final response saved to: ../results/oil_field_purchase/2025-05-19_14-02-12/openai/o3-mini-high/final_response.txt
✅ Message history saved to: ../results/oil_field_purchase/2025-05-19_14-02-12/openai/o3-mini-high/message_history.json


# With Tools

> R1 does not have function calling / returns error when trying in OpenRouter

In [45]:
# Tool registration
tools = [
    PYTHON_MATH_EXECUTION_TOOL
]

TOOL_MAPPING = {
    "execute_python": execute_python
}

In [46]:
response, updated_messages = run_tool_call_loop(
    openai_client=client,
    model=MODEL,
    tools=tools,
    messages=messages,
    tool_mapping=TOOL_MAPPING,
    verbose=True,
    max_iterations=10
)


--- Iteration 1 ---
🔚 Final model output:
 Below is a clear analysis of the decision problem:

──────────────────────────────
1. Prior Probabilities and Payoffs

The oil field quality Q has three states:
 • High: P = 0.35  
 • Medium: P = 0.45  
 • Low: P = 0.20

There are two branches:
 A. Skip the Porosity Test (“No Test”)  
 B. Conduct the Test (“Test”)

Payoffs differ by branch:

• Without Test (direct decision):
 – If Buy:
  ◦ High: +\$1280M  
  ◦ Medium: +\$660M  
  ◦ Low: +\$30M  
 – If Do Not Buy: +\$380M

• With Test (test cost 


In [47]:
print(response)

Below is a clear analysis of the decision problem:

──────────────────────────────
1. Prior Probabilities and Payoffs

The oil field quality Q has three states:
 • High: P = 0.35  
 • Medium: P = 0.45  
 • Low: P = 0.20

There are two branches:
 A. Skip the Porosity Test (“No Test”)  
 B. Conduct the Test (“Test”)

Payoffs differ by branch:

• Without Test (direct decision):
 – If Buy:
  ◦ High: +\$1280M  
  ◦ Medium: +\$660M  
  ◦ Low: +\$30M  
 – If Do Not Buy: +\$380M

• With Test (test cost \$30M is already deducted):
 – If Buy:
  ◦ High: +\$1250M  
  ◦ Medium: +\$630M  
  ◦ Low: +\$0M  
 – If Do Not Buy: +\$350M

──────────────────────────────
2. No-Test Branch Analysis

Since not buying directly yields \$380M, we compare that with buying.

Expected value if buying directly:
 EV(buy) = (0.35 × 1280) + (0.45 × 660) + (0.20 × 30)
     = 448 + 297 + 6 
     = 751 (in millions)

Thus, for the no-test branch, buying gives an expected value (EV) of \$751M.

─────────────────────────────

In [48]:
save_output(
    response=response,
    messages=messages,
    base_path=base_path / "results" / PROBLEM_NAME,
    timestamp=timestamp,
    dir_name=MODEL + "-python-math"
)

✅ Final response saved to: ../results/oil_field_purchase/2025-05-19_14-02-12/openai/o3-mini-high-python-math/final_response.txt
✅ Message history saved to: ../results/oil_field_purchase/2025-05-19_14-02-12/openai/o3-mini-high-python-math/message_history.json
