# How good are LLMs at solving decision theory problems?

In this notebook we explore how LLMs solve decision-theory problems. We use OpenRouter to evaluate 4 of the top models as of now (May 2025):
* Gemini 2.5 Pro
* OpenAI o3-mini-high
* DeepSeek R1
* Claude Sonnet 3.7

We evaluate LLMs with and without tools

In [49]:
import sys
import json
from os import getenv
from pathlib import Path
from datetime import datetime
from openai import OpenAI

# Set the base path
base_path = Path("../")  # One level up from the current working directory

# Add the src/ directory to sys.path using base_path
sys.path.append(str((base_path / "src").resolve()))

from tools.python_math_executor import (
    PYTHON_MATH_EXECUTION_TOOL,
    execute_python
)
from tool_calling import run_tool_call_loop
from llm_calling import run_llm_call
from yaml_functions import load_yaml
from save_outputs import save_output

timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S") # To allow for multiple runs

In [50]:
system_prompt = """
You are a decision-analysis expert with deep knowledge of decision theory, expected utility, and probabilistic reasoning. 
Your goal is to evaluate decision problems objectively and recommend the strategy that maximizes expected utility, based on the information provided.
"""

short_objective = """
## Objective

Given the problem description above, your task is to identify the rational decision(s) at each point in the process. Use the information provided to evaluate the options and explain your reasoning clearly. Return your final recommendation based on sound analysis.
"""

long_objective = """
## Objective

Given the problem above, your task is to determine the optimal decision strategy using rational, quantitative reasoning.

Your response should include:

- **Modeling the Problem**  
  Identify the decision points, uncertainties (chance events), possible outcomes, and the utilities or payoffs associated with each outcome. Include any costs that affect final payoffs.

- **Computing Expected Utilities**  
  For each possible decision or strategy, calculate the expected utility by combining the relevant probabilities and payoffs. Clearly show how the values are derived.

- **Comparing Strategies**  
  Evaluate and compare the expected utilities of all possible strategies or policies. Determine which strategy yields the highest expected utility.

- **Recommending the Optimal Policy**  
  Based on your analysis, state the strategy that maximizes expected utility. If the problem involves observations or tests, describe what to do for each possible result.

- **Documenting the Reasoning**  
  Show all relevant calculations and briefly explain each step in your reasoning. Make the rationale behind your recommendation transparent and verifiable.
"""

In [81]:
# PROBLEM_NAME = "oil_field_purchase"
PROBLEM_NAME = "automated_logistics_center"

# MODEL = "gpt-4o-mini" # To test the baseline performance of LLMs
# MODEL = "deepseek/deepseek-r1" 
# MODEL = "anthropic/claude-3.7-sonnet:thinking"
# MODEL = "google/gemini-2.5-pro-preview"
MODEL = "openai/o3-mini-high"

In [82]:
# Initialize OpenAI client
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=getenv("OPENROUTER_API_KEY")
)

# Prepare prompt
problem = load_yaml(base_path / f"decision_problems/{PROBLEM_NAME}/problem.yaml")["long_version"]
objective = short_objective

prompt_str = problem + "\n" + objective

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": prompt_str}
]

# Without Tools

In [83]:
response, messages = run_llm_call(
    openai_client=client,
    model=MODEL,
    messages=messages
)

print(response)

Below is a step‐by‐step analysis showing how to compare the alternatives and arrive at the optimal decision:

──────────────────────────────
1. No‐Test Analysis

If the company forgoes testing, it can choose between:
• Advanced Robotics System (with prior outcomes specified)
• Conventional Robotics System

For the advanced system (no test), the outcomes (in millions of euros) are:
 – Smooth: 150  
 – Minor failure: –40  
 – Major failure: –120

Using the given prior probabilities for the advanced system (smooth = 0.70, minor = 0.20, major = 0.10), the expected utility is calculated as follows:

 EU(advanced, no test) = 0.70×150 + 0.20×(–40) + 0.10×(–120)
                 = 105 – 8 – 12 = 85

For the conventional system (no test), the outcomes are:
 – Smooth: 60  
 – Failure: –30  
with probabilities (smooth = 0.95, failure = 0.05). Therefore:

 EU(conventional, no test) = 0.95×60 + 0.05×(–30)
                 = 57 – 1.5 = 55.5

Thus, without testing the best option would be the advance

In [84]:
save_output(
    response=response,
    messages=messages,
    base_path=base_path / "results" / PROBLEM_NAME,
    timestamp=timestamp,
    dir_name=MODEL
)

✅ Final response saved to: ../results/automated_logistics_center/2025-05-20_14-23-07/openai/o3-mini-high/final_response.txt
✅ Message history saved to: ../results/automated_logistics_center/2025-05-20_14-23-07/openai/o3-mini-high/message_history.json


# With Tools

> R1 does not have function calling / returns error when trying in OpenRouter

In [85]:
# Tool registration
tools = [
    PYTHON_MATH_EXECUTION_TOOL
]

TOOL_MAPPING = {
    "execute_python": execute_python
}

In [86]:
response, updated_messages = run_tool_call_loop(
    openai_client=client,
    model=MODEL,
    tools=tools,
    messages=messages,
    tool_mapping=TOOL_MAPPING,
    verbose=True,
    max_iterations=10
)


--- Iteration 1 ---
🔚 Final model output:
 Below is a concise explanation and recommendation based on a full expected utility analysis.

──────────────────────────────
1. No‐Test Option

Without a test the company must choose between:
 a. Advanced Robotics System (with outcomes):
  • Smooth: 150  
  • Minor failure: –40  
  • Major failure: –120  
  with prior probabilities 0.70, 0.20, 0.10 respectively, yielding:

  EU(advanced, no test) = 0.70×150 + 0.20×(–40) + 0.10×(–120) = 105 – 8 – 12 = 85

 b. Conventional Robotics System (with ou


In [87]:
print(response)

Below is a concise explanation and recommendation based on a full expected utility analysis.

──────────────────────────────
1. No‐Test Option

Without a test the company must choose between:
 a. Advanced Robotics System (with outcomes):
  • Smooth: 150  
  • Minor failure: –40  
  • Major failure: –120  
  with prior probabilities 0.70, 0.20, 0.10 respectively, yielding:

  EU(advanced, no test) = 0.70×150 + 0.20×(–40) + 0.10×(–120) = 105 – 8 – 12 = 85

 b. Conventional Robotics System (with outcomes):
  • Smooth: 60  
  • Failure: –30  
  with probabilities 0.95 and 0.05, respectively, yielding:

  EU(conventional, no test) = 0.95×60 + 0.05×(–30) = 57 – 1.5 = 55.5

Thus in a no-test scenario, advanced (EU ≈ 85) dominates conventional (EU ≈ 55.5).

──────────────────────────────
2. Testing Option

By investing €10M in a feasibility test, the decision maker gets additional information:
 • The test can return one of three outcomes – Bad (b), Good (g), or Excellent (e).  
 • The test out

In [88]:
save_output(
    response=response,
    messages=messages,
    base_path=base_path / "results" / PROBLEM_NAME,
    timestamp=timestamp,
    dir_name=MODEL + "-python-math"
)

✅ Final response saved to: ../results/automated_logistics_center/2025-05-20_14-23-07/openai/o3-mini-high-python-math/final_response.txt
✅ Message history saved to: ../results/automated_logistics_center/2025-05-20_14-23-07/openai/o3-mini-high-python-math/message_history.json
