# **Inference**

If we were to progress through the topics of this course chronologically, this section would be towards the very end; inference is the function that connects the engineering and scientific heavy lifting of training to deployment, and it functionally looks more similar to traditional back-end engineering than pure machine learning. 

However, inference is a good place to begin because it defines the goal to which we will spend the rest of the semester working towards and demonstrates the capabilities of models that we will work to better understand. It also is the layer in which a lot of value will accrue in the coming years, and I hope to demonstrate that while it is technically more straightforward, there is a lot of juice to squeeze in this area.

**What is Inference?**

Inference is the process of using a trained model to generate outputs. Today, inference can be done by downloading pre-trained models and using GPUs to run them, or by paying a provider to offload the computational and operational burden and submitting inference requests via APIs. These are often called model providers. 

The most straightward way to run "inference" is simply by using one of these model providers via API.

**OpenRouter**

OpenRouter unifies many model providers into one API service, allowing users to toggle between different models and providers to optimize for cost, availability, speed, etc.

In [113]:
import dotenv
import os
import requests
from IPython.display import Image, display
import time
import openai
from collections import defaultdict
import json

dotenv.load_dotenv()

openrouter_api_key = os.getenv("OPENROUTER_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")
notion_api_key = os.getenv("NOTION_API_KEY")
google_calendar_api_key = os.getenv("GOOGLE_API_KEY")
openai.api_key = openai_api_key



In [2]:
# Download a few schemas of popular OpenRouter models and display relevant information
response = requests.get("https://openrouter.ai/api/v1/models")
models_data = response.json()

popular_models = ["openai/gpt-5", "anthropic/claude-opus-4.1", "x-ai/grok-code-fast-1", "google/gemini-2.5-flash"]

print("Popular OpenRouter Models:\n")
for model_id in popular_models:
    for model in models_data.get("data", []):
        if model.get("id") == model_id:
            print(f"Model: {model['name']}")
            print(f"ID: {model['id']}")
            print(f"Context Length: {model['context_length']:,} tokens")
            pricing = model['pricing']
            print(f"Pricing: ${float(pricing['prompt'])*1000000:.3f}/1M input, ${float(pricing['completion'])*1000000:.3f}/1M output")
            print("-" * 50)

Popular OpenRouter Models:

Model: OpenAI: GPT-5
ID: openai/gpt-5
Context Length: 400,000 tokens
Pricing: $0.625/1M input, $5.000/1M output
--------------------------------------------------
Model: Anthropic: Claude Opus 4.1
ID: anthropic/claude-opus-4.1
Context Length: 200,000 tokens
Pricing: $15.000/1M input, $75.000/1M output
--------------------------------------------------
Model: xAI: Grok Code Fast 1
ID: x-ai/grok-code-fast-1
Context Length: 256,000 tokens
Pricing: $0.200/1M input, $1.500/1M output
--------------------------------------------------
Model: Google: Gemini 2.5 Flash
ID: google/gemini-2.5-flash
Context Length: 1,048,576 tokens
Pricing: $0.300/1M input, $2.500/1M output
--------------------------------------------------


Let's test out GPT-5, OpenAI's newest model, which provides great intellegence per unit cost

In [8]:
# Use gpt-5 from OpenRouter to complete a vanilla completion task
headers = {
    "Authorization": f"Bearer {openrouter_api_key}",
    "Content-Type": "application/json"
}

data = {
    "model": "openai/gpt-5",
    "messages": [
        {"role": "user", "content": "Write a haiku about artificial intelligence"}
    ]
}

response_text = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers=headers,
    json=data
)

print(response_text.json()['choices'][0]['message']['content'])


Silent circuits dream,
learning the shape of our world—
new mind wakes in light.


Many models have different input/output modalities available

In [25]:
cat_image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
display(Image(url=cat_image_url, width=300))


In [26]:
# Use gpt-5 to complete a task using a photo of a cat as input and ask it to describe the cat

data_vision = {
    "model": "openai/gpt-5",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Please describe this cat in detail."},
                {"type": "image_url", "image_url": {"url": cat_image_url}}
            ]
        }
    ]
}

response_vision = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers=headers,
    json=data_vision
)

print(response_vision.json()['choices'][0]['message']['content'])


- Coat and pattern: Short-haired ginger/orange tabby with warm caramel tones. Classic tabby “M” on the forehead, darker mackerel-like stripes on the face and body, and a paler cream chin, chest, and underside. The fur looks plush and well-groomed.
- Face: Round head with a gently tapered, short muzzle. Pink nose leather edged slightly darker. Long, white whiskers and a few white eyebrow whiskers.
- Eyes: Large, almond-round eyes in a golden amber color, dark-rimmed, giving a soft, attentive expression.
- Ears: Medium, triangular ears with slightly rounded tips; inner fur is light. The left ear shows a tiny nick/notch near the edge.
- Build: Appears medium-sized and sturdy with a thick neck and solid shoulders—typical of a healthy adult domestic shorthair.
- Posture and expression: Sitting and looking slightly upward toward the camera with a calm, curious, gentle look.
- Setting: Shot outdoors on a pale concrete surface; background is softly blurred with a red line (likely a hose) runni

**Determining costs is an important part of inference.**

Inference costs are broken down into a few components:
- price per input token
- price per output token
- additional tool fees

New models utilize "reasoning" capabilities, which is when a model engages in a self-reflective dialogue to reason through a problem. This uses additional tokens that are often billed as additional output tokens, even though you do not observe them in the output. Openrouter automatically adds them to the output token count.

In [12]:
# Report the cost breakdown from the previous two GPT-5 calls

# Get pricing for GPT-5
for model in models_data.get("data", []):
    if model.get("id") == "openai/gpt-5":
        gpt5_pricing = model['pricing']
        break

# Calculate costs
text_prompt_tokens = response_text.json()['usage']['prompt_tokens']
text_completion_tokens = response_text.json()['usage']['completion_tokens']
vision_prompt_tokens = response_vision.json()['usage']['prompt_tokens']
vision_completion_tokens = response_vision.json()['usage']['completion_tokens']

text_cost = (text_prompt_tokens * float(gpt5_pricing['prompt']) + 
             text_completion_tokens * float(gpt5_pricing['completion']))

vision_cost = (vision_prompt_tokens * float(gpt5_pricing['prompt']) + 
               vision_completion_tokens * float(gpt5_pricing['completion']))

print("Cost Breakdown for GPT-5 Calls:")
print("-" * 50)
print(f"Text Completion Task:")
print(f"  Input tokens: {text_prompt_tokens}")
print(f"  Output tokens: {text_completion_tokens}")
print(f"  Cost: ${text_cost:.6f}")
print()
print(f"Vision Task (Cat Description):")
print(f"  Input tokens: {vision_prompt_tokens}")
print(f"  Output tokens: {vision_completion_tokens}")
print(f"  Cost: ${vision_cost:.6f}")
print()
print(f"Total Cost: ${(text_cost + vision_cost):.6f}")

Cost Breakdown for GPT-5 Calls:
--------------------------------------------------
Text Completion Task:
  Input tokens: 13
  Output tokens: 663
  Cost: $0.003323

Vision Task (Cat Description):
  Input tokens: 640
  Output tokens: 651
  Cost: $0.003655

Total Cost: $0.006978


**Understanding the difficulty of your task and choosing the appropriate model and settings is important.**

In [None]:
# Give GPT-5 a difficult question that requires a lot of reasoning and demonstrate
# that a worse model will not be able to solve it and lower reasoning cannot solve it either

difficult_question = """
A cylinder with insulating walls contains an ideal gas. 
The piston moves very slowly so the process is quasi-static. 
Then the piston develops a tiny leak, releasing gas into a reservoir at lower pressure. 
Show whether entropy of the universe increases or not, and justify using the fundamental thermodynamic relation.
"""

# Test with GPT-5 (high reasoning)
print("GPT-5 (High Reasoning):")
# print("-" * 50)
start_time = time.time()
response_gpt5 = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers=headers,
    json={
        "model": "openai/gpt-5",
        "messages": [{"role": "user", "content": difficult_question}],
        "reasoning": { "effort": "high" },
        "temperature": 1.0
    }
)
gpt5_high_reasoning_time = time.time() - start_time
print(response_gpt5.json()['choices'][0]['message']['content'])
print(f"\nTime taken: {gpt5_high_reasoning_time:.2f}s")


start_time = time.time()
print("-" * 50)
print("GPT-5-nano:")
response_gpt5_nano = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers=headers,
    json={
        "model": "openai/gpt-5-nano",
        "messages": [{"role": "user", "content": difficult_question}],
        "temperature": 1.0
    }
)
gpt5_nano_time = time.time() - start_time

print(response_gpt5_nano.json()['choices'][0]['message']['content'])



GPT-5 (High Reasoning):
Short answer: The entropy of the universe increases when the leak appears.

Reasoning with the fundamental relation:

- Split the universe into two thermodynamic subsystems: 1 = gas in the cylinder; 2 = the lower-pressure reservoir. The piston/weights are purely mechanical and carry no entropy.
- For each subsystem, the fundamental relation is dU = T dS − P dV + μ dN, i.e., dS = (dU + P dV − μ dN)/T.
- Add the two subsystems:
  dS1 + dS2 = (dU1 + P1 dV1 − μ1 dN1)/T1 + (dU2 + P2 dV2 − μ2 dN2)/T2.
- The only coupling between 1 and 2 is the tiny leak: the separating wall is adiabatic and rigid (no heat or volume transfer between 1 and 2). Mechanical work P dV is exchanged with the external weights, which have no entropy. Energy conservation for the whole (1+2+weights) gives dU1 + dU2 = −(P1 dV1 + P2 dV2). Substituting cancels the P dV terms, leaving the entropy change of the universe (since the weights contribute no entropy):
  dS_univ = dS1 + dS2 = (μ1/T1 − μ2/T2)

Speed

In [31]:
# Show that the higher reasoning model was slower and more expensive (but more performant)

# Get pricing for both models
for model in models_data.get("data", []):
    if model.get("id") == "openai/gpt-5":
        gpt5_pricing = model['pricing']
    elif model.get("id") == "openai/gpt-3.5-turbo":
        gpt35_pricing = model['pricing']

# Calculate costs
gpt5_tokens = response_gpt5.json()['usage']
gpt5_nano_tokens = response_gpt5_nano.json()['usage']

gpt5_cost = (gpt5_tokens['prompt_tokens'] * float(gpt5_pricing['prompt']) + 
             gpt5_tokens['completion_tokens'] * float(gpt5_pricing['completion'])) 

gpt5_nano_cost = (gpt5_nano_tokens['prompt_tokens'] * float(gpt5_pricing['prompt']) + 
              gpt5_nano_tokens['completion_tokens'] * float(gpt5_pricing['completion'])) 

print("Performance Comparison:")
print("=" * 60)
print(f"{'Model':<20} {'Time (s)':<15} {'Cost ($)':<15} {'Quality':<15}")
print("-" * 60)
print(f"{'GPT-5':<20} {gpt5_high_reasoning_time:<15.2f} {gpt5_cost:<15.6f} {'High':<15}")
print(f"{'GPT-5 nano':<20} {gpt5_nano_time:<15.2f} {gpt5_nano_cost:<15.6f} {'Lower':<15}")
print("-" * 60)
print(f"\nGPT-5 is {gpt5_cost/gpt5_nano_cost:.1f}x more expensive than GPT-5 nano")
print(f"GPT-5 took {gpt5_high_reasoning_time/gpt5_nano_time:.1f}x longer than GPT-5 nano")

Performance Comparison:
Model                Time (s)        Cost ($)        Quality        
------------------------------------------------------------
GPT-5                242.89          0.100933        High           
GPT-5 nano           67.50           0.070368        Lower          
------------------------------------------------------------

GPT-5 is 1.4x more expensive than GPT-5 nano
GPT-5 took 3.6x longer than GPT-5 nano


Data privacy considerations (data collection, no-training)

In the above cells, we began exploring the common considerations that one encounters during inference. Questions about the economics of requests, modalities and performance of different models, and the quality vs. cost tradeoff inform how AI applications are built. 

# **How can we improve inference performance?**

# **Tools**

A recent paradigm that has improved the usefulness of LLMs is **tools**. Tools are functions that are provided to the model to achieve a specific task. Just like calculators are tools for humans to do arithmatic, tools are capabilities that models can use as necessary. Web search, code execution, and messaging integrations are common examples.

Tool-calling has a few motivations.

The need for tools is apparent when wanting to integrate LLMs into everyday products. What if we want a model to look up the weather at a location today? Or, what if we wanted a model to search through my Notion documents? Models would only be able to achieve these tasks provided they are given the functionality to do so. This is what tools are, generally.

Furthermore, LLMs are generate responses probabilistically; while this provides many benefits that make LLMs so successful, this aspect is not well-suited towards well-defined, deterministic tasks, such as running code or computing the sum or large numbers. Therefore, we may want to provide models the ability to use pre-defined tools to complete these tasks instead of having to reason through these tasks itself.


**Toy Custom Tools**

In [93]:
from tools import load_tools, router
client = openai.OpenAI()


def call_model(input_messages, response_id, tools_available):
    response = client.responses.create(
        model="gpt-5",
        input=input_messages,
        tools= tools_available,
        temperature=1.0,
        **({"previous_response_id": response_id} if response_id else {})
    )

    model_out = response.output_text
    tool_calls = []

    for tool_call in response.output:
        if tool_call.type != "function_call":
            continue

        tool_calls.append({
            "call_id": tool_call.call_id,
            "name": tool_call.name,
            "args": json.loads(tool_call.arguments)
        })    
    
    return model_out, tool_calls, response.id


def run_turn(user_text, session_id, messages_out, history, response_ids, tools):

    history[session_id].append({"role": "user", "content": user_text})

    model_out, tool_calls, response_id = call_model([{"role": "user", "content": user_text}], response_ids[session_id], tools)
    response_ids[session_id] = response_id

    while True:

        if model_out:
            messages_out.append(model_out)
            history[session_id].append({"role": "assistant", "content": model_out})

        if not tool_calls:
            return messages_out
        
        tool_result_history = []

        for tool_call in tool_calls:
            tool_result = router(tool_call["name"], tool_call["args"])
            tool_result_history.append({"type": "function_call_output", "call_id": tool_call["call_id"], "output": str(tool_result)})
            history[session_id].append({"type": "function_call_output", "call_id": tool_call["call_id"], "output": str(tool_result)})

        model_out, tool_calls, response_id = call_model(tool_result_history, response_id, tools)
        response_ids[session_id] = response_id



In [94]:
#custom tools we have defined
tools = load_tools()
tools

[{'type': 'function',
  'name': 'random_number_generator',
  'description': 'Generate a random number in the range of two specified numbers.',
  'strict': True,
  'parameters': {'type': 'object',
   'properties': {'min': {'type': 'number',
     'description': 'The minimum number in the range'},
    'max': {'type': 'number',
     'description': 'The maximum number in the range'}},
   'required': ['min', 'max'],
   'additionalProperties': False}}]

In [99]:
# Store conversation history (in production, use a database)
history = defaultdict(list)
response_ids = defaultdict(str)
session_id = "1"

In [100]:
run_turn("Generate random number between 1 and 111. You should use the random_number_generator tool.", session_id, [], history, response_ids, tools)[0]

'95'

In [101]:
history

defaultdict(list,
            {'1': [{'role': 'user',
               'content': 'Generate random number between 1 and 111. You should use the random_number_generator tool.'},
              {'type': 'function_call_output',
               'call_id': 'call_H4YYmt5PPDGuVoJdLhM3uUmG',
               'output': '95'},
              {'role': 'assistant', 'content': '95'}]})

Many providers provide standard, out-of-the-box tools like web-search. The internal flow is the same as the one implemented above.

In [102]:
response = client.responses.create(
    model="gpt-5",
    tools=[{"type": "web_search"}],
    input="What was the weather in New Haven, CT today?"
)
print(response.output_text)

Today (Sunday, September 21, 2025) in New Haven, CT: partly sunny, with a high around 67°F (20°C) and a low near 49°F (10°C). Late afternoon was mostly sunny around 65°F. 


# **MCP**

Model context protocol (MCP) is a standardized format for third-party tools to be published and utilized by LLMs. LLMs are able to invoke functionality provided by MCP servers just as they can invoke tool calls defined locally, with the only difference being that they are executed on a remote server. You may think of this as an API designed for LLMs, and it often sits on top of a standard API.

Common MCP servers include Notion, Github, WhatsApp, etc.

In [115]:
resp = client.responses.create(
    model="gpt-5",
    tools=[
        {
            "type": "mcp",
            "server_label": "google_calendar",
            "connector_id": "connector_googlecalendar",
            "authorization": google_calendar_api_key,
            "require_approval": "never",
        },
    ],
    input="What's on my Google Calendar for today, sept 21?",
)

print(resp.output_text)

Here’s what’s on your calendar today (Sun Sep 21, America/Los_Angeles):

- 3:00–4:00 PM — Testing Google MCP Server

Want details or a different timezone view?


# **Model Specs and Prompt Engineering**

The process of improving model quality by modifying the inputs passed into the model has a few names, such as "Prompt Engineering" or the fancier "In-Context Learning" and "Few-shot Learning." However, the general idea is the same: the information that we pass into the model greatly affects the model's ability to produce good results. 

There are many ways we can improve the prompts we give to the LLM. Sometimes, it makes sense to give the model a few examples (few-shot learning) along with the instructions, and it is usually beneficial to provide as much context and clarity about the request as possible. Some massive companies have seen their success stem from mastering this context window (Cursor!). 

It is also important to understand how model providers and developers can set guardrails on LLMs to make sure the models remained aligned and, downstream, adhere to the goals of developers deploying these models.

**OpenAI's Model Specs**

While different providers have different implementations of model specs, studying one is still useful and provides a good high-level inuition into how we should think about giving LLMs instructions.

OpenAI released an abstracted overview of the instructions they provide GPT models. They define the following chain of command (https://model-spec.openai.com/2025-09-12.html#chain_of_command):
- Root: Model Spec “root” sections
- System: Model Spec “system” sections and system messages
- Developer: Model Spec “developer” sections and developer messages
- User: Model Spec “user” sections and user messages
- Guideline: Model Spec “guideline” sections

Root and system instructions are defined by OpenAI and unable to be changed or contradicted, regardless of whether a user requests otherwise.

The "Developer" level is a way for us to give instructions to guide the behavior of the model on all downstream tasks. As long as these instructions are not in conflict of OpenAI's root or system instructions, they will be adhered to even if a user requests otherwise. In colloquial terms, we call this the "System Prompt" (OpenAI's schema makes this confusing).

We may pass in "Developer"-level instructions by assuming the role of system in the API (again, confusing given that OpenAI has mixed the definitions of system/developer).

In [119]:
# Define the chat sequence with system prompt for Chinese responses
chinese_chat_data = {
    "model": "openai/gpt-5",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant. Always respond in Chinese (Simplified Chinese). No matter what language the user speaks, you must respond only in Chinese."
        },
        {
            "role": "user",
            "content": "What is your name? Respond in English"
        }
    ],
    "temperature": 1.0
}

# Make the API call
response_chinese = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers=headers,
    json=chinese_chat_data
)

# Display the conversation
print("System Prompt:")
print("→", chinese_chat_data["messages"][0]["content"])
print()
print("User Query (English):")
print("→", chinese_chat_data["messages"][1]["content"])
print()

result = response_chinese.json()
if 'choices' in result and len(result['choices']) > 0:
    assistant_response = result['choices'][0]['message']['content']
    print("GPT-5 Response (Chinese):")
    print("→", assistant_response)
        


System Prompt:
→ You are a helpful assistant. Always respond in Chinese (Simplified Chinese). No matter what language the user speaks, you must respond only in Chinese.

User Query (English):
→ What is your name? Respond in English

GPT-5 Response (Chinese):
→ 我叫 ChatGPT。


**Prompt Engineering**

Sometimes, the clarity and context in our prompt determines whether the LLM can successfully complete the request. When building an AI system, this often invloves trial-and-error to identify where the model is getting confused and iterate on the prompt until the output/behavior is correct. 

There are interesting algoirthmic attempts to automate this process and tailor prompts to a specific task, such as GEPA.

In [121]:
import dspy
from datasets import load_dataset

lm = dspy.LM("openai/gpt-4.1-mini", temperature=1, api_key=openai_api_key, max_tokens=32000)
dspy.configure(lm=lm)

def init_dataset():
    train_split = load_dataset("AI-MO/aimo-validation-aime")['train']
    train_split = [
        dspy.Example({
            "problem": x['problem'],
            'solution': x['solution'],
            'answer': x['answer'],
        }).with_inputs("problem")
        for x in train_split
    ]
    import random
    random.Random(0).shuffle(train_split)
    tot_num = len(train_split)

    test_split = load_dataset("MathArena/aime_2025")['train']
    test_split = [
        dspy.Example({
            "problem": x['problem'],
            'answer': x['answer'],
        }).with_inputs("problem")
        for x in test_split
    ]

    train_set = train_split[:int(0.5 * tot_num)]
    val_set = train_split[int(0.5 * tot_num):]
    test_set = test_split * 5

    return train_set, val_set, test_set

train_set, val_set, test_set = init_dataset()

  from .autonotebook import tqdm as notebook_tqdm


In [122]:
class GenerateResponse(dspy.Signature):
    """Solve the problem and provide the answer in the correct format."""
    problem = dspy.InputField()
    answer = dspy.OutputField()

program = dspy.ChainOfThought(GenerateResponse)

In [123]:
def metric(example, prediction, trace=None, pred_name=None, pred_trace=None):
    correct_answer = int(example['answer'])
    try:
        llm_answer = int(prediction.answer)
    except ValueError as e:
        return 0
    return int(correct_answer == llm_answer)

In [124]:
evaluate = dspy.Evaluate(
    devset=test_set,
    metric=metric,
    num_threads=32,
    display_table=True,
    display_progress=True
)

evaluate(program)

Average Metric: 66.00 / 150 (44.0%): 100%|██████████| 150/150 [04:19<00:00,  1.73s/it]

2025/09/21 18:22:03 INFO dspy.evaluate.evaluate: Average Metric: 66 / 150 (44.0%)





Unnamed: 0,problem,example_answer,reasoning,pred_answer,metric
0,Find the sum of all integer bases $b>9$ for which $17_b$ is a divi...,70,"First, we interpret the given expressions in base \(b\): - \(17_b ...",70,✔️ [1]
1,"On $\triangle ABC$ points $A, D, E$, and $B$ lie in that order on ...",588,We have triangle \( ABC \) with points on the sides as follows: - ...,\(\boxed{\frac{7644}{13}}\),
2,The 9 members of a baseball team went to an ice-cream parlor after...,16,"We have 9 baseball players and 3 ice cream flavors: chocolate (C),...",16,✔️ [1]
3,"Find the number of ordered pairs $(x,y)$, where both $x$ and $y$ a...",117,"We are given the equation: \[ 12x^2 - xy - 6y^2 = 0, \] and we nee...",117,✔️ [1]
4,There are $8!= 40320$ eight-digit positive integers that use each ...,279,We are considering 8-digit numbers formed from the digits 1 throug...,279,✔️ [1]
...,...,...,...,...,...
145,Let $S$ be the set of vertices of a regular $24$-gon. Find the num...,113,We have a regular 24-gon with vertex set \( S \). We want to find ...,12,
146,Let $A_1 A_2 A_3 \ldots A_{11}$ be an $11$-sided non-convex simple...,19,"Given an 11-sided polygon \( A_1 A_2 \ldots A_{11} \), with vertex...",19,✔️ [1]
147,"Let $x_1, x_2, x_3, \ldots$ be a sequence of rational numbers defi...",248,"Given the sequence defined by: \[ x_1 = \frac{25}{11}, \quad x_{k+...",1,
148,Let $\triangle ABC$ be a right triangle with $\angle A = 90^\circ$...,104,We are given a right triangle \( \triangle ABC \) with \(\angle A ...,104,✔️ [1]


EvaluationResult(score=44.0, results=<list of 150 results>)

In [125]:
def metric_with_feedback(example, prediction, trace=None, pred_name=None, pred_trace=None):
    correct_answer = int(example['answer'])
    written_solution = example.get('solution', '')
    try:
        llm_answer = int(prediction.answer)
    except ValueError as e:
        feedback_text = f"The final answer must be a valid integer and nothing else. You responded with '{prediction.answer}', which couldn't be parsed as a python integer. Please ensure your answer is a valid integer without any additional text or formatting."
        feedback_text += f" The correct answer is '{correct_answer}'."
        if written_solution:
            feedback_text += f" Here's the full step-by-step solution:\n{written_solution}\n\nThink about what takeaways you can learn from this solution to improve your future answers and approach to similar problems and ensure your final answer is a valid integer."
        return dspy.Prediction(score=0, feedback=feedback_text)

    score = int(correct_answer == llm_answer)

    feedback_text = ""
    if score == 1:
        feedback_text = f"Your answer is correct. The correct answer is '{correct_answer}'."
    else:
        feedback_text = f"Your answer is incorrect. The correct answer is '{correct_answer}'."
    
    if written_solution:
        feedback_text += f" Here's the full step-by-step solution:\n{written_solution}\n\nThink about what takeaways you can learn from this solution to improve your future answers and approach to similar problems."

    return dspy.Prediction(score=score, feedback=feedback_text)

In [None]:
from dspy import GEPA

optimizer = GEPA(
    metric=metric_with_feedback,
    auto="light",
    num_threads=32,
    track_stats=True,
    reflection_minibatch_size=3,
    reflection_lm=dspy.LM(model="gpt-5", temperature=1.0, max_tokens=32000, api_key=openai_api_key)
)

optimized_program = optimizer.compile(
    program,
    trainset=train_set,
    valset=val_set,
)

2025/09/21 18:22:10 INFO dspy.teleprompt.gepa.gepa: Running GEPA for approx 560 metric calls of the program. This amounts to 6.22 full evals on the train+val set.
2025/09/21 18:22:10 INFO dspy.teleprompt.gepa.gepa: Using 45 examples for tracking Pareto scores. You can consider using a smaller sample of the valset to allow GEPA to explore more diverse solutions within the same budget.
GEPA Optimization:   0%|          | 0/560 [00:00<?, ?rollouts/s]2025/09/21 18:25:58 INFO dspy.evaluate.evaluate: Average Metric: 21.0 / 45 (46.7%)
2025/09/21 18:25:58 INFO dspy.teleprompt.gepa.gepa: Iteration 0: Base program full valset score: 0.4666666666666667
GEPA Optimization:   8%|▊         | 45/560 [03:48<43:34,  5.08s/rollouts]2025/09/21 18:25:58 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Selected program 0 score: 0.4666666666666667


Average Metric: 3.00 / 3 (100.0%): 100%|██████████| 3/3 [02:19<00:00, 46.51s/it]

2025/09/21 18:28:18 INFO dspy.evaluate.evaluate: Average Metric: 3.0 / 3 (100.0%)
2025/09/21 18:28:18 INFO dspy.teleprompt.gepa.gepa: Iteration 1: All subsample scores perfect. Skipping.
2025/09/21 18:28:18 INFO dspy.teleprompt.gepa.gepa: Iteration 1: Reflective mutation did not propose a new candidate
GEPA Optimization:   9%|▊         | 48/560 [06:08<1:14:04,  8.68s/rollouts]2025/09/21 18:28:18 INFO dspy.teleprompt.gepa.gepa: Iteration 2: Selected program 0 score: 0.4666666666666667



Average Metric: 1.00 / 3 (33.3%): 100%|██████████| 3/3 [01:25<00:00, 28.53s/it]

2025/09/21 18:29:43 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 3 (33.3%)





2025/09/21 18:30:55 INFO dspy.teleprompt.gepa.gepa: Iteration 2: Proposed new text for predict: Task:
- You will be given a single math problem in a field such as geometry, combinatorics, or number theory.
- Solve it accurately and output ONLY the final answer in the required format. Do not include any explanation unless explicitly asked.

Output format:
- By default, the answer must be a plain integer string with no extra text, symbols, LaTeX, labels, or formatting (e.g., not “\boxed{...}”, not “Answer: ...”, not decimals unless explicitly required).
- If the problem asks for a remainder, output the least nonnegative residue (an integer in the specified modulus range).
- If you compute a non-integer but the problem context suggests an integer (counts, remainders, many geometry lengths in contest settings), re-check your approach for exact methods and avoid approximations.

General solution guidelines:
- Avoid numeric approximation unless the problem explicitly requests it; prefer exac

Average Metric: 2.00 / 3 (66.7%): 100%|██████████| 3/3 [03:24<00:00, 68.23s/it] 

2025/09/21 18:38:07 INFO dspy.evaluate.evaluate: Average Metric: 2.0 / 3 (66.7%)





2025/09/21 18:42:55 INFO dspy.teleprompt.gepa.gepa: Iteration 3: Proposed new text for predict: Task:
- You will be given a single math problem (e.g., geometry, combinatorics, number theory).
- Solve it exactly and output ONLY the final answer in the required format. Do not include any explanation unless explicitly asked.

Output format:
- By default, output a plain integer string with no extra text, symbols, LaTeX, or labels (e.g., not “\boxed{...}”, not “Answer: ...”).
- If the problem asks for a remainder, output the least nonnegative residue.
- If the problem asks for m+n for a rational m/n, reduce the fraction to lowest terms first, then output m+n.
- Avoid decimals unless the problem explicitly requests them.

General solution guidelines:
- Prefer exact algebraic/number-theoretic/geometry arguments; avoid approximations.
- Re-check that the final answer matches the expected type (integer, remainder, etc.).
- If you compute a non-integer but context suggests an integer (counts, re

Average Metric: 1.00 / 3 (33.3%): 100%|██████████| 3/3 [00:33<00:00, 11.04s/it]

2025/09/21 18:45:32 INFO dspy.evaluate.evaluate: Average Metric: 1.0 / 3 (33.3%)





2025/09/21 18:46:03 INFO dspy.teleprompt.gepa.gepa: Iteration 4: Proposed new text for predict: You are given a single “problem” (typically a math contest-style question). Your task is to solve it correctly and output two sections:

- reasoning: A clear, compact derivation showing the key steps and justifications (avoid fluff).
- answer: The final result only (as required by the problem), with no extra text.

General requirements
- Interpret the question precisely. If it asks for a derived quantity (e.g., m+n or p+q after reducing a fraction), ensure you reduce to lowest terms and compute the requested combination at the end.
- Keep algebraic manipulations exact; avoid decimal approximations that can introduce errors. Cross-check cancellations and factorization.
- If the problem involves maximality/minimality over a set of objects and asks for the “smallest X that can contain each of them,” interpret this as choosing X large enough for the worst-case instance in the set (e.g., maximize

In [None]:
print(optimized_program.predict.signature.instructions)

In [None]:
evaluate(optimized_program)

While people often manually go through this trial-and-error process, the performance improvement from simply giving more context demonstrates the importance of fully investigating the limits of the prompt and context you provide your model. 