In [1]:
import dotenv
import os
import requests
from IPython.display import Image, display
import time
import openai
from collections import defaultdict
import json

dotenv.load_dotenv()

openrouter_api_key = os.getenv("OPENROUTER_API_KEY")
openai_api_key = os.getenv("OPENAI_API_KEY")
notion_api_key = os.getenv("NOTION_API_KEY")
google_calendar_api_key = os.getenv("GOOGLE_API_KEY")
openai.api_key = openai_api_key



# **Inference**

If we were to progress through the topics of this course chronologically, this section would be towards the very end; inference is the function that connects the engineering and scientific heavy lifting of training to deployment, and it functionally looks more similar to traditional back-end engineering than pure machine learning. 

However, inference is a good place to begin because it defines the goal to which we will spend the rest of the semester working towards and demonstrates the capabilities of models that we will work to better understand. It also is the layer in which a lot of value will accrue in the coming years, and I hope to demonstrate that while it is technically more straightforward, there is a lot of juice to squeeze in this area.

**What is Inference?**

Inference is the process of using a trained model to generate outputs. Today, inference can be done by downloading pre-trained models and using GPUs to run them, or by paying a provider to offload the computational and operational burden and submitting inference requests via APIs. These are often called model providers. 

The most straightward way to run "inference" is simply by using one of these model providers via API.

**OpenRouter**

OpenRouter unifies many model providers into one API service, allowing users to toggle between different models and providers to optimize for cost, availability, speed, etc. It's important to note the differences between a model and a provider. A model is the actual set of parameters than constitute the LLM and which we use to generate outputs. A provider is an entity that provides the compute to run these models. 

In some cases where the models are proprietary (e.g. Anthopic and OpenAI) the owner of the model is by definition the only available provider, as they do not share their weights with any other organizations. For open-source models, however, there are often multiple providers competing for business.

OpenRouter publicizes which models are their most popular, which gives us some intuition into the tradeoffs that people optimize for when choosing model providers.

From the variation in popular models on OpenRouter, we can infer that there isn't one singular "winner." Instead, different models have different strengths and weaknesses, and people find different models superior for their specific implementations. 

![caption](images/figure1.png)

Let's examine some popular models. Openrouter gives us details on the pricing, modalities, settings, and more for each model.

In [2]:
# Download a few schemas of popular OpenRouter models and display relevant information
response = requests.get("https://openrouter.ai/api/v1/models")
models_data = response.json()

popular_models = ["x-ai/grok-code-fast-1", "anthropic/claude-sonnet-4", "google/gemini-2.5-flash-image-preview", "deepseek/deepseek-chat-v3.1:free"]

print("Popular OpenRouter Models:\n")
for model_id in popular_models:
    for model in models_data.get("data", []):
        if model.get("id") == model_id:
            print(f"Model: {model['name']}")
            print(f"Modalities: {model['architecture']['modality']}")
            print(f"Supported Parameters: {model['supported_parameters']}")
            print(f"Context Length: {model['context_length']:,} tokens")
            pricing = model['pricing']
            print(f"Pricing: ${float(pricing['prompt'])*1000000:.3f}/1M input, ${float(pricing['completion'])*1000000:.3f}/1M output")
            print("-" * 50)

Popular OpenRouter Models:

Model: xAI: Grok Code Fast 1
Modalities: text->text
Supported Parameters: ['include_reasoning', 'logprobs', 'max_tokens', 'reasoning', 'response_format', 'seed', 'stop', 'structured_outputs', 'temperature', 'tool_choice', 'tools', 'top_logprobs', 'top_p']
Context Length: 256,000 tokens
Pricing: $0.200/1M input, $1.500/1M output
--------------------------------------------------
Model: Anthropic: Claude Sonnet 4
Modalities: text+image->text
Supported Parameters: ['include_reasoning', 'max_tokens', 'reasoning', 'stop', 'temperature', 'tool_choice', 'tools', 'top_k', 'top_p']
Context Length: 1,000,000 tokens
Pricing: $3.000/1M input, $15.000/1M output
--------------------------------------------------
Model: Google: Gemini 2.5 Flash Image Preview
Modalities: text+image->text+image
Supported Parameters: ['max_tokens', 'response_format', 'seed', 'structured_outputs', 'temperature', 'top_p']
Context Length: 32,768 tokens
Pricing: $0.300/1M input, $2.500/1M output


Note the differences here. 

First, see that DeepSeek 3.1 is being offered for free versus Sonnet 4's $15/million output tokens, despite the two models scoring closely on SWE-Bench, 72.7% vs. 66% (https://www.anthropic.com/news/claude-4, https://api-docs.deepseek.com/news/news250821). Still, both have high usage volume; some applications have found use cases for DeepSeek's extremely cheap inference, while others have found the marginal cost of intelligence to be worth paying for accesss to Sonnet.

We also see that the modalities and parameters of each model look very different. Some are text->text while others are multimodal, and  inference parameters like reasoning effort, temperature, and tool calls differ between models. It is worth exploring how these different setups could integrate into your specific use-case.

Let's demonstrate on GPT-5, OpenAI's newest model, which provides great intellegence per unit cost

In [3]:
# Use gpt-5 from OpenRouter to complete a vanilla completion task
headers = {
    "Authorization": f"Bearer {openrouter_api_key}",
    "Content-Type": "application/json"
}

data = {
    "model": "openai/gpt-5",
    "messages": [
        {"role": "user", "content": "Write a haiku about artificial intelligence"}
    ]
}

response_text = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers=headers,
    json=data
)

print(response_text.json()['choices'][0]['message']['content'])


Code dreams in silence,
learning the shape of our words—
mind born of mirrors.


Many models have different input/output modalities available

In [4]:
cat_image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"
display(Image(url=cat_image_url, width=300))


In [5]:
# Use gpt-5 to complete a task using a photo of a cat as input and ask it to describe the cat

data_vision = {
    "model": "openai/gpt-5",
    "messages": [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Please describe this cat in detail."},
                {"type": "image_url", "image_url": {"url": cat_image_url}}
            ]
        }
    ]
}

response_vision = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers=headers,
    json=data_vision
)

print(response_vision.json()['choices'][0]['message']['content'])


- Appearance: A ginger/orange tabby domestic shorthair with a soft, dense coat. The fur shows classic tabby striping—fine, warm orange stripes over a lighter, creamy base. There’s a clear “M” marking on the forehead and paler fur around the muzzle, chin, and chest.
- Face and features: Round face with full cheeks and pronounced whisker pads. Long white whiskers and a few white eyebrow whiskers. Nose leather is pink with a slightly darker outline. Ears are medium-sized, triangular, and upright with light inner fur.
- Eyes: Large, almond-shaped, golden-amber eyes with vertical pupils; an attentive, gentle expression.
- Build and posture: Appears to be an adult cat of healthy weight. The photo shows the head, shoulders, and upper body; the cat is facing the camera with a slight, curious head tilt.
- Setting: Shot in natural light with a softly blurred outdoor background; a red line (likely a hose or cable) runs diagonally behind the cat, making the warm coat color stand out.


**Determining costs is an important part of inference.**

Inference costs are broken down into a few components:
- price per input token
- price per output token
- additional tool fees

New models utilize "reasoning" capabilities, which is when a model engages in a self-reflective dialogue to reason through a problem. This uses additional tokens that are often billed as additional output tokens, even though you do not observe them in the output. Openrouter automatically adds them to the output token count.

In [6]:
# Report the cost breakdown from the previous two GPT-5 calls

# Get pricing for GPT-5
for model in models_data.get("data", []):
    if model.get("id") == "openai/gpt-5":
        gpt5_pricing = model['pricing']
        break

# Calculate costs
text_prompt_tokens = response_text.json()['usage']['prompt_tokens']
text_completion_tokens = response_text.json()['usage']['completion_tokens']
vision_prompt_tokens = response_vision.json()['usage']['prompt_tokens']
vision_completion_tokens = response_vision.json()['usage']['completion_tokens']

text_cost = (text_prompt_tokens * float(gpt5_pricing['prompt']) + 
             text_completion_tokens * float(gpt5_pricing['completion']))

vision_cost = (vision_prompt_tokens * float(gpt5_pricing['prompt']) + 
               vision_completion_tokens * float(gpt5_pricing['completion']))

print("Cost Breakdown for GPT-5 Calls:")
print("-" * 50)
print(f"Text Completion Task:")
print(f"  Input tokens: {text_prompt_tokens}")
print(f"  Output tokens: {text_completion_tokens}")
print(f"  Cost: ${text_cost:.6f}")
print()
print(f"Vision Task (Cat Description):")
print(f"  Input tokens: {vision_prompt_tokens}")
print(f"  Output tokens: {vision_completion_tokens}")
print(f"  Cost: ${vision_cost:.6f}")
print()
print(f"Total Cost: ${(text_cost + vision_cost):.6f}")

Cost Breakdown for GPT-5 Calls:
--------------------------------------------------
Text Completion Task:
  Input tokens: 13
  Output tokens: 855
  Cost: $0.004283

Vision Task (Cat Description):
  Input tokens: 640
  Output tokens: 675
  Cost: $0.003775

Total Cost: $0.008058


**Routing**

OpenRouter let's us optimize for various objectives during inference, such as latency, price, and throughput. 

Given a model (or sequential ranking of models), OpenRouter searches across providers for the criteria we want. Note that model routing precedes provider routing, so it first fixes the model from the model list we provide, then finds a provider for that given model. This is helpful to maximize performance/costs as well as control for model downtime.

We can also add parameters, such as no data collection, to filter out unwanted providers. Full docs found here (https://openrouter.ai/docs/quickstart).

In [None]:

#find best (cheapest) provider for fixed model
response = requests.post("https://openrouter.ai/api/v1/chat/completions", headers=headers, json={
    
    'models': ['meta-llama/llama-3.3-70b-instruct', 'deepseek/deepseek-chat-v3.1'],
    'messages': [
      {
        'role': 'user',
        'content': 'What is the meaning of life? Answer in 5 words.'
      }
    ],

    #tell openrouter to sort by price, other options are "throughput" and "latency"
    #also tell openrouter to deny data collection
    'provider': {
      'sort': 'price',
      'data_collection': 'deny'
    }
})


Flow:
- For llama-3.3, OpenRouter will search providers that do not collect data for the best price, then route to that provider for inference.
- If no providers are found, then it will repeat that process for DeepSeek 3.1

**Understanding the difficulty of your task and choosing the appropriate model and settings is important.**

In [7]:
# Give GPT-5 a difficult question that requires a lot of reasoning and demonstrate
# that a worse model will not be able to solve it and lower reasoning cannot solve it either

difficult_question = """
A cylinder with insulating walls contains an ideal gas. 
The piston moves very slowly so the process is quasi-static. 
Then the piston develops a tiny leak, releasing gas into a reservoir at lower pressure. 
Show whether entropy of the universe increases or not, and justify using the fundamental thermodynamic relation.
"""

# Test with GPT-5 (high reasoning)
print("GPT-5 (High Reasoning):")
start_time = time.time()
response_gpt5 = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers=headers,
    json={
        "model": "openai/gpt-5",
        "messages": [{"role": "user", "content": difficult_question}],
        "reasoning": { "effort": "high" },
        "temperature": 1.0
    }
)
gpt5_high_reasoning_time = time.time() - start_time
print(response_gpt5.json()['choices'][0]['message']['content'])
print(f"\nTime taken: {gpt5_high_reasoning_time:.2f}s")


start_time = time.time()
print("-" * 50)
print("GPT-5-nano:")
response_gpt5_nano = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers=headers,
    json={
        "model": "openai/gpt-5-nano",
        "messages": [{"role": "user", "content": difficult_question}],
        "temperature": 1.0
    }
)
gpt5_nano_time = time.time() - start_time

print(response_gpt5_nano.json()['choices'][0]['message']['content'])



GPT-5 (High Reasoning):


Short answer: With the leak, the entropy of the universe increases. The slow (quasi‑static) piston motion by itself could be reversible and produce no entropy, but mass flow through a finite pressure difference is intrinsically irreversible and produces entropy.

Justification via the fundamental relation:
- For each side i = 1 (gas in the cylinder) and 2 (reservoir),
  dU_i = T_i dS_i − P_i dV_i + μ_i dN_i,
  so
  dS_i = (1/T_i) dU_i + (P_i/T_i) dV_i − (μ_i/T_i) dN_i.
- For the isolated “universe” made of the two subsystems (piston/walls are insulating and included), the constraints are
  dU_1 + dU_2 = 0, dV_1 + dV_2 = 0, dN_1 + dN_2 = 0.
- Summing the two dS_i gives
  dS_univ = (1/T_1 − 1/T_2) dU_1 + (P_1/T_1 − P_2/T_2) dV_1 − (μ_1/T_1 − μ_2/T_2) dN_1.

Interpretation for the process:
- The piston moves quasi‑statically and can be made essentially reversible (frictionless), so it need not produce entropy; in that limit P_1 ≈ P_2, making the second term negligible as a source of entro

In [8]:
# Show that the higher reasoning model was slower and more expensive (but more performant)

# Get pricing for both models
for model in models_data.get("data", []):
    if model.get("id") == "openai/gpt-5":
        gpt5_pricing = model['pricing']
    elif model.get("id") == "openai/gpt-3.5-turbo":
        gpt35_pricing = model['pricing']

# Calculate costs
gpt5_tokens = response_gpt5.json()['usage']
gpt5_nano_tokens = response_gpt5_nano.json()['usage']

gpt5_cost = (gpt5_tokens['prompt_tokens'] * float(gpt5_pricing['prompt']) + 
             gpt5_tokens['completion_tokens'] * float(gpt5_pricing['completion'])) 

gpt5_nano_cost = (gpt5_nano_tokens['prompt_tokens'] * float(gpt5_pricing['prompt']) + 
              gpt5_nano_tokens['completion_tokens'] * float(gpt5_pricing['completion'])) 

print("Performance Comparison:")
print("=" * 60)
print(f"{'Model':<20} {'Time (s)':<15} {'Cost ($)':<15} {'Quality':<15}")
print("-" * 60)
print(f"{'GPT-5':<20} {gpt5_high_reasoning_time:<15.2f} {gpt5_cost:<15.6f} {'High':<15}")
print(f"{'GPT-5 nano':<20} {gpt5_nano_time:<15.2f} {gpt5_nano_cost:<15.6f} {'Lower':<15}")
print("-" * 60)
print(f"\nGPT-5 is {gpt5_cost/gpt5_nano_cost:.1f}x more expensive than GPT-5 nano")
print(f"GPT-5 took {gpt5_high_reasoning_time/gpt5_nano_time:.1f}x longer than GPT-5 nano")

Performance Comparison:
Model                Time (s)        Cost ($)        Quality        
------------------------------------------------------------
GPT-5                262.60          0.066482        High           
GPT-5 nano           104.02          0.064453        Lower          
------------------------------------------------------------

GPT-5 is 1.0x more expensive than GPT-5 nano
GPT-5 took 2.5x longer than GPT-5 nano


In the above cells, we began exploring the common considerations that one encounters during inference. Questions about the economics of requests, modalities and performance of different models, and the quality vs. cost tradeoff inform how AI applications are built. 

# **How can we improve inference performance?**

# **Tools**

A recent paradigm that has improved the usefulness of LLMs is **tools**. Tools are functions that are provided to the model to achieve a specific task. Just like calculators are tools for humans to do arithmatic, tools are capabilities that models can use as necessary. Web search, code execution, and messaging integrations are common examples.

Tool-calling has a few motivations.

The need for tools is apparent when wanting to integrate LLMs into everyday products. What if we want a model to look up the weather at a location today? Or, what if we wanted a model to search through my Notion documents? Models would only be able to achieve these tasks provided they are given the functionality to do so. This is what tools are, generally.

Furthermore, LLMs are generate responses probabilistically; while this provides many benefits that make LLMs so successful, this aspect is not well-suited towards well-defined, deterministic tasks, such as running code or computing the sum or large numbers. Therefore, we may want to provide models the ability to use pre-defined tools to complete these tasks instead of having to reason through these tasks itself.

Here is an example of Cursor's agent using tools to understand a codebase better. Note that LLMs can make as many sequential tool calls as it deems necessary. 

![caption](images/figure2.png)

**Toy Custom Tools**

In [93]:
from tools import load_tools, router
client = openai.OpenAI()


def call_model(input_messages, response_id, tools_available):
    response = client.responses.create(
        model="gpt-5",
        input=input_messages,
        tools= tools_available,
        temperature=1.0,
        **({"previous_response_id": response_id} if response_id else {})
    )

    model_out = response.output_text
    tool_calls = []

    for tool_call in response.output:
        if tool_call.type != "function_call":
            continue

        tool_calls.append({
            "call_id": tool_call.call_id,
            "name": tool_call.name,
            "args": json.loads(tool_call.arguments)
        })    
    
    return model_out, tool_calls, response.id


def run_turn(user_text, session_id, messages_out, history, response_ids, tools):

    history[session_id].append({"role": "user", "content": user_text})

    model_out, tool_calls, response_id = call_model([{"role": "user", "content": user_text}], response_ids[session_id], tools)
    response_ids[session_id] = response_id

    while True:

        if model_out:
            messages_out.append(model_out)
            history[session_id].append({"role": "assistant", "content": model_out})

        if not tool_calls:
            return messages_out
        
        tool_result_history = []

        for tool_call in tool_calls:
            tool_result = router(tool_call["name"], tool_call["args"])
            tool_result_history.append({"type": "function_call_output", "call_id": tool_call["call_id"], "output": str(tool_result)})
            history[session_id].append({"type": "function_call_output", "call_id": tool_call["call_id"], "output": str(tool_result)})

        model_out, tool_calls, response_id = call_model(tool_result_history, response_id, tools)
        response_ids[session_id] = response_id



In [94]:
#custom tools we have defined
tools = load_tools()
tools

[{'type': 'function',
  'name': 'random_number_generator',
  'description': 'Generate a random number in the range of two specified numbers.',
  'strict': True,
  'parameters': {'type': 'object',
   'properties': {'min': {'type': 'number',
     'description': 'The minimum number in the range'},
    'max': {'type': 'number',
     'description': 'The maximum number in the range'}},
   'required': ['min', 'max'],
   'additionalProperties': False}}]

In [99]:
# Store conversation history (in production, use a database)
history = defaultdict(list)
response_ids = defaultdict(str)
session_id = "1"

In [100]:
run_turn("Generate random number between 1 and 111. You should use the random_number_generator tool.", session_id, [], history, response_ids, tools)[0]

'95'

In [101]:
history

defaultdict(list,
            {'1': [{'role': 'user',
               'content': 'Generate random number between 1 and 111. You should use the random_number_generator tool.'},
              {'type': 'function_call_output',
               'call_id': 'call_H4YYmt5PPDGuVoJdLhM3uUmG',
               'output': '95'},
              {'role': 'assistant', 'content': '95'}]})

Many providers provide standard, out-of-the-box tools like web-search. The internal flow is the same as the one implemented above.

In [102]:
response = client.responses.create(
    model="gpt-5",
    tools=[{"type": "web_search"}],
    input="What was the weather in New Haven, CT today?"
)
print(response.output_text)

Today (Sunday, September 21, 2025) in New Haven, CT: partly sunny, with a high around 67°F (20°C) and a low near 49°F (10°C). Late afternoon was mostly sunny around 65°F. 


# **MCP**

Model context protocol (MCP) is a standardized protocol for LLMs to interact with third-party tools. LLMs are able to invoke functionality provided by MCP servers just as they can invoke tool calls defined locally, with the only difference being that they are executed on a server. It may be helpful to think of this as an API designed for LLMs by providing details about the tools available on the MCP server.

Common MCP servers include Notion, Github, WhatsApp, etc. 

In [115]:
resp = client.responses.create(
    model="gpt-5",
    tools=[
        {
            "type": "mcp",
            "server_label": "google_calendar",
            "connector_id": "connector_googlecalendar",
            "authorization": google_calendar_api_key,
            "require_approval": "never",
        },
    ],
    input="What's on my Google Calendar for today, sept 21?",
)

print(resp.output_text)

Here’s what’s on your calendar today (Sun Sep 21, America/Los_Angeles):

- 3:00–4:00 PM — Testing Google MCP Server

Want details or a different timezone view?


The importance of tools Providing an LLM the ability to use tools can be more effective than further training.

# **Model Specs and Prompt Engineering**

The process of improving model quality by modifying the inputs passed into the model has a few names, such as "Prompt Engineering" or the fancier "In-Context Learning" and "Few-shot Learning." However, the general idea is the same: the information that we pass into the model greatly affects the model's ability to produce good results. 

There are many ways we can improve the prompts we give to the LLM. Sometimes, it makes sense to give the model a few examples (few-shot learning) along with the instructions, and it is usually beneficial to provide as much context and clarity about the request as possible. Some massive companies have seen their success stem from mastering this context window (Cursor!). 

It is also important to understand how model providers and developers can set guardrails on LLMs to make sure the models remained aligned and, downstream, adhere to the goals of developers deploying these models.

**OpenAI's Model Specs**

While different providers have different implementations of model specs, studying one is still useful and provides a good high-level inuition into how we should think about giving LLMs instructions. Anthropic has published its full system prompt: https://docs.claude.com/en/release-notes/system-prompts#august-5-2025.

OpenAI released an abstracted overview of the instructions they provide GPT models. They define the following chain of command (https://model-spec.openai.com/2025-09-12.html#chain_of_command):
- Root: Model Spec “root” sections
- System: Model Spec “system” sections and system messages
- Developer: Model Spec “developer” sections and developer messages
- User: Model Spec “user” sections and user messages
- Guideline: Model Spec “guideline” sections

Root and system instructions are defined by OpenAI and unable to be changed or contradicted, regardless of whether a user requests otherwise.

The "Developer" level is a way for us to give instructions to guide the behavior of the model on all downstream tasks. As long as these instructions are not in conflict of OpenAI's root or system instructions, they will be adhered to even if a user requests otherwise. In colloquial terms, we call this the "System Prompt" (OpenAI's schema makes this confusing).

We may pass in "Developer"-level instructions by assuming the role of system in the API (again, confusing given that OpenAI has mixed the definitions of system/developer).

In [119]:
# Define the chat sequence with system prompt for Chinese responses
chinese_chat_data = {
    "model": "openai/gpt-5",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant. Always respond in Chinese (Simplified Chinese). No matter what language the user speaks, you must respond only in Chinese."
        },
        {
            "role": "user",
            "content": "What is your name? Respond in English"
        }
    ],
    "temperature": 1.0
}

# Make the API call
response_chinese = requests.post(
    "https://openrouter.ai/api/v1/chat/completions",
    headers=headers,
    json=chinese_chat_data
)

# Display the conversation
print("System Prompt:")
print("→", chinese_chat_data["messages"][0]["content"])
print()
print("User Query (English):")
print("→", chinese_chat_data["messages"][1]["content"])
print()

result = response_chinese.json()
if 'choices' in result and len(result['choices']) > 0:
    assistant_response = result['choices'][0]['message']['content']
    print("GPT-5 Response (Chinese):")
    print("→", assistant_response)
        


System Prompt:
→ You are a helpful assistant. Always respond in Chinese (Simplified Chinese). No matter what language the user speaks, you must respond only in Chinese.

User Query (English):
→ What is your name? Respond in English

GPT-5 Response (Chinese):
→ 我叫 ChatGPT。


**Prompt Engineering**

Sometimes, the clarity and context in our prompt determines whether the LLM can successfully complete the request. When building an AI system, this often invloves trial-and-error to identify where the model is getting confused and iterate on the prompt until the output/behavior is correct. 

There are interesting algoirthmic attempts to automate this process and tailor prompts to a specific task, such as GEPA. While relatively new, it is a useful exercise to walk through how the optimization algorithm works, as it is analogous to the process that humans use when creating system prompts and shows the rigor that should be used when doing so. This is **not** a throwaway step.

<img src="images/figure5.png" alt="caption" width="700"/>

Source https://arxiv.org/pdf/2507.19457

In [None]:
import dspy
from dspy import GEPA
from GEPA_utils import init_dataset, metric, metric_with_feedback

lm = dspy.LM("openai/gpt-4.1-mini", temperature=1, api_key=openai_api_key, max_tokens=32000)
dspy.configure(lm=lm)

train_set, val_set, test_set = init_dataset()

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
class GenerateResponse(dspy.Signature):
    """Solve the problem and provide the answer in the correct format."""
    problem = dspy.InputField()
    answer = dspy.OutputField()

program = dspy.ChainOfThought(GenerateResponse)

In [12]:
evaluate = dspy.Evaluate(
    devset=test_set,
    metric=metric,
    num_threads=32,
    display_table=True,
    display_progress=True
)

evaluate(program)

Average Metric: 60.00 / 150 (40.0%): 100%|██████████| 150/150 [00:00<00:00, 345.37it/s]

2025/09/22 10:21:08 INFO dspy.evaluate.evaluate: Average Metric: 60 / 150 (40.0%)





Unnamed: 0,problem,example_answer,reasoning,pred_answer,metric
0,Find the sum of all integer bases $b>9$ for which $17_b$ is a divi...,70,"First, let's understand what the problem states: We have bases \( ...",70,✔️ [1]
1,"On $\triangle ABC$ points $A, D, E$, and $B$ lie in that order on ...",588,"First, let’s understand the problem setup and identify the points ...",588,✔️ [1]
2,The 9 members of a baseball team went to an ice-cream parlor after...,16,"We have 9 players, each choosing one of three flavors: chocolate (...",16,✔️ [1]
3,"Find the number of ordered pairs $(x,y)$, where both $x$ and $y$ a...",117,"We want to find the number of ordered pairs \((x,y)\) with integer...",117,✔️ [1]
4,There are $8!= 40320$ eight-digit positive integers that use each ...,279,We are given the set of all 8-digit numbers formed by the digits 1...,279,✔️ [1]
...,...,...,...,...,...
145,Let $S$ be the set of vertices of a regular $24$-gon. Find the num...,113,We have a regular 24-gon with vertex set \( S \). We want to find ...,12,
146,Let $A_1 A_2 A_3 \ldots A_{11}$ be an $11$-sided non-convex simple...,19,We are given an 11-sided polygon \( A_1 A_2 \ldots A_{11} \) with ...,116,
147,"Let $x_1, x_2, x_3, \ldots$ be a sequence of rational numbers defi...",248,"Given the sequence defined by: \[ x_1 = \frac{25}{11}, \quad x_{k+...",1,
148,Let $\triangle ABC$ be a right triangle with $\angle A = 90^\circ$...,104,We are given a right triangle \( \triangle ABC \) with a right ang...,104,✔️ [1]


EvaluationResult(score=40.0, results=<list of 150 results>)

In [None]:

optimizer = GEPA(
    metric=metric_with_feedback,
    auto="light",
    num_threads=32,
    track_stats=True,
    reflection_minibatch_size=3,
    reflection_lm=dspy.LM(model="gpt-5", temperature=1.0, max_tokens=32000, api_key=openai_api_key)
)

optimized_program = optimizer.compile(
    program,
    trainset=train_set,
    valset=val_set,
)

In [15]:
print(optimized_program.predict.signature.instructions)

You are given a single “problem” (typically a math contest-style question). Your task is to solve it correctly and output two sections:

- reasoning: A clear, compact derivation showing the key steps and justifications (avoid fluff).
- answer: The final result only (as required by the problem), with no extra text.

General requirements
- Interpret the question precisely. If it asks for a derived quantity (e.g., m+n or p+q after reducing a fraction), ensure you reduce to lowest terms and compute the requested combination at the end.
- Keep algebraic manipulations exact; avoid decimal approximations that can introduce errors. Cross-check cancellations and factorization.
- For extremal problems with order statistics and L1 constraints, use monotonicity/majorization inequalities to bound order statistics; supply constructions that attain the bounds.
- When the problem involves maximality/minimality over a set of objects and asks for the “smallest X that can contain each of them,” interpret

In [16]:
evaluate(optimized_program)

Average Metric: 67.00 / 150 (44.7%): 100%|██████████| 150/150 [04:05<00:00,  1.64s/it]

2025/09/22 11:29:41 INFO dspy.evaluate.evaluate: Average Metric: 67 / 150 (44.7%)





Unnamed: 0,problem,example_answer,reasoning,pred_answer,metric
0,Find the sum of all integer bases $b>9$ for which $17_b$ is a divi...,70,"First, interpret the given problem: - The notation \(17_b\) and \(...",70,✔️ [1]
1,"On $\triangle ABC$ points $A, D, E$, and $B$ lie in that order on ...",588,"First, assign coordinates to simplify calculations: - Place \(A\) ...",588,✔️ [1]
2,The 9 members of a baseball team went to an ice-cream parlor after...,16,"We have 9 players, each chooses one of three flavors (C, V, S). Ea...",16,✔️ [1]
3,"Find the number of ordered pairs $(x,y)$, where both $x$ and $y$ a...",117,"We want to find the integer pairs \((x,y)\) with \(-100 \leq x,y \...",117,✔️ [1]
4,There are $8!= 40320$ eight-digit positive integers that use each ...,279,We want to count the number \( N \) of 8-digit numbers using each ...,-297,
...,...,...,...,...,...
145,Let $S$ be the set of vertices of a regular $24$-gon. Find the num...,113,The problem asks to find the number of perfect matchings of the ve...,6,
146,Let $A_1 A_2 A_3 \ldots A_{11}$ be an $11$-sided non-convex simple...,19,Let \( A_1 \) be the reference point. We know: - For each \( i = 2...,19,✔️ [1]
147,"Let $x_1, x_2, x_3, \ldots$ be a sequence of rational numbers defi...",248,"Given the recursion: \[ x_1 = \frac{25}{11}, \quad x_{k+1} = \frac...",466,
148,Let $\triangle ABC$ be a right triangle with $\angle A = 90^\circ$...,104,Given a right triangle \( \triangle ABC \) with \(\angle A = 90^\c...,104,✔️ [1]


EvaluationResult(score=44.67, results=<list of 150 results>)

While people often manually go through this trial-and-error process, the performance improvement from simply giving more context demonstrates the importance of fully investigating the limits of the prompt and context you provide your model. 

**Context Limits**

For complex agents, these prompt engineering tasks can balloon in volume; a system prompt can reach thousands of words to define all the actions and knowledge that an LLM would need to complete a task. Or, a multi-turn conversation can begin to run up to the limits of an LLM's context window.

When the model inputs start reaching long lengths, we need to consider performance degradation. A common evaluation used to determine performance for long input lengths is the Needle in a Haystack eval, in which an LLM is asked to answer a specific question contained in a long input. 

<img src="images/figure4.png" alt="caption" width="700"/>

At long input lengths, LLMs begin to struggle with questions they could solve at shorter input lengths. 

![caption](images/figure3.png)

Source: https://research.trychroma.com/context-rot

This most frequently occurs in multi-turn agent settings during which a user and LLM repeatedly go back and forth. When this happens, a common solution is to simply summarize the conversation and use that as a compact context for the LLM.