# FutureAGI Agent Optimizer — Example Notebook

This notebook demonstrates how to use our `agent-opt` library to automatically improve and optimize LLM agents and prompts.
It runs a small Question-Answering optimization across multiple optimizers and compares the best prompts found.



In [None]:
# @title Installation
# Install dependencies
%pip install agent-opt -q


In [None]:
import os
import getpass

# Enter your API keys interactively (Jupyter will prompt)
OPENAI_API_KEY = getpass.getpass('Enter your OPENAI_API_KEY: ')
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

print("✅ API Key set successfully!")

Enter your OPENAI_API_KEY: ··········
✅ API Key set successfully!


## Imports and Logging Configuration

Import the required modules and configure logging so you can follow optimizer progress.  
If an import fails, make sure the corresponding package is installed in your kernel environment.


In [None]:
# CELL 2 — Imports and logging configuration
import logging
import pandas as pd
import random
import time
from typing import List, Dict, Any

# --- Framework Imports ---
from fi.opt.base.base_optimizer import BaseOptimizer
from fi.opt.generators import LiteLLMGenerator
from fi.opt.datamappers import BasicDataMapper
from fi.opt.base.evaluator import Evaluator
from fi.opt.types import OptimizationResult, IterationHistory
from fi.opt.utils import setup_logging

# --- Evaluator Imports ---
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

# --- Import All Optimizers for on the fly changing ---
from fi.opt.optimizers import (
    RandomSearchOptimizer,
    ProTeGi,
    MetaPromptOptimizer,
    GEPAOptimizer,
    PromptWizardOptimizer,
    BayesianSearchOptimizer,
)

# Configure logging
setup_logging(level=logging.INFO, log_to_console=True, log_to_file=True, log_file="agent-opt.log")
logger = logging.getLogger(__name__)

print("✅ All components imported and logging is configured.")

✅ All components imported and logging is configured.


## Prepare the Dataset

Create an in-memory QA dataset. Replace with your actual dataset for real-life experiment!


In [None]:
def create_dataset() -> List[Dict[str, Any]]:
    '''Creates a sample dataset for the QA task.'''
    data = {
        'context': [
            "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower.",
            "Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy, through a cellular process that converts carbon dioxide and water into glucose and oxygen.",
            "The first person to walk on the Moon was Neil Armstrong. The historic event occurred on July 20, 1969, during the Apollo 11 mission.",
            "The Amazon River in South America is the largest river by discharge volume of water in the world, and the second longest in length.",
            "William Shakespeare was an English playwright, poet, and actor, widely regarded as the greatest writer in the English language. His plays have been translated into every major living language."
        ],
        'question': [
            "Who designed the Eiffel Tower?",
            "What are the products of photosynthesis?",
            "When did a person first walk on the Moon?",
            "Which river is the largest by water volume?",
            "What is Shakespeare known for?"
        ],
        'answer': [
            "Gustave Eiffel",
            "Glucose and oxygen",
            "July 20, 1969",
            "The Amazon River",
            "Being the greatest writer in the English language"
        ]
    }
    df = pd.DataFrame(data)
    return df.to_dict("records")

dataset = create_dataset()
print("✅ Sample dataset created successfully. Here are the first two examples:")
for item in dataset[:2]:
    print(item)

✅ Sample dataset created successfully. Here are the first two examples:
{'context': 'The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower.', 'question': 'Who designed the Eiffel Tower?', 'answer': 'Gustave Eiffel'}
{'context': 'Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy, through a cellular process that converts carbon dioxide and water into glucose and oxygen.', 'question': 'What are the products of photosynthesis?', 'answer': 'Glucose and oxygen'}


## Define the Evaluation Strategy

We use a Custom LLM-as-a-Judge to score responses from 0.0 to 1.0. You can replace this with a local metric for cost savings.


In [None]:
# LLM provider used by the judge
provider = LiteLLMProvider()

correctness_judge_config = {
    "name": "correctness_judge",
    "grading_criteria": '''You are evaluating an AI's answer to a question. The score must be 1.0 if the 'response'
is semantically equivalent to the 'expected_response' (the ground truth). The score should be 0.0 if it is incorrect.
Partial credit is acceptable. For example, if the expected answer is "Gustave Eiffel" and the response is
"The tower was designed by Eiffel", a score of 0.8 is appropriate.''',
}

# Instantiate the judge and evaluator wrapper
correctness_judge = CustomLLMJudge(provider, config=correctness_judge_config)
evaluator = Evaluator(metric=correctness_judge)

# Data mapper connects model outputs to the judge expectations
data_mapper = BasicDataMapper(
    key_map={
        "response": "generated_output",
        "expected_response": "answer"
    }
)

print("✅ Evaluation strategy defined using a Custom LLM-as-a-Judge.")

2025-10-02 13:21:36,715 - fi.opt.base.evaluator - INFO - Initialized Evaluator with local metric: CustomLLMJudge
✅ Evaluation strategy defined using a Custom LLM-as-a-Judge.


## Initial Prompt and Models

Define the initial prompt and generator model to optimize for, and teacher models to optimize using.


In [None]:
INITIAL_PROMPT = "Context: {context}\\nQuestion: {question}\\nAnswer:" # @param {"type":"string"}
GENERATOR_MODEL = "gpt-4o-mini" # @param {"type":"string"}
TEACHER_MODEL = "gpt-5" # @param {"type":"string"}

print(f"✅ Ready to optimize! We will improve `{INITIAL_PROMPT}`")

✅ Ready to optimize! We will improve `Context: {context}\nQuestion: {question}\nAnswer:`


## GEPA Optimizer Setup
For this demo, we will be using **GEPA** optimizer.

 **GEPA** (Genetic-Pareto), is a state-of-the-art evolutionary algorithm for prompt optimization. Instead of making small, random changes, GEPA treats prompts like DNA and intelligently evolves them over generations.

### How it Works:
1.  **Evaluate:** It first tests the performance of the current best prompt(s) on a sample of data.
2.  **Reflect:** It uses a powerful "reflection" model (our `reflection_model`) to analyze the results, especially the failures. It generates rich, textual feedback on *why* the prompt failed.
3.  **Mutate:** Based on this reflection, it rewrites the prompt to create new, improved "offspring" prompts.
4.  **Select:** It uses a sophisticated method called Pareto-aware selection to choose the most promising new prompts to carry forward to the next generation. This ensures that it doesn't just find one good prompt, but a diverse set of high-performing ones.

This cycle of **Evaluate -> Reflect -> Mutate -> Select** allows GEPA to navigate the vast space of possible prompts much more efficiently than random chance, often leading to significant performance improvements.

For more information refer to our [FutureAGI Optimization Docs!](https://docs.futureagi.com/future-agi/get-started/optimization/optimizers/overview)

In [None]:
optimizer = GEPAOptimizer(reflection_model=TEACHER_MODEL, generator_model=GENERATOR_MODEL)

2025-10-02 13:22:24,342 - fi.opt.optimizers.gepa - INFO - Initialized with reflection_model: gpt-5, generator_model: gpt-4o-mini


## Run the GEPA Optimizer

This cell runs each GEPA Optimizer. It may take time and WILL consume API Credits.


In [None]:
results = optimizer.optimize(
    evaluator = evaluator,
    data_mapper = data_mapper,
    dataset = dataset,
    initial_prompts = [INITIAL_PROMPT],
    max_metric_calls = 40 # Since our dataset is small and isn't too complex, a lower limit should suffice.
    )

2025-10-02 13:35:17,359 - fi.opt.optimizers.gepa - INFO - --- Starting GEPA Prompt Optimization ---
2025-10-02 13:35:17,362 - fi.opt.optimizers.gepa - INFO - Dataset size: 5
2025-10-02 13:35:17,364 - fi.opt.optimizers.gepa - INFO - Initial prompts: ['Context: {context}\\nQuestion: {question}\\nAnswer:']
2025-10-02 13:35:17,367 - fi.opt.optimizers.gepa - INFO - Max metric calls: 40
2025-10-02 13:35:17,368 - fi.opt.optimizers.gepa - INFO - Creating internal GEPA adapter...
2025-10-02 13:35:17,368 - fi.opt.optimizers.gepa - INFO - Initialized with generator_model: gpt-4o-mini
2025-10-02 13:35:17,369 - fi.opt.optimizers.gepa - INFO - Seed candidate for GEPA: {'prompt': 'Context: {context}\\nQuestion: {question}\\nAnswer:'}
2025-10-02 13:35:17,370 - fi.opt.optimizers.gepa - INFO - Calling gepa.optimize...



GEPA Optimization:   0%|          | 0/40 [00:00<?, ?rollouts/s][A

2025-10-02 13:35:17,378 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-02 13:35:17,380 - fi.opt.optimizers.gepa - INFO - Evaluating prompt: 'Context: {context}\nQuestion: {question}\nAnswer:...'
2025-10-02 13:35:17,381 - fi.opt.optimizers.gepa - INFO - Batch size: 5
2025-10-02 13:35:17,381 - fi.opt.optimizers.gepa - INFO - Generating outputs...
2025-10-02 13:35:25,646 - fi.opt.optimizers.gepa - INFO - Output generation finished in 8.26s.
2025-10-02 13:35:25,647 - fi.opt.optimizers.gepa - INFO - Mapping evaluation inputs...
2025-10-02 13:35:25,650 - fi.opt.optimizers.gepa - INFO - Evaluating generated outputs...
2025-10-02 13:35:25,650 - fi.opt.base.evaluator - INFO - Starting evaluation for 5 inputs using 'local' strategy.
2025-10-02 13:35:25,651 - fi.opt.base.evaluator - INFO - Running local evaluation with metric: CustomLLMJudge
2025-10-02 13:35:31,175 - fi.opt.base.evaluator - INFO - Input #1 evaluated successfully. Score: 0.8000
Reason: {
  "s


GEPA Optimization:  12%|█▎        | 5/40 [00:13<01:36,  2.76s/rollouts][A

Iteration 0: Base program full valset score: 0.74
Iteration 1: Selected program 0 score: 0.74
2025-10-02 13:35:31,186 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-02 13:35:31,187 - fi.opt.optimizers.gepa - INFO - Evaluating prompt: 'Context: {context}\nQuestion: {question}\nAnswer:...'
2025-10-02 13:35:31,188 - fi.opt.optimizers.gepa - INFO - Batch size: 3
2025-10-02 13:35:31,188 - fi.opt.optimizers.gepa - INFO - Generating outputs...
2025-10-02 13:35:37,726 - fi.opt.optimizers.gepa - INFO - Output generation finished in 6.54s.
2025-10-02 13:35:37,728 - fi.opt.optimizers.gepa - INFO - Mapping evaluation inputs...
2025-10-02 13:35:37,729 - fi.opt.optimizers.gepa - INFO - Evaluating generated outputs...
2025-10-02 13:35:37,730 - fi.opt.base.evaluator - INFO - Starting evaluation for 3 inputs using 'local' strategy.
2025-10-02 13:35:37,731 - fi.opt.base.evaluator - INFO - Running local evaluation with metric: CustomLLMJudge
2025-10-02 13:35:43,572 


GEPA Optimization:  28%|██▊       | 11/40 [01:14<03:35,  7.45s/rollouts][A

Iteration 1: New subsample score 0.0 is not better than old score 2.6, skipping
Iteration 2: Selected program 0 score: 0.74
2025-10-02 13:36:32,255 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-02 13:36:32,256 - fi.opt.optimizers.gepa - INFO - Evaluating prompt: 'Context: {context}\nQuestion: {question}\nAnswer:...'
2025-10-02 13:36:32,257 - fi.opt.optimizers.gepa - INFO - Batch size: 3
2025-10-02 13:36:32,257 - fi.opt.optimizers.gepa - INFO - Generating outputs...
2025-10-02 13:36:37,702 - fi.opt.optimizers.gepa - INFO - Output generation finished in 5.44s.
2025-10-02 13:36:37,705 - fi.opt.optimizers.gepa - INFO - Mapping evaluation inputs...
2025-10-02 13:36:37,706 - fi.opt.optimizers.gepa - INFO - Evaluating generated outputs...
2025-10-02 13:36:37,707 - fi.opt.base.evaluator - INFO - Starting evaluation for 3 inputs using 'local' strategy.
2025-10-02 13:36:37,708 - fi.opt.base.evaluator - INFO - Running local evaluation with metric: CustomLLM


GEPA Optimization:  55%|█████▌    | 22/40 [02:18<01:55,  6.43s/rollouts][A

Iteration 2: New program is on the linear pareto front
Iteration 2: Full valset score for new program: 0.8
Iteration 2: Full train_val score for new program: 0.8
Iteration 2: Individual valset scores for new program: [1.0, 0.9, 1.0, 0.2, 0.9]
Iteration 2: New valset pareto front scores: [1.0, 1.0, 1.0, 0.5, 0.9]
Iteration 2: Full valset pareto front score: 0.8800000000000001
Iteration 2: Updated valset pareto front programs: [{1}, {0}, {1}, {0}, {1}]
Iteration 2: Best valset aggregate score so far: 0.8
Iteration 2: Best program as per aggregate score on train_val: 1
Iteration 2: Best program as per aggregate score on valset: 1
Iteration 2: Best score on valset: 0.8
Iteration 2: Best score on train_val: 0.8
Iteration 2: Linear pareto front program index: 1
Iteration 2: New program candidate index: 1
Iteration 3: Selected program 1 score: 0.8
2025-10-02 13:37:36,216 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-02 13:37:36,217 - fi.opt.optimizers.g


GEPA Optimization:  70%|███████   | 28/40 [03:29<01:38,  8.17s/rollouts][A

Iteration 3: New subsample score 1.0 is not better than old score 2.8, skipping
Iteration 4: Selected program 0 score: 0.74
2025-10-02 13:38:46,783 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-02 13:38:46,784 - fi.opt.optimizers.gepa - INFO - Evaluating prompt: 'Context: {context}\nQuestion: {question}\nAnswer:...'
2025-10-02 13:38:46,785 - fi.opt.optimizers.gepa - INFO - Batch size: 3
2025-10-02 13:38:46,787 - fi.opt.optimizers.gepa - INFO - Generating outputs...
2025-10-02 13:38:48,908 - fi.opt.optimizers.gepa - INFO - Output generation finished in 2.12s.
2025-10-02 13:38:48,912 - fi.opt.optimizers.gepa - INFO - Mapping evaluation inputs...
2025-10-02 13:38:48,914 - fi.opt.optimizers.gepa - INFO - Evaluating generated outputs...
2025-10-02 13:38:48,914 - fi.opt.base.evaluator - INFO - Starting evaluation for 3 inputs using 'local' strategy.
2025-10-02 13:38:48,915 - fi.opt.base.evaluator - INFO - Running local evaluation with metric: CustomLLM


GEPA Optimization:  98%|█████████▊| 39/40 [04:40<00:07,  7.36s/rollouts][A

Iteration 4: New program is on the linear pareto front
Iteration 4: Full valset score for new program: 0.97
Iteration 4: Full train_val score for new program: 0.97
Iteration 4: Individual valset scores for new program: [1.0, 0.95, 1.0, 1.0, 0.9]
Iteration 4: New valset pareto front scores: [1.0, 1.0, 1.0, 1.0, 0.9]
Iteration 4: Full valset pareto front score: 0.9800000000000001
Iteration 4: Updated valset pareto front programs: [{1, 2}, {0}, {1, 2}, {2}, {1, 2}]
Iteration 4: Best valset aggregate score so far: 0.97
Iteration 4: Best program as per aggregate score on train_val: 2
Iteration 4: Best program as per aggregate score on valset: 2
Iteration 4: Best score on valset: 0.97
Iteration 4: Best score on train_val: 0.97
Iteration 4: Linear pareto front program index: 2
Iteration 4: New program candidate index: 2
Iteration 5: Selected program 2 score: 0.97
2025-10-02 13:39:57,405 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-02 13:39:57,406 - fi.

GEPA Optimization:  98%|█████████▊| 39/40 [06:06<00:09,  9.41s/rollouts]

Iteration 5: New program is on the linear pareto front
Iteration 5: Full valset score for new program: 0.99
Iteration 5: Full train_val score for new program: 0.99
Iteration 5: Individual valset scores for new program: [0.95, 1.0, 1.0, 1.0, 1.0]
Iteration 5: New valset pareto front scores: [1.0, 1.0, 1.0, 1.0, 1.0]
Iteration 5: Full valset pareto front score: 1.0
Iteration 5: Updated valset pareto front programs: [{1, 2}, {0, 3}, {1, 2, 3}, {2, 3}, {3}]
Iteration 5: Best valset aggregate score so far: 0.99
Iteration 5: Best program as per aggregate score on train_val: 3
Iteration 5: Best program as per aggregate score on valset: 3
Iteration 5: Best score on valset: 0.99
Iteration 5: Best score on train_val: 0.99
Iteration 5: Linear pareto front program index: 3
Iteration 5: New program candidate index: 3
2025-10-02 13:41:24,251 - fi.opt.optimizers.gepa - INFO - gepa.optimize finished in 366.88s.
2025-10-02 13:41:24,253 - fi.opt.optimizers.gepa - INFO - GEPA result best score: 0.99
2025




## Final Results of GEPA Optimizer Run


### Best Prompt Found

In [None]:
print(f"{results.best_generator.get_prompt_template()}")

You will receive inputs in this exact format:
Context: {context}
Question: {question}
Answer:

Task:
Return the shortest, direct answer to the Question using only information from the Context. Prefer copying the exact minimal phrase from the Context.

Strict output rules:
- Output exactly one line in the form: Answer: {answer}
- Do not add any extra words, sentences, explanations, or restatements.
- Do not include prefixes like “The answer is”.
- Do not add punctuation at the end unless it is part of the answer itself.
- Preserve capitalization exactly as it appears in the Context.
- Use only the provided Context; do not rely on external knowledge.
- If the answer is not present or cannot be determined from the Context, output: Answer: Unknown

Answer selection rules:
- Prefer an exact extractive span from the Context; avoid paraphrasing.
- Choose the shortest span that fully answers the question.
- Keep the span contiguous when possible. You may drop surrounding hedges or qualifiers t

### Final Score

In [None]:
results.final_score

0.99

### Iteration History

In [None]:
for idx, hist in enumerate(results.history):
  print(f"---- Iteration {idx+1} ----")
  print(f"===PROMPT===\n{hist.prompt}")
  print(f"\n\n===AVERAGE SCORE===\n{hist.average_score}")

---- Iteration 1 ----
===PROMPT===
Context: {context}\nQuestion: {question}\nAnswer:


===AVERAGE SCORE===
0.74
---- Iteration 2 ----
===PROMPT===
You are given inputs in the form:
Context: {context}
Question: {question}
Answer:

Your task: Provide only the minimal, exact answer to the question based solely on the provided context.

Strict output requirements:
- Output only the answer text (a single short phrase or name). Do not restate the question or context.
- Do not write a full sentence. No extra words, qualifiers, or explanations.
- Do not add punctuation (e.g., no trailing period).
- Prefer an exact phrase copied from the context when possible.
- Use the canonical capitalization and wording as it appears in the context. Include definite articles only if they are part of the entity’s name in the context (e.g., “The Amazon River”).
- Provide the shortest unambiguous span that fully answers the question. Do not add related entities or attributions (e.g., avoid “and his company”).
-

## Try our other Optimizers!

In [None]:
from fi.opt.optimizers import (
    RandomSearchOptimizer,
    ProTeGi,
    MetaPromptOptimizer,
    PromptWizardOptimizer,
    BayesianSearchOptimizer,
)