<a href="https://colab.research.google.com/github/daniel-p-green/alain-notebooks/blob/main/FutureAGI_Agent_Optimizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# FutureAGI Agent Optimizer 🚀 — Interactive Demo

This notebook demonstrates how to use our `agent-opt` library to automatically improve and optimize LLM agents and prompts.
It runs a small Question-Answering optimization across multiple optimizers and compares the best prompts found.



In [None]:
# @title Installation
# Install dependencies
!pip install agent-opt -q


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.2/9.2 MB[0m [31m72.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.9/400.9 kB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m272.3/272.3 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# @title API Key Setup
import os
import getpass

# Enter your API keys interactively (Jupyter will prompt)
OPENAI_API_KEY = getpass.getpass('Enter your OPENAI_API_KEY: ')
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

print("✅ API Key set successfully!")

Enter your OPENAI_API_KEY: ··········
✅ API Key set successfully!


## Imports and Logging Configuration

Import the required modules and configure logging so you can follow optimizer progress.  
If an import fails, make sure the corresponding package is installed in your kernel environment.


In [None]:
# @title
# CELL 2 — Imports and logging configuration
import logging
import pandas as pd
import random
import time
from typing import List, Dict, Any

# --- Framework Imports ---
from fi.opt.base.base_optimizer import BaseOptimizer
from fi.opt.generators import LiteLLMGenerator
from fi.opt.datamappers import BasicDataMapper
from fi.opt.base.evaluator import Evaluator
from fi.opt.types import OptimizationResult, IterationHistory
from fi.opt.utils import setup_logging

# --- Evaluator Imports ---
from fi.evals.metrics import CustomLLMJudge
from fi.evals.llm import LiteLLMProvider

# --- Import All Optimizers for on the fly changing ---
from fi.opt.optimizers import (
    RandomSearchOptimizer,
    ProTeGi,
    MetaPromptOptimizer,
    GEPAOptimizer,
    PromptWizardOptimizer,
    BayesianSearchOptimizer,
)

# Configure logging
setup_logging(level=logging.INFO, log_to_console=True, log_to_file=True, log_file="agent-opt.log")
logger = logging.getLogger(__name__)

print("✅ All components imported and logging is configured.")

✅ All components imported and logging is configured.


## Prepare the Dataset

Create an in-memory QA dataset. Replace with your CSV/JSON loader for a real experiment.


In [None]:
# @title Create sample dataset
def create_dataset() -> List[Dict[str, Any]]:
    '''Creates a sample dataset for the QA task.'''
    data = {
        'context': [
            "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower.",
            "Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy, through a cellular process that converts carbon dioxide and water into glucose and oxygen.",
            "The first person to walk on the Moon was Neil Armstrong. The historic event occurred on July 20, 1969, during the Apollo 11 mission.",
            "The Amazon River in South America is the largest river by discharge volume of water in the world, and the second longest in length.",
            "William Shakespeare was an English playwright, poet, and actor, widely regarded as the greatest writer in the English language. His plays have been translated into every major living language."
        ],
        'question': [
            "Who designed the Eiffel Tower?",
            "What are the products of photosynthesis?",
            "When did a person first walk on the Moon?",
            "Which river is the largest by water volume?",
            "What is Shakespeare known for?"
        ],
        'answer': [
            "Gustave Eiffel",
            "Glucose and oxygen",
            "July 20, 1969",
            "The Amazon River",
            "Being the greatest writer in the English language"
        ]
    }
    df = pd.DataFrame(data)
    return df.to_dict("records")

dataset = create_dataset()
print("✅ Sample dataset created successfully. Here are the first two examples:")
for item in dataset[:2]:
    print(item)

✅ Sample dataset created successfully. Here are the first two examples:
{'context': 'The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower.', 'question': 'Who designed the Eiffel Tower?', 'answer': 'Gustave Eiffel'}
{'context': 'Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy, through a cellular process that converts carbon dioxide and water into glucose and oxygen.', 'question': 'What are the products of photosynthesis?', 'answer': 'Glucose and oxygen'}


## Define the Evaluation Strategy

We use a Custom LLM-as-a-Judge to score responses from 0.0 to 1.0. You can replace this with a local metric for cost savings.


In [None]:
# LLM provider used by the judge
provider = LiteLLMProvider()

correctness_judge_config = {
    "name": "correctness_judge",
    "grading_criteria": '''You are evaluating an AI's answer to a question. The score must be 1.0 if the 'response'
is semantically equivalent to the 'expected_response' (the ground truth). The score should be 0.0 if it is incorrect.
Partial credit is acceptable. For example, if the expected answer is "Gustave Eiffel" and the response is
"The tower was designed by Eiffel", a score of 0.8 is appropriate.''',
}

# Instantiate the judge and evaluator wrapper
correctness_judge = CustomLLMJudge(provider, config=correctness_judge_config)
evaluator = Evaluator(metric=correctness_judge)

# Data mapper connects model outputs to the judge expectations
data_mapper = BasicDataMapper(
    key_map={
        "response": "generated_output",
        "expected_response": "answer"
    }
)

print("✅ Evaluation strategy defined using a Custom LLM-as-a-Judge.")

2025-10-10 09:51:50,785 - fi.opt.base.evaluator - INFO - Initialized Evaluator with local metric: CustomLLMJudge
✅ Evaluation strategy defined using a Custom LLM-as-a-Judge.


## Initial Prompt and Models

Define the initial prompt and generator model to optimize for, and teacher models to optimize using.


In [None]:
# @title Prompt Setup
INITIAL_PROMPT = "Context: {context}\\nQuestion: {question}\\nAnswer:" # @param {"type":"string"}
GENERATOR_MODEL = "gpt-4o-mini" # @param {"type":"string"}
TEACHER_MODEL = "gpt-5" # @param {"type":"string"}

print(f"✅ Ready to optimize! We will improve `{INITIAL_PROMPT}`")

✅ Ready to optimize! We will improve `Context: {context}\nQuestion: {question}\nAnswer:`


## GEPA Optimizer Setup
For this demo, we will be using **GEPA** optimizer.

 **GEPA** (Genetic-Pareto), is a state-of-the-art evolutionary algorithm for prompt optimization. Instead of making small, random changes, GEPA treats prompts like DNA and intelligently evolves them over generations.

### How it Works:
1.  **Evaluate:** It first tests the performance of the current best prompt(s) on a sample of data.
2.  **Reflect:** It uses a powerful "reflection" model (our `reflection_model`) to analyze the results, especially the failures. It generates rich, textual feedback on *why* the prompt failed.
3.  **Mutate:** Based on this reflection, it rewrites the prompt to create new, improved "offspring" prompts.
4.  **Select:** It uses a sophisticated method called Pareto-aware selection to choose the most promising new prompts to carry forward to the next generation. This ensures that it doesn't just find one good prompt, but a diverse set of high-performing ones.

This cycle of **Evaluate -> Reflect -> Mutate -> Select** allows GEPA to navigate the vast space of possible prompts much more efficiently than random chance, often leading to significant performance improvements.

For more information refer to our [FutureAGI Optimization Docs!](https://docs.futureagi.com/future-agi/get-started/optimization/optimizers/overview)

In [None]:
optimizer = GEPAOptimizer(reflection_model=TEACHER_MODEL, generator_model=GENERATOR_MODEL)

2025-10-10 09:52:00,416 - fi.opt.optimizers.gepa - INFO - Initialized with reflection_model: gpt-5, generator_model: gpt-4o-mini


## Run the GEPA Optimizer

This cell runs each GEPA Optimizer. It may take time and WILL consume API Credits.


In [None]:
# @title
results = optimizer.optimize(
    evaluator = evaluator,
    data_mapper = data_mapper,
    dataset = dataset,
    initial_prompts = [INITIAL_PROMPT],
    max_metric_calls = 40 # Since our dataset is small and isn't too complex, a lower limit should suffice.
    )

2025-10-10 09:52:03,121 - fi.opt.optimizers.gepa - INFO - --- Starting GEPA Prompt Optimization ---
2025-10-10 09:52:03,123 - fi.opt.optimizers.gepa - INFO - Dataset size: 5
2025-10-10 09:52:03,123 - fi.opt.optimizers.gepa - INFO - Initial prompts: ['Context: {context}\\nQuestion: {question}\\nAnswer:']
2025-10-10 09:52:03,124 - fi.opt.optimizers.gepa - INFO - Max metric calls: 40
2025-10-10 09:52:03,125 - fi.opt.optimizers.gepa - INFO - Creating internal GEPA adapter...
2025-10-10 09:52:03,126 - fi.opt.optimizers.gepa - INFO - Initialized with generator_model: gpt-4o-mini
2025-10-10 09:52:03,126 - fi.opt.optimizers.gepa - INFO - Seed candidate for GEPA: {'prompt': 'Context: {context}\\nQuestion: {question}\\nAnswer:'}
2025-10-10 09:52:03,127 - fi.opt.optimizers.gepa - INFO - Calling gepa.optimize...





GEPA Optimization:   0%|          | 0/40 [00:00<?, ?rollouts/s][A[A[A

2025-10-10 09:52:03,130 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-10 09:52:03,131 - fi.opt.optimizers.gepa - INFO - Evaluating prompt: 'Context: {context}\nQuestion: {question}\nAnswer:...'
2025-10-10 09:52:03,131 - fi.opt.optimizers.gepa - INFO - Batch size: 5
2025-10-10 09:52:03,132 - fi.opt.optimizers.gepa - INFO - Generating outputs...
2025-10-10 09:52:07,313 - fi.opt.optimizers.gepa - INFO - Output generation finished in 4.18s.
2025-10-10 09:52:07,315 - fi.opt.optimizers.gepa - INFO - Mapping evaluation inputs...
2025-10-10 09:52:07,316 - fi.opt.optimizers.gepa - INFO - Evaluating generated outputs...
2025-10-10 09:52:07,317 - fi.opt.base.evaluator - INFO - Starting evaluation for 5 inputs using 'local' strategy.
2025-10-10 09:52:07,319 - fi.opt.base.evaluator - INFO - Running local evaluation with metric: CustomLLMJudge
2025-10-10 09:52:14,452 - fi.opt.base.evaluator - INFO - Input #1 evaluated successfully. Score: 0.8000
Reason: {
  "s




GEPA Optimization:  12%|█▎        | 5/40 [00:11<01:19,  2.27s/rollouts][A[A[A

Iteration 0: Base program full valset score: 0.7
Iteration 1: Selected program 0 score: 0.7
2025-10-10 09:52:14,463 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-10 09:52:14,464 - fi.opt.optimizers.gepa - INFO - Evaluating prompt: 'Context: {context}\nQuestion: {question}\nAnswer:...'
2025-10-10 09:52:14,464 - fi.opt.optimizers.gepa - INFO - Batch size: 3
2025-10-10 09:52:14,465 - fi.opt.optimizers.gepa - INFO - Generating outputs...
2025-10-10 09:52:18,325 - fi.opt.optimizers.gepa - INFO - Output generation finished in 3.86s.
2025-10-10 09:52:18,327 - fi.opt.optimizers.gepa - INFO - Mapping evaluation inputs...
2025-10-10 09:52:18,327 - fi.opt.optimizers.gepa - INFO - Evaluating generated outputs...
2025-10-10 09:52:18,328 - fi.opt.base.evaluator - INFO - Starting evaluation for 3 inputs using 'local' strategy.
2025-10-10 09:52:18,329 - fi.opt.base.evaluator - INFO - Running local evaluation with metric: CustomLLMJudge
2025-10-10 09:52:23,739 - 




GEPA Optimization:  40%|████      | 16/40 [01:39<02:39,  6.65s/rollouts][A[A[A

Iteration 1: New program is on the linear pareto front
Iteration 1: Full valset score for new program: 0.96
Iteration 1: Full train_val score for new program: 0.96
Iteration 1: Individual valset scores for new program: [1.0, 0.9, 1.0, 0.9, 1.0]
Iteration 1: New valset pareto front scores: [1.0, 1.0, 1.0, 0.9, 1.0]
Iteration 1: Full valset pareto front score: 0.9800000000000001
Iteration 1: Updated valset pareto front programs: [{1}, {0}, {1}, {1}, {1}]
Iteration 1: Best valset aggregate score so far: 0.96
Iteration 1: Best program as per aggregate score on train_val: 1
Iteration 1: Best program as per aggregate score on valset: 1
Iteration 1: Best score on valset: 0.96
Iteration 1: Best score on train_val: 0.96
Iteration 1: Linear pareto front program index: 1
Iteration 1: New program candidate index: 1
Iteration 2: Selected program 1 score: 0.96
2025-10-10 09:53:42,932 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-10 09:53:42,933 - fi.opt.optimi




GEPA Optimization:  55%|█████▌    | 22/40 [02:52<02:35,  8.66s/rollouts][A[A[A

Iteration 2: New subsample score 2.8 is not better than old score 2.9, skipping
Iteration 3: Selected program 1 score: 0.96
2025-10-10 09:54:55,325 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-10 09:54:55,325 - fi.opt.optimizers.gepa - INFO - Evaluating prompt: 'You are given inputs in the form:
Context: {context}
Question: {question}
Answer:

Your job is to ou...'
2025-10-10 09:54:55,326 - fi.opt.optimizers.gepa - INFO - Batch size: 3
2025-10-10 09:54:55,326 - fi.opt.optimizers.gepa - INFO - Generating outputs...
2025-10-10 09:54:57,925 - fi.opt.optimizers.gepa - INFO - Output generation finished in 2.60s.
2025-10-10 09:54:57,927 - fi.opt.optimizers.gepa - INFO - Mapping evaluation inputs...
2025-10-10 09:54:57,928 - fi.opt.optimizers.gepa - INFO - Evaluating generated outputs...
2025-10-10 09:54:57,929 - fi.opt.base.evaluator - INFO - Starting evaluation for 3 inputs using 'local' strategy.
2025-10-10 09:54:57,930 - fi.opt.base.evaluator - INF




GEPA Optimization:  62%|██████▎   | 25/40 [02:57<01:48,  7.24s/rollouts][A[A[A

Iteration 3: All subsample scores perfect. Skipping.
Iteration 3: Reflective mutation did not propose a new candidate
Iteration 4: Selected program 0 score: 0.7
2025-10-10 09:55:00,935 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-10 09:55:00,936 - fi.opt.optimizers.gepa - INFO - Evaluating prompt: 'Context: {context}\nQuestion: {question}\nAnswer:...'
2025-10-10 09:55:00,936 - fi.opt.optimizers.gepa - INFO - Batch size: 3
2025-10-10 09:55:00,939 - fi.opt.optimizers.gepa - INFO - Generating outputs...
2025-10-10 09:55:03,040 - fi.opt.optimizers.gepa - INFO - Output generation finished in 2.10s.
2025-10-10 09:55:03,042 - fi.opt.optimizers.gepa - INFO - Mapping evaluation inputs...
2025-10-10 09:55:03,043 - fi.opt.optimizers.gepa - INFO - Evaluating generated outputs...
2025-10-10 09:55:03,044 - fi.opt.base.evaluator - INFO - Starting evaluation for 3 inputs using 'local' strategy.
2025-10-10 09:55:03,044 - fi.opt.base.evaluator - INFO - Running lo




GEPA Optimization:  78%|███████▊  | 31/40 [04:12<01:22,  9.17s/rollouts][A[A[A

Iteration 4: New subsample score 2.25 is not better than old score 2.3, skipping
Iteration 5: Selected program 1 score: 0.96
2025-10-10 09:56:15,364 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-10 09:56:15,365 - fi.opt.optimizers.gepa - INFO - Evaluating prompt: 'You are given inputs in the form:
Context: {context}
Question: {question}
Answer:

Your job is to ou...'
2025-10-10 09:56:15,366 - fi.opt.optimizers.gepa - INFO - Batch size: 3
2025-10-10 09:56:15,366 - fi.opt.optimizers.gepa - INFO - Generating outputs...
2025-10-10 09:56:16,448 - fi.opt.optimizers.gepa - INFO - Output generation finished in 1.08s.
2025-10-10 09:56:16,450 - fi.opt.optimizers.gepa - INFO - Mapping evaluation inputs...
2025-10-10 09:56:16,450 - fi.opt.optimizers.gepa - INFO - Evaluating generated outputs...
2025-10-10 09:56:16,451 - fi.opt.base.evaluator - INFO - Starting evaluation for 3 inputs using 'local' strategy.
2025-10-10 09:56:16,452 - fi.opt.base.evaluator - IN




GEPA Optimization:  92%|█████████▎| 37/40 [05:38<00:33, 11.02s/rollouts][A[A[A

Iteration 5: New subsample score 2.9 is not better than old score 2.9, skipping
Iteration 6: Selected program 0 score: 0.7
2025-10-10 09:57:42,105 - fi.opt.optimizers.gepa - INFO - Starting evaluation for a candidate prompt.
2025-10-10 09:57:42,106 - fi.opt.optimizers.gepa - INFO - Evaluating prompt: 'Context: {context}\nQuestion: {question}\nAnswer:...'
2025-10-10 09:57:42,106 - fi.opt.optimizers.gepa - INFO - Batch size: 3
2025-10-10 09:57:42,107 - fi.opt.optimizers.gepa - INFO - Generating outputs...
2025-10-10 09:57:44,081 - fi.opt.optimizers.gepa - INFO - Output generation finished in 1.97s.
2025-10-10 09:57:44,083 - fi.opt.optimizers.gepa - INFO - Mapping evaluation inputs...
2025-10-10 09:57:44,084 - fi.opt.optimizers.gepa - INFO - Evaluating generated outputs...
2025-10-10 09:57:44,084 - fi.opt.base.evaluator - INFO - Starting evaluation for 3 inputs using 'local' strategy.
2025-10-10 09:57:44,084 - fi.opt.base.evaluator - INFO - Running local evaluation with metric: CustomLLMJ

GEPA Optimization:  92%|█████████▎| 37/40 [06:54<00:33, 11.19s/rollouts]

Iteration 6: Full valset score for new program: 0.96
Iteration 6: Full train_val score for new program: 0.96
Iteration 6: Individual valset scores for new program: [1.0, 1.0, 1.0, 0.9, 0.9]
Iteration 6: New valset pareto front scores: [1.0, 1.0, 1.0, 0.9, 1.0]
Iteration 6: Full valset pareto front score: 0.9800000000000001
Iteration 6: Updated valset pareto front programs: [{1, 2}, {0, 2}, {1, 2}, {1, 2}, {1}]
Iteration 6: Best valset aggregate score so far: 0.96
Iteration 6: Best program as per aggregate score on train_val: 1
Iteration 6: Best program as per aggregate score on valset: 1
Iteration 6: Best score on valset: 0.96
Iteration 6: Best score on train_val: 0.96
Iteration 6: Linear pareto front program index: 1
Iteration 6: New program candidate index: 2
2025-10-10 09:58:57,250 - fi.opt.optimizers.gepa - INFO - gepa.optimize finished in 414.12s.
2025-10-10 09:58:57,250 - fi.opt.optimizers.gepa - INFO - GEPA result best score: 0.96
2025-10-10 09:58:57,251 - fi.opt.optimizers.gepa




## Final Results of GEPA Optimizer Run


In [None]:
# @title Best Prompt Found
print(f"{results.best_generator.get_prompt_template()}")

You are given inputs in the form:
Context: {context}
Question: {question}
Answer:

Your job is to output only the answer text, to be placed after "Answer:". Follow these rules:

- Use only information from the provided Context. Do not use outside knowledge.
- Output the shortest accurate phrase, value, or list that directly answers the Question.
- Prefer a concise noun phrase or value over a full sentence. Do not restate the question or add explanation, examples, or extra context.
- Reuse exact wording from the Context when possible. If needed, minimally condense by removing hedges like “widely regarded as” to keep the core claim.
- Preserve the original capitalization, spelling, numbers, and units as they appear in the Context.
- If the answer is a date, output it exactly in the format given in the Context (e.g., "July 20, 1969").
- If the answer is a list, output just the items in the same order and wording as in the Context, joined as they appear (e.g., “glucose and oxygen”).
- For 

In [None]:
# @title Final Score
results.final_score

0.96

In [None]:
# @title Iteration History
for idx, hist in enumerate(results.history):
  print(f"---- Iteration {idx+1} ----")
  print(f"===PROMPT===\n{hist.prompt}")
  print(f"\n\n===AVERAGE SCORE===\n{hist.average_score}")

---- Iteration 1 ----
===PROMPT===
Context: {context}\nQuestion: {question}\nAnswer:


===AVERAGE SCORE===
0.7
---- Iteration 2 ----
===PROMPT===
You are given inputs in the form:
Context: {context}
Question: {question}
Answer:

Your job is to output only the answer text, to be placed after "Answer:". Follow these rules:

- Use only information from the provided Context. Do not use outside knowledge.
- Output the shortest accurate phrase, value, or list that directly answers the Question.
- Prefer a concise noun phrase or value over a full sentence. Do not restate the question or add explanation, examples, or extra context.
- Reuse exact wording from the Context when possible. If needed, minimally condense by removing hedges like “widely regarded as” to keep the core claim.
- Preserve the original capitalization, spelling, numbers, and units as they appear in the Context.
- If the answer is a date, output it exactly in the format given in the Context (e.g., "July 20, 1969").
- If the a

In [None]:
# @title Pick and choose from our range of Optimizers!!
optimizer_name = "PromptWizard" # @param ["PromptWizard","RandomSearch","ProTeGI","MetaPrompt","BayesianSearch"]


---
### Conclusion & Caveats

- Different optimizers trade off cost, speed, and depth of edits.
- **Meta-Prompt**, **ProTeGi**, and **GEPA** often find more robust prompts but cost more.
- Use a local metric during development to reduce API costs, then run the teacher-guided optimizers for final refinement.
