# Multi-Objective Bayesian Optimization for CLI Summarization

This notebook advances our previous benchmarking efforts by employing Bayesian multi-objective optimization to find the best configurations for a command-line summarization tool. Instead of a simple grid search, we use `scikit-optimize` to intelligently explore the hyperparameter space.

### Objectives
We aim to find configurations that are **Pareto-optimal** across three competing objectives:
1.  **Maximize Accuracy**: The factual correctness and quality of the summary.
2.  **Minimize Latency**: The wall-clock time required to generate a summary.
3.  **Minimize Token Cost**: The estimated monetary cost of the API call, based on input and output token counts.

### Search Space
The optimizer will sweep through the following hyperparameters:
-   **Model**: `phi3:mini`, `llama3:8b` (Categorical)
-   **Temperature**: `0.0` to `1.0` (Continuous)
-   **Max Tokens**: `256`, `512`, `768` (Categorical)

### Outcome
The result is a 3D plot visualizing the trade-offs between our three objectives and a saved CSV file, `pareto_frontier.csv`, containing the set of most efficient configurations.

In [None]:
import dspy
import os
import time
import warnings
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from skopt import gp_minimize
from skopt.space import Real, Categorical

# Suppress verbose warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully.")

## 1. Configure Models, Costs, and Search Space

We'll configure our local models, define a cost estimation function, and lay out the hyperparameter search space for the Bayesian optimizer.

In [None]:
# --- Model Configuration ---
phi3_mini = dspy.OllamaLocal(model='phi3:mini')
llama3_8b = dspy.OllamaLocal(model='llama3:8b')

models_map = {
    "phi3:mini": phi3_mini,
    "llama3:8b": llama3_8b
}

# --- Evaluator Configuration ---
try:
    # Using a powerful model as a judge is preferred.
    # Ensure OPENAI_API_KEY is set in your environment variables.
    evaluator_lm = dspy.OpenAI(model='gpt-4o-mini', max_tokens=2048, model_type='chat')
    print("Using GPT-4o-mini as the evaluator.")
except Exception as e:
    print(f"Could not initialize OpenAI evaluator: {e}")
    print("Falling back to llama3:8b as the evaluator. Quality scores may be less reliable.")
    evaluator_lm = llama3_8b

dspy.settings.configure(evaluator=evaluator_lm)

# --- Cost Estimation (Approximation) ---
# Based on OpenAI pricing for a capable model (e.g., gpt-4o-mini, $/1M tokens)
COST_PER_INPUT_TOKEN = 0.15 / 1_000_000
COST_PER_OUTPUT_TOKEN = 0.60 / 1_000_000

def estimate_cost(input_text, output_text):
    # A common heuristic: avg token length is ~4 chars
    input_tokens = len(input_text) / 4
    output_tokens = len(output_text) / 4
    cost = (input_tokens * COST_PER_INPUT_TOKEN) + (output_tokens * COST_PER_OUTPUT_TOKEN)
    return cost

# --- Search Space Definition for Scikit-Optimize ---
search_space = [
    Categorical(list(models_map.keys()), name='model_name'),
    Real(0.0, 1.0, name='temperature'),
    Categorical([256, 512, 768], name='max_tokens')
]

print("\nConfiguration complete.")

## 2. Define Data, Signature, and Metrics

We set up the summarization task structure, create a sample dataset, and define the core evaluation logic for our three objectives.

In [None]:
# --- DSPy Signature & Dataset ---
class SummarizationSignature(dspy.Signature):
    """Summarize the given document into a short, concise paragraph."""
    document = dspy.InputField(desc="A long text document.")
    summary = dspy.OutputField(desc="A concise summary of the document.")

trainset = [
    dspy.Example(document="Quantum computing is a revolutionary type of computing that leverages the principles of quantum mechanics to process information. Unlike classical computers, which use bits to represent information as either a 0 or a 1, quantum computers use qubits. Qubits can exist in a superposition of both 0 and 1 simultaneously, and they can be entangled, meaning their fates are linked even when physically separated. This allows quantum computers to perform complex calculations at speeds unattainable by classical computers. Key applications include drug discovery, materials science, financial modeling, and breaking cryptographic codes. However, building and maintaining stable quantum computers is a massive engineering challenge due to qubit decoherence, where qubits lose their quantum properties due to interaction with the environment.").with_inputs('document'),
    dspy.Example(document="The Roman Republic was a period of ancient Roman civilization that began with the overthrow of the Roman Kingdom in 509 BC and ended in 27 BC with the establishment of the Roman Empire. It was characterized by a republican form of government, where annual magistrates were elected by the citizens. The cornerstone of the Republic was the Senate, a body of elder statesmen that advised the magistrates. Society was divided into patricians (the aristocracy) and plebeians (the common citizens). For centuries, the Republic expanded its territory through conquest and alliances, eventually controlling the entire Mediterranean basin. However, internal tensions, civil wars, and the rise of powerful military leaders like Julius Caesar ultimately led to its collapse and the rise of Augustus as the first Roman Emperor.").with_inputs('document')
]

# --- Evaluation Metrics ---
accuracy_metric = dspy.Assess(
    dev_data=trainset, 
    metric=lambda gold, pred, trace: dspy.evaluate.answer_exact_match(gold, pred, trace=trace) and dspy.evaluate.answer_faithfulness_score(gold, pred, trace=trace),
    display_progress=False,
    display_table=0,
)

def evaluate_configuration(model_name, temperature, max_tokens):
    """Runs a given configuration and returns the three objective scores."""
    # 1. Configure the LM
    lm = models_map[model_name]
    lm.config['temperature'] = temperature
    lm.config['max_tokens'] = max_tokens
    
    with dspy.settings.context(lm=lm):
        summarizer = dspy.Predict(SummarizationSignature)
        
        # 2. Evaluate Accuracy
        # We'll use a simplified dspy.Assess for this programmatic evaluation
        # A more robust approach would be a custom metric as in the previous notebook
        accuracy_score = accuracy_metric(summarizer, dev_data=trainset)[0]

        # 3. Evaluate Latency and Cost
        total_latency = 0
        total_cost = 0
        for example in trainset:
            start_time = time.time()
            prediction = summarizer(document=example.document)
            total_latency += (time.time() - start_time)
            total_cost += estimate_cost(example.document, prediction.summary)
            
    avg_latency = total_latency / len(trainset)
    avg_cost = total_cost / len(trainset)
    
    return accuracy_score, avg_latency, avg_cost

## 3. Bayesian Optimization

We define the objective function for the optimizer. Since Bayesian optimizers minimize a *single* value, we **scalarize** our three objectives into one score. A simple weighted sum is a common technique. The optimizer's goal is to find hyperparameters that minimize this combined score. We will store the results of every trial to build our Pareto frontier later.

In [None]:
evaluated_points = []
iteration_count = 0

@skopt.utils.use_named_args(search_space)
def objective_function(**params):
    global iteration_count
    iteration_count += 1
    
    model_name = params['model_name']
    temperature = params['temperature']
    max_tokens = params['max_tokens']
    
    print(f"[Iteration {iteration_count}] Testing: {model_name}, Temp: {temperature:.2f}, Max Tokens: {max_tokens}")
    
    # Get our three objective scores
    accuracy, latency, cost = evaluate_configuration(model_name, temperature, max_tokens)
    print(f"  -> Results: Accuracy={accuracy:.2f}, Latency={latency:.2f}s, Cost=${cost:.8f}")
    
    # Store the raw results for Pareto analysis later
    evaluated_points.append({
        'model_name': model_name,
        'temperature': temperature,
        'max_tokens': max_tokens,
        'accuracy': accuracy,
        'latency': latency,
        'cost': cost
    })

    # Scalarize the objectives into a single value to minimize.
    # We want to MINIMIZE latency and cost, and MAXIMIZE accuracy.
    # Therefore, we minimize (-1 * accuracy).
    # Weights can be tuned to prioritize one objective over others.
    # Let's start with a balanced approach.
    weight_accuracy = 1.5 # Higher weight to prioritize quality
    weight_latency = 1.0
    weight_cost = 0.5
    
    # We normalize latency and cost to bring them to a similar scale as accuracy (0-100)
    # Assuming max latency of ~30s and max cost of ~$0.0005 for normalization
    norm_latency = latency / 30.0
    norm_cost = cost / 0.0005
    
    score = (weight_latency * norm_latency) + (weight_cost * norm_cost) - (weight_accuracy * (accuracy / 100.0))
    return score

# --- Run the Optimizer ---
N_CALLS = 25 # Number of configurations to test
print(f"\nStarting Bayesian optimization for {N_CALLS} iterations...\n")

results = gp_minimize(
    func=objective_function, 
    dimensions=search_space, 
    n_calls=N_CALLS, 
    random_state=42
)

print("\nOptimization complete.")

## 4. Identify Pareto Frontier and Visualize

From all the configurations tested, we now identify the **Pareto-efficient** points. A point is on the Pareto frontier if you cannot improve one objective without worsening another. These represent the optimal trade-offs.

We then plot these points in a 3D scatter plot to visualize the frontier.

In [None]:
def is_pareto_efficient(points):
    """
    Finds the Pareto-efficient points from a set of points.
    :param points: An (n_points, n_objectives) array.
    :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient.
    """
    is_efficient = np.ones(points.shape[0], dtype=bool)
    for i, p in enumerate(points):
        if is_efficient[i]:
            # Find all points that are not p
            other_points = np.arange(points.shape[0]) != i
            # Find all points that are better or equal in all objectives
            dominating_points = points[other_points]
            is_dominated = np.any(np.all(dominating_points <= p, axis=1))
            if is_dominated:
                is_efficient[i] = False
    return is_efficient

# --- Prepare Data for Analysis ---
df = pd.DataFrame(evaluated_points)

# We want to minimize latency and cost, but maximize accuracy.
# The Pareto function assumes all objectives are being minimized, so we use (-1 * accuracy).
points_for_pareto = df[['latency', 'cost', 'accuracy']].copy()
points_for_pareto['accuracy'] = -1 * points_for_pareto['accuracy']

# --- Find and Save the Frontier ---
pareto_mask = is_pareto_efficient(points_for_pareto.values)
pareto_df = df[pareto_mask].sort_values(by='accuracy', ascending=False).reset_index(drop=True)

output_filename = 'pareto_frontier.csv'
pareto_df.to_csv(output_filename, index=False)
print(f"Pareto frontier saved to {output_filename}")

# --- Visualize the Results ---
# Create hover text
df['text'] = df.apply(lambda row: f"Model: {row['model_name']}<br>Temp: {row['temperature']:.2f}<br>Max Tokens: {row['max_tokens']}<br>Accuracy: {row['accuracy']:.2f}<br>Latency: {row['latency']:.2f}s<br>Cost: ${row['cost']:.6f}", axis=1)

fig = go.Figure()

# Add all evaluated points (non-Pareto)
fig.add_trace(go.Scatter3d(
    x=df[~pareto_mask]['latency'],
    y=df[~pareto_mask]['cost'],
    z=df[~pareto_mask]['accuracy'],
    mode='markers',
    marker=dict(size=5, color='blue', opacity=0.6),
    text=df[~pareto_mask]['text'],
    hoverinfo='text',
    name='Dominated Points'
))

# Add Pareto-efficient points
fig.add_trace(go.Scatter3d(
    x=df[pareto_mask]['latency'],
    y=df[pareto_mask]['cost'],
    z=df[pareto_mask]['accuracy'],
    mode='markers',
    marker=dict(size=8, color='red', symbol='diamond'),
    text=df[pareto_mask]['text'],
    hoverinfo='text',
    name='Pareto Frontier'
))

fig.update_layout(
    title='Multi-Objective Optimization: Accuracy vs. Latency vs. Cost',
    scene=dict(
        xaxis_title='Latency (s) ↓',
        yaxis_title='Cost ($) ↓',
        zaxis_title='Accuracy (%) ↑'
    ),
    margin=dict(l=0, r=0, b=0, t=40)
)

fig.show()

# --- Display the Pareto Frontier Table ---
print("\n--- Pareto Optimal Configurations ---")
print(pareto_df.to_markdown(index=False))