# Multi-Model Comparison & Optimization

This notebook provides a framework for systematically comparing different language models (e.g., Haiku, Sonnet, Opus) to find the optimal balance of cost, quality, and latency for the Self-Critique pipeline.

## Learning Objectives

- **Model Benchmarking**: Compare the performance of different models on a standardized dataset.
- **Parameter Grid Search**: Systematically test different temperature and `max_tokens` settings.
- **Cost-Quality Analysis**: Visualize the Pareto frontier to make informed trade-offs.
- **Hybrid Strategies**: Explore using different models for different pipeline stages.

---


## Section 1: Setup and Configuration

In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import itertools

from notebooks._shared_utilities import (
    create_benchmark_dataset,
    calculate_cost_metrics,
    calculate_quality_metrics
)

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (15, 7)

print("âœ“ Environment setup complete")

## Section 2: Model Comparison Experiment

We'll run the same benchmark dataset through three different models: Haiku (fastest, cheapest), Sonnet (balanced), and Opus (highest quality).


In [None]:
def run_experiment(models_to_test: list, dataset: list) -> pd.DataFrame:
    """Runs a multi-model experiment and returns a DataFrame of results."""
    results = []
    for model in tqdm(models_to_test, desc="Testing Models"):
        for paper in tqdm(dataset, desc=f"Processing with {model}", leave=False):
            # Simulate a pipeline run
            # Replace with a real pipeline call
            simulated_latency = 10 if 'haiku' in model else 6 if 'sonnet' in model else 15
            simulated_quality = 7.5 if 'haiku' in model else 8.5 if 'sonnet' in model else 9.5
            
            run_result = {
                'model': model,
                'paper': paper['title'],
                'total_metrics': {
                    'total_input_tokens': np.random.randint(2800, 3200),
                    'total_output_tokens': np.random.randint(800, 1200),
                    'total_latency_seconds': simulated_latency + np.random.uniform(-1, 1)
                },
                'critique': f"Overall: {simulated_quality + np.random.uniform(-0.3, 0.3)}/10"
            }
            
            cost_metrics = calculate_cost_metrics(run_result, model=model)
            quality_metrics = calculate_quality_metrics(run_result)
            
            results.append({
                'model': model,
                'latency': run_result['total_metrics']['total_latency_seconds'],
                'cost': cost_metrics['total_cost_usd'],
                'quality': quality_metrics['overall']
            })
    return pd.DataFrame(results)

models = ["claude-haiku-4-20250514", "claude-sonnet-4-20250514", "claude-opus-4-20250514"]
benchmark_dataset = create_benchmark_dataset()

experiment_results = run_experiment(models, benchmark_dataset)

# Aggregate results
summary = experiment_results.groupby('model').mean().reset_index()
print("Experiment Summary:")
print(summary)

## Section 3: Cost-Quality Pareto Frontier

A Pareto frontier helps visualize the trade-offs between cost and quality. The optimal models lie on the edge of the curve.


In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

sns.scatterplot(data=summary, x='cost', y='quality', hue='model', s=200, ax=ax, palette='viridis')

ax.set_title('Cost vs. Quality Trade-off', fontsize=16)
ax.set_xlabel('Average Cost per Execution (USD)')
ax.set_ylabel('Average Quality Score')
ax.grid(True)

# Annotate points
for i, row in summary.iterrows():
    ax.text(row['cost'] * 1.01, row['quality'], row['model'].split('-')[1], fontsize=12)

plt.show()

print("Interpretation:")
print("- Haiku: Lowest cost, lowest quality.")
print("- Opus: Highest quality, highest cost.")
print("- Sonnet: A balanced choice, offering a good quality improvement over Haiku for a moderate cost increase.")

## Section 4: Stage-Specific Model Selection (Hybrid Strategy)

A powerful optimization is to use different models for different stages. For example, use a cheaper model for the initial summary and a more powerful one for the critique and revision.


In [None]:
def calculate_hybrid_cost(stage1_model, stage2_model, stage3_model):
    """Simulates the cost of a hybrid model strategy."""
    # Dummy token counts for each stage
    stage_tokens = {
        'stage1': {'input': 3000, 'output': 800},
        'stage2': {'input': 3800, 'output': 400},
        'stage3': {'input': 4200, 'output': 800}
    }
    
    models_used = [stage1_model, stage2_model, stage3_model]
    total_cost = 0
    
    for i, model in enumerate(models_used):
        stage_name = f'stage{i+1}'
        stage_result = {'total_metrics': {'total_input_tokens': stage_tokens[stage_name]['input'], 'total_output_tokens': stage_tokens[stage_name]['output']}}
        cost_metrics = calculate_cost_metrics(stage_result, model=model)
        total_cost += cost_metrics['total_cost_usd']
    
    return total_cost

# Compare strategies
sonnet_only_cost = calculate_hybrid_cost(models[1], models[1], models[1])
hybrid_cost = calculate_hybrid_cost(models[0], models[2], models[2]) # Haiku for summary, Opus for critique/revision

print(f"Sonnet-Only Strategy Cost: ${sonnet_only_cost:.4f}")
print(f"Hybrid (Haiku/Opus) Strategy Cost: ${hybrid_cost:.4f}")
print(f"Savings: {(sonnet_only_cost - hybrid_cost) / sonnet_only_cost:.2%}")


## Section 5: Temperature Grid Search

Fine-tuning the `temperature` parameter can impact both quality and cost (by affecting output token count).


In [None]:
# This is a conceptual example. A real grid search would take a long time.
temperatures = [0.1, 0.3, 0.5, 0.7]
max_tokens = [1024, 2048]

grid_search_params = list(itertools.product(temperatures, max_tokens))
grid_search_results = []

for temp, tokens in grid_search_params:
    # Simulate a run with these params
    simulated_quality = 8.5 - abs(0.5 - temp) # Assume optimal temp is 0.5
    simulated_cost = 0.015 * (tokens / 2048)
    
    grid_search_results.append({
        'temperature': temp,
        'max_tokens': tokens,
        'quality': simulated_quality,
        'cost': simulated_cost
    })

grid_df = pd.DataFrame(grid_search_results)

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(grid_df['temperature'], grid_df['max_tokens'], grid_df['quality'], c=grid_df['cost'], cmap='viridis')

ax.set_xlabel('Temperature')
ax.set_ylabel('Max Tokens')
ax.set_zlabel('Quality Score')
ax.set_title('Grid Search for Optimal Parameters')
plt.show()

print("Optimal Parameters (Highest Quality):")
print(grid_df.loc[grid_df['quality'].idxmax()])