# Experiment 2: Run Cross-Model Evaluations

This notebook runs MCP evaluations using Codex agent with different LLM models.

**Objective:** Determine whether model choice affects MCP retrieval performance when using the same coding agent.

**Each evaluation takes 2-3 hours** and tests 25 cases × 4 MCP servers = 100 evaluations per model.

**See:** `notes/EXPERIMENT_2_CROSS_MODEL.md` for detailed experimental design.

---

## Setup

### Environment Variables Required

The evaluation scripts need several API keys and configuration:

- `OPENAI_API_KEY`: For gpt-4o, gpt-5, gpt-4o-mini models
- `PUBMED_EMAIL`: Required for PubMed API access
- `PUBMED_API_KEY`: Required for pubmed-mcp server

These are loaded from files:
- `~/openai.key`

In [None]:
import subprocess
import os
from pathlib import Path
from datetime import datetime
import yaml

# Set working directory to project root
project_root = Path.cwd().parent if 'notebook' in str(Path.cwd()) else Path.cwd()
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

## Experiment 2 Design

**Independent Variable:** Underlying LLM model used by Codex agent
- gpt-5 (baseline from Experiment 1)
- gpt-5-mini (smaller/faster model)
- gpt-5-nano (smallest/cheapest model)

**Controlled Variables:**
- Same agent: Codex CLI
- Same MCP servers: artl, simple-pubmed, biomcp, pubmed-mcp
- Same test cases: 25 cases
- Same threshold: 0.9

**Total evaluations:** 3 models × 4 MCPs × 25 cases = 300 evaluations

## Model Configurations

Each model uses a different config file in `project/generated/`:

| Model | Config File | 
|-------|-------------|
| gpt-5 | `literature_mcp_eval_config_codex_gpt5.yaml` |
| gpt-5-mini | `literature_mcp_eval_config_codex_gpt5_mini.yaml` |
| gpt-5-nano | `literature_mcp_eval_config_codex_gpt5_nano.yaml` |

## 1. Run Codex + gpt-5 Evaluation (Baseline)

**Note:** We can reuse the results from Experiment 1 (`results/compare_agents/codex_20251208.yaml`)

**Command (if running fresh):**

```bash
#!/bin/bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export OPENAI_API_KEY=$(cat ~/openai.key)
export PUBMED_EMAIL=justinreese@lbl.gov
export PUBMED_API_KEY=01eec0a16472164c6d69163bd28368311808
uv run metacoder eval project/generated/literature_mcp_eval_config_codex_gpt5.yaml \
  -o results/compare_models/codex_gpt5_$(date +%Y%m%d).yaml
```

**Duration:** ~2-3 hours

**Output:** `results/compare_models/codex_gpt5_YYYYMMDD.yaml`

In [None]:
# Check if gpt-5 evaluation already exists
gpt5_result = f"results/compare_models/codex_gpt5_{datetime.now().strftime('%Y%m%d')}.yaml"

if os.path.exists(gpt5_result):
    print(f"✓ gpt-5 evaluation already completed: {gpt5_result}")
    # Show brief summary
    with open(gpt5_result, 'r') as f:
        results = yaml.safe_load(f)
        if results and 'results' in results:
            count = len(results['results'])
            passed = sum(1 for r in results['results'] if r.get('passed', False))
            print(f"  Tests: {count}, Passed: {passed} ({passed/count*100:.1f}%)")
else:
    print("⚠️  gpt-5 evaluation not found.")
    print("    You can copy from Experiment 1: cp results/compare_agents/codex_20251208.yaml results/compare_models/codex_gpt5_20251209.yaml")

## 2. Run Codex + gpt-5-mini Evaluation

**Command:**

```bash
#!/bin/bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export OPENAI_API_KEY=$(cat ~/openai.key)
export PUBMED_EMAIL=justinreese@lbl.gov
export PUBMED_API_KEY=01eec0a16472164c6d69163bd28368311808
uv run metacoder eval project/generated/literature_mcp_eval_config_codex_gpt5_mini.yaml \
  -o results/compare_models/codex_gpt5_mini_$(date +%Y%m%d).yaml
```

**Duration:** ~2-3 hours

**Output:** `results/compare_models/codex_gpt5_mini_YYYYMMDD.yaml`

In [None]:
# Check if gpt-5-mini evaluation already exists
gpt5_mini_result = f"results/compare_models/codex_gpt5_mini_{datetime.now().strftime('%Y%m%d')}.yaml"

if os.path.exists(gpt5_mini_result):
    print(f"✓ gpt-5-mini evaluation already completed: {gpt5_mini_result}")
    with open(gpt5_mini_result, 'r') as f:
        results = yaml.safe_load(f)
        if results and 'results' in results:
            count = len(results['results'])
            passed = sum(1 for r in results['results'] if r.get('passed', False))
            print(f"  Tests: {count}, Passed: {passed} ({passed/count*100:.1f}%)")
else:
    print("⚠️  gpt-5-mini evaluation not found.")

## 3. Run Codex + gpt-5-nano Evaluation

**Command:**

```bash
#!/bin/bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export OPENAI_API_KEY=$(cat ~/openai.key)
export PUBMED_EMAIL=justinreese@lbl.gov
export PUBMED_API_KEY=01eec0a16472164c6d69163bd28368311808
uv run metacoder eval project/generated/literature_mcp_eval_config_codex_gpt5_nano.yaml \
  -o results/compare_models/codex_gpt5_nano_$(date +%Y%m%d).yaml
```

**Duration:** ~2-3 hours

**Output:** `results/compare_models/codex_gpt5_nano_YYYYMMDD.yaml`

In [None]:
# Check if gpt-5-nano evaluation already exists
gpt5_nano_result = f"results/compare_models/codex_gpt5_nano_{datetime.now().strftime('%Y%m%d')}.yaml"

if os.path.exists(gpt5_nano_result):
    print(f"✓ gpt-5-nano evaluation already completed: {gpt5_nano_result}")
    with open(gpt5_nano_result, 'r') as f:
        results = yaml.safe_load(f)
        if results and 'results' in results:
            count = len(results['results'])
            passed = sum(1 for r in results['results'] if r.get('passed', False))
            print(f"  Tests: {count}, Passed: {passed} ({passed/count*100:.1f}%)")
else:
    print("⚠️  gpt-5-nano evaluation not found.")

## Check All Evaluation Status

In [None]:
from glob import glob

# Find all result files in compare_models
result_files = sorted(glob("results/compare_models/codex_*.yaml"))

print("Experiment 2 Evaluation Results")
print("=" * 70)

models_found = {}
for f in result_files:
    try:
        # Get file size first to skip empty files
        if os.path.getsize(f) == 0:
            print(f"\n⚠️  {Path(f).name}: EMPTY FILE (evaluation incomplete or failed)")
            continue
            
        with open(f, 'r') as file:
            data = yaml.safe_load(file)
            if data and 'results' in data:
                filename = Path(f).stem
                # Extract model name: codex_MODEL_DATE.yaml
                parts = filename.split('_')
                if len(parts) >= 3:
                    model = '_'.join(parts[1:-1])  # Everything between 'codex' and date
                    date = parts[-1]
                else:
                    model = parts[1] if len(parts) > 1 else 'unknown'
                    date = parts[2] if len(parts) > 2 else 'unknown'
                
                count = len(data['results'])
                passed = sum(1 for r in data['results'] if r.get('passed', False))
                pass_rate = (passed / count * 100) if count > 0 else 0
                
                models_found[model] = True
                
                print(f"\n{model.upper()} ({date}):")
                print(f"  File: {f}")
                print(f"  Tests: {count}")
                print(f"  Passed: {passed}")
                print(f"  Pass rate: {pass_rate:.1f}%")
    except Exception as e:
        print(f"\nError reading {f}: {e}")

print("\n" + "=" * 70)

# Check completeness
expected_models = {'gpt5', 'gpt5_mini', 'gpt5_nano'}
missing_models = expected_models - set(models_found.keys())

if missing_models:
    print(f"\n⚠️  Missing evaluations for: {', '.join(missing_models)}")
else:
    print("\n✓ All model evaluations complete!")
    print("\nNext step: Run analysis in experiment_2_cross_model_analysis.ipynb")

## Running Evaluations from Notebook (Alternative)

If you want to run evaluations directly from this notebook:

In [None]:
def run_evaluation(model_name, config_file, output_file):
    """
    Run a metacoder evaluation for a specific model configuration.
    
    Args:
        model_name: Display name for the model
        config_file: Path to YAML config file
        output_file: Path for results output
    """
    import os
    from pathlib import Path
    import subprocess
    
    # Set up environment
    env = os.environ.copy()
    
    # Load API keys
    with open(Path.home() / 'openai.key', 'r') as f:
        env['OPENAI_API_KEY'] = f.read().strip()
    
    # PubMed credentials
    env['PUBMED_EMAIL'] = 'justinreese@lbl.gov'
    env['PUBMED_API_KEY'] = '01eec0a16472164c6d69163bd28368311808'
    
    # Remove old output file if exists
    if os.path.exists(output_file):
        os.remove(output_file)
        print(f"Removed old output: {output_file}")
    
    # Run evaluation
    print(f"\nStarting {model_name} evaluation...")
    print(f"Config: {config_file}")
    print(f"Output: {output_file}")
    print(f"This will take 2-3 hours...\n")
    
    result = subprocess.run(
        ['uv', 'run', 'metacoder', 'eval', config_file, '-o', output_file],
        env=env,
        capture_output=True,
        text=True
    )
    
    if result.returncode == 0:
        print(f"✓ {model_name} evaluation complete!")
        return True
    else:
        print(f"✗ {model_name} evaluation failed:")
        print(result.stderr)
        return False

# Example usage (uncomment to run):
# run_evaluation(
#     model_name='gpt-4o-mini',
#     config_file='project/literature_mcp_eval_config_goose_gpt4o_mini.yaml',
#     output_file=f'results/compare_models/goose_gpt4o_mini_{datetime.now().strftime("%Y%m%d")}.yaml'
# )

---

## Troubleshooting

### Common Issues

1. **Empty result files**: Check if evaluation crashed mid-run
   - Look for error logs
   - Verify API keys are valid
   - Check MCP server configurations

2. **API rate limits**: OpenAI may throttle requests
   - Evaluations automatically retry with backoff
   - Check API usage dashboards

3. **PubMed MCP failures**:
   - Verify `PUBMED_EMAIL` is set
   - Verify `PUBMED_API_KEY` is set (required for pubmed-mcp)

### Monitoring Long-Running Evaluations

```bash
# Watch for new results being written
watch -n 60 'wc -l results/compare_models/goose_*.yaml'

# Check if metacoder is running
ps aux | grep metacoder

# Tail output if running in background
tail -f nohup.out
```

## Next Steps

Once all evaluations are complete:

1. **Run analysis**: Open `experiment_2_cross_model_analysis.ipynb`
2. **Generate figures**: The analysis notebook creates publication-ready plots
3. **Compare with Experiment 1**: Cross-reference agent vs. model effects
4. **Update manuscript**: Integrate findings into Results section