# Experiment 2: Run Cross-Model Evaluations

This notebook runs MCP evaluations using Codex agent with different LLM models.

**Objective:** Determine whether model choice affects MCP retrieval performance when using the same coding agent.

**Each evaluation takes 2-3 hours** and tests 25 cases √ó 4 MCP servers = 100 evaluations per model.

**See:** `notes/EXPERIMENT_2_CROSS_MODEL.md` for detailed experimental design.

---

## Setup

### Environment Variables Required

The evaluation scripts need several API keys and configuration:

- `OPENAI_API_KEY`: For Codex models (gpt-5.1-codex-max, gpt-5.1-codex-mini, gpt-5.1-codex)
- `PUBMED_EMAIL`: Required for PubMed API access
- `PUBMED_API_KEY`: Required for pubmed-mcp server

These are loaded from files:
- `~/openai.key.another`

### MCP-Only Mode

All evaluations run with `disable_shell_tool: true` to prevent filesystem access and ensure fair comparison using only MCP tools.

In [None]:
import subprocess
import os
from pathlib import Path
from datetime import datetime
import yaml
import shutil

# Set working directory to project root
project_root = Path.cwd().parent if 'notebook' in str(Path.cwd()) else Path.cwd()
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

def run_isolated_eval(model_name: str, config_path: str, output_name: str, background: bool = False):
    """
    Run Codex evaluation in isolated /tmp directory with shell_tool disabled.
    
    Args:
        model_name: Display name for the model (e.g., 'gpt5', 'gpt51_codex_mini')
        config_path: Path to config file relative to project root
        output_name: Base name for output file (e.g., 'codex_gpt5_mcp_only')
        background: If True, run in background
    
    Returns:
        Tuple of (isolated_dir, output_file, process or None)
    """
    # Create isolated directory in /tmp
    isolated_dir = Path(f"/tmp/mcp_eval_isolated_codex_{model_name}_{datetime.now().strftime('%Y%m%d_%H%M%S')}")
    isolated_dir.mkdir(parents=True, exist_ok=True)
    
    print(f"=== Creating isolated environment: {isolated_dir} ===")
    
    # Copy ONLY the config file
    config_file = project_root / config_path
    config_basename = config_file.name
    isolated_config = isolated_dir / config_basename
    shutil.copy(config_file, isolated_config)
    
    print(f"‚úì Copied config: {config_basename}")
    
    # Create isolated workdir
    isolated_workdir = isolated_dir / "eval_workdir"
    isolated_workdir.mkdir(exist_ok=True)
    
    print(f"‚úì Created isolated workdir: {isolated_workdir}")
    
    # Set output file
    output_file = project_root / f"results/compare_models/{output_name}_{datetime.now().strftime('%Y%m%d')}.yaml"
    
    print(f"=== Running evaluation ===")
    print(f"Model: {model_name}")
    print(f"Config: {config_basename}")
    print(f"Run dir: {isolated_dir}")
    print(f"Workdir: {isolated_workdir}")
    print(f"Output: {output_file}")
    print("")
    
    # Set environment variables
    env = os.environ.copy()
    env["OPENAI_API_KEY"] = open(Path.home() / "openai.key.another").read().strip()
    env["PUBMED_EMAIL"] = "justinreese@lbl.gov"
    env["PUBMED_API_KEY"] = "01eec0a16472164c6d69163bd28368311808"
    
    # Use project's venv python
    venv_python = project_root / ".venv/bin/python"
    
    cmd = [
        str(venv_python), "-m", "metacoder.metacoder", "eval",
        str(isolated_config),
        "--workdir", str(isolated_workdir),
        "-o", str(output_file)
    ]
    
    if background:
        print(f"üöÄ Starting evaluation in background...")
        process = subprocess.Popen(
            cmd,
            cwd=isolated_dir,
            env=env,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True
        )
        return isolated_dir, output_file, process
    else:
        print(f"üöÄ Starting evaluation (this will take 2-3 hours)...")
        result = subprocess.run(
            cmd,
            cwd=isolated_dir,
            env=env,
            capture_output=True,
            text=True
        )
        
        if result.returncode != 0:
            print(f"‚ùå Evaluation failed with return code {result.returncode}")
            print(f"STDERR: {result.stderr}")
            raise RuntimeError(f"Evaluation failed: {result.stderr}")
        
        print(f"‚úÖ Evaluation complete!")
        print(f"Output saved to: {output_file}")
        
        return isolated_dir, output_file, None

## Experiment 2 Design

**Independent Variable:** Underlying LLM model used by Codex agent
- gpt-5.1-codex-max (baseline from Experiment 1)
- gpt-5.1-codex-mini (smaller/faster model)
- gpt-5.1-codex (standard Codex model)

**Controlled Variables:**
- Same agent: Codex CLI
- Same MCP servers: artl, simple-pubmed, biomcp, pubmed-mcp
- Same test cases: 25 cases
- Same threshold: 0.9

**Total evaluations:** 3 models √ó 4 MCPs √ó 25 cases = 300 evaluations

## Model Configurations

Each model uses a different config file in `project/generated/`:

| Model | Config File | 
|-------|-------------|
| gpt-5.1-codex-max | `literature_mcp_eval_config_codex_gpt5.yaml` |
| gpt-5.1-codex-mini | `literature_mcp_eval_config_codex_gpt51_codex_mini.yaml` |
| gpt-5.1-codex | `literature_mcp_eval_config_codex_gpt51_codex.yaml` |

## 1. Run Codex + gpt-5 Evaluation (Baseline)

**Note:** We can reuse the results from Experiment 1 (`results/compare_agents/codex_20251216.yaml`)

**Command (if running fresh):**

```bash
#!/bin/bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export OPENAI_API_KEY=$(cat ~/openai.key)
export PUBMED_EMAIL=justinreese@lbl.gov
export PUBMED_API_KEY=01eec0a16472164c6d69163bd28368311808
uv run metacoder eval project/generated/literature_mcp_eval_config_codex_gpt5.yaml \
  -o results/compare_models/codex_gpt5_$(date +%Y%m%d).yaml
```

**Duration:** ~2-3 hours

**Output:** `results/compare_models/codex_gpt5_YYYYMMDD.yaml`

In [None]:
# Check if gpt-5 evaluation already exists
gpt5_result = f"results/compare_models/codex_gpt5_{datetime.now().strftime('%Y%m%d')}.yaml"

if os.path.exists(gpt5_result):
    print(f"‚úì gpt-5 evaluation already completed: {gpt5_result}")
    # Show brief summary
    with open(gpt5_result, 'r') as f:
        results = yaml.safe_load(f)
        if results and 'results' in results:
            count = len(results['results'])
            passed = sum(1 for r in results['results'] if r.get('passed', False))
            print(f"  Tests: {count}, Passed: {passed} ({passed/count*100:.1f}%)")
else:
    print("‚ö†Ô∏è  gpt-5 evaluation not found.")
    print("    You can copy from Experiment 1: cp results/compare_agents/codex_20251216.yaml results/compare_models/codex_gpt5_20251209.yaml")

In [None]:
# Run gpt-5.1-codex-max evaluation (if not already done)
# Uncomment the line below to run:

# run_isolated_eval(
#     model_name='gpt5_max',
#     config_path='project/generated/literature_mcp_eval_config_codex_gpt5.yaml',
#     output_name='codex_gpt5_max_mcp_only'
# )

## 2. Run Codex + gpt-5.1-codex-mini Evaluation

**Command:**

```bash
#!/bin/bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export OPENAI_API_KEY=$(cat ~/openai.key)
export PUBMED_EMAIL=justinreese@lbl.gov
export PUBMED_API_KEY=01eec0a16472164c6d69163bd28368311808
uv run metacoder eval project/generated/literature_mcp_eval_config_codex_gpt51_codex_mini.yaml \
  -o results/compare_models/codex_gpt51_codex_mini_$(date +%Y%m%d).yaml
```

**Duration:** ~2-3 hours

**Output:** `results/compare_models/codex_gpt51_codex_mini_YYYYMMDD.yaml`

In [None]:
# Check if gpt-5-mini evaluation already exists
gpt5_mini_result = f"results/compare_models/codex_gpt5_mini_{datetime.now().strftime('%Y%m%d')}.yaml"

if os.path.exists(gpt5_mini_result):
    print(f"‚úì gpt-5-mini evaluation already completed: {gpt5_mini_result}")
    with open(gpt5_mini_result, 'r') as f:
        results = yaml.safe_load(f)
        if results and 'results' in results:
            count = len(results['results'])
            passed = sum(1 for r in results['results'] if r.get('passed', False))
            print(f"  Tests: {count}, Passed: {passed} ({passed/count*100:.1f}%)")
else:
    print("‚ö†Ô∏è  gpt-5-mini evaluation not found.")

In [None]:
# Run gpt-5.1-codex-mini evaluation (if not already done)
# Uncomment the line below to run:

# run_isolated_eval(
#     model_name='gpt51_codex_mini',
#     config_path='project/generated/literature_mcp_eval_config_codex_gpt51_codex_mini.yaml',
#     output_name='codex_gpt51_codex_mini_mcp_only'
# )

## 3. Run Codex + gpt-5.1-codex Evaluation

**Command:**

```bash
#!/bin/bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export OPENAI_API_KEY=$(cat ~/openai.key)
export PUBMED_EMAIL=justinreese@lbl.gov
export PUBMED_API_KEY=01eec0a16472164c6d69163bd28368311808
uv run metacoder eval project/generated/literature_mcp_eval_config_codex_gpt51_codex.yaml \
  -o results/compare_models/codex_gpt51_codex_$(date +%Y%m%d).yaml
```

**Duration:** ~2-3 hours

**Output:** `results/compare_models/codex_gpt51_codex_YYYYMMDD.yaml`

In [None]:
# Check if gpt-5-nano evaluation already exists
gpt5_nano_result = f"results/compare_models/codex_gpt5_nano_{datetime.now().strftime('%Y%m%d')}.yaml"

if os.path.exists(gpt5_nano_result):
    print(f"‚úì gpt-5-nano evaluation already completed: {gpt5_nano_result}")
    with open(gpt5_nano_result, 'r') as f:
        results = yaml.safe_load(f)
        if results and 'results' in results:
            count = len(results['results'])
            passed = sum(1 for r in results['results'] if r.get('passed', False))
            print(f"  Tests: {count}, Passed: {passed} ({passed/count*100:.1f}%)")
else:
    print("‚ö†Ô∏è  gpt-5-nano evaluation not found.")

In [None]:
# Run gpt-5.1-codex evaluation (if not already done)
# Uncomment the line below to run:

# run_isolated_eval(
#     model_name='gpt51_codex',
#     config_path='project/generated/literature_mcp_eval_config_codex_gpt51_codex.yaml',
#     output_name='codex_gpt51_codex_mcp_only'
# )

## Check All Evaluation Status

In [None]:
from glob import glob

# Find all result files in compare_models
result_files = sorted(glob("results/compare_models/codex_*.yaml"))

print("Experiment 2 Evaluation Results")
print("=" * 70)

models_found = {}
for f in result_files:
    try:
        # Get file size first to skip empty files
        if os.path.getsize(f) == 0:
            print(f"\n‚ö†Ô∏è  {Path(f).name}: EMPTY FILE (evaluation incomplete or failed)")
            continue
            
        with open(f, 'r') as file:
            data = yaml.safe_load(file)
            if data and 'results' in data:
                filename = Path(f).stem
                # Extract model name: codex_MODEL_DATE.yaml
                parts = filename.split('_')
                if len(parts) >= 3:
                    model = '_'.join(parts[1:-1])  # Everything between 'codex' and date
                    date = parts[-1]
                else:
                    model = parts[1] if len(parts) > 1 else 'unknown'
                    date = parts[2] if len(parts) > 2 else 'unknown'
                
                count = len(data['results'])
                passed = sum(1 for r in data['results'] if r.get('passed', False))
                pass_rate = (passed / count * 100) if count > 0 else 0
                
                models_found[model] = True
                
                print(f"\n{model.upper()} ({date}):")
                print(f"  File: {f}")
                print(f"  Tests: {count}")
                print(f"  Passed: {passed}")
                print(f"  Pass rate: {pass_rate:.1f}%")
    except Exception as e:
        print(f"\nError reading {f}: {e}")

print("\n" + "=" * 70)

# Check completeness
expected_models = {'gpt5', 'gpt5_mini', 'gpt5_nano'}
missing_models = expected_models - set(models_found.keys())

if missing_models:
    print(f"\n‚ö†Ô∏è  Missing evaluations for: {', '.join(missing_models)}")
else:
    print("\n‚úì All model evaluations complete!")
    print("\nNext step: Run analysis in experiment_2_cross_model_analysis.ipynb")

## Running Evaluations from Notebook (Alternative)

If you want to run evaluations directly from this notebook:

In [None]:
# DEPRECATED: Use run_isolated_eval() instead - it runs in /tmp with MCP-only mode
# This function is kept for reference only

def run_evaluation(model_name, config_file, output_file):
    """
    DEPRECATED: Use run_isolated_eval() instead.
    
    Run a metacoder evaluation for a specific model configuration.
    """
    import os
    from pathlib import Path
    import subprocess
    
    # Set up environment
    env = os.environ.copy()
    
    # Load API keys - use openai.key.another
    with open(Path.home() / 'openai.key.another', 'r') as f:
        env['OPENAI_API_KEY'] = f.read().strip()
    
    # PubMed credentials
    env['PUBMED_EMAIL'] = 'justinreese@lbl.gov'
    env['PUBMED_API_KEY'] = '01eec0a16472164c6d69163bd28368311808'
    
    # Remove old output file if exists
    if os.path.exists(output_file):
        os.remove(output_file)
        print(f"Removed old output: {output_file}")
    
    # Run evaluation
    print(f"\nStarting {model_name} evaluation...")
    print(f"Config: {config_file}")
    print(f"Output: {output_file}")
    print(f"This will take 2-3 hours...\n")
    
    result = subprocess.run(
        ['uv', 'run', 'metacoder', 'eval', config_file, '-o', output_file],
        env=env,
        capture_output=True,
        text=True
    )
    
    if result.returncode == 0:
        print(f"‚úì {model_name} evaluation complete!")
        return True
    else:
        print(f"‚úó {model_name} evaluation failed:")
        print(result.stderr)
        return False

---

## Troubleshooting

### Common Issues

1. **Empty result files**: Check if evaluation crashed mid-run
   - Look for error logs
   - Verify API keys are valid
   - Check MCP server configurations

2. **API rate limits**: OpenAI may throttle requests
   - Evaluations automatically retry with backoff
   - Check API usage dashboards

3. **PubMed MCP failures**:
   - Verify `PUBMED_EMAIL` is set
   - Verify `PUBMED_API_KEY` is set (required for pubmed-mcp)

### Monitoring Long-Running Evaluations

```bash
# Watch for new results being written
watch -n 60 'wc -l results/compare_models/goose_*.yaml'

# Check if metacoder is running
ps aux | grep metacoder

# Tail output if running in background
tail -f nohup.out
```

## Next Steps

Once all evaluations are complete:

1. **Run analysis**: Open `experiment_2_cross_model_analysis.ipynb`
2. **Generate figures**: The analysis notebook creates publication-ready plots
3. **Compare with Experiment 1**: Cross-reference agent vs. model effects
4. **Update manuscript**: Integrate findings into Results section