# Experiment 1: Run Cross-Agent Evaluations

This notebook runs MCP evaluations across three coding agents: Claude Code, Goose, and Codex.

**Each evaluation takes 2-3 hours** and tests 25 cases × 4 MCP servers = 100 evaluations per agent.

**Total time: ~2-3 hours** (if run in parallel) or ~6-9 hours (if run sequentially)

---

## Setup

### Environment Variables Required

**CRITICAL: ALL agents require `OPENAI_API_KEY`** (DeepEval uses OpenAI for evaluation scoring)

**Claude Code:**
- `OPENAI_API_KEY`: From `~/openai.key` (for DeepEval CorrectnessMetric)
- `ANTHROPIC_API_KEY`: From `~/cborg.key` (CBORG proxy)
- `ANTHROPIC_BASE_URL`: `https://api.cborg.lbl.gov`
- `PUBMED_EMAIL`, `PUBMED_API_KEY`: For MCP servers

**Goose:**
- `OPENAI_API_KEY`: From `~/openai.key` (for agent + DeepEval)
- `PUBMED_EMAIL`, `PUBMED_API_KEY`: For MCP servers

**Codex:**
- `OPENAI_API_KEY`: From `~/openai.key` (for agent + DeepEval)
- `PUBMED_EMAIL`, `PUBMED_API_KEY`: For MCP servers

In [None]:
import subprocess
import os
from pathlib import Path
from datetime import datetime
import yaml

# Set working directory to project root
project_root = Path.cwd().parent if 'notebook' in str(Path.cwd()) else Path.cwd()
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

## Running Evaluations in Parallel

**Recommended:** Run all 3 agents simultaneously in separate terminals to complete in ~2-3 hours total.

Open **3 terminal windows** and run these commands:

### Terminal 1: Claude Code (via CBORG)

```bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export OPENAI_API_KEY=$(cat ~/openai.key)
export ANTHROPIC_API_KEY=$(cat ~/cborg.key)
export ANTHROPIC_BASE_URL=https://api.cborg.lbl.gov
export PUBMED_EMAIL=justinreese@lbl.gov
export PUBMED_API_KEY=01eec0a16472164c6d69163bd28368311808
uv run metacoder eval project/literature_mcp_eval_config_claude.yaml \
  -o results/compare_agents/claude_$(date +%Y%m%d).yaml
```

**Note:** Uses CBORG (LBL cluster proxy) for Anthropic API access. `OPENAI_API_KEY` is required for DeepEval's CorrectnessMetric (evaluation scorer).

### Terminal 2: Goose + gpt-4o (via OpenAI)

```bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export OPENAI_API_KEY=$(cat ~/openai.key)
export PUBMED_EMAIL=justinreese@lbl.gov
export PUBMED_API_KEY=01eec0a16472164c6d69163bd28368311808
uv run metacoder eval project/literature_mcp_eval_config_goose_gpt4o.yaml \
  -o results/compare_agents/goose_$(date +%Y%m%d).yaml
```

### Terminal 3: Codex + gpt-4o (via OpenAI)

```bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export OPENAI_API_KEY=$(cat ~/openai.key)
export PUBMED_EMAIL=justinreese@lbl.gov
export PUBMED_API_KEY=01eec0a16472164c6d69163bd28368311808
uv run metacoder eval project/literature_mcp_eval_config_codex.yaml \
  -o results/compare_agents/codex_$(date +%Y%m%d).yaml
```

**Note:** `OPENAI_API_KEY` is required for both the Codex agent and DeepEval's CorrectnessMetric (evaluation scorer).

---

## Check Evaluation Status

In [None]:
from glob import glob

# Find all result files
result_files = sorted(glob("results/compare_agents/*.yaml"))

print("Experiment 1 Evaluation Results")
print("=" * 70)

agents_found = {}
for f in result_files:
    try:
        # Get file size first to skip empty files
        if os.path.getsize(f) == 0:
            print(f"\n⚠️  {Path(f).name}: EMPTY FILE (evaluation incomplete or failed)")
            continue
            
        with open(f, 'r') as file:
            data = yaml.safe_load(file)
            if data and 'results' in data:
                filename = Path(f).stem
                # Extract agent name: agent_DATE.yaml
                parts = filename.split('_')
                agent = parts[0]
                date = parts[1] if len(parts) > 1 else 'unknown'
                
                count = len(data['results'])
                passed = sum(1 for r in data['results'] if r.get('passed', False))
                pass_rate = (passed / count * 100) if count > 0 else 0
                
                agents_found[agent] = True
                
                print(f"\n{agent.upper()} ({date}):")
                print(f"  File: {f}")
                print(f"  Tests: {count}")
                print(f"  Passed: {passed}")
                print(f"  Pass rate: {pass_rate:.1f}%")
    except Exception as e:
        print(f"\nError reading {f}: {e}")

print("\n" + "=" * 70)

# Check completeness
expected_agents = {'claude', 'goose', 'codex'}
missing_agents = expected_agents - set(agents_found.keys())

if missing_agents:
    print(f"\n⚠️  Missing evaluations for: {', '.join(missing_agents)}")
else:
    print("\n✓ All agent evaluations complete!")
    print("\nNext step: Run analysis in experiment_1_cross_agent_analysis.ipynb")

---

## Configuration Details

### Agents Tested

1. **Claude Code** (claude-sonnet-4-20250514)
   - Provider: Anthropic via CBORG proxy
   - Config: `project/literature_mcp_eval_config_claude.yaml`

2. **Goose** (gpt-4o)
   - Provider: OpenAI
   - Config: `project/literature_mcp_eval_config_goose_gpt4o.yaml`

3. **Codex** (gpt-4o)
   - Provider: OpenAI
   - Config: `project/literature_mcp_eval_config_codex.yaml`

### MCP Servers Tested

1. **ARTL MCP** - Berkeley Lab Contextualizer AI
   - Repo: https://github.com/contextualizer-ai/artl-mcp
   - Command: `uvx artl-mcp`

2. **Simple PubMed MCP**
   - Repo: https://github.com/andybrandt/mcp-simple-pubmed
   - Command: `uvx mcp-simple-pubmed`
   - Env: `PUBMED_EMAIL=justinreese@lbl.gov`

3. **BioMCP**
   - Repo: https://github.com/genomoncology/biomcp
   - Command: `uv run --with biomcp-python biomcp run`

4. **PubMed MCP** (chrismannina)
   - Repo: https://github.com/chrismannina/pubmed-mcp
   - Command: `uv run --with git+https://github.com/chrismannina/pubmed-mcp@main -m src.main`
   - Env: `PUBMED_API_KEY`, `PUBMED_EMAIL=justinreese@lbl.gov`

### Test Cases

25 literature retrieval tasks organized into groups:
- **Text extraction** (11 cases)
- **Metadata** (4 cases)
- **Table/Figure/Legend extraction** (4 cases)
- **Supplementary material** (3 cases)
- **Summarization** (2 cases)
- **Publication status** (1 case)

See [TEST_CASES.md](../TEST_CASES.md) for complete test case details.

### Evaluation Framework

- **Framework:** Metacoder (https://github.com/ai4curation/metacoder)
- **Metric:** CorrectnessMetric (semantic similarity via DeepEval)
- **Pass threshold:** 0.9 (90% semantic match)
- **Total evaluations per agent:** 100 (25 cases × 4 MCPs)
- **Duration:** ~2-3 hours per agent

### CBORG Configuration (Claude Code)

CBORG is LBL's cluster proxy for Anthropic API access:
- **Base URL:** `https://api.cborg.lbl.gov`
- **API Key:** Stored in `~/cborg.key`
- **Credit tracking:** Monitor usage at CBORG dashboard

**Note:** If CBORG credits run low, switch to direct Anthropic:
```bash
export ANTHROPIC_API_KEY=$(cat ~/anthropic.key)
unset ANTHROPIC_BASE_URL
```

### References

- **MCP Server Catalog:** https://docs.google.com/spreadsheets/d/1506RuqfyUrBHd6lGNtY5j688CvVwk5mf5h_LEfsPIp4/edit?gid=614919216#gid=614919216
- **Experiment Design:** `notes/EXPERIMENT_1_RESULTS.md`