# Experiment 1: Run Cross-Agent Evaluations

This notebook runs MCP evaluations across three coding agents: Claude Code, Goose, and Gemini.

**Each evaluation takes 2-3 hours** and tests 25 cases × 4 MCP servers = 100 evaluations per agent.

---

## Setup

In [None]:
import subprocess
import os
from pathlib import Path
from datetime import datetime

# Set working directory
project_root = Path.cwd().parent
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

## 1. Run Claude Code Evaluation

**Command:** `./run_claude_eval.sh`

**Duration:** ~2-3 hours

**Output:** `results/compare_agents/claude_YYYYMMDD.yaml`

In [None]:
# Run Claude evaluation
# Uncomment to run:
# !./run_claude_eval.sh

print("Claude evaluation script: ./run_claude_eval.sh")
print("Status: Already completed (results/compare_agents/claude_20251031.yaml)")

## 2. Run Goose Evaluation

**Command:** `./run_goose_eval.sh`

**Duration:** ~2-3 hours

**Output:** `results/compare_agents/goose_YYYYMMDD.yaml`

In [None]:
# Run Goose evaluation
# Uncomment to run:
# !./run_goose_eval.sh

print("Goose evaluation script: ./run_goose_eval.sh")
print("Status: Needs re-run with fixed MCP configs")

## 3. Run Gemini Evaluation

**Command:** `./run_gemini_eval.sh`

**Duration:** ~2-3 hours

**Output:** `results/compare_agents/gemini_YYYYMMDD.yaml`

In [None]:
# Run Gemini evaluation
# Uncomment to run:
# !./run_gemini_eval.sh

print("Gemini evaluation script: ./run_gemini_eval.sh")
print("Status: Currently running in background")

## Check Evaluation Status

In [None]:
import yaml
from glob import glob

# Find all result files
result_files = sorted(glob('results/compare_agents/*.yaml'))

print("Available evaluation results:")
print("=" * 70)

for f in result_files:
    try:
        with open(f, 'r') as file:
            data = yaml.safe_load(file)
            if data and 'results' in data:
                agent = Path(f).stem.split('_')[0]
                date = Path(f).stem.split('_')[1]
                count = len(data['results'])
                summary = data.get('summary', {})
                pass_rate = summary.get('pass_rate', 0) * 100 if summary else 0
                
                print(f"\n{agent.upper()} ({date}):")
                print(f"  File: {f}")
                print(f"  Tests: {count}")
                print(f"  Pass rate: {pass_rate:.1f}%")
    except Exception as e:
        print(f"\nError reading {f}: {e}")

print("\n" + "=" * 70)

## Run Analysis

Once all evaluations are complete, run the cross-agent analysis:

In [None]:
# Run cross-agent analysis notebook
# Uncomment when all evaluations are complete:
# !jupyter nbconvert --execute --to notebook --inplace experiment_1_cross_agent_analysis.ipynb

print("Analysis notebook: experiment_1_cross_agent_analysis.ipynb")
print("Run after all three evaluations are complete")

---

## Configuration Details

### MCP Servers Tested

1. **ARTL MCP** - Berkeley Lab Contextualizer AI
   - Repo: https://github.com/contextualizer-ai/artl-mcp
   - Command: `uvx artl-mcp`

2. **Simple PubMed MCP**
   - Repo: https://github.com/andybrandt/mcp-simple-pubmed
   - Command: `uvx mcp-simple-pubmed`
   - Env: `PUBMED_EMAIL=justinreese@lbl.gov`

3. **BioMCP**
   - Repo: https://github.com/genomoncology/biomcp
   - Command: `uv run --with biomcp-python biomcp run`

4. **PubMed MCP** (chrismannina)
   - Repo: https://github.com/chrismannina/pubmed-mcp
   - Command: `uv run --with git+https://github.com/chrismannina/pubmed-mcp@main -m src.main`
   - Env: `PUBMED_API_KEY`, `PUBMED_EMAIL=justinreese@lbl.gov`

### Test Cases

25 literature retrieval tasks organized into groups:
- Text extraction (9 cases)
- Metadata (8 cases)  
- Table/Figure/Legend extraction (4 cases)
- Summarization (2 cases)
- Supplementary material (1 case)
- Publication status (1 case)

### Evaluation Framework

- **Framework:** Metacoder (https://github.com/ai4curation/metacoder)
- **Metric:** CorrectnessMetric (semantic similarity)
- **Pass threshold:** 0.9
- **Total evaluations per agent:** 100 (25 cases × 4 MCPs)

### References

- **MCP Server Catalog:** https://docs.google.com/spreadsheets/d/1506RuqfyUrBHd6lGNtY5j688CvVwk5mf5h_LEfsPIp4/edit?gid=614919216#gid=614919216
- **Experiment Design:** `notes/experiment_1_cross_agent_comparison.md`
- **Results Documentation:** `notes/EXPERIMENT_1_RESULTS.md`