# Experiment 1: How to Run Cross-Agent Evaluations

This notebook documents the procedures for running MCP evaluations with different coding agents.

**Purpose:** Reproducibility documentation for cross-agent comparison experiments.

---

## Overview

We use the `metacoder` framework to evaluate different coding agents (Goose, Claude Code, Gemini) on literature retrieval tasks using various MCP servers.

**Evaluation Framework:** metacoder (https://github.com/Your-Org/metacoder)

**Test Configuration:** `project/literature_mcp_eval_config.yaml` (base config)

**Agents Tested:**
- Goose (goose-cli)
- Claude Code (claude CLI)
- Gemini (gemini-cli) - planned

**MCP Servers:**
- artl (ARTL MCP)
- simple-pubmed (simple PubMed MCP)
- biomcp (BioMCP)
- pubmed-mcp (PubMed MCP from chrismannina)

---

## Configuration Files

### Base Configuration: `project/literature_mcp_eval_config.yaml`

This file defines:
- **Coders:** `goose: {}`
- **Models:** `claude-4-sonnet` (Anthropic's claude-sonnet-4-20250514)
- **MCP Servers:** Connection details for all 4 MCPs
- **Test Cases:** 25 literature retrieval tasks

### Agent-Specific Configurations

- `literature_mcp_eval_config.yaml` - Goose
- `literature_mcp_eval_config_claude.yaml` - Claude Code
- `literature_mcp_eval_config_gemini.yaml` - Gemini

**Key Difference:** The `coders:` section changes to specify the agent:

```yaml
# Goose
coders:
  goose: {}

# Claude Code
coders:
  claude: {}

# Gemini
coders:
  gemini: {}
```

---

## MCP Server Configurations

### 1. ARTL (artl-mcp)
```yaml
artl:
  name: artl
  command: uvx
  args: [artl-mcp]
```

### 2. Simple PubMed
```yaml
simple-pubmed:
  name: pubmed
  command: uvx
  args: [mcp-simple-pubmed]
  env:
    PUBMED_EMAIL: ctparker@lbl.gov
```

### 3. BioMCP
```yaml
biomcp:
  name: biomcp
  command: uv
  args: ["run", "--with", "biomcp-python", "biomcp", "run"]
```

### 4. PubMed MCP (chrismannina)
```yaml
pubmed-mcp:
  name: pubmed-mcp
  command: uv
  args: ["run", "--with", "git+https://github.com/chrismannina/pubmed-mcp@main", "-m", "src.main"]
  env:
    PUBMED_API_KEY: "01eec0a16472164c6d69163bd28368311808"
```

**Note:** Each MCP is tested independently (4 separate runs per test case).

---

## Running Goose Evaluation

### Prerequisites
```bash
# Install goose
pip install goose-ai  # or appropriate installation method

# Set API keys
export ANTHROPIC_API_KEY=$(cat ~/anthropic.key)
export OPENAI_API_KEY=$(cat ~/openai.key)
```

### Run Script: `run_goose_eval_fixed.sh`
```bash
#!/bin/bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export OPENAI_API_KEY=$(cat ~/openai.key)
export ANTHROPIC_API_KEY=$(cat ~/anthropic.key)
rm -f results/compare_agents/goose_20251103.yaml
uv run metacoder eval project/literature_mcp_eval_config.yaml \
  -o results/compare_agents/goose_20251103.yaml
```

### Execute
```bash
chmod +x run_goose_eval_fixed.sh
./run_goose_eval_fixed.sh
```

### Results
- **Output:** `results/compare_agents/goose_20251103.yaml`
- **Test Cases:** 25 cases × 4 MCP servers = 100 evaluations
- **Duration:** ~2-3 hours (depending on API response times)

---

## Running Claude Code Evaluation

### Prerequisites
```bash
# Install Claude Code CLI
# (installation method depends on distribution)

# Set API key
export ANTHROPIC_API_KEY=$(cat ~/anthropic.key)
```

### Run Script: `run_claude_eval.sh` (create this)
```bash
#!/bin/bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export ANTHROPIC_API_KEY=$(cat ~/anthropic.key)
rm -f results/compare_agents/claude_$(date +%Y%m%d).yaml
uv run metacoder eval project/literature_mcp_eval_config_claude.yaml \
  -o results/compare_agents/claude_$(date +%Y%m%d).yaml
```

### Execute
```bash
chmod +x run_claude_eval.sh
./run_claude_eval.sh
```

### Results
- **Output:** `results/compare_agents/claude_YYYYMMDD.yaml`
- **Test Cases:** 25 cases × 4 MCP servers = 100 evaluations
- **Duration:** ~2-3 hours

**Note:** The Oct 31 run used: `results/compare_agents/claude_20251031.yaml`

---

## Running Gemini Evaluation (Planned)

### Prerequisites
```bash
# Install Gemini CLI
# Set API key
export GOOGLE_API_KEY=$(cat ~/google.key)
```

### Run Script: `run_gemini_eval.sh` (to be created)
```bash
#!/bin/bash
cd /Users/jtr4v/PythonProject/mcp_literature_eval
export GOOGLE_API_KEY=$(cat ~/google.key)
rm -f results/compare_agents/gemini_$(date +%Y%m%d).yaml
uv run metacoder eval project/literature_mcp_eval_config_gemini.yaml \
  -o results/compare_agents/gemini_$(date +%Y%m%d).yaml
```

### Execute
```bash
chmod +x run_gemini_eval.sh
./run_gemini_eval.sh
```

---

## Analysis Workflow

After collecting results from multiple agents:

### 1. Update Analysis Notebook
Edit `experiment_1_cross_agent_analysis.ipynb` to include new result files:

```python
result_files = {
    'goose': '../results/compare_agents/goose_20251103.yaml',
    'claude': '../results/compare_agents/claude_20251031.yaml',
    'gemini': '../results/compare_agents/gemini_YYYYMMDD.yaml',  # Add when ready
}
```

### 2. Run Analysis
```bash
cd notebook
uv run jupyter nbconvert --execute --to notebook --inplace \
  experiment_1_cross_agent_analysis.ipynb
```

### 3. View Results
```bash
# Generated figures
ls -la ../results/figures/exp1_*.png

# Results summary
cat ../notes/EXPERIMENT_1_RESULTS.md
```

---

## Troubleshooting

### Goose MCP Extension Issues

**Problem:** Goose fails with `pubmed-mcp` and `simple-pubmed` using `--with-extension`

**Symptoms:**
```
Error: Command '[...goose', 'run', '-t', '...', '--with-extension', '...']'
returned non-zero exit status 1.
```

**Diagnosis:**
```bash
# Test manually
goose run -t "What is PMID:12345?" --with-extension "uvx mcp-simple-pubmed"
```

**Known Issues:**
- Goose's `--with-extension` may have compatibility issues with certain MCP launch commands
- The `uv run --with git+...` syntax may not work correctly as an extension

**Resolution:** See `notes/test_goose_extension_fix.sh` for debugging steps

---

## Test Case Structure

Each test case includes:

```yaml
- name: PMID_28027860_Full_Text
  group: "Text extraction"
  metrics:
  - CorrectnessMetric
  input: "What is the first sentence of section 2 in PMID:28027860?"
  expected_output: "Even though many of NFLE's core features..."
  threshold: 0.9
```

**Test Case Groups:**
1. Text extraction (9 cases)
2. Metadata (8 cases)
3. Table/Figure/Legend extraction (4 cases)
4. Summarization (2 cases)
5. Supplementary material (1 case)
6. Publication status (1 case)

**Total:** 25 test cases × 4 MCP servers = 100 evaluations per agent

---

## Evaluation Metrics

### CorrectnessMetric

Evaluates semantic similarity between expected and actual output using an LLM judge.

**Scoring:**
- **Score range:** 0.0 to 1.0
- **Pass threshold:** 0.9 (configurable per test case)
- **Method:** Semantic similarity with penalty for omissions and contradictions

**Evaluation Criteria:**
1. Factual accuracy
2. Completeness (no significant omissions)
3. No contradictions with expected output

---

## Output Format

Results are stored in YAML format:

```yaml
results:
  - model: claude-4-sonnet
    coder: goose
    case_name: PMID_28027860_Full_Text
    case_group: Text extraction
    metric_name: CorrectnessMetric
    score: 0.92
    passed: true
    reason: "Output matches expected with high semantic similarity"
    actual_output: "..."
    expected_output: "..."
    execution_time: 45.2
    servers: [artl]
    execution_metadata:
      success: true
      stdout: "..."
      stderr: "..."

summary:
  total_cases: 100
  passed: 47
  failed: 53
  pass_rate: 0.47
```

---

## Reproducibility Checklist

To reproduce the cross-agent comparison:

- [ ] Install all coding agents (goose, claude, gemini)
- [ ] Set up API keys in `~/anthropic.key`, `~/openai.key`, etc.
- [ ] Install metacoder: `uv add metacoder`
- [ ] Verify MCP servers are accessible: `uvx artl-mcp`, `uvx mcp-simple-pubmed`, etc.
- [ ] Run Goose evaluation: `./run_goose_eval_fixed.sh`
- [ ] Run Claude evaluation: `./run_claude_eval.sh`
- [ ] Run Gemini evaluation: `./run_gemini_eval.sh`
- [ ] Execute analysis notebook: `jupyter nbconvert --execute ...`
- [ ] Review results in `notes/EXPERIMENT_1_RESULTS.md`
- [ ] Check generated figures in `results/figures/`

---

## Known Limitations

1. **Goose MCP Extension Compatibility:** pubmed-mcp and simple-pubmed fail with 100% execution errors
2. **API Rate Limits:** Long-running evaluations may hit rate limits
3. **Evaluation Time:** Each full run takes 2-3 hours
4. **Non-determinism:** LLM responses may vary between runs
5. **Cost:** Each run costs ~$10-20 in API fees (varies by model)

---

## References

- **Metacoder Framework:** https://github.com/Your-Org/metacoder
- **ARTL MCP:** https://github.com/Your-Org/artl-mcp
- **Simple PubMed MCP:** https://github.com/Your-Org/mcp-simple-pubmed
- **BioMCP:** https://github.com/Your-Org/biomcp-python
- **PubMed MCP:** https://github.com/chrismannina/pubmed-mcp
- **Experiment Design:** `notes/experiment_1_cross_agent_comparison.md`
- **Analysis Notebook:** `notebook/experiment_1_cross_agent_analysis.ipynb`
- **Results:** `notes/EXPERIMENT_1_RESULTS.md`

---