# Benchmark Claude Opus on Sparse Pairs

The MedGemma 27B benchmark on the new 6,482-pair test set showed F1=0.634,
with performance strongly correlated to prompt length. Sparse pairs (<2,000 chars)
had F1=0.278. This notebook benchmarks Claude Opus on 100 sparse pairs to see
if the difficulty is inherent or a model limitation.

In [1]:
import json
import random
import time

from datasets import load_dataset
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
)

MODEL = "claude-opus-4-6"
print(f"Using model: {MODEL} (via claude-agent-sdk, no API key needed)")

Using model: claude-opus-4-6 (via claude-agent-sdk, no API key needed)


## Load dataset and select sparse pairs

In [2]:
ds = load_dataset("abicyclerider/entity-resolution-pairs")
test = ds["test"]
print(f"Test set: {len(test)} pairs")

# Identify sparse pairs (< 2000 chars)
sparse_indices = [
    i for i in range(len(test))
    if len(test[i]["messages"][0]["content"]) < 2000
]
print(f"Sparse pairs (<2000 chars): {len(sparse_indices)}")

# Label balance in sparse pairs
sparse_labels = [test[i]["messages"][1]["content"] == "True" for i in sparse_indices]
print(f"  Match: {sum(sparse_labels)}, Non-match: {len(sparse_labels) - sum(sparse_labels)}")

Test set: 6482 pairs
Sparse pairs (<2000 chars): 915
  Match: 695, Non-match: 220


In [3]:
# Sample 100 sparse pairs (balanced)
SEED = 42
N_SAMPLE = 100

random.seed(SEED)

match_idx = [i for i in sparse_indices if test[i]["messages"][1]["content"] == "True"]
non_match_idx = [i for i in sparse_indices if test[i]["messages"][1]["content"] == "False"]

n_each = N_SAMPLE // 2
sampled = random.sample(match_idx, n_each) + random.sample(non_match_idx, n_each)
random.shuffle(sampled)

prompts = [test[i]["messages"][0]["content"] for i in sampled]
labels = [test[i]["messages"][1]["content"] == "True" for i in sampled]

lengths = [len(p) for p in prompts]
print(f"Sampled {len(sampled)} sparse pairs (seed={SEED})")
print(f"  Match: {sum(labels)}, Non-match: {len(labels) - sum(labels)}")
print(f"  Prompt length: min={min(lengths)}, median={sorted(lengths)[len(lengths)//2]}, max={max(lengths)}")

Sampled 100 sparse pairs (seed=42)
  Match: 50, Non-match: 50
  Prompt length: min=408, median=1513, max=1999


## Run Claude Opus inference

In [4]:
SYSTEM_PROMPT = (
    "You are a medical entity resolution assistant. "
    "Compare the two patient records and respond with only "
    "True (same patient) or False (different patients)."
)


def parse_prediction(text):
    text = text.strip().lower()
    if "true" in text:
        return True
    elif "false" in text:
        return False
    return None


async def call_claude_async(prompt, system=SYSTEM_PROMPT):
    """Call Claude via claude-agent-sdk (subscription auth, no API key)."""
    from claude_agent_sdk import query, ClaudeAgentOptions

    result_parts = []
    async for message in query(
        prompt=prompt,
        options=ClaudeAgentOptions(
            model=MODEL,
            max_turns=1,
            allowed_tools=[],
            system_prompt=system,
        ),
    ):
        if hasattr(message, "content"):
            if isinstance(message.content, list):
                for block in message.content:
                    if hasattr(block, "text"):
                        result_parts.append(block.text)
            elif isinstance(message.content, str):
                result_parts.append(message.content)

    return "\n".join(result_parts)

In [5]:
# Quick test (1 pair) — uses subscription auth via claude-agent-sdk
test_resp = await call_claude_async(prompts[0])
print(f"Test response: '{test_resp}'")
print(f"Parsed: {parse_prediction(test_resp)}")
print(f"True label: {labels[0]}")

Test response: 'False'
Parsed: False
True label: False


In [6]:
# Run all 100 pairs via claude-agent-sdk
predictions = []
raw_responses = []
pair_times = []
unparseable = []

t_start = time.time()

for i, (prompt, label) in enumerate(zip(prompts, labels)):
    t0 = time.time()
    try:
        raw = await call_claude_async(prompt)
    except Exception as e:
        print(f"  ERROR on pair {i}: {e}")
        raw = ""
    elapsed = time.time() - t0
    pair_times.append(elapsed)
    raw_responses.append(raw)

    pred = parse_prediction(raw)
    predictions.append(pred)
    if pred is None:
        unparseable.append(i)

    if (i + 1) % 25 == 0:
        valid = [p for p in predictions if p is not None]
        valid_l = [l for p, l in zip(predictions, labels) if p is not None]
        acc = accuracy_score(valid_l, valid) if valid else 0
        total = time.time() - t_start
        print(
            f"  [{i+1}/{len(prompts)}] acc={acc:.3f}, "
            f"unparseable={len(unparseable)}, "
            f"elapsed={total:.0f}s ({total/(i+1):.1f}s/pair)"
        )

total_time = time.time() - t_start
print(f"\nDone in {total_time:.0f}s ({total_time/len(prompts):.1f}s/pair)")

  [25/100] acc=0.880, unparseable=0, elapsed=62s (2.5s/pair)
  [50/100] acc=0.860, unparseable=0, elapsed=132s (2.6s/pair)
  [75/100] acc=0.813, unparseable=0, elapsed=198s (2.6s/pair)
  [100/100] acc=0.800, unparseable=0, elapsed=270s (2.7s/pair)

Done in 270s (2.7s/pair)


## Results

In [7]:
valid_preds = [p for p in predictions if p is not None]
valid_labels = [l for p, l in zip(predictions, labels) if p is not None]

metrics = {
    "accuracy": round(accuracy_score(valid_labels, valid_preds), 4),
    "precision": round(precision_score(valid_labels, valid_preds, zero_division=0), 4),
    "recall": round(recall_score(valid_labels, valid_preds, zero_division=0), 4),
    "f1": round(f1_score(valid_labels, valid_preds, zero_division=0), 4),
}
cm = confusion_matrix(valid_labels, valid_preds).tolist()

print("=" * 60)
print(f"Claude Opus — Sparse Pairs ({len(sampled)} pairs, <2000 chars)")
print("=" * 60)
print(f"Parseable: {len(valid_preds)}/{len(predictions)} ({len(unparseable)} unparseable)")
for m, v in metrics.items():
    print(f"  {m:>10s}: {v:.3f}")
print(f"\nConfusion matrix:")
print(f"  TN={cm[0][0]}  FP={cm[0][1]}")
print(f"  FN={cm[1][0]}  TP={cm[1][1]}")
print(f"\nTiming: {total_time:.0f}s total ({total_time/len(prompts):.1f}s/pair)")

Claude Opus — Sparse Pairs (100 pairs, <2000 chars)
Parseable: 100/100 (0 unparseable)
    accuracy: 0.800
   precision: 0.841
      recall: 0.740
          f1: 0.787

Confusion matrix:
  TN=43  FP=7
  FN=13  TP=37

Timing: 270s total (2.7s/pair)


In [8]:
# Comparison table
print("\n" + "=" * 60)
print("Comparison")
print("=" * 60)

baselines = [
    ("MedGemma 27B (all 6482)",    0.729, 0.977, 0.469, 0.634, "new test set"),
    ("MedGemma 27B (sparse only)", 0.358, 0.958, 0.163, 0.278, "<2000 chars, n=915"),
    ("MedGemma 27B (old 338)",     0.799, 0.848, 0.728, 0.783, "original test set"),
    ("Claude Opus (old 338)",      0.940, 0.919, 0.953, 0.936, "original test set"),
]

print(f"{'Model':<30s}  {'Acc':>6s}  {'Prec':>6s}  {'Rec':>6s}  {'F1':>6s}  Dataset")
print("-" * 90)
for name, acc, prec, rec, f1, note in baselines:
    print(f"{name:<30s}  {acc:>6.3f}  {prec:>6.3f}  {rec:>6.3f}  {f1:>6.3f}  {note}")
print(f"{'Claude Opus (THIS RUN)':<30s}  {metrics['accuracy']:>6.3f}  "
      f"{metrics['precision']:>6.3f}  {metrics['recall']:>6.3f}  {metrics['f1']:>6.3f}  "
      f"sparse, {len(sampled)} pairs")


Comparison
Model                              Acc    Prec     Rec      F1  Dataset
------------------------------------------------------------------------------------------
MedGemma 27B (all 6482)          0.729   0.977   0.469   0.634  new test set
MedGemma 27B (sparse only)       0.358   0.958   0.163   0.278  <2000 chars, n=915
MedGemma 27B (old 338)           0.799   0.848   0.728   0.783  original test set
Claude Opus (old 338)            0.940   0.919   0.953   0.936  original test set
Claude Opus (THIS RUN)           0.800   0.841   0.740   0.787  sparse, 100 pairs


In [9]:
# Save results
results = {
    "model": MODEL,
    "system_prompt": SYSTEM_PROMPT,
    "dataset": "abicyclerider/entity-resolution-pairs",
    "subset": "sparse (<2000 chars)",
    "n_sampled": len(sampled),
    "n_parseable": len(valid_preds),
    "seed": SEED,
    "metrics": metrics,
    "confusion_matrix": cm,
    "timing_s": round(total_time, 1),
    "per_pair": [
        {
            "dataset_index": sampled[i],
            "true_label": labels[i],
            "prediction": predictions[i],
            "raw_response": raw_responses[i],
            "prompt_length": len(prompts[i]),
            "time_s": round(pair_times[i], 2),
        }
        for i in range(len(sampled))
    ],
}

out_path = "benchmark_claude_sparse_results.json"
with open(out_path, "w") as f:
    json.dump(results, f, indent=2)
print(f"Results saved to {out_path}")

Results saved to benchmark_claude_sparse_results.json


## Error analysis

In [10]:
# Show misclassified pairs
errors = [
    (i, sampled[i], labels[i], predictions[i], raw_responses[i], len(prompts[i]))
    for i in range(len(sampled))
    if predictions[i] is not None and predictions[i] != labels[i]
]

print(f"Total errors: {len(errors)}/{len(valid_preds)}")
fn = [e for e in errors if e[2] == True and e[3] == False]
fp = [e for e in errors if e[2] == False and e[3] == True]
print(f"  False negatives (missed matches): {len(fn)}")
print(f"  False positives (false alarms):   {len(fp)}")

# Show first 3 FN examples
print("\n" + "=" * 60)
print("Sample false negatives")
print("=" * 60)
for idx, (i, ds_idx, true_l, pred, resp, length) in enumerate(fn[:3]):
    print(f"\n--- FN {idx+1} (dataset idx {ds_idx}, {length} chars) ---")
    print(prompts[i][:600])
    if len(prompts[i]) > 600:
        print("...")
    print(f"  Response: '{resp}' | True: {true_l}")

Total errors: 20/100
  False negatives (missed matches): 13
  False positives (false alarms):   7

Sample false negatives

--- FN 1 (dataset idx 1256, 1064 chars) ---
You are a medical record matching expert. Compare these two patient medical records and determine if they belong to the same patient based only on their clinical history.

Record A:
CONDITIONS:
  2020: Medication review due (situation)
  2021: Sprain (morphologic abnormality); Sprain of ankle

MEDICATIONS:
- Naproxen sodium 220 MG Oral Tablet (2021–2021)

OBSERVATIONS:
- Body Height: 174.1 cm (2020-08-26)
- Body Weight: 83.2 kg (2020-08-26)
- Body Mass Index: 27.5 kg/m2 (2020-08-26)
- Systolic Blood Pressure: 125.0 mm[Hg] (2020-08-26)
- Diastolic Blood Pressure: 80.0 mm[Hg] (2020-08-26)

PROCE
...
  Response: 'False' | True: True

--- FN 2 (dataset idx 1560, 1824 chars) ---
You are a medical record matching expert. Compare these two patient medical records and determine if they belong to the same patient based only on the