# Multi-Model Evaluation

This notebook demonstrates how to compare LLM performance across multiple
answering models using a single benchmark. You'll learn to configure models,
run verification, and analyze per-model results.

| Step | What You'll Do |
|------|---------------|
| 1 | Configure multiple answering models |
| 2 | Run verification across all models |
| 3 | Compare models by pass rate |
| 4 | Drill into per-question performance |
| 5 | Use `from_overrides` for separate model runs |
| 6 | Measure variance with replicates |

See [Multi-Model Evaluation](../06-running-verification/multi-model.md) for
full documentation.

In [1]:
# Mock cell: patches run_verification so examples execute without live API keys.
# This cell is hidden in the rendered documentation.
import datetime
import os
from unittest.mock import patch

from karenina.schemas.results import VerificationResultSet
from karenina.schemas.verification import VerificationConfig, VerificationResult
from karenina.schemas.verification.model_identity import ModelIdentity
from karenina.schemas.verification.result_components import (
    VerificationResultMetadata,
    VerificationResultTemplate,
)

os.chdir(os.path.dirname(os.path.abspath("__file__")))


def _mock_run_verification(self, config, question_ids=None, **kwargs):
    """Return realistic mock results for multi-model documentation."""
    qids = question_ids or self.get_question_ids()
    mock_results = []

    # Model performance profiles — different models have different strengths
    # Keys are substrings that match questions in test_checkpoint.jsonld
    model_profiles = {
        "gpt-4o": {
            "capital of France": True,
            "6 multiplied by 7": True,
            "atomic number 8": True,
            "prime number": True,
        },
        "claude-sonnet-4-5-20250514": {
            "capital of France": True,
            "6 multiplied by 7": True,
            "atomic number 8": True,
            "prime number": True,
        },
        "gemini-2.0-flash": {
            "capital of France": True,
            "6 multiplied by 7": True,
            "atomic number 8": False,
            "prime number": True,
        },
    }

    for qid in qids:
        q = self.get_question(qid)
        q_text = q.get("question", "")
        has_template = self.has_template(qid)

        for ans_model in config.answering_models:
            model_name = ans_model.model_name
            profile = model_profiles.get(model_name, {})

            for rep in range(config.replicate_count):
                for parse_model in config.parsing_models:
                    passed = any(
                        key in q_text and profile.get(key, False)
                        for key in profile
                    )
                    if not has_template:
                        passed = None

                    answering = ModelIdentity(
                        interface=ans_model.interface,
                        model_name=ans_model.model_name,
                    )
                    parsing = ModelIdentity(
                        interface=parse_model.interface,
                        model_name=parse_model.model_name,
                    )

                    template_result = None
                    if has_template:
                        template_result = VerificationResultTemplate(
                            raw_llm_response=f"Mock answer from {model_name}",
                            verify_result=passed,
                            template_verification_performed=True,
                        )

                    metadata = VerificationResultMetadata(
                        question_id=qid,
                        template_id=(
                            self.get_template(qid)[:10] + "..."
                            if has_template
                            else "no_template"
                        ),
                        completed_without_errors=True,
                        question_text=q_text,
                        answering=answering,
                        parsing=parsing,
                        execution_time=1.2,
                        timestamp=datetime.datetime.now(
                            datetime.UTC
                        ).isoformat(),
                        result_id=f"mock-{qid[:8]}-{model_name[:4]}-r{rep}",
                        run_name=kwargs.get("run_name"),
                        replicate=rep,
                    )

                    mock_results.append(
                        VerificationResult(
                            metadata=metadata, template=template_result
                        )
                    )

    return VerificationResultSet(results=mock_results)


_patcher1 = patch(
    "karenina.benchmark.benchmark.Benchmark.run_verification",
    _mock_run_verification,
)
_patcher2 = patch(
    "karenina.schemas.verification.config.VerificationConfig._validate_config",
    lambda self: None,
)
_ = _patcher1.start()
_ = _patcher2.start()

## Step 1: Configure Multiple Models

Provide a list of answering models to compare. Each model is a `ModelConfig`
with its own provider and interface settings.

In [2]:
from karenina.benchmark import Benchmark
from karenina.schemas import ModelConfig, VerificationConfig

benchmark = Benchmark.load("test_checkpoint.jsonld")

config = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="gpt4o",
            model_name="gpt-4o",
            model_provider="openai",
            interface="langchain",
        ),
        ModelConfig(
            id="claude-sonnet",
            model_name="claude-sonnet-4-5-20250514",
            model_provider="anthropic",
            interface="langchain",
        ),
        ModelConfig(
            id="gemini-flash",
            model_name="gemini-2.0-flash",
            model_provider="google_genai",
            interface="langchain",
        ),
    ],
    parsing_models=[
        ModelConfig(
            id="parser",
            model_name="gpt-4o-mini",
            model_provider="openai",
            interface="langchain",
        ),
    ],
)

total_tasks = (
    benchmark.question_count
    * len(config.answering_models)
    * len(config.parsing_models)
    * config.replicate_count
)
print(f"Questions: {benchmark.question_count}")
print(f"Answering models: {len(config.answering_models)}")
print(f"Parsing models: {len(config.parsing_models)}")
print(f"Total tasks: {total_tasks}")

Questions: 5
Answering models: 3
Parsing models: 1
Total tasks: 15


## Step 2: Run Verification

A single call runs all model combinations and returns one `VerificationResultSet`:

In [3]:
results = benchmark.run_verification(config)
print(f"Total results: {len(results.results)}")

Total results: 15


## Step 3: Compare Models by Pass Rate

Group results by answering model to see overall performance:

In [4]:
by_model = results.group_by_model(by="answering")

for model_name, model_results in by_model.items():
    summary = model_results.get_summary()
    pass_info = summary.get("template_pass_overall", {})
    passed = pass_info.get("passed", 0)
    total = pass_info.get("total", 0)
    pct = pass_info.get("pass_pct", 0)
    print(f"{model_name}: {passed}/{total} passed ({pct:.0f}%)")

langchain:gpt-4o: 4/4 passed (100%)
langchain:claude-sonnet-4-5-20250514: 4/4 passed (100%)
langchain:gemini-2.0-flash: 3/4 passed (75%)


Filter to a specific model using the `interface:model_name` format:

In [5]:
gpt4_results = results.filter(answering_models=["langchain:gpt-4o"])
print(f"GPT-4o results: {len(gpt4_results.results)}")

GPT-4o results: 5


## Step 4: Per-Question Comparison

See how each model performed on the same question:

In [6]:
by_question = results.group_by_question()

for qid, q_results in list(by_question.items())[:3]:
    q_text = q_results.results[0].metadata.question_text
    print(f"\n{q_text[:50]}...")
    for r in q_results.results:
        model = r.metadata.answering.model_name
        passed = r.template.verify_result if r.template else "N/A"
        print(f"  {model}: {'PASS' if passed is True else 'FAIL' if passed is False else passed}")


What is the capital of France?...
  gpt-4o: PASS
  claude-sonnet-4-5-20250514: PASS
  gemini-2.0-flash: PASS

What is 6 multiplied by 7?...
  gpt-4o: PASS
  claude-sonnet-4-5-20250514: PASS
  gemini-2.0-flash: PASS

What element has the atomic number 8? Provide both...
  gpt-4o: PASS
  claude-sonnet-4-5-20250514: PASS
  gemini-2.0-flash: FAIL


## Step 5: Separate Runs with `from_overrides`

For independent result sets per model, loop with `from_overrides`:

In [7]:
models_to_compare = [
    ("gpt-4o", "openai"),
    ("claude-sonnet-4-5-20250514", "anthropic"),
    ("gemini-2.0-flash", "google_genai"),
]

all_results = {}
for model_name, provider in models_to_compare:
    run_config = VerificationConfig.from_overrides(
        answering_model=model_name,
        answering_provider=provider,
        answering_id=f"ans-{model_name}",
        parsing_model="gpt-4o-mini",
        parsing_provider="openai",
        parsing_id="parser",
    )
    run_results = benchmark.run_verification(run_config, run_name=model_name)
    all_results[model_name] = run_results

for model_name, model_results in all_results.items():
    summary = model_results.get_summary()
    pass_info = summary.get("template_pass_overall", {})
    print(f"{model_name}: {pass_info.get('passed', 0)}/{pass_info.get('total', 0)} passed")

gpt-4o: 4/4 passed
claude-sonnet-4-5-20250514: 4/4 passed
gemini-2.0-flash: 3/4 passed


## Step 6: Replicates for Variance Analysis

Run each model combination multiple times to measure consistency:

In [8]:
config_with_reps = VerificationConfig(
    answering_models=[
        ModelConfig(
            id="gpt4o",
            model_name="gpt-4o",
            model_provider="openai",
            interface="langchain",
        ),
    ],
    parsing_models=[
        ModelConfig(
            id="parser",
            model_name="gpt-4o-mini",
            model_provider="openai",
            interface="langchain",
        ),
    ],
    replicate_count=3,
)

rep_results = benchmark.run_verification(config_with_reps)
print(f"Total results: {len(rep_results.results)} (5 questions x 3 replicates)")

by_replicate = rep_results.group_by_replicate()
for rep_num, rep_group in sorted(by_replicate.items()):
    summary = rep_group.get_summary()
    pass_info = summary.get("template_pass_overall", {})
    print(f"Replicate {rep_num}: {pass_info.get('passed', 0)}/{pass_info.get('total', 0)} passed")

Total results: 15 (5 questions x 3 replicates)
Replicate 0: 4/4 passed
Replicate 1: 4/4 passed
Replicate 2: 4/4 passed


## Summary

| Approach | When to Use |
|----------|------------|
| Multiple `answering_models` list | Single run, answer caching, combined result set |
| `from_overrides` loop | Separate result sets, per-model export, independent analysis |
| Replicates | Variance analysis, consistency measurement |

## Next Steps

- [Verification Result structure](../07-analyzing-results/verification-result.md) — full result hierarchy
- [DataFrame analysis](../07-analyzing-results/dataframe-analysis.md) — convert results to pandas DataFrames
- [Python API verification](../06-running-verification/python-api.md) — single-model workflow

In [9]:
# Cleanup mocks
_ = _patcher1.stop()
_ = _patcher2.stop()