# Running Verification: End-to-End Workflow

This notebook walks through the complete verification workflow — from loading a
benchmark through configuration, running verification, and inspecting results.

For conceptual background, see the [Running Verification](../06-running-verification/index.md) guide.

In [None]:
# Mock cell: patches run_verification so examples execute without live API keys.
# This cell is hidden in the rendered documentation.
import datetime
import os
from unittest.mock import patch

from karenina.schemas.results import VerificationResultSet
from karenina.schemas.verification import VerificationConfig, VerificationResult
from karenina.schemas.verification.model_identity import ModelIdentity
from karenina.schemas.verification.result_components import (
    VerificationResultMetadata,
    VerificationResultRubric,
    VerificationResultTemplate,
)

# Change to notebooks directory so test_checkpoint.jsonld is found
os.chdir(os.path.dirname(os.path.abspath("__file__")))


def _mock_run_verification(self, config, question_ids=None, **kwargs):  # noqa: ARG001
    """Return realistic mock results for documentation examples."""
    qids = question_ids or self.get_question_ids()
    mock_results = []
    answers = {
        "capital of France": ("Paris", True),
        "6 multiplied by 7": ("42", True),
        "atomic number 8": ("Oxygen (O)", True),
        "17 a prime": ("True", True),
        "machine learning": ("Machine learning is a subset of AI", None),
    }
    for qid in qids:
        q = self.get_question(qid)
        question_text = q["question"]
        response, verified = ("Mock response", True)
        for key, (resp, ver) in answers.items():
            if key in question_text.lower():
                response, verified = resp, ver
                break
        answering = ModelIdentity(model_name="gpt-4o", interface="langchain")
        parsing = ModelIdentity(model_name="gpt-4o", interface="langchain")
        ts = datetime.datetime.now(tz=datetime.UTC).isoformat()
        result_id = VerificationResultMetadata.compute_result_id(qid, answering, parsing, ts)
        template_result = None
        if verified is not None:
            template_result = VerificationResultTemplate(
                raw_llm_response=response,
                verify_result=verified,
                template_verification_performed=True,
            )
        rubric_result = None
        if "capital" in question_text.lower():
            rubric_result = VerificationResultRubric(
                rubric_evaluation_performed=True,
                llm_trait_scores={"Is the response concise?": True},
            )
        result = VerificationResult(
            metadata=VerificationResultMetadata(
                question_id=qid,
                template_id="mock_template" if verified is not None else "no_template",
                completed_without_errors=True,
                question_text=question_text,
                raw_answer=q.get("raw_answer"),
                answering=answering,
                parsing=parsing,
                execution_time=1.2,
                timestamp=ts,
                result_id=result_id,
            ),
            template=template_result,
            rubric=rubric_result,
        )
        mock_results.append(result)
    return VerificationResultSet(results=mock_results)


_patcher_run = patch(
    "karenina.benchmark.benchmark.Benchmark.run_verification",
    _mock_run_verification,
)
_patcher_validate = patch.object(
    VerificationConfig,
    "_validate_config",
    lambda self: None,  # noqa: ARG005
)
_patcher_run.start()
_patcher_validate.start()

---

## Step 1: Load a Benchmark

Load a benchmark from a JSON-LD checkpoint file:

In [2]:
from karenina import Benchmark

benchmark = Benchmark.load("test_checkpoint.jsonld")

print(f"Benchmark: {benchmark.name}")
print(f"Questions: {benchmark.question_count}")
print(f"Complete:  {benchmark.is_complete}")

Benchmark: Documentation Test Benchmark
Questions: 5
Complete:  False


---

## Step 2: Configure Verification

Create a `VerificationConfig` to control which models to use, evaluation mode,
and optional pipeline features:

In [3]:
from karenina.schemas.config import ModelConfig
from karenina.schemas.verification import VerificationConfig

config = VerificationConfig(
    answering_models=[
        ModelConfig(id="gpt-4o", model_name="gpt-4o", interface="langchain"),
    ],
    parsing_models=[
        ModelConfig(id="gpt-4o", model_name="gpt-4o", interface="langchain"),
    ],
    evaluation_mode="template_and_rubric",
    rubric_enabled=True,
)

print(f"Evaluation mode: {config.evaluation_mode}")
print(f"Rubric enabled:  {config.rubric_enabled}")

Evaluation mode: template_and_rubric
Rubric enabled:  True


For simpler setups, use the `from_overrides` convenience method:

In [4]:
quick_config = VerificationConfig.from_overrides(
    answering_id="gpt-4o",
    answering_model="gpt-4o",
    parsing_id="gpt-4o",
    parsing_model="gpt-4o",
)
print(f"Quick config mode: {quick_config.evaluation_mode}")

Quick config mode: template_only


---

## Step 3: Run Verification

### Full Run

Run verification across all questions:

In [5]:
results = benchmark.run_verification(config)
print(f"Completed: {len(results)} verifications across {benchmark.question_count} questions")

Completed: 5 verifications across 5 questions


### Partial Run

Verify only a subset of questions by passing `question_ids`:

In [6]:
question_ids = benchmark.get_question_ids()[:2]
partial_results = benchmark.run_verification(config, question_ids=question_ids)
print(f"Verified {len(partial_results)} of {benchmark.question_count} questions")

Verified 2 of 5 questions


---

## Step 4: Inspect Results

`run_verification()` returns a `VerificationResultSet` — a container with
filtering, grouping, and analysis methods.

### Iterating Over Results

Each result is a `VerificationResult` with template (correctness) and
rubric (quality) sections:

In [7]:
for result in results:
    meta = result.metadata
    q_text = meta.question_text[:50]

    # Template result (correctness)
    if result.template and result.template.verify_result is not None:
        status = "PASS" if result.template.verify_result else "FAIL"
    else:
        status = "N/A"

    # Rubric result (quality)
    rubric_info = ""
    if result.rubric and result.rubric.rubric_evaluation_performed:
        scores = result.rubric.llm_trait_scores or {}
        rubric_info = f" | rubric traits: {len(scores)}"

    print(f"  [{status}] {q_text}{rubric_info}")

  [PASS] What is the capital of France? | rubric traits: 1
  [PASS] What is 6 multiplied by 7?
  [PASS] What element has the atomic number 8? Provide both
  [PASS] Is 17 a prime number?
  [N/A] Explain the concept of machine learning in simple 


### Summary Statistics

In [8]:
summary = results.get_summary()
print(f"Total results:  {summary['num_results']}")
print(f"Completed:      {summary['num_completed']}")
print(f"With template:  {summary['num_with_template']}")
print(f"With rubric:    {summary['num_with_rubric']}")
print(f"Unique models:  {summary['num_models']}")

Total results:  5
Completed:      5
With template:  4
With rubric:    1
Unique models:  1


### Filtering Results

In [9]:
filtered = results.filter(completed_only=True, has_template=True)
print(f"Filtered: {len(filtered)} results with template verification")

Filtered: 4 results with template verification


### Grouping Results

In [None]:
by_question = results.group_by_question()
for _qid, group in by_question.items():
    first = group.results[0]
    q_text = first.metadata.question_text[:40]
    print(f"  {q_text}: {len(group)} result(s)")

---

## Summary

This notebook covered the complete verification workflow:

| Step | What | Key API |
|------|------|---------|
| 1 | Load benchmark | `Benchmark.load()` |
| 2 | Configure | `VerificationConfig()` or `.from_overrides()` |
| 3 | Run verification | `benchmark.run_verification(config)` |
| 4 | Inspect results | Iterate, `.get_summary()`, `.filter()`, `.group_by_question()` |

For more details, see:
- [VerificationConfig](../06-running-verification/verification-config.md) — Full configuration tutorial
- [Analyzing Results](../07-analyzing-results/index.md) — DataFrame analysis and export
- [CLI Verification](../06-running-verification/cli.md) — Running from the command line

In [11]:
# Clean up the mocks
_ = _patcher_run.stop()
_ = _patcher_validate.stop()