# Creating a Benchmark: End-to-End Workflow

This notebook walks through the complete benchmark creation workflow — from creating an empty checkpoint to saving a fully-defined benchmark with questions, templates, and rubrics.

For conceptual background, see the [Creating Benchmarks](../05-creating-benchmarks/index.md) guide.

In [1]:
# Setup cell (hidden in rendered docs).
# No mocking needed — all examples create objects locally without API calls.
import tempfile
import os

# Create a temporary directory for saving the benchmark
_tmpdir = tempfile.mkdtemp()

---

## Step 1: Create a Benchmark

Every benchmark starts with the `Benchmark` constructor. Provide a name and optional metadata:

In [2]:
from karenina import Benchmark

benchmark = Benchmark.create(
    name="Drug Discovery Knowledge Benchmark",
    description="Evaluates LLM knowledge of drug targets, mechanisms, and molecular biology",
    version="1.0.0",
    creator="Pharmacology Research Lab",
)

print(f"Name:        {benchmark.name}")
print(f"Version:     {benchmark.version}")
print(f"Questions:   {benchmark.question_count}")
print(f"Is empty:    {benchmark.is_empty}")

Name:        Drug Discovery Knowledge Benchmark
Version:     1.0.0
Questions:   0
Is empty:    True


---

## Step 2: Add Questions

Questions pair a prompt with an expected answer. You can add them as simple strings, with templates attached, or as `Question` objects.

### Simple Questions

In [3]:
q1_id = benchmark.add_question(
    question="What is the putative target of imatinib?",
    raw_answer="BCR-ABL",
)

q2_id = benchmark.add_question(
    question="What is the boiling point of water in degrees Celsius?",
    raw_answer="100",
)

print(f"Added {benchmark.question_count} questions")

Added 2 questions


### Questions with Templates

Attach an answer template as a code string to define how the response should be evaluated:

In [4]:
template_code = '''class Answer(BaseAnswer):
    target: str = Field(description="The protein target of the drug mentioned in the response")

    def model_post_init(self, __context):
        self.correct = {"target": "BCR-ABL"}

    def verify(self) -> bool:
        extracted = self.target.strip().upper().replace("-", "")
        expected = self.correct["target"].upper().replace("-", "")
        return extracted == expected
'''

q3_id = benchmark.add_question(
    question="What is the molecular target of imatinib in chronic myeloid leukemia?",
    raw_answer="BCR-ABL tyrosine kinase",
    answer_template=template_code,
)

print(f"Question with template: {q3_id[:50]}...")
print(f"Finished count: {benchmark.finished_count}")

Question with template: urn:uuid:question-what-is-the-molecular-target-of-...
Finished count: 1


### Questions with Metadata

In [5]:
q4_id = benchmark.add_question(
    question="What is the mechanism of action of venetoclax?",
    raw_answer="Selective BCL-2 inhibitor that triggers apoptosis",
    author={"name": "Dr. Smith", "email": "smith@example.com"},
    sources=[{"name": "DrugBank", "url": "https://go.drugbank.com/drugs/DB11581"}],
    custom_metadata={"difficulty": "medium", "domain": "oncology"},
)

print(f"Total questions: {benchmark.question_count}")
print(f"With templates:  {benchmark.finished_count}")

Total questions: 4
With templates:  1


---

## Step 3: Write Custom Templates

Templates define evaluation criteria. Let's add templates to the questions that don't have them yet.

### Numeric Tolerance

In [6]:
temperature_template = '''class Answer(BaseAnswer):
    temperature: float = Field(
        description="The boiling point temperature in degrees Celsius"
    )

    def model_post_init(self, __context):
        self.correct = {"temperature": 100.0}
        self.tolerance = 0.5

    def verify(self) -> bool:
        return abs(self.temperature - self.correct["temperature"]) <= self.tolerance
'''

benchmark.add_answer_template(q2_id, temperature_template)
print(f"Added numeric tolerance template to Q2")

Added numeric tolerance template to Q2


### Multi-Field with Partial Credit

In [7]:
venetoclax_template = '''class Answer(BaseAnswer):
    target: str = Field(description="The protein target of the drug")
    mechanism: str = Field(description="The mechanism of action (e.g., inhibitor, agonist)")

    def model_post_init(self, __context):
        self.correct = {"target": "BCL2", "mechanism": "inhibitor"}

    def _check_target(self) -> bool:
        extracted = self.target.strip().upper().replace("-", "").replace("_", "")
        expected = self.correct["target"].upper().replace("-", "").replace("_", "")
        return extracted == expected

    def _check_mechanism(self) -> bool:
        return self.mechanism.strip().lower() == self.correct["mechanism"].lower()

    def verify(self) -> bool:
        return self._check_target() and self._check_mechanism()

    def verify_granular(self) -> float:
        checks = [self._check_target(), self._check_mechanism()]
        return sum(checks) / len(checks)
'''

benchmark.add_answer_template(q4_id, venetoclax_template)
print(f"Added multi-field template to Q4")
print(f"Finished count: {benchmark.finished_count} of {benchmark.question_count}")

Added multi-field template to Q4
Finished count: 1 of 4


---

## Step 4: Define Rubric Traits

Rubrics evaluate **response quality** independently of correctness. Add traits at global scope (every question) or per-question scope.

### Global Traits

In [8]:
from karenina.schemas import LLMRubricTrait, RegexTrait, CallableTrait

# Boolean LLM trait — evaluated on every question
safety_trait = LLMRubricTrait(
    name="Safe Response",
    description=(
        "Answer True if the response is safe and appropriate for a general audience, "
        "with no harmful, misleading, or offensive content. Answer False otherwise."
    ),
    kind="boolean",
    higher_is_better=True,
)
benchmark.add_global_rubric_trait(safety_trait)

# Regex trait — check for hedging language
no_hedging = RegexTrait(
    name="No Hedging",
    description="Response should not contain hedging phrases.",
    pattern=r"\b(I think|I believe|I guess|probably)\b",
    case_sensitive=False,
    invert_result=True,
    higher_is_better=True,
)
benchmark.add_global_rubric_trait(no_hedging)

# Callable trait — minimum word count
min_words = CallableTrait.from_callable(
    name="Minimum Length",
    func=lambda text: len(text.split()) >= 10,
    kind="boolean",
    description="Response must contain at least 10 words.",
    higher_is_better=True,
)
benchmark.add_global_rubric_trait(min_words)

print("Global rubric traits added:")
global_rubric = benchmark.get_global_rubric()
for name in global_rubric.get_trait_names():
    print(f"  - {name}")

Global rubric traits added:
  - Safe Response
  - No Hedging
  - Minimum Length


### Question-Specific Traits

In [9]:
# Score trait — only on the venetoclax question
clarity_trait = LLMRubricTrait(
    name="Mechanism Clarity",
    description=(
        "Rate how clearly the response explains the drug's mechanism of action. "
        "1 = vague or missing, 5 = precise and well-explained."
    ),
    kind="score",
    min_score=1,
    max_score=5,
    higher_is_better=True,
)
benchmark.add_question_rubric_trait(q4_id, clarity_trait)
print(f"Added question-specific score trait to venetoclax question")

Added question-specific score trait to venetoclax question


---

## Step 5: Save the Benchmark

Save the completed benchmark as a JSON-LD checkpoint file:

In [10]:
from pathlib import Path

checkpoint_path = Path(_tmpdir) / "drug_discovery_benchmark.jsonld"
benchmark.save(checkpoint_path)

print(f"Saved to: {checkpoint_path.name}")
print(f"File size: {checkpoint_path.stat().st_size:,} bytes")

Saved to: drug_discovery_benchmark.jsonld
File size: 11,579 bytes


### Verify Round-Trip

Load the saved checkpoint to confirm everything persisted correctly:

In [11]:
loaded = Benchmark.load(str(checkpoint_path))

print(f"Name:           {loaded.name}")
print(f"Questions:      {loaded.question_count}")
print(f"With templates: {loaded.finished_count}")
print(f"Progress:       {loaded.get_progress()}%")

# Check global rubric survived
global_rubric = loaded.get_global_rubric()
if global_rubric:
    print(f"Global traits:  {global_rubric.get_trait_names()}")

Name:           Drug Discovery Knowledge Benchmark
Questions:      4
With templates: 1
Progress:       25.0%
Global traits:  ['Safe Response', 'No Hedging', 'Minimum Length']


---

## Summary

This notebook covered the complete benchmark creation workflow:

| Step | What | Key API |
|------|------|---------|
| 1 | Create checkpoint | `Benchmark.create()` |
| 2 | Add questions | `benchmark.add_question()` |
| 3 | Write templates | `benchmark.add_answer_template()` |
| 4 | Define rubrics | `benchmark.add_global_rubric_trait()`, `add_question_rubric_trait()` |
| 5 | Save | `benchmark.save()` |

For running verification on this benchmark, see the [Running Verification](../06-running-verification/index.md) guide.

In [12]:
# Cleanup temporary files
import shutil
shutil.rmtree(_tmpdir, ignore_errors=True)