# Custom Templates: A Progressive Tutorial

This notebook demonstrates how to write custom answer templates in Karenina, progressing from simple patterns to advanced multi-field evaluation. Templates define how a Judge LLM should parse model responses and how correctness is verified.

For conceptual background, see [Answer Templates](../04-core-concepts/answer-templates.md). For the full writing guide, see [Writing Custom Templates](../05-creating-benchmarks/writing-templates.md).

In [None]:
# Setup cell: ensures examples execute without live API keys.
# This cell is hidden in rendered documentation.
from pydantic import Field

from karenina.schemas.entities import BaseAnswer

---

## Template Basics

Every template is a Pydantic model named `Answer` that extends `BaseAnswer`. It defines:

- **Attributes** — what the Judge LLM should extract from the response
- **Ground truth** — the expected correct values (set in `model_post_init`)
- **`verify()`** — comparison logic returning `True` (pass) or `False` (fail)

### Case-Insensitive String Matching

The simplest pattern: normalize strings before comparison to handle common variations.

In [None]:
class Answer(BaseAnswer):
    gene_symbol: str = Field(
        description="The gene symbol mentioned in the response"
    )

    def model_post_init(self, __context):
        self.correct = {"gene_symbol": "TP53"}

    def verify(self) -> bool:
        return self.gene_symbol.strip().upper() == self.correct["gene_symbol"].upper()


# Test with common variations a Judge LLM might extract
for variant in ["tp53", " TP53 ", "Tp53", "P53"]:
    parsed = Answer(gene_symbol=variant)
    print(f"{variant!r:12s} → verify(): {parsed.verify()}")

---

## Numeric Tolerance

Exact numeric comparison is often too strict. Use absolute or relative tolerance depending on the value range.

### Absolute Tolerance

In [None]:
class Answer(BaseAnswer):
    temperature: float = Field(
        description="The boiling point temperature in degrees Celsius"
    )

    def model_post_init(self, __context):
        self.correct = {"temperature": 100.0}
        self.tolerance = 0.5  # Accept within ±0.5°C

    def verify(self) -> bool:
        return abs(self.temperature - self.correct["temperature"]) <= self.tolerance


for temp in [100.0, 99.8, 100.3, 101.0]:
    parsed = Answer(temperature=temp)
    print(f"{temp:6.1f}°C → verify(): {parsed.verify()}")

### Percentage-Based Tolerance

For values spanning wide ranges, relative tolerance adapts to the magnitude:

In [None]:
class Answer(BaseAnswer):
    population: int = Field(
        description="The estimated population of the city"
    )

    def model_post_init(self, __context):
        self.correct = {"population": 8_336_817}
        self.tolerance_pct = 10  # Accept within 10%

    def verify(self) -> bool:
        expected = self.correct["population"]
        threshold = expected * (self.tolerance_pct / 100)
        return abs(self.population - expected) <= threshold


for pop in [8_336_817, 8_000_000, 9_000_000, 7_000_000]:
    parsed = Answer(population=pop)
    print(f"{pop:>10,} → verify(): {parsed.verify()}")

---

## List Comparison

When evaluating lists, decide whether order matters and whether extra items are acceptable.

### Set-Based (Order Doesn't Matter)

In [None]:
class Answer(BaseAnswer):
    symptoms: list[str] = Field(
        description="The symptoms of the condition listed in the response"
    )

    def model_post_init(self, __context):
        self.correct = {"symptoms": ["fever", "cough", "fatigue"]}

    def verify(self) -> bool:
        extracted = {s.strip().lower() for s in self.symptoms}
        expected = {s.lower() for s in self.correct["symptoms"]}
        return extracted == expected


# Order doesn't matter; normalization handles case
parsed = Answer(symptoms=["Fatigue", "fever", "Cough"])
print(f"Extracted: {parsed.symptoms}")
print(f"verify():  {parsed.verify()}")

# Missing items fail
parsed2 = Answer(symptoms=["fever", "cough"])
print(f"\nMissing item: {parsed2.symptoms}")
print(f"verify():     {parsed2.verify()}")

### Subset Matching (Extra Items OK)

Sometimes you want to check that required items are present, but extra items are acceptable:

In [None]:
class Answer(BaseAnswer):
    proteins: list[str] = Field(
        description="Proteins involved in the signaling pathway mentioned in the response"
    )

    def model_post_init(self, __context):
        self.correct = {"required_proteins": ["EGFR", "RAS", "RAF"]}

    def verify(self) -> bool:
        extracted = {p.strip().upper() for p in self.proteins}
        required = {p.upper() for p in self.correct["required_proteins"]}
        return required.issubset(extracted)


# Extra proteins are fine
parsed = Answer(proteins=["EGFR", "RAS", "RAF", "MEK", "ERK"])
print(f"Extracted: {parsed.proteins}")
print(f"verify():  {parsed.verify()}")

# Missing a required protein fails
parsed2 = Answer(proteins=["EGFR", "MEK"])
print(f"\nMissing required: {parsed2.proteins}")
print(f"verify():         {parsed2.verify()}")

---

## Multi-Field Templates with Partial Credit

For templates with multiple attributes, implement both `verify()` (all-or-nothing) and `verify_granular()` (partial credit as a float from 0.0 to 1.0). The verification pipeline automatically calls `verify_granular()` when present.

In [None]:
class Answer(BaseAnswer):
    drug_name: str = Field(
        description="The name of the drug mentioned in the response"
    )
    target: str = Field(
        description="The protein target of the drug"
    )
    mechanism: str = Field(
        description="The mechanism of action (e.g., inhibitor, agonist)"
    )

    def model_post_init(self, __context):
        self.correct = {
            "drug_name": "venetoclax",
            "target": "BCL2",
            "mechanism": "inhibitor",
        }

    def _check_drug_name(self) -> bool:
        return self.drug_name.strip().lower() == self.correct["drug_name"].lower()

    def _check_target(self) -> bool:
        extracted = self.target.strip().upper().replace("-", "").replace("_", "")
        expected = self.correct["target"].upper().replace("-", "").replace("_", "")
        return extracted == expected

    def _check_mechanism(self) -> bool:
        return self.mechanism.strip().lower() == self.correct["mechanism"].lower()

    def verify(self) -> bool:
        return self._check_drug_name() and self._check_target() and self._check_mechanism()

    def verify_granular(self) -> float:
        checks = [self._check_drug_name(), self._check_target(), self._check_mechanism()]
        return sum(checks) / len(checks)


# 2 out of 3 correct
parsed = Answer(drug_name="Venetoclax", target="Bcl-2", mechanism="agonist")
print(f"Drug:      {parsed._check_drug_name()}")
print(f"Target:    {parsed._check_target()}")
print(f"Mechanism: {parsed._check_mechanism()}")
print(f"verify():          {parsed.verify()}")
print(f"verify_granular(): {parsed.verify_granular():.2f}")

---

## Boolean Attribute Pattern

Instead of extracting text and matching strings, ask the Judge LLM to make boolean judgments directly. This avoids string normalization pitfalls entirely — the LLM answers yes/no for each concept.

In [None]:
class Answer(BaseAnswer):
    mentions_bcl2: bool = Field(
        description="True if the response identifies BCL2 (or BCL-2, Bcl-2) as the target"
    )
    mentions_inhibition: bool = Field(
        description="True if the response describes inhibition as the mechanism of action"
    )
    mentions_apoptosis: bool = Field(
        description="True if the response mentions apoptosis or programmed cell death"
    )

    def model_post_init(self, __context):
        self.correct = {
            "mentions_bcl2": True,
            "mentions_inhibition": True,
            "mentions_apoptosis": True,
        }

    def verify(self) -> bool:
        return all(
            getattr(self, field) == self.correct[field]
            for field in self.correct
        )

    def verify_granular(self) -> float:
        matches = sum(
            1 for field in self.correct
            if getattr(self, field) == self.correct[field]
        )
        return matches / len(self.correct)


# Simulates a response that missed one concept
parsed = Answer(mentions_bcl2=True, mentions_inhibition=True, mentions_apoptosis=False)
print(f"verify():          {parsed.verify()}")
print(f"verify_granular(): {parsed.verify_granular():.2f}")

---

## Adding Templates to a Benchmark

Templates are added to benchmarks as **code strings** — not class objects. This is the most reliable method and works in all environments (scripts, notebooks, CI).

In [None]:
from karenina import Benchmark

benchmark = Benchmark.create(name="Custom Templates Demo")

# Option 1: Provide template when adding the question
template_code = '''class Answer(BaseAnswer):
    target: str = Field(description="The protein target mentioned")

    def model_post_init(self, __context):
        self.correct = {"target": "BCR-ABL"}

    def verify(self) -> bool:
        extracted = self.target.strip().upper().replace("-", "")
        expected = self.correct["target"].upper().replace("-", "")
        return extracted == expected
'''

q1_id = benchmark.add_question(
    question="What is the molecular target of imatinib?",
    raw_answer="BCR-ABL",
    answer_template=template_code,
)
print(f"Q1 added with template: {q1_id[:50]}...")

# Option 2: Add template to an existing question
q2_id = benchmark.add_question(
    question="How many chromosomes are in a human somatic cell?",
    raw_answer="46",
)

count_template = '''class Answer(BaseAnswer):
    count: int = Field(description="The number of chromosomes mentioned")

    def model_post_init(self, __context):
        self.correct = {"count": 46}

    def verify(self) -> bool:
        return self.count == self.correct["count"]
'''

benchmark.add_answer_template(q2_id, count_template)
print(f"Q2 template added: {q2_id[:50]}...")

print(f"\nTotal questions: {benchmark.question_count}")
print(f"With templates:  {len(benchmark.get_finished_templates())}")

---

## Template Design Guidelines

| Guideline | Why |
|-----------|-----|
| **Keep `verify()` deterministic** | Same input must always produce the same result — no randomness, no network calls |
| **Normalize before comparing** | Strip whitespace, standardize case, handle hyphens/underscores |
| **Use helper methods** | Extract each check into `_check_*()` methods for `verify_granular()` |
| **Write clear field descriptions** | The Judge LLM only sees name, type, and description — be specific |
| **Test locally first** | Instantiate your template and call `verify()` before running full verification |

## Next Steps

- [Generating Templates](../05-creating-benchmarks/generating-templates.md) — Automatic template generation for common patterns
- [Defining Rubrics](../05-creating-benchmarks/defining-rubrics.md) — Quality assessment alongside correctness
- [Running Verification](../06-running-verification/index.md) — Execute verification with your templates
- [Answer Templates](../04-core-concepts/answer-templates.md) — Conceptual foundation (field types, naming requirement)