## Tutorial 02: Custom Evaluators - Building Domain-Specific Evaluation Criteria

While built-in evaluators provide general-purpose evaluation capabilities, custom evaluators allow you to define domain-specific evaluation criteria tailored to your agent's unique requirements. In this tutorial, you'll learn how to create custom evaluators with well-defined rubrics to assess recipe quality, dietary compliance, food safety, and overall recipe helpfulness using LLM-as-a-judge.

### What You'll Learn
- How to extend the base Evaluator class to create custom evaluators
- Design rubrics with clear scoring criteria (3-point and 5-point scales)
- Create domain-specific evaluators for recipe quality, dietary compliance, and food safety
- Implement an LLM-as-a-judge custom evaluator with a 5-point scale (1-5)
- Combine multiple custom evaluators in a single evaluation workflow
- Understand when to use custom evaluators versus built-in evaluators

### Tutorial Details

| Information         | Details                                                                       |
|:--------------------|:------------------------------------------------------------------------------|
| Tutorial type       | Beginner - Introduction to creating custom evaluation criteria                |
| Tutorial components | Custom evaluators, rubric design, LLM-as-judge, multi-metric evaluation      |
| Tutorial vertical   | Agent Evaluation                                                              |
| Example complexity  | Easy                                                                          |
| SDK used            | Strands Agents, Strands Evals                                                 |

### Understanding Custom Evaluators

Custom evaluators extend the base `Evaluator` class to define domain-specific criteria beyond general-purpose metrics. Use them when you need specialized quality standards, business rules, or compliance requirements.

#### When to Use Custom Evaluators

| Use Case | Description |
|:---------|:------------|
| Domain-Specific Requirements | Criteria unique to your domain (recipe completeness, medical accuracy, legal compliance) |
| Business Logic Validation | Agent outputs must meet specific business rules |
| Specialized Quality Metrics | Standard metrics don't capture your use case nuances |
| LLM-as-Judge | Sophisticated evaluation requiring context and nuance |

#### Key Components

| Component | Purpose |
|:----------|:--------|
| Rubric | Clear scoring criteria with defined thresholds |
| Evaluation Logic | Code that analyzes agent output against the rubric |
| Structured Results | Score, label, and explanation of the evaluation |

#### Rubric Scale Patterns

| Scale | Scores | Best For |
|:------|:-------|:---------|
| 3-Point | 0 (Not met), 1 (Partial), 2 (Full) | Simple pass/fail criteria |
| 5-Point (0-4) | 0 (Absent) to 4 (Complete) | Detailed quality assessment |
| 5-Point (1-5) | 1 (Very Poor) to 5 (Excellent) | LLM-as-Judge evaluations |

### Environment Setup

Configure AWS region and model settings using inline configuration.

In [None]:
# AWS Configuration (inline - no config.py)
import boto3

session = boto3.Session()
AWS_REGION = session.region_name or 'us-east-1'
DEFAULT_MODEL = 'us.anthropic.claude-3-7-sonnet-20250219-v1:0'
JUDGE_MODEL = 'us.anthropic.claude-3-7-sonnet-20250219-v1:0'

### Setup and Imports

Import all required packages for agent creation and custom evaluation.

In [None]:
# Standard imports
import json
from typing import Any

# Strands imports
from strands import Agent, tool
from strands_evals import Dataset, Case
from strands_evals.evaluators import Evaluator
from strands_evals.types import EvaluationData, EvaluationOutput

# Display utilities
from IPython.display import Markdown, display

# Web search tool dependency
from ddgs import DDGS
from ddgs.exceptions import DDGSException, RatelimitException

### Recipe Bot Agent

We'll use the Recipe Bot agent from Tutorial 01 as our evaluation target. This agent helps users find recipes and answer cooking questions using web search.

In [None]:
# Define a websearch tool
@tool
def websearch(
    keywords: str, region: str = "us-en", max_results: int | None = None
) -> str:
    """Search the web to get updated information.
    Args:
        keywords (str): The search query keywords.
        region (str): The search region: wt-wt, us-en, uk-en, ru-ru, etc..
        max_results (int | None): The maximum number of results to return.
    Returns:
        List of dictionaries with search results.
    """
    try:
        results = DDGS().text(keywords, region=region, max_results=max_results)
        return results if results else "No results found."
    except RatelimitException:
        return "RatelimitException: Please try again after a short delay."
    except DDGSException as d:
        return f"DuckDuckGoSearchException: {d}"
    except Exception as e:
        return f"Exception: {e}"


# Create a recipe assistant agent
recipe_agent = Agent(
    model=DEFAULT_MODEL,
    system_prompt="""You are RecipeBot, a helpful cooking assistant.
    Help users find recipes based on ingredients and answer cooking questions.
    Use the websearch tool to find recipes when users mention ingredients or to look up cooking information.""",
    tools=[websearch],
)

### Test the Agent

Before creating custom evaluators, let's verify the agent works correctly.

In [None]:
# Test the agent with a simple query
recipe_agent("Give me a simple pasta recipe")

### Creating Custom Evaluators

We'll create four custom evaluators, each focusing on a different aspect of recipe quality:

| Evaluator | Scale | What It Measures |
|:----------|:------|:-----------------|
| RecipeQualityEvaluator | 5-point (0-4) | Ingredients, steps, timing completeness |
| DietaryComplianceEvaluator | 3-point | Respects dietary restrictions |
| RecipeSafetyEvaluator | 3-point | Food safety information |
| RecipeHelpfulnessLLMJudge | 5-point (1-5) | Overall helpfulness via LLM |

#### Custom Evaluator 1: RecipeQualityEvaluator

Assesses whether a recipe includes essential components: ingredients list, preparation steps, and timing information.

In [None]:
class RecipeQualityEvaluator(Evaluator[str, str]):
    """Evaluates recipe completeness: ingredients, steps, and timing.
    
    Rubric (5-point scale: 0-4):
    - 0: No recipe components present
    - 1: Only one component present (e.g., just ingredients)
    - 2: Two components present (e.g., ingredients and steps, but no timing)
    - 3: All three components present but incomplete
    - 4: All three components present and complete
    """
    
    def __init__(self):
        super().__init__()
    
    def evaluate(self, evaluation_case: EvaluationData[str, str]) -> list[EvaluationOutput]:
        """Evaluate recipe quality based on completeness."""
        actual = evaluation_case.actual_output or ""
        output_text = str(actual).lower()
        
        has_ingredients = any(keyword in output_text for keyword in 
                            ['ingredient', 'cup', 'tablespoon', 'teaspoon', 'oz', 'gram'])
        has_steps = any(keyword in output_text for keyword in 
                       ['step', 'instruction', 'cook', 'mix', 'add', 'prepare', 'heat'])
        has_timing = any(keyword in output_text for keyword in 
                        ['minute', 'hour', 'time', 'until', 'second'])
        
        components_count = sum([has_ingredients, has_steps, has_timing])
        
        if components_count == 0:
            score = 0
            label = "Incomplete"
            explanation = "Recipe lacks all essential components (ingredients, steps, timing)"
        elif components_count == 1:
            score = 1
            label = "Poor"
            missing = []
            if not has_ingredients: missing.append("ingredients")
            if not has_steps: missing.append("steps")
            if not has_timing: missing.append("timing")
            explanation = f"Recipe is missing: {', '.join(missing)}"
        elif components_count == 2:
            score = 2
            label = "Fair"
            missing = []
            if not has_ingredients: missing.append("ingredients")
            if not has_steps: missing.append("steps")
            if not has_timing: missing.append("timing")
            explanation = f"Recipe is missing: {', '.join(missing)}"
        else:
            is_detailed = len(output_text) > 200
            if is_detailed:
                score = 4
                label = "Excellent"
                explanation = "Recipe includes all essential components with good detail"
            else:
                score = 3
                label = "Good"
                explanation = "Recipe includes all components but could be more detailed"
        
        return [EvaluationOutput(
            score=score,
            test_pass=score >= 3,
            reason=explanation
        )]
    
    async def evaluate_async(self, evaluation_case: EvaluationData[str, str]) -> list[EvaluationOutput]:
        return self.evaluate(evaluation_case)

#### Custom Evaluator 2: DietaryComplianceEvaluator

This evaluator checks whether the recipe appropriately addresses dietary restrictions mentioned in the user's request (vegan, gluten-free, dairy-free, etc.). It uses a **3-point scale** for simplicity. This is particularly important for applications where dietary compliance could have health or ethical implications for users.

In [None]:
class DietaryComplianceEvaluator(Evaluator[str, str]):
    """Evaluates whether recipe respects dietary restrictions.
    
    Rubric (3-point scale):
    - 0: Violates stated dietary restrictions
    - 1: Partially compliant or unclear
    - 2: Fully compliant with dietary restrictions
    """
    
    def __init__(self):
        super().__init__()
    
    def evaluate(self, evaluation_case: EvaluationData[str, str]) -> list[EvaluationOutput]:
        """Evaluate dietary compliance based on user request."""
        input_text = str(evaluation_case.input).lower()
        actual = evaluation_case.actual_output or ""
        output_text = str(actual).lower()
        
        restrictions = {
            'vegan': ['meat', 'chicken', 'beef', 'pork', 'fish', 'egg', 'dairy', 'milk', 'cheese', 'butter'],
            'vegetarian': ['meat', 'chicken', 'beef', 'pork', 'fish'],
            'gluten-free': ['flour', 'wheat', 'bread', 'pasta', 'gluten'],
            'dairy-free': ['milk', 'cheese', 'butter', 'cream', 'yogurt'],
            'nut-free': ['nut', 'almond', 'peanut', 'walnut', 'pecan', 'cashew']
        }
        
        mentioned_restrictions = [key for key in restrictions.keys() if key in input_text]
        
        if not mentioned_restrictions:
            return [EvaluationOutput(
                score=2,
                test_pass=True,
                reason="No dietary restrictions specified in the request"
            )]
        
        violations = []
        for restriction in mentioned_restrictions:
            for ingredient in restrictions[restriction]:
                if ingredient in output_text:
                    violations.append(f"{ingredient} (violates {restriction})")
        
        if violations:
            score = 0
            label = "Non-Compliant"
            explanation = f"Recipe violates {', '.join(mentioned_restrictions)} restrictions. Found: {', '.join(violations[:3])}"
        elif any(f"{r} option" in output_text or f"{r} substitute" in output_text for r in mentioned_restrictions):
            score = 2
            label = "Compliant"
            explanation = f"Recipe explicitly addresses {', '.join(mentioned_restrictions)} requirements"
        else:
            score = 1
            label = "Unclear"
            explanation = f"Recipe may be {', '.join(mentioned_restrictions)} but doesn't explicitly confirm"
        
        return [EvaluationOutput(
            score=score,
            test_pass=score >= 2,
            reason=explanation
        )]
    
    async def evaluate_async(self, evaluation_case: EvaluationData[str, str]) -> list[EvaluationOutput]:
        return self.evaluate(evaluation_case)

#### Custom Evaluator 3: RecipeSafetyEvaluator

This evaluator validates that recipes include important food safety information such as cooking temperatures, handling instructions, and cross-contamination warnings. It uses a **3-point scale**. Food safety is critical for preventing foodborne illness, making this evaluator essential for any recipe-based agent.

In [None]:
class RecipeSafetyEvaluator(Evaluator[str, str]):
    """Evaluates food safety considerations in recipes.
    
    Rubric (3-point scale):
    - 0: No safety information present
    - 1: Some safety information but incomplete
    - 2: Adequate safety information included
    """
    
    def __init__(self):
        super().__init__()
    
    def evaluate(self, evaluation_case: EvaluationData[str, str]) -> list[EvaluationOutput]:
        """Evaluate food safety information."""
        actual = evaluation_case.actual_output or ""
        output_text = str(actual).lower()
        
        has_temperature = any(keyword in output_text for keyword in 
                            ['degree', 'temperature', '°f', '°c', 'fahrenheit', 'celsius', 'internal temp'])
        has_doneness = any(keyword in output_text for keyword in 
                          ['done', 'cooked through', 'no longer pink', 'until tender', 'golden brown'])
        has_handling = any(keyword in output_text for keyword in 
                          ['wash', 'clean', 'sanitize', 'separate', 'refrigerat', 'thaw', 'defrost'])
        has_storage = any(keyword in output_text for keyword in 
                         ['store', 'keep', 'leftover', 'refrigerat', 'freeze'])
        
        safety_indicators = sum([has_temperature, has_doneness, has_handling, has_storage])
        
        if safety_indicators == 0:
            score = 0
            label = "Unsafe"
            explanation = "Recipe lacks food safety information (temperature, doneness, handling)"
        elif safety_indicators == 1:
            score = 1
            label = "Minimal Safety"
            indicators = []
            if has_temperature: indicators.append("temperature")
            if has_doneness: indicators.append("doneness")
            if has_handling: indicators.append("handling")
            if has_storage: indicators.append("storage")
            explanation = f"Recipe includes limited safety info: {', '.join(indicators)}"
        else:
            score = 2
            label = "Safe"
            indicators = []
            if has_temperature: indicators.append("temperature")
            if has_doneness: indicators.append("doneness")
            if has_handling: indicators.append("handling")
            if has_storage: indicators.append("storage")
            explanation = f"Recipe includes adequate safety information: {', '.join(indicators)}"
        
        return [EvaluationOutput(
            score=score,
            test_pass=score >= 2,
            reason=explanation
        )]
    
    async def evaluate_async(self, evaluation_case: EvaluationData[str, str]) -> list[EvaluationOutput]:
        return self.evaluate(evaluation_case)

#### Custom Evaluator 4: RecipeHelpfulnessLLMJudge

This evaluator uses an LLM to judge the overall helpfulness and quality of the recipe response. Unlike rule-based evaluators, this evaluator leverages a language model's understanding to assess nuanced qualities like clarity, completeness, and practical value. It uses a **5-point scale (1-5)**.

**Key Difference**: LLM-as-a-judge evaluators can understand context, detect subtle quality issues, and provide more nuanced assessments than simple keyword matching. This makes them ideal for evaluating subjective qualities that are difficult to capture with rules alone.

In [None]:
class RecipeHelpfulnessLLMJudge(Evaluator[str, str]):
    """Uses an LLM to evaluate recipe helpfulness and overall quality.
    
    Rubric (5-point scale: 1-5):
    - 1: Very poor - Fails to meet basic requirements, unhelpful or incorrect
    - 2: Poor - Meets minimal requirements but has major issues (missing key info, unclear)
    - 3: Fair - Acceptable but room for improvement (basic recipe, lacks detail)
    - 4: Good - Meets requirements well with minor issues (clear, complete, useful)
    - 5: Excellent - Exceeds requirements, comprehensive, clear, and highly practical
    """
    
    def __init__(self, model: str = None):
        super().__init__()
        self.judge_agent = Agent(
            model=model or JUDGE_MODEL,
            system_prompt="""You are an expert culinary evaluator. Evaluate recipe responses on a scale of 1-5.
            
            Scoring criteria:
            - 1: Very poor - Fails to meet basic requirements, unhelpful or incorrect
            - 2: Poor - Meets minimal requirements but has major issues
            - 3: Fair - Acceptable but room for improvement
            - 4: Good - Meets requirements well with minor issues
            - 5: Excellent - Exceeds requirements, comprehensive and clear
            
            Consider:
            - Completeness: Does it include ingredients, steps, timing?
            - Clarity: Are instructions clear and easy to follow?
            - Practicality: Can a home cook actually make this?
            - Relevance: Does it address the user's request?
            - Detail: Is there enough information without overwhelming?
            
            Respond with ONLY a JSON object in this exact format:
            {"score": <1-5>, "label": "<Very Poor|Poor|Fair|Good|Excellent>", "explanation": "<brief explanation>"}
            """
        )
    
    def evaluate(self, evaluation_case: EvaluationData[str, str]) -> list[EvaluationOutput]:
        """Use LLM to evaluate recipe helpfulness."""
        actual = evaluation_case.actual_output or ""
        
        eval_prompt = f"""Evaluate this recipe response:

User Request: {evaluation_case.input}

Recipe Response: {actual}

Provide your evaluation as a JSON object."""
        
        try:
            judge_response = self.judge_agent(eval_prompt)
            
            judge_output = str(judge_response)
            
            if "```json" in judge_output:
                json_str = judge_output.split("```json")[1].split("```")[0].strip()
            elif "```" in judge_output:
                json_str = judge_output.split("```")[1].split("```")[0].strip()
            else:
                json_str = judge_output.strip()
            
            result_data = json.loads(json_str)
            
            score = result_data.get("score", 3)
            label = result_data.get("label", "Fair")
            explanation = result_data.get("explanation", "LLM evaluation completed")
            
            score = max(1, min(5, int(score)))
            
        except Exception as e:
            score = 3
            label = "Fair"
            explanation = f"LLM evaluation failed: {str(e)[:100]}"
        
        return [EvaluationOutput(
            score=score,
            test_pass=score >= 3,
            reason=explanation
        )]
    
    async def evaluate_async(self, evaluation_case: EvaluationData[str, str]) -> list[EvaluationOutput]:
        return self.evaluate(evaluation_case)

**Rule-Based vs LLM-as-Judge**: Rule-based evaluators use keyword matching (fast, deterministic, objective). LLM-as-judge uses language models for subjective qualities (clarity, helpfulness) but is slower and costs more.

### Create Test Cases

Now we'll create test cases that exercise different aspects of our custom evaluators. We'll use a balanced set of 3 examples covering different dietary needs and safety considerations.

In [None]:
# Create test cases for evaluation
test_cases = [
    Case(
        name="Basic Recipe Request",
        input="Give me a simple chicken pasta recipe",
        expected_output="A complete recipe with ingredients, steps, and cooking times"
    ),
    Case(
        name="Vegan Recipe Request",
        input="I need a vegan chocolate cake recipe",
        expected_output="A vegan recipe with no animal products"
    ),
    Case(
        name="Food Safety Critical Recipe",
        input="How do I cook chicken safely?",
        expected_output="Recipe with proper temperature and safety guidelines"
    )
]

### Create Dataset with Custom Evaluators

We'll create a dataset that uses all four custom evaluators (including the LLM-as-a-judge) to provide comprehensive, multi-metric evaluation of recipe responses.

In [None]:
# Initialize all custom evaluators
recipe_quality_eval = RecipeQualityEvaluator()
dietary_compliance_eval = DietaryComplianceEvaluator()
recipe_safety_eval = RecipeSafetyEvaluator()
llm_judge_eval = RecipeHelpfulnessLLMJudge()

# Create separate datasets for each evaluator
dataset_quality = Dataset(
    cases=test_cases,
    evaluator=recipe_quality_eval
)

dataset_compliance = Dataset(
    cases=test_cases,
    evaluator=dietary_compliance_eval
)

dataset_safety = Dataset(
    cases=test_cases,
    evaluator=recipe_safety_eval
)

dataset_llm_judge = Dataset(
    cases=test_cases,
    evaluator=llm_judge_eval
)

### Run Evaluation

Execute the evaluation using our custom evaluators. Each test case will be evaluated against all four custom metrics, including the LLM-as-a-judge evaluator.

In [None]:
# Define agent task wrapper
def agent_task(case: Case) -> str:
    """Execute agent and return response."""
    response = recipe_agent(case.input)
    return str(response)

### Display Summary Report

Use the built-in display functionality to view aggregated results.

In [None]:
# Run separate evaluations for each evaluator
report_quality = dataset_quality.run_evaluations(agent_task)
# display(Markdown("## Recipe Quality Evaluator Report"))
report_quality.run_display()

In [None]:
report_quality.to_file('report')

In [None]:
report_quality.run_display()

In [None]:
report_compliance = dataset_compliance.run_evaluations(agent_task)
display(Markdown("## Dietary Compliance Evaluator Report"))
report_compliance.run_display()

In [None]:
report_safety = dataset_safety.run_evaluations(agent_task)
display(Markdown("## Recipe Safety Evaluator Report"))
report_safety.run_display()

### Evaluator Selection Guide

**Custom**: Domain-specific criteria, business rules, specialized metrics. Use rule-based for objective criteria; LLM-as-judge for subjective.  
**Built-In**: General capabilities, standardized metrics, baseline evaluation.  
**Best Practice**: Combine both for comprehensive coverage.

### Summary

You've learned to create custom evaluators by extending the Evaluator class, designing rubrics with 3-point and 5-point scales, and implementing both rule-based evaluators (objective criteria) and LLM-as-judge evaluators (subjective assessment).