# Evaluate quality using LLM and code-based scorers

## Scale your human expertise with LLM scorers for automated quality evaluation that is tuned to your use case

LLM scorers are AI-powered quality assessment tools that scale human expertise to evaluate GenAI quality automatically - in development and production. They assess semantic correctness, style, safety, and relative quality - answering questions like "Does this answer correctly?" and "Is this appropriate for our brand?"

MLflow has 2 flavors of judges:

- **Built-in Judges** - Research-backed judges for safety, hallucination, retrieval quality, and relevance
- **Custom Judges** - Tune our research-backed LLM judges to your business needs and human expert judgment

MLflow also supports _custom code-based metrics_, so if the built-in judges don't fit your use case, you can write your own.

The same judges can be used to both evaluate quality in development and monitor quality in production.

![demo-eval](https://i.imgur.com/M3kLBHF.gif)

This notebook demonstrates how to create and use LLM judges to evaluate GenAI quality. You'll learn to use both built-in judges and create custom guidelines that align with your business requirements.


## Install packages (only required if running in a Databricks Notebook)

In [None]:
%pip install -U -r ../../requirements.txt
dbutils.library.restartPython()

## Environment Setup

Load environment variables and verify MLflow configuration.


In [None]:
import sys
sys.path.append('../')
sys.path.append('../../')

import os
from dotenv import load_dotenv
import mlflow
from mlflow_demo.utils import *

if mlflow.utils.databricks_utils.is_in_databricks_notebook():
  print("Running in Databricks Notebook")
  setup_databricks_notebook_env()
else:
  print("Running in Local IDE")
  setup_local_ide_env()

# Verify key variables are loaded
print('=== Environment Setup ===')
print(f'DATABRICKS_HOST: {os.getenv("DATABRICKS_HOST")}')
print(f'MLFLOW_EXPERIMENT_ID: {os.getenv("MLFLOW_EXPERIMENT_ID")}')
print(f'LLM_MODEL: {os.getenv("LLM_MODEL")}')
print(f'UC_CATALOG: {os.getenv("UC_CATALOG")}')
print(f'UC_SCHEMA: {os.getenv("UC_SCHEMA")}')
print('✅ Environment variables loaded successfully!')

import logging
logging.getLogger("urllib3").setLevel(logging.ERROR)
logging.getLogger("mlflow").setLevel(logging.ERROR)

In [None]:
# Import demo utilities
from mlflow_demo.utils.mlflow_helpers import get_mlflow_experiment_id, generate_evaluation_links

print('✅ All imports successful')

# 📊 Step 1: Understanding MLflow Evaluation

A **scorer** is what looks at a trace and performs evaluation on that trace, then returns feedback which gets attached to the trace. This is the core building block of MLflow's evaluation system.

### 🎯 **What's Inside a Scorer?**

Within a scorer, you can use:

- **LM judges** - Feed aspects of the trace to an LLM to perform assessments (e.g., "Is this response safe?")
- **Deterministic code** - Count tokens, measure latency, check formatting, validate compliance
- **Combination of both** - Use LLM reasoning for content quality + code for objective measures

### 🚀 **The Complete Workflow: Scorer → Evaluate → Monitor**

**1. Create Scorers** - Build evaluation logic for your quality dimensions

```python
scorers = [Safety(), RelevanceToQuery(), your_custom_scorer]
```

**2. Run offline evaluation with `mlflow.genai.evaluate()`** - MLflow handles the coordination

```python
results = mlflow.genai.evaluate(
    data=your_traces,        # Your traces or test data
    scorers=scorers          # List of scorers to run
)
```

MLflow's `evaluate()` function automatically:

- Feeds your traces through all scorers in parallel
- Aggregates results into structured metrics
- Stores the results in your MLflow Experiment

**3. Use for Online Monitoring** - Use the same scorers to monitor production quality - we cover these steps in notebook 5.

**🔍 Behind the Scenes**: For each trace, each scorer extracts the data it needs, runs its evaluation logic (LM judges, code, or both), and returns structured feedback that MLflow organizes into actionable insights.

**▶️ Run the next cells to see different types of scorers in action**


# 🎯 Step 2: Predefined LLM Judge Scorers

MLflow provides research-backed judges for common evaluation needs. Here, we will use a few of the built-in judge scorers that apply to our email generation use case.

Here, we will use three predefined scorers to provide a basic quality assessment of our email generation app:

🎯 `RelevanceToQuery`: Does the generated email directly address the user's request?

- Checks if the email content stays focused on what the user asked for
- Critical for maintaining professional relevance in business communications

🏠 `RetrievalGroundedness`: Is the email content grounded in the retrieved customer data?

- Checks for hallucination of customer details, meeting notes, or account information in the generated email
- Essential for maintaining trust and accuracy in customer communications

🔒 `Safety`: Does the generated email avoid harmful or inappropriate content?

- Catches potential toxic content that could damage business relationships
- Critical safeguard for any customer-facing communications

**📚 Documentation**

- [**Predefined Judge Scorers**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/predefined-judge-scorers)


In [None]:
from mlflow.genai.scorers import RelevanceToQuery, RetrievalGroundedness, Safety
import mlflow

# A Scorer operates on a MLflow Trace, so let's retrieve a few Traces:
print('\n🔍 Loading recent traces from our email demo app...')

# Load recent traces for evaluation
traces = mlflow.search_traces(
    max_results=3,
    filter_string='status = "OK"',
    order_by=['timestamp DESC'],
)
print(f"✅ Found {len(traces)} traces for evaluation")

# Now, let's run evaluation using these scorers

eval_results = mlflow.genai.evaluate(data=traces, scorers=[RelevanceToQuery(), RetrievalGroundedness(), Safety()])

print(f"\n📊 Evaluation completed!")
print(f"🆔 Run ID: {eval_results._run_id}")

# Generate and display evaluation links
generate_evaluation_links(eval_results._run_id)

# 🎯 Step 2A: Controlling Data Fields for Built-in Scorers

## The Challenge: One Size Doesn't Fit All

While predefined scorers like `RelevanceToQuery`, `RetrievalGroundedness`, and `Safety` are powerful, they use **default data extraction** that may not always match your specific needs.

### Common Issue: Email Subject Line Groundedness

In our email generation app, we've noticed that `RetrievalGroundedness` sometimes flags email **subject lines** as "not grounded" in the retrieved customer data. However, this is often acceptable because:

- 📧 **Subject lines are often creative summaries** - "Follow-up on our conversation"
- 🎯 **They don't need to reference specific data points** - Unlike the email body
- ✅ **They can be professionally generic** - "Checking in" or "Next steps"

### The Solution: Custom Data Extraction + Predefined Judges

Instead of writing entirely new evaluation logic, we can:

1. **Wrap the proven `is_grounded` judge** - Keep the research-backed evaluation logic
2. **Customize data extraction** - Pass only the email body (not subject) to the judge
3. **Maintain evaluation quality** - Get accurate groundedness assessment for content that matters

This is the **hybrid approach** - combining the best of predefined judges with custom control over data extraction.

**📚 Reference**:

- [**Custom Scorers Documentation**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-scorers#example-2-wrap-a-predefined-llm-judge)


In [None]:
# 🔧 Example: Custom Email Body Groundedness Scorer

from mlflow.genai.judges import is_grounded
from mlflow.genai.scorers import scorer
import re

@scorer
def email_is_grounded(trace):
    """
    Custom groundedness scorer that evaluates only the email body content,
    excluding the subject line to avoid false negatives on creative/generic subjects.

    This demonstrates how to wrap the proven is_grounded judge with custom data extraction.
    """
    import json
    from mlflow.genai.judges import is_grounded

    # Extract the original request
    outputs = json.loads(trace.data.response)
    email_body = outputs.get('email_body')
    user_input = outputs.get('user_input')
    input_facts = trace.search_spans(span_type="RETRIEVER")[0].outputs

    if user_input is None or len(user_input) == 0:
      request = "Generate an email based on the provided context."
    else:
      request = "Generate an email based on the provided context, considering the user's request: " + user_input

    # Use the proven is_grounded judge with our extracted email body
    return is_grounded(request=request, response=email_body, context=input_facts)

# A Scorer operates on a MLflow Trace, so let's retrieve a few Traces:
print('\n🔍 Loading recent traces from our email demo app...')

# Load recent traces for evaluation
traces = mlflow.search_traces(
    max_results=3,
    filter_string='status = "OK"',
    order_by=['timestamp DESC'],
)
print(f"✅ Found {len(traces)} traces for evaluation")

# Now, let's run evaluation using this scorer

eval_results = mlflow.genai.evaluate(data=traces, scorers=[email_is_grounded])

print(f"\n📊 Evaluation completed!")
print(f"🆔 Run ID: {eval_results._run_id}")

# Generate and display evaluation links
generate_evaluation_links(eval_results._run_id)

# 🛠️ Step 3: Custom Guidelines using Native Judge Classes

While built-in judges handle common quality aspects, your business has specific requirements. We'll create custom guidelines using native MLflow judge classes for direct instantiation.

## What are Guidelines?

**Guidelines** are a simple way to codify your business-specific rules as **natural language criteria** that result in **pass/fail** evaluation. Guidelines can be a short sentence or a longer set of criteria.

Like before, we can use the predefined Guidelines scorer directly or wrap it in a custom scorer to control data extraction and processing. This parallels the pattern shown in the previous cells where you had predefined scorers (Step 2) and then hybrid approaches with custom data extraction (Step 2A).

### Why Guidelines?

- ✅ **Easy to explain** to business stakeholders
- ✅ **Domain experts can write them** directly - no coding required
- ✅ **Clear pass/fail decisions** - perfect for compliance and business rules
- ✅ **Starting point** - recommended before complex prompt-based judges

### How Guidelines Work

1. **You write the rule** in plain language: _"The email must reference specific customer data"_
2. **An LLM evaluates** whether the content passes or fails the guideline
3. **You get clear feedback** with rationale for the decision

## Email Generation Guidelines

For our email app, we'll create simple guidelines to ensure quality:

1. **Tone of Voice** - The tone must be professional
2. **Accuracy** - All facts must come from provided data
3. **Personalization** - Emails must be tailored to specific customers
4. **Relevance** - Content must be prioritized by urgency

**📚 Reference**:

- [**Guidelines Scorers Documentation**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/meets-guidelines)


In [None]:
from mlflow.genai.scorers import Guidelines
from mlflow.genai.judges import meets_guidelines
from mlflow.genai.scorers import scorer
import mlflow

# Tone of voice Guideline - Ensure professional tone
tone = Guidelines(
  name='tone',
  guidelines="""The response maintains a professional tone.""")

# Accuracy Guideline - Verify all facts come from provided data
@scorer
def accuracy(trace):
    """
    Custom accuracy scorer that evaluates only the email body content,
    excluding the subject line to avoid false negatives on creative/generic subjects.

    This demonstrates how to wrap the proven Guidelines judge with custom data extraction.
    """
    import json
    from mlflow.genai.judges import meets_guidelines
    # Extract the original request
    outputs = json.loads(trace.data.response)
    email_body = outputs.get('email_body')
    user_input = outputs.get('user_input')
    input_facts = trace.search_spans(span_type="RETRIEVER")[0].outputs

    accuracy_guideline = """The email_body correctly references all factual information from the provided_info based on these rules:
- All factual information must be directly sourced from the provided data with NO fabrication
- Names, dates, numbers, and company details must be 100% accurate with no errors
- Meeting discussions must be summarized with the exact same sentiment and priority as presented in the data
- Support ticket information must include correct ticket IDs, status, and resolution details when available
- All product usage statistics must be presented with the same metrics provided in the data
- No references to CloudFlow features, services, or offerings unless specifically mentioned in the customer data
- AUTOMATIC FAIL if any information is mentioned that is not explicitly provided in the data
- It is OK if the email_body follows the user_input request to omit certain facts, as long as no fabricated facts are introduced."""

    # Use the proven Guidelines judge with our extracted email body
    return meets_guidelines(guidelines=accuracy_guideline, context={'provided_info': input_facts, 'email': email_body, 'user_input': user_input})
# Personalization Guideline - Ensure emails are tailored to specific customers
@scorer
def personalized(trace):
    """
    Custom personalization scorer that evaluates only the email body content,
    excluding the subject line to avoid false negatives on creative/generic subjects.

    This demonstrates how to wrap the proven Guidelines judge with custom data extraction.
    """
    import json
    from mlflow.genai.judges import meets_guidelines
    # Extract the original request
    outputs = json.loads(trace.data.response)
    email_body = outputs.get('email_body')
    user_input = outputs.get('user_input')
    input_facts = trace.search_spans(span_type="RETRIEVER")[0].outputs

    personalized_guideline = """The email_body demonstrates clear personalization based on the provided_info based on these rules:
- Email must begin by referencing the most recent meeting/interaction
- Immediately next, the email must address the customer's MOST pressing concern as evidenced in the data
- Content structure must be customized based on the account's health status (critical issues first for "Fair" or "Poor" accounts)
- Industry-specific language must be used that reflects the customer's sector
- Recommendations must ONLY reference features that are:
  a) Listed as "least_used_features" in the data, AND
  b) Directly related to the "potential_opportunity" field
- Relationship history must be acknowledged (new vs. mature relationship)
- Deal stage must influence communication approach (implementation vs. renewal vs. growth)
- AUTOMATIC FAIL if recommendations could be copied to another customer in a different situation"""

    # Use the proven Guidelines judge with our extracted email body
    return meets_guidelines(guidelines=personalized_guideline, context={'provided_info': input_facts, 'email': email_body, 'user_input': user_input})

# Relevance Guideline - Prioritize content by urgency
@scorer
def relevance(trace):
    """
    Custom relevance scorer that evaluates only the email body content,
    excluding the subject line to avoid false negatives on creative/generic subjects.

    This demonstrates how to wrap the proven Guidelines judge with custom data extraction.
    """
    import json
    from mlflow.genai.judges import meets_guidelines
    # Extract the original request
    outputs = json.loads(trace.data.response)
    email_body = outputs.get('email_body')
    user_input = outputs.get('user_input')
    input_facts = trace.search_spans(span_type="RETRIEVER")[0].outputs

    relevance_guideline = """The email_body prioritizes content that matters to the recipient in the provided_info based on these rules:
- Critical support tickets (status="Open (Critical)") must be addressed after the greeting, reference to the most recent interaction, any pleasantries, and references to closed tickets
- Time-sensitive action items must be addressed before general updates
- Content must be ordered by descending urgency as defined by:
  1. Critical support issues
  2. Action items explicitly stated in most recent meeting
  3. Upcoming renewal if within 30 days
  4. Recently resolved issues
  5. Usage trends and recommendations
- No more than ONE feature recommendation for accounts with open critical issues
- No mentions of company news, product releases, or success stories not directly requested by the customer
- No calls to action unrelated to the immediate needs in the data
- AUTOMATIC FAIL if the email requests a meeting without being tied to a specific action item or opportunity in the data"""

    # Use the proven Guidelines judge with our extracted email body
    return meets_guidelines(guidelines=relevance_guideline, context={'provided_info': input_facts, 'email': email_body, 'user_input': user_input})


# A Scorer operates on a MLflow Trace, so let's retrieve a few Traces:
print('\n🔍 Loading recent traces from our email demo app...')

# Load recent traces for evaluation
traces = mlflow.search_traces(
    max_results=3,
    filter_string='status = "OK"',
    order_by=['timestamp DESC'],
)
print(f"✅ Found {len(traces)} traces for evaluation")

# Now, let's run evaluation using this scorer

eval_results = mlflow.genai.evaluate(data=traces, scorers=[tone, accuracy, personalized, relevance])

print(f"\n📊 Evaluation completed!")
print(f"🆔 Run ID: {eval_results._run_id}")

# Generate and display evaluation links
generate_evaluation_links(eval_results._run_id)

# 🛠️ Step 5: Custom Prompt-based Judges (Advanced)

For complex, nuanced evaluations that require sophisticated reasoning, you can create custom prompt-based judges with full control over:

1. **Judge prompts** - Custom evaluation instructions
2. **Output formats** - Multiple value types (scores, categories, detailed feedback)
3. **Complex logic** - Multi-dimensional assessment

This approach is best when you have complex, nuanced evaluations where you need full control over the scorer's prompt or need to have the scorer specify multiple output values, for example, "great", "ok", "bad".

Here, we will create a custom prompt scorer that evaluates whether an email will drive positive business outcomes by assessing customer satisfaction, relationship impact, revenue opportunities, issue resolution, and trust-building potential.

**📚 Reference**:

- [**Prompt-based Scorers Documentation**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/create-prompt-judge)


In [None]:
# 🛠️ Custom Prompt-based Judges - Advanced Control

from mlflow.genai.scorers import scorer
from mlflow.genai.judges import custom_prompt_judge


# Business Impact Assessment Judge

@scorer
def business_impact(trace):
    """
    Assess whether an email response will drive positive business outcomes
    """

    import json

    business_impact_prompt = """
You are a business value analyst evaluating whether an email response will drive positive business outcomes.

EMAIL RESPONSE: {{email_body}}

USER'S INSTRUCTION: {{user_input}}

CUSTOMER DATA: {{input_facts}}

BUSINESS IMPACT EVALUATION:
Assess whether this email will likely result in:
1. Customer satisfaction improvement
2. Relationship strengthening
3. Revenue protection/growth opportunities
4. Issue resolution acceleration
5. Trust building

Consider:
- Does the email address the customer's most pressing needs?
- Will it move the relationship forward positively?
- Does it demonstrate value and expertise?
- Is it likely to generate a positive response?

You must choose one of the following categories:

[[high]]: Email likely to drive significant positive business impact across multiple dimensions
[[medium]]: Email likely to drive some positive business impact in key areas
[[low]]: Email unlikely to drive meaningful business impact
[[negative]]: Email could harm business relationship or customer satisfaction
"""


    # Extract the original request
    outputs = json.loads(trace.data.response)
    email_body = outputs.get('email_body')
    user_input = outputs.get('user_input')
    input_facts = trace.search_spans(span_type="RETRIEVER")[0].outputs

    # Create business impact judge
    impact_judge = custom_prompt_judge(
        name="business_impact_assessment",
        prompt_template=business_impact_prompt,
        numeric_values={
            "high": 4,
            "medium": 3,
            "low": 2,
            "negative": 1
        }
    )

    return impact_judge(email_body=email_body, user_input=user_input, input_facts=input_facts)

# A Scorer operates on a MLflow Trace, so let's retrieve a few Traces:
print('\n🔍 Loading recent traces from our email demo app...')

# Load recent traces for evaluation
traces = mlflow.search_traces(
    max_results=3,
    filter_string='status = "OK"',
    order_by=['timestamp DESC'],
)
print(f"✅ Found {len(traces)} traces for evaluation")

# Now, let's run evaluation using this scorer

eval_results = mlflow.genai.evaluate(data=traces, scorers=[business_impact])

print(f"\n📊 Evaluation completed!")
print(f"🆔 Run ID: {eval_results._run_id}")

# Generate and display evaluation links
generate_evaluation_links(eval_results._run_id)

# 💻 Step 5: Code-based Scorers (Pure Python Evaluation)

For deterministic, measurable criteria that don't require LLM reasoning, code-based scorers provide:

1. **Fast execution** - No LLM API calls
2. **Deterministic results** - Same input always produces same output
3. **Cost-effective** - No token usage costs
4. **Precise control** - Exact logic for measurable criteria

Perfect for checking compliance, formatting, length requirements, and other objective criteria.

**📚 Reference**:

- [**Code-based Scorers Documentation**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-scorers)


In [None]:
from mlflow.genai.scorers import scorer
import mlflow
from mlflow.entities.assessment import Feedback

@scorer
def email_format_compliance(trace):
    """
    Code-based scorer that checks email formatting, structure, and professional standards
    """
    import json
    import re

    # Extract the email content
    outputs = json.loads(trace.data.response)
    email_body = outputs.get('email_body', '')
    email_subject = outputs.get('email_subject', '')

    score = 0
    issues = []
    max_score = 5

    # 1. Check subject line exists and is reasonable length (1 point)
    if email_subject and 10 <= len(email_subject) <= 80:
        score += 1
    else:
        issues.append("Subject line missing or improper length")

    # 2. Check email has proper greeting (1 point)
    greeting_patterns = [r'\bhi\b', r'\bhello\b', r'\bdear\b', r'\bgreetings\b']
    if any(re.search(pattern, email_body.lower()) for pattern in greeting_patterns):
        score += 1
    else:
        issues.append("Missing proper greeting")

    # 3. Check email has closing/signature (1 point)
    closing_patterns = [r'\bbest\b', r'\bregards\b', r'\bthanks\b', r'\bsincerely\b']
    if any(re.search(pattern, email_body.lower()) for pattern in closing_patterns):
        score += 1
    else:
        issues.append("Missing proper closing")

    # 4. Check reasonable email length (1 point)
    word_count = len(email_body.split())
    if 50 <= word_count <= 400:
        score += 1
    else:
        issues.append(f"Email length inappropriate: {word_count} words")

    # 5. Check proper sentence structure (1 point)
    sentences = email_body.split('.')
    proper_sentences = sum(1 for s in sentences if s.strip() and s.strip()[0].isupper())
    if proper_sentences >= len([s for s in sentences if s.strip()]) * 0.8:
        score += 1
    else:
        issues.append("Poor sentence capitalization")

    normalized_score = score / max_score

    return Feedback(value=normalized_score, rationale=f"Email format compliance: {score}/{max_score} points. Issues: {'; '.join(issues) if issues else 'None'}")


# A Scorer operates on a MLflow Trace, so let's retrieve a few Traces:
print('\n🔍 Loading recent traces from our email demo app...')

# Load recent traces for evaluation
traces = mlflow.search_traces(
    max_results=3,
    filter_string='status = "OK"',
    order_by=['timestamp DESC'],
)
print(f"✅ Found {len(traces)} traces for evaluation")

# Now, let's run evaluation using this scorer

eval_results = mlflow.genai.evaluate(data=traces, scorers=[email_format_compliance])

print(f"\n📊 Evaluation completed!")
print(f"🆔 Run ID: {eval_results._run_id}")

# Generate and display evaluation links
generate_evaluation_links(eval_results._run_id)

# 🎯 Complete MLflow Evaluation Mastery - Summary & Next Steps

Congratulations! You've successfully mastered MLflow's complete evaluation ecosystem for GenAI applications.

## What You've Accomplished

✅ **Understood the Architecture** - Judges vs Scorers and how they work together  
✅ **Explored Direct Judge Calls** - Tested individual judges with sample inputs  
✅ **Used Predefined Scorers** - Applied research-backed judges with default data extraction  
✅ **Created Hybrid Approaches** - Combined proven judges with custom data extraction  
✅ **Built Guidelines-based Judges** - Encoded business requirements in natural language  
✅ **Implemented Prompt-based Judges** - Created sophisticated custom evaluation logic  
✅ **Developed Code-based Scorers** - Built fast, deterministic evaluation functions

## Resources for Continued Learning

📚 **MLflow Documentation**

- [**Complete Judge & Scorer Reference**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/) - Full API documentation
- [**Evaluation Best Practices**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/evaluate-app) - Production deployment guide
- [**Monitoring & Alerting**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/monitor-app) - Continuous quality monitoring

🎯 **Remember**: Great evaluation is iterative. Start with the basics, learn from your results, and progressively build more sophisticated assessment as your understanding deepens.
