# LLM-as-a-Judge Tutorial: Multi-Step Planning Evaluation with MLflow

This interactive notebook demonstrates how to use MLflow's LLM-as-a-Judge pattern to evaluate AI agent planning decisions.

## Tutorial Goals

1. Use MLflow tracing to capture agent planning actions
2. Create a judge using `mlflow.genai.judges.make_judge()`
3. Evaluate multi-step plans using the judge
4. Integrate with MLflow experiments for reproducibility

## Scenario

An AI agent creates a multi-step plan to accomplish a task using available resources. The judge evaluates whether the plan is logical, complete, efficient, and uses valid tools.

**Evaluation Criteria:**
- Logical step ordering: Are steps in the correct sequence?
- Tool validity: Are only valid, available tools used?
- Task sufficiency: Will the plan accomplish the goal?
- Efficiency: Is the approach optimal?

---

Based on the tool_selection_judge.py pattern and adapted for planning evaluation.

## Setup: Import Dependencies

First, let's import all the necessary libraries.

In [None]:
from genai.common.config import AgentConfig
from genai.agents.agent_planning.prompts import get_judge_instructions, get_planning_prompt
import mlflow
from typing import List
import os
from pathlib import Path

# Load environment variables from .env file if it exists
try:
    from dotenv import load_dotenv
    env_file = Path(".env")
    if env_file.exists():
        load_dotenv(env_file)
        print(f"‚úì Loaded environment variables from {env_file.absolute()}")
    else:
        print(f"‚ÑπÔ∏è  No .env file found at {env_file.absolute()}")
        print("   You can create one with your credentials or set environment variables manually")
except ImportError:
    print("‚ÑπÔ∏è  python-dotenv not installed. Install with: pip install python-dotenv")
    print("   Or set environment variables manually in the cells below")

## Configuration: Set Up Environment

**Two ways to configure credentials:**

1. **Recommended**: Create a `.env` file in this directory with your credentials
2. **Alternative**: Uncomment and set credentials in the cells below

### Create a `.env` file (Recommended)

Create a file named `.env` in the same directory as this notebook:

**For Databricks:**
```
DATABRICKS_TOKEN=your-token-here
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
```

**For OpenAI:**
```
OPENAI_API_KEY=sk-your-key-here
```

The cell above will automatically load these credentials.

In [26]:
# ============================================================================
# CONFIGURATION: Choose your provider and models
# ============================================================================

# Option 1: Databricks (default)
PROVIDER = "databricks"
AGENT_MODEL = "databricks-gpt-5"
JUDGE_MODEL = "databricks-gemini-2-5-flash"

# Option 2: OpenAI (uncomment to use)
# PROVIDER = "openai"
# AGENT_MODEL = "gpt-4o-mini"
# JUDGE_MODEL = "gpt-4o"

# Other settings
TEMPERATURE = 1.0
EXPERIMENT_NAME = "agent-planning-judge-notebook"

print(f"‚úì Configuration set:")
print(f"  Provider: {PROVIDER}")
print(f"  Agent Model: {AGENT_MODEL}")
print(f"  Judge Model: {JUDGE_MODEL}")

‚úì Configuration set:
  Provider: databricks
  Agent Model: databricks-gpt-5
  Judge Model: databricks-gemini-2-5-flash


### Verify Credentials (Optional Manual Setup)

If you didn't create a `.env` file, you can set credentials manually by uncommenting the appropriate lines below:

In [27]:
# ============================================================================
# MANUAL CREDENTIAL SETUP (if not using .env file)
# ============================================================================

# For Databricks - Uncomment and set these if you didn't create a .env file
# os.environ["DATABRICKS_TOKEN"] = "your-token-here"
# os.environ["DATABRICKS_HOST"] = "https://your-workspace.cloud.databricks.com"

# For OpenAI - Uncomment and set this if you didn't create a .env file
# os.environ["OPENAI_API_KEY"] = "sk-your-key-here"

# Verify credentials are set
if PROVIDER == "databricks":
    if "DATABRICKS_TOKEN" in os.environ and "DATABRICKS_HOST" in os.environ:
        print("‚úì Databricks credentials found")
    else:
        print("‚ö†Ô∏è  Missing Databricks credentials!")
        print("   Create a .env file or uncomment the lines above to set credentials")
elif PROVIDER == "openai":
    if "OPENAI_API_KEY" in os.environ:
        print("‚úì OpenAI credentials found")
    else:
        print("‚ö†Ô∏è  Missing OpenAI credentials!")
        print("   Create a .env file or uncomment the lines above to set credentials")

‚úì Databricks credentials found


## Step 1: Setup MLflow Tracing

Enable MLflow tracing to capture all agent planning actions and LLM calls automatically.

In [28]:
from genai.common.mlflow_config import setup_mlflow_tracking

setup_mlflow_tracking(
    experiment_name=EXPERIMENT_NAME,
    enable_autolog=True
)

print("‚úì MLflow tracing enabled")
print(f"  Experiment: {EXPERIMENT_NAME}")
print("  View traces: mlflow ui")

‚úì MLflow tracing enabled
  Experiment: agent-planning-judge-notebook
  View traces: mlflow ui


## Step 2: Import the Agent Class

Instead of redefining the class, we import it directly from the module to follow DRY principles.

In [30]:
# Import the AgentPlanningJudge class from the module
from genai.agents.agent_planning import AgentPlanningJudge

print("‚úì AgentPlanningJudge imported successfully")
print("\nThe class provides:")
print("  - create_plan(): Generate multi-step plans")
print("  - evaluate(): Judge plan quality using MLflow")
print("  - execute_plan_with_tools(): Execute plans with actual tool calls")

‚úì AgentPlanningJudge imported successfully

The class provides:
  - create_plan(): Generate multi-step plans
  - evaluate(): Judge plan quality using MLflow
  - execute_plan_with_tools(): Execute plans with actual tool calls


## Step 2: Initialize Agent and Judge

Create the configuration and instantiate our planning judge.

In [31]:
# Create agent configuration
config = AgentConfig(
    model=AGENT_MODEL,
    provider=PROVIDER,
    temperature=TEMPERATURE
)

# Initialize the judge
judge = AgentPlanningJudge(config, judge_model=JUDGE_MODEL)

print("‚úì Agent and Judge initialized")
print(f"  Provider: {config.provider}")
print(f"  Agent Model: {config.model}")
print(f"  Judge Model: {JUDGE_MODEL}")
print(f"  Temperature: {config.temperature}")

‚úì Agent and Judge initialized
  Provider: databricks
  Agent Model: databricks-gpt-5
  Judge Model: databricks-gemini-2-5-flash
  Temperature: 1.0


## Step 3: Define Planning Scenario

Set up a task goal and available resources for the agent to plan with.

**üí° Try different task goals to see how the agent plans and how the judge evaluates!**

In [32]:
# Define the scenario
task_goal = "Book a flight from NYC to SF for next Tuesday and add to calendar"
available_resources = [
    "flight_search_api",
    "booking_api",
    "calendar_api",
    "hotel_search_api",
    "email_api"
]

print("Planning Scenario:")
print(f"  Task Goal: {task_goal}")
print(f"  Available Resources: {available_resources}")

Planning Scenario:
  Task Goal: Book a flight from NYC to SF for next Tuesday and add to calendar
  Available Resources: ['flight_search_api', 'booking_api', 'calendar_api', 'hotel_search_api', 'email_api']


### üí° Try Different Task Goals

Uncomment one of these examples or write your own:

In [None]:
# Example task goals to try:
# task_goal = "Send confirmation email after booking flight"
# available_resources = ["email_api", "flight_search_api", "booking_api"]

# task_goal = "Find and book a hotel in Paris for next month"
# available_resources = ["hotel_search_api", "booking_api", "email_api"]

# task_goal = "Book flight and add travel event to calendar"
# available_resources = ["calendar_api", "flight_search_api", "booking_api", "email_api"]

## Step 4: Agent Creates Plan

The agent creates a multi-step plan based on the task goal. MLflow automatically traces this action.

In [33]:
print("\n[Step 4] Agent creates a multi-step plan...")

plan = judge.create_plan(task_goal, available_resources)

print("  ‚îî‚îÄ ‚úì Plan created:\n")
# Indent the plan for better readability
for line in plan.split('\n'):
    if line.strip():
        print(f"      {line}")


[Step 4] Agent creates a multi-step plan...
  ‚îî‚îÄ ‚úì Plan created:

      1. flight_search_api: Query round-trip/one-way options from NYC (JFK/LGA/EWR) to San Francisco (SFO/OAK/SJC) for next Tuesday, specifying preferred departure window (e.g., 7:00‚Äì12:00), cabin class, non-stop preference, baggage needs, and max budget.
      2. flight_search_api: Filter results for best options by total travel time, price, and baggage/fare rules; retrieve at least three candidate flights with fare class, change/refund policies, and total cost breakdown.
      3. flight_search_api: Retrieve seat maps and fare conditions for the top candidate to confirm seat availability and any add-on fees.
      4. booking_api: Create a booking hold for the selected flight with passenger details (full name, DOB, gender, TSA Known Traveler Number if any, frequent flyer number, contact email/phone), and specify seating preference and baggage requirements; capture the hold expiration time.
      5. booking_api: 

## Step 5: Judge Evaluates the Plan

Now the judge evaluates the plan quality on multiple dimensions.

In [34]:
print("\n[Step 5] Judge evaluates the plan quality...")

# Get the trace ID from the agent's action
trace_id = mlflow.get_last_active_trace_id()

# Evaluate with the judge
result = judge.evaluate(trace_id)

# Display results
print("\n[Step 6] Evaluation Results")
print("=" * 70)
print(f"Quality: {result['quality'].upper()} (Score: {result['score']}/5)")
print("\nDetailed Assessment:")
print(f"{result['reasoning']}")
print("=" * 70)


[Step 5] Judge evaluates the plan quality...

[Step 6] Evaluation Results
Quality: EXCELLENT (Score: 5/5)

Detailed Assessment:
The AI agent's plan demonstrates an excellent understanding of the task goal and available resources. It is comprehensive, logically structured, and uses tools appropriately.

**Strengths of the plan:**
*   **Comprehensive:** The plan covers all aspects of booking a flight, from searching and filtering to booking, payment, seat assignments, notifications, and calendar integration. It even includes post-booking considerations like email confirmation and optional price monitoring.
*   **Logical Flow:** The steps are ordered very logically, with each step building on the previous one. For example, searching comes before booking, and booking comes before adding to the calendar.
*   **Tool Utilization:** The plan correctly identifies and uses the `flight_search_api`, `booking_api`, `calendar_api`, and `email_api` as needed, and correctly avoids the `hotel_search_a

## Step 7: Execute Plan with Actual Tools

Now we'll execute the plan using actual (simulated) tools. This demonstrates the complete multi-agent workflow where the LLM dynamically selects which tools to call for each step.

In [35]:
print("\n[Step 7] Executing plan with actual tools...")
print("=" * 70)

# Execute the plan
execution_result = judge.execute_plan_with_tools(plan, task_goal)

print("\n  ‚úì Execution Complete!")
print(f"  ‚îî‚îÄ Total Steps: {execution_result['total_steps']}")
print(f"  ‚îî‚îÄ Successful: {execution_result['successful_steps']}/{execution_result['total_steps']}")

print("\n  Step-by-Step Results:")
for step_result in execution_result['step_results']:
    step_num = step_result['step_number']
    tool = step_result.get('tool_used', 'No tool')
    success = '‚úì' if step_result.get('success') else '‚úó'
    
    print(f"  {success} Step {step_num}: {tool}")
    if step_result.get('result'):
        result_preview = str(step_result['result'])[:100]
        print(f"     Result: {result_preview}...")

print("=" * 70)


[Step 7] Executing plan with actual tools...

  Executing 18 steps...
  ‚îî‚îÄ Step 1/18: flight_search_api: Query round-trip/one-way options from NYC...
     ‚úì Used flight_search_api
  ‚îî‚îÄ Step 2/18: flight_search_api: Filter results for best options by total ...
     ‚úì Used flight_search_api
  ‚îî‚îÄ Step 3/18: flight_search_api: Retrieve seat maps and fare conditions fo...
     ‚úì Used flight_search_api
  ‚îî‚îÄ Step 4/18: booking_api: Create a booking hold for the selected flight w...
     ‚úì Used booking_api
  ‚îî‚îÄ Step 5/18: booking_api: Calculate final total including taxes, seat sel...
  ‚îî‚îÄ Step 6/18: booking_api: Confirm purchase using stored or provided payme...
     ‚úì Used booking_api
  ‚îî‚îÄ Step 7/18: booking_api: Add seat assignments if not already included; c...
     ‚úì Used booking_api
  ‚îî‚îÄ Step 8/18: booking_api: Opt in for flight status notifications if avail...
  ‚îî‚îÄ Step 9/18: calendar_api: Create a calendar event titled ‚ÄúFlight: NYC to.

## üéØ Try It Yourself!

Run the complete workflow (Plan ‚Üí Evaluate ‚Üí Execute) with a custom task:

In [36]:
def run_complete_workflow(task: str, resources: List[str]):
    """
    Complete workflow: Agent creates plan ‚Üí Judge evaluates ‚Üí Execute with tools
    """
    print(f"\n{'='*70}")
    print(f"Task Goal: {task}")
    print(f"Available Resources: {resources}")
    print(f"{'='*70}\n")
    
    # Step 1: Agent creates plan
    print("üìù Creating plan...")
    plan = judge.create_plan(task, resources)
    print("\n‚úì Plan created:")
    for line in plan.split('\n'):
        if line.strip():
            print(f"  {line}")
    
    # Step 2: Judge evaluates plan quality
    print("\n‚öñÔ∏è  Evaluating plan quality...")
    trace_id = mlflow.get_last_active_trace_id()
    eval_result = judge.evaluate(trace_id)
    
    print(f"\n‚úì Quality: {eval_result['quality'].upper()} (Score: {eval_result['score']}/5)")
    print(f"  Assessment: {eval_result['reasoning'][:150]}...")
    
    # Step 3: Execute plan with actual tools
    print("\nüîß Executing plan with tools...")
    exec_result = judge.execute_plan_with_tools(plan, task)
    
    print(f"\n‚úì Execution: {exec_result['successful_steps']}/{exec_result['total_steps']} steps successful")
    for step_result in exec_result['step_results']:
        if step_result.get('tool_used'):
            success = "‚úì" if step_result.get('success') else "‚úó"
            print(f"  {success} Step {step_result['step_number']}: {step_result['tool_used']}")
    
    print(f"\n{'='*70}\n")
    
    return {
        "evaluation": eval_result,
        "execution": exec_result
    }

# Try the complete workflow!
result = run_complete_workflow(
    task="Book a flight to Boston and send confirmation email",
    resources=["flight_search_api", "booking_api", "email_api"]
)


Task Goal: Book a flight to Boston and send confirmation email
Available Resources: ['flight_search_api', 'booking_api', 'email_api']

üìù Creating plan...

‚úì Plan created:
  1. Gather traveler details and preferences (origin city/airport, destination: Boston, travel dates/times, cabin class, number of passengers, baggage needs, seating preferences, max budget, flexibility) from the requester.
  2. Call flight_search_api with the collected parameters to retrieve available flight options to Boston.
  3. Parse flight_search_api results to filter for:
     - Matching or best-fit dates/times and cabin class
     - Total price within budget (including baggage if available)
     - Reasonable layovers and total duration
     - Preferred airlines or loyalty programs (if provided)
  4. Select the top 3 candidate itineraries based on price, duration, and preferences; include fare rules and cancellation/change policies if provided by flight_search_api.
  5. Present the top 3 options to the re

## üß™ Experiment: Complete Workflow with Multiple Scenarios

Let's run the complete workflow (Plan ‚Üí Evaluate ‚Üí Execute) on multiple tasks:

In [37]:
# Define test scenarios (using only available tools)
test_scenarios = [
    (
        "Book a flight and hotel for vacation",
        ["flight_search_api", "booking_api", "hotel_search_api", "email_api"]
    ),
    (
        "Schedule team meeting and send invites",
        ["calendar_api", "email_api"]
    ),
    (
        "Plan business trip with flight, hotel, and calendar",
        ["flight_search_api", "booking_api", "hotel_search_api", "calendar_api", "email_api"]
    ),
]

# Run complete workflow for all scenarios
results = []
for task, resources in test_scenarios:
    result = run_complete_workflow(task, resources)
    results.append({
        "task": task,
        "plan_score": result["evaluation"]["score"],
        "plan_quality": result["evaluation"]["quality"],
        "exec_success_rate": result["execution"]["successful_steps"] / result["execution"]["total_steps"] * 100
    })

# Summary
print("\n" + "=" * 70)
print("üìä EXPERIMENT SUMMARY")
print("=" * 70)
avg_plan_score = sum(r["plan_score"] for r in results) / len(results)
avg_exec_rate = sum(r["exec_success_rate"] for r in results) / len(results)

print(f"\nAverage Plan Quality Score: {avg_plan_score:.1f}/5")
print(f"Average Execution Success Rate: {avg_exec_rate:.0f}%\n")

print("Detailed Results:")
print(f"{'Task':<50} | {'Plan Quality':<12} | {'Score':<5} | {'Exec %':<6}")
print("-" * 70)
for r in results:
    print(f"{r['task'][:48]:<50} | {r['plan_quality']:12s} | {r['plan_score']}/5   | {r['exec_success_rate']:.0f}%")
print("=" * 70)


Task Goal: Book a flight and hotel for vacation
Available Resources: ['flight_search_api', 'booking_api', 'hotel_search_api', 'email_api']

üìù Creating plan...

‚úì Plan created:
  1. Define trip parameters (no API): Determine traveler details (names as on passports), preferred departure/return dates, origin and destination cities/airports, cabin class, baggage needs, seating preferences, hotel dates, location preferences, star rating, budget ceilings for flight and hotel, and acceptable layover limits.
  2. Search flights (flight_search_api): Query round-trip options with origin, destination, date range (include +/- 1‚Äì2 days for flexibility), passenger count, cabin class, max layovers, preferred departure time windows, and budget cap. Request results with total price, fare rules, baggage allowance, change/cancel policies, layover durations, and hold/lock options.
  3. Filter and shortlist flights (no API): Rank flights by total travel time, layover quality, on-time performance (i

## üé® Customization

### Modify Evaluation Criteria

The judge's evaluation criteria are defined in `prompts.py`. You can view them:

In [38]:
print("Current Judge Instructions:")
print("=" * 70)
print(get_judge_instructions())
print("=" * 70)

Current Judge Instructions:
You are an expert evaluator assessing AI agent planning capabilities.

You will receive a trace showing the agent's planning process:
{{ trace }}

Evaluate the plan's quality by considering these criteria:

1. **Logical Step Ordering**: Are the steps in the correct sequence? Does each step build on previous ones?
2. **Tool Validity**: Does the plan only use valid, available tools/resources? No hallucinated tools?
3. **Task Sufficiency**: Will this plan actually accomplish the stated goal? Are all necessary steps included?
4. **Efficiency**: Is the approach optimal, or are there unnecessary steps or redundant actions?

Assign a quality rating:

- **excellent**: Clear structure, logical flow, all steps necessary and sufficient, optimal efficiency
- **good**: Well-structured with minor inefficiencies or small improvements possible
- **adequate**: Basic plan that works but lacks optimization or has minor logical issues
- **poor**: Significant gaps, illogical ord

## üìö Key Concepts Summary

### Complete Workflow: Plan ‚Üí Evaluate ‚Üí Execute

This tutorial demonstrates the full multi-agent planning lifecycle:

1. **Planning**: Agent creates multi-step plan using LLM
2. **Evaluation**: Judge assesses plan quality on 4 dimensions
3. **Execution**: Executor actually calls tools using LLM function calling

### MLflow Tracing
- Automatically captures planning inputs/outputs
- Tracks execution time and metadata
- Creates parent-child relationships for nested calls
- Records every tool call with parameters and results

### MLflow Judge
- Created with `make_judge()`
- Evaluates on 4 dimensions: ordering, validity, sufficiency, efficiency
- Returns quality level + detailed rationale

### Tool Execution with LLM Function Calling
- **Dynamic Tool Selection**: LLM chooses which tool to use for each step
- **Parameter Generation**: LLM determines appropriate parameters
- **Context Passing**: Results from earlier steps inform later steps
- **Automatic Tracing**: Every tool call is traced in MLflow

### Quality Assessment Criteria
- **Logical Ordering**: Sequential flow of steps
- **Tool Validity**: Only uses available resources
- **Task Sufficiency**: Complete plan to achieve goal
- **Efficiency**: Optimal approach without redundancy

### Available Tools (5 Simulated APIs)
- `flight_search_api`: Search for flights
- `booking_api`: Book flights
- `hotel_search_api`: Search for hotels
- `calendar_api`: Manage calendar events
- `email_api`: Send emails

In [None]:
example_prompt = get_planning_prompt(
    "Book a flight to Boston",
    ["flight_search_api", "booking_api"]
)

print("Example Planning Prompt:")
print("=" * 70)
print(example_prompt)
print("=" * 70)

## üìä Quality Levels Explained

The judge evaluates plans on a 5-point scale:

| Quality Level | Score | Description |
|--------------|-------|-------------|
| **Excellent** | 5 | Clear structure, logical flow, all steps necessary and sufficient, optimal efficiency |
| **Good** | 4 | Well-structured with minor inefficiencies or small improvements possible |
| **Adequate** | 3 | Basic plan that works but lacks optimization or has minor logical issues |
| **Poor** | 2 | Significant gaps, illogical ordering, or uses invalid tools |
| **Very Poor** | 1 | Fundamentally flawed, cannot achieve goal, or relies heavily on unavailable resources |

## üöÄ Next Steps

1. **Explore MLflow UI**: Run `mlflow ui` to see detailed traces
2. **Modify Prompts**: Edit `prompts.py` to change evaluation criteria
3. **Try Different Models**: Experiment with different agent and judge models
4. **Add More Resources**: Expand the `available_resources` list
5. **Complex Tasks**: Try multi-domain tasks requiring diverse resources
6. **Apply to Your Use Case**: Adapt this pattern for your own planning evaluations

## üìñ Resources

- [MLflow GenAI Judges Documentation](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html)
- [Tool Selection Judge](../tools_selection/tool_selection_judge.ipynb) - Related tutorial
- [Agent Planning Judge README](README.md)