# LLM-as-a-Judge Tutorial: Tool Selection Evaluation with MLflow

This interactive notebook demonstrates how to use MLflow's LLM-as-a-Judge pattern to evaluate AI agent decisions.

## Tutorial Goals

1. Use MLflow tracing to capture agent actions
2. Create a judge using `mlflow.genai.judges.make_judge()`
3. Evaluate agent decisions using the judge
4. Integrate with MLflow experiments for reproducibility

## Scenario

An AI agent selects a tool to answer a user query. The judge evaluates whether the agent chose the appropriate tool.

**Evaluation Criteria:**
- Does the selected tool match the user's intent?
- Can this tool address the task requirements?
- Are there more suitable tools available?


![LLM-as-a-Judge tool selection evaluation with MLflow](images/tools_selection_notebook_diagram.png)

---

Based on: [Using LLM as a Judge](https://medium.com/@juanc.olamendy/using-llm-as-a-judge-to-evaluate-agent-outputs-a-comprehensive-tutorial-00b6f1f356cc)

## Setup: Import Dependencies

First, let's import all the necessary libraries.

In [None]:
from genai.common.config import AgentConfig
from genai.agents.tools_selection.prompts import get_judge_instructions, get_tool_selection_prompt
import mlflow
from typing import List
import os
from pathlib import Path

# Load environment variables from .env file if it exists
try:
    from dotenv import load_dotenv
    env_file = Path(".env")
    if env_file.exists():
        load_dotenv(env_file)
        print(f"‚úì Loaded environment variables from {env_file.absolute()}")
    else:
        print(f"‚ÑπÔ∏è  No .env file found at {env_file.absolute()}")
        print("   You can create one with your credentials or set environment variables manually")
except ImportError:
    print("‚ÑπÔ∏è  python-dotenv not installed. Install with: pip install python-dotenv")
    print("   Or set environment variables manually in the cells below")

## Configuration: Set Up Environment

**Two ways to configure credentials:**

1. **Recommended**: Create a `.env` file in this directory with your credentials
2. **Alternative**: Uncomment and set credentials in the cells below

### Create a `.env` file (Recommended)

Create a file named `.env` in the same directory as this notebook:

**For Databricks:**
```
DATABRICKS_TOKEN=your-token-here
DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
```

**For OpenAI:**
```
OPENAI_API_KEY=sk-your-key-here
```

The cell above will automatically load these credentials.

In [2]:
# ============================================================================
# CONFIGURATION: Choose your provider and models
# ============================================================================

# Option 1: Databricks (default)
PROVIDER = "databricks"
AGENT_MODEL = "databricks-gpt-5"
JUDGE_MODEL = "databricks-gemini-2-5-flash"

# Option 2: OpenAI (uncomment to use)
# PROVIDER = "openai"
# AGENT_MODEL = "gpt-4o-mini"
# JUDGE_MODEL = "gpt-4o"

# Other settings
TEMPERATURE = 1.0
EXPERIMENT_NAME = "tool-selection-judge-notebook"

print(f"‚úì Configuration set:")
print(f"  Provider: {PROVIDER}")
print(f"  Agent Model: {AGENT_MODEL}")
print(f"  Judge Model: {JUDGE_MODEL}")

‚úì Configuration set:
  Provider: databricks
  Agent Model: databricks-gpt-5
  Judge Model: databricks-gemini-2-5-flash


### Verify Credentials (Optional Manual Setup)

If you didn't create a `.env` file, you can set credentials manually by uncommenting the appropriate lines below:

In [3]:
# ============================================================================
# MANUAL CREDENTIAL SETUP (if not using .env file)
# ============================================================================

# For Databricks - Uncomment and set these if you didn't create a .env file
# os.environ["DATABRICKS_TOKEN"] = "your-token-here"
# os.environ["DATABRICKS_HOST"] = "https://your-workspace.cloud.databricks.com"

# For OpenAI - Uncomment and set this if you didn't create a .env file
# os.environ["OPENAI_API_KEY"] = "sk-your-key-here"

# Verify credentials are set
if PROVIDER == "databricks":
    if "DATABRICKS_TOKEN" in os.environ and "DATABRICKS_HOST" in os.environ:
        print("‚úì Databricks credentials found")
    else:
        print("‚ö†Ô∏è  Missing Databricks credentials!")
        print("   Create a .env file or uncomment the lines above to set credentials")
elif PROVIDER == "openai":
    if "OPENAI_API_KEY" in os.environ:
        print("‚úì OpenAI credentials found")
    else:
        print("‚ö†Ô∏è  Missing OpenAI credentials!")
        print("   Create a .env file or uncomment the lines above to set credentials")

‚úì Databricks credentials found


## Step 1: Setup MLflow Tracing

Enable MLflow tracing to capture all agent actions and LLM calls automatically.

In [None]:
from genai.common.mlflow_config import setup_mlflow_tracking

setup_mlflow_tracking(
    experiment_name=EXPERIMENT_NAME,
    enable_autolog=True
)

print("\n[Step 1] MLflow tracing enabled")
print(f"  ‚îî‚îÄ Experiment: {EXPERIMENT_NAME}")
print(f"  ‚îî‚îÄ View traces: mlflow ui")

## Step 2: Import the Agent Class

Import the `AgentToolSelectionJudge` class that demonstrates the complete LLM-as-a-Judge pattern:
1. Agent performs an action (`select_tool`) - traced with MLflow
2. Judge evaluates the action (`evaluate`) - uses `make_judge()`

In [None]:
# Import the AgentToolSelectionJudge class from the module
from genai.agents.tools_selection.tool_selection_judge import AgentToolSelectionJudge

print("‚úì AgentToolSelectionJudge imported successfully")
print("\nThe class provides:")
print("  - select_tool(): Agent selects a tool based on user query")
print("  - evaluate(): Judge evaluates the agent's tool selection")
print("  - Uses MLflow tracing and make_judge() for evaluation")

## Step 2: Initialize Agent and Judge

Create the configuration and instantiate our judge.

In [None]:
# Create agent configuration
config = AgentConfig(
    model=AGENT_MODEL,
    provider=PROVIDER,
    temperature=TEMPERATURE
)

# Initialize the judge
judge = AgentToolSelectionJudge(config, judge_model=JUDGE_MODEL)

print("\n[Step 2] Initializing Agent and Judge")
print(f"  ‚îî‚îÄ Provider: {config.provider}")
print(f"  ‚îî‚îÄ Agent Model: {config.model}")
print(f"  ‚îî‚îÄ Judge Model: {JUDGE_MODEL}")
print(f"  ‚îî‚îÄ Temperature: {config.temperature}")

## Step 3: Define Test Scenario

Set up a user query and available tools for the agent to choose from.

In [None]:
# Define the scenario
user_request = "What's the weather like in San Francisco?"
available_tools = ["get_weather_api", "search_web", "get_calendar", "send_email"]

print("\n[Step 3] Test Scenario")
print(f"  ‚îî‚îÄ User Query: {user_request}")
print(f"  ‚îî‚îÄ Available Tools: {available_tools}")

In [None]:
# Example queries to try:
#user_request = "Send email to John about the meeting"
# user_request = "What meetings do I have today?"
# user_request = "Search for information about machine learning"
# user_request = "What's the current stock price of AAPL?"

In [None]:
print("\n[Step 4] Agent selects a tool...")
tool_selected = judge.select_tool(user_request, available_tools)
print(f"  ‚îî‚îÄ ‚úì Selected: {tool_selected}")

## Step 5: Judge Evaluates the Selection

Now the judge evaluates whether the agent made the right choice.

In [None]:
print("\n[Step 5] Judge evaluates the selection...")

# Get the trace ID from the agent's action
trace_id = mlflow.get_last_active_trace_id()

# Evaluate with the judge
result = judge.evaluate(trace_id)

# Display results
print("\n[Step 6] Evaluation Results")
print("=" * 70)
print(f"Decision: {'‚úì CORRECT' if result['is_correct'] else '‚úó INCORRECT'}")
print("\nReasoning:")
print(f"{result['reasoning']}")
print("=" * 70)

## View Detailed Traces in MLflow UI

You can view detailed traces in the MLflow UI to see:
- The full conversation flow
- LLM inputs and outputs
- Execution times
- All logged parameters

Run this command in your terminal:
```bash
mlflow ui
```

Then navigate to: http://localhost:5000

## üéØ Try It Yourself!

Run the complete workflow with a custom query:

In [11]:
def run_evaluation(query: str, tools: List[str]):
    """
    Complete workflow: Agent selects tool ‚Üí Judge evaluates
    """
    print(f"\n{'='*70}")
    print(f"Query: {query}")
    print(f"Available Tools: {tools}")
    print(f"{'='*70}\n")
    
    # Agent selects tool
    selected = judge.select_tool(query, tools)
    print(f"‚úì Agent selected: {selected}\n")
    
    # Judge evaluates
    trace_id = mlflow.get_last_active_trace_id()
    result = judge.evaluate(trace_id)
    
    # Show results
    print(f"Decision: {'‚úì CORRECT' if result['is_correct'] else '‚úó INCORRECT'}")
    print(f"\nReasoning:\n{result['reasoning']}")
    print(f"\n{'='*70}\n")
    
    return result

# Try it!
result = run_evaluation(
    query="Send email to John about the meeting",
    tools=["get_weather_api", "search_web", "get_calendar", "send_email"]
)


Query: Send email to John about the meeting
Available Tools: ['get_weather_api', 'search_web', 'get_calendar', 'send_email']

‚úì Agent selected: send_email

Decision: ‚úì CORRECT

Reasoning:
The user's request was to "Send email to John about the meeting". The agent correctly identified and selected the `send_email` tool from the available options. This tool directly aligns with the user's intent and is the most appropriate choice among the given tools (`get_weather_api`, `search_web`, `get_calendar`).




## üß™ Experiment: Test Multiple Scenarios

Let's evaluate multiple queries and see how the judge performs:

In [None]:
# Define test scenarios
test_scenarios = [
    ("What's the weather in Boston?", ["get_weather_api", "search_web", "get_calendar", "send_email"]),
    ("Schedule a meeting for tomorrow", ["get_weather_api", "search_web", "get_calendar", "send_email"]),
    ("Find information about Python", ["get_weather_api", "search_web", "get_calendar", "send_email"]),
    ("Send a message to Sarah", ["get_weather_api", "search_web", "get_calendar", "send_email"]),
]

# Run all scenarios
results = []
for query, tools in test_scenarios:
    result = run_evaluation(query, tools)
    results.append({
        "query": query,
        "correct": result["is_correct"]
    })

# Summary
correct_count = sum(1 for r in results if r["correct"])
total = len(results)
print(f"\nüìä Summary: {correct_count}/{total} selections were correct ({correct_count/total*100:.0f}%)")

## üé® Customization

### Modify Evaluation Criteria

The judge's evaluation criteria are defined in `prompts.py`. You can view them:

In [None]:
print("Current Judge Instructions:")
print("=" * 70)
print(get_judge_instructions())
print("=" * 70)

### View Tool Selection Prompt

See how the agent is instructed to select tools:

In [None]:
example_prompt = get_tool_selection_prompt(
    "What's the weather?",
    ["get_weather_api", "search_web"]
)

print("Example Tool Selection Prompt:")
print("=" * 70)
print(example_prompt)
print("=" * 70)

## üìö Key Concepts Summary

### MLflow Tracing
- Automatically captures function inputs/outputs
- Tracks execution time and metadata
- Creates parent-child relationships for nested calls

### MLflow Judge
- Created with `make_judge()`
- Takes predefined evaluation criteria
- Returns structured feedback (value + rationale)

### Separation of Concerns
- **Agent**: Performs the task (tool selection)
- **Judge**: Evaluates the agent's performance
- **Prompts**: Define behavior (easy to modify in `prompts.py`)

## üöÄ Next Steps

1. **Explore MLflow UI**: Run `mlflow ui` to see detailed traces
2. **Modify Prompts**: Edit `prompts.py` to change evaluation criteria
3. **Try Different Models**: Experiment with different agent and judge models
4. **Add More Tools**: Expand the `available_tools` list
5. **Apply to Your Use Case**: Adapt this pattern for your own agent evaluations

## üìñ Resources

- [MLflow GenAI Judges Documentation](https://mlflow.org/docs/latest/llms/llm-evaluate/index.html)
- [Original Tutorial](https://medium.com/@juanc.olamendy/using-llm-as-a-judge-to-evaluate-agent-outputs-a-comprehensive-tutorial-00b6f1f356cc)
- [Tool Selection Judge README](README.md)