# MLflow Evaluation Demo: Multi-Turn Conversation with Session-Level Judges

![Multi-turn evaluationjudges](images/llm_as_judge.png)

This notebook demonstrates MLflow 3.9 evaluation capabilities by showing how to:
1. Set up session-level judges directly in the notebook for multi-turn conversation in a session
2. Use `mlflow.genai.evaluate()` API to evaluate conversations
3. Extract and interpret evaluation results
4. Use Evaluation UI and Metrics Dashboard

## What You'll Learn

- How to create session-level judges with `{{ conversation }}` template
- How to call `mlflow.genai.evaluate()` directly
- How to extract results from evaluation DataFrames
- Best practices for multi-turn conversation evaluation

---

In [2]:
import mlflow
print(mlflow.__version__)

3.9.0


## Setup: Import Dependencies

In [4]:
from genai.common.config import AgentConfig
from genai.agents.multi_turn.customer_support_agent_simple import CustomerSupportAgentSimple
from genai.agents.multi_turn.scenarios import get_scenario_printer_troubleshooting
from genai.agents.multi_turn.prompts import (
    get_coherence_judge_instructions,
    get_context_retention_judge_instructions,
    get_support_agent_system_prompt,
)
import mlflow
from mlflow.genai.judges import make_judge
from typing_extensions import Literal
import os
from pathlib import Path

# Load environment variables
try:
    from dotenv import load_dotenv
    env_file = Path(".env")
    if env_file.exists():
        load_dotenv(env_file)
        print("‚úì Loaded environment variables from {env_file.absolute()}")
    else:
        print("‚ÑπÔ∏è  No .env file found. Set environment variables manually.")
except ImportError:
    print("‚ÑπÔ∏è  python-dotenv not installed. Set environment variables manually.")

## Convenience for display
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

‚úì Loaded environment variables from {env_file.absolute()}


## Configuration

In [4]:
# Provider and model configuration
PROVIDER = "databricks"  # or "openai"
AGENT_MODEL = "databricks-gpt-5"
JUDGE_MODEL = "databricks-gemini-2-5-flash"
TEMPERATURE = 1.0
EXPERIMENT_NAME = "customer-support-dev-connect-demo"

# Change to your workspace path here 
USER_WORKSPACE_PATH = "/Users/jules@databricks.com"

print("‚úì Configuration:")
print(f"  Provider: {PROVIDER}")
print(f"  Agent Model: {AGENT_MODEL}")
print(f"  Judge Model: {JUDGE_MODEL}")

‚úì Configuration:
  Provider: databricks
  Agent Model: databricks-gpt-5
  Judge Model: databricks-gemini-2-5-flash


In [5]:
# import os

# # On Databricks, the token and host are available from the notebook context
# os.environ["DATABRICKS_TOKEN"] = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
# os.environ["DATABRICKS_HOST"] = spark.conf.get("spark.databricks.workspaceUrl")

print(f"‚úì DATABRICKS_HOST set to: {os.environ['DATABRICKS_HOST']}")
print(f"‚úì DATABRICKS_TOKEN set (length: {len(os.environ['DATABRICKS_TOKEN'])})")


‚úì DATABRICKS_HOST set to: https://e2-dogfood.staging.cloud.databricks.com
‚úì DATABRICKS_TOKEN set (length: 36)


## Step 1: Setup MLflow Tracking

In [6]:
# Auto loging for OpenAI models
mlflow.openai.autolog()

using_databricks_mlflow = True
if using_databricks_mlflow:
    mlflow.set_tracking_uri("databricks")
    EXPERIMENT_NAME = f"/Users/jules@databricks.com/{EXPERIMENT_NAME}"
    mlflow.set_experiment(EXPERIMENT_NAME)
else:
    mlflow.set_tracking_uri("http://localhost:5000")
    mlflow.set_experiment(EXPERIMENT_NAME)

2026/02/12 13:43:30 INFO mlflow.tracking.fluent: Experiment with name '/Users/jules@databricks.com/customer-support-dev-connect-demo' does not exist. Creating a new experiment.


## Step 2: Initialize Agent (Conversation-Only)

Note: This agent handles conversations only. We'll set up evaluation separately.

In [7]:
config = AgentConfig(
    model=AGENT_MODEL,
    provider=PROVIDER,
    temperature=TEMPERATURE,
    mlflow_experiment=EXPERIMENT_NAME
)

agent = CustomerSupportAgentSimple(config)

print("‚úì Agent initialized (conversation-only)")
print(f"  Provider: {config.provider}")
print(f"  Model: {config.model}")

‚úì Agent initialized (conversation-only)
  Provider: databricks
  Model: databricks-gpt-5


## Step 3: Run Conversation Scenario

In [8]:
# Load printer troubleshooting scenario
scenario = get_scenario_printer_troubleshooting()

print(f"Scenario: {scenario['name']}")
print(f"Turns: {len(scenario['messages'])}")
print(f"Session ID: {scenario['session_id']}")

Scenario: Printer Troubleshooting
Turns: 4
Session ID: session-printer-001


In [9]:
# Run the conversation
# Under the hood, it tags messages with a session id (mlflow.update_current_trace(metadata={"mlflow.trace.session": session_id}))
# Then it calls into OpenAI for responses
conv_result = agent.run_conversation(
    messages=scenario['messages'],
    session_id=scenario['session_id']
)

print("\n‚úì Conversation completed")
print("  Turns: {conv_result['turns']}")


Running 4-turn conversation (Session: session-printer-001)

Turn 1/4
  User: My HP LaserJet 3000 won't turn on at all. The power light doesn't come on.
  Agent: I‚Äôm sorry‚Äîthat‚Äôs frustrating. Let‚Äôs narrow it down:

1) Power source
- Is the outlet working? Test with another device.
- Try a different outlet (avoid power strips/surge protectors).

2) Power cable and switch
- Ensure the power cord is firmly seated in the printer and wall.
- Inspect the cord for damage; try another compatible cable if possible.
- Make sure the rear power switch (if present) is set to On.

3) Power/reset
- Unplug printer for 60 seconds, hold power button (if any) for 10 seconds, reconnect and try again.

If still dead: do you hear any clicks/fans? Any LEDs flash briefly? Have there been recent power surges? This may indicate a failed power supply.

Turn 2/4
  User: Yes, I've checked the power cable and it's plugged in securely to both the printer and wall outlet.
  Agent: Thanks for confirming. Let‚Ä

---

# üéØ Evaluation Showcase: MLflow Methods

Now we'll demonstrate MLflow's evaluation capabilities step-by-step.

## Step 4: Create Session-Level Judges with `make_judge()`

This is where we showcase MLflow's judge creation API.

### Prompts for judges retrieved

In [9]:
# First, let's see what the judges will evaluate
print("="*70)
print("AGENT PROMPT INSTRUCTIONS")
print("="*70)
print(get_support_agent_system_prompt())
print("\n" + "="*70)
print("COHERENCE JUDGE INSTRUCTIONS")
print("="*70)
print(get_coherence_judge_instructions())
print("\n" + "="*70)
print("CONTEXT RETENTION JUDGE INSTRUCTIONS")
print("="*70)
print(get_context_retention_judge_instructions())
print("\n" + "="*70)
print("\nüí° Both instructions contain {{ conversation }} - this makes them session-level!")

AGENT PROMPT INSTRUCTIONS


AttributeError: 'Prompt' object has no attribute 'template'

#### Evaluation judge defined

In [11]:
# Configure judge model URI
if PROVIDER == "databricks":
    os.environ["OPENAI_API_KEY"] = os.environ.get("DATABRICKS_TOKEN", "")
    os.environ["OPENAI_API_BASE"] = f"{config.databricks_host}/serving-endpoints"
    # judge_model_uri = f"openai:/{JUDGE_MODEL}"
    judge_model_uri = f"databricks:/{JUDGE_MODEL}"
else:
    judge_model_uri = JUDGE_MODEL

print(f"Judge model URI: {judge_model_uri}")

Judge model URI: databricks:/databricks-gemini-2-5-flash


#### Create our judges with the `make_judge` MLflow API

In [12]:
# Create judges using mlflow.genai.judges.make_judge()
print("Creating judges...\n")

coherence_judge = make_judge(
    name="conversation_coherence",
    model=judge_model_uri,
    instructions=get_coherence_judge_instructions(),
    feedback_value_type=bool
)

context_judge = make_judge(
    name="context_retention",
    model=judge_model_uri,
    instructions=get_context_retention_judge_instructions(),
    feedback_value_type=Literal["excellent", "good", "fair", "poor"]
)

print("‚úì Judges created")
print(f"  Coherence judge is session-level: {coherence_judge.is_session_level_scorer}")
print(f"  Context judge is session-level: {context_judge.is_session_level_scorer}")

Creating judges...

‚úì Judges created
  Coherence judge is session-level: True
  Context judge is session-level: True


## üîçÔ∏é Step 5: Search for Traces with `mlflow.search_traces()`

In [13]:
# Get experiment and search for traces
experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

session_traces = mlflow.search_traces(
    experiment_ids=[experiment.experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{scenario['session_id']}'"
)

print(f"‚úì Found {len(session_traces)} traces")
print(f"  Session ID: {scenario['session_id']}")
print("  Each trace = 1 conversation turn")
display(session_traces[["request_time", "request", "response"]].sort_values(by="request_time", ascending=True))

  session_traces = mlflow.search_traces(


‚úì Found 4 traces
  Session ID: session-printer-001
  Each trace = 1 conversation turn


Unnamed: 0,request_time,request,response
3,1770932632151,"{'message': 'My HP LaserJet 3000 won't turn on at all. The power light doesn't come on.', 'session_id': 'session-printer-001'}","I‚Äôm sorry‚Äîthat‚Äôs frustrating. Let‚Äôs narrow it down:\n\n1) Power source\n- Is the outlet working? Test with another device.\n- Try a different outlet (avoid power strips/surge protectors).\n\n2) Power cable and switch\n- Ensure the power cord is firmly seated in the printer and wall.\n- Inspect the cord for damage; try another compatible cable if possible.\n- Make sure the rear power switch (if present) is set to On.\n\n3) Power/reset\n- Unplug printer for 60 seconds, hold power button (if any) for 10 seconds, reconnect and try again.\n\nIf still dead: do you hear any clicks/fans? Any LEDs flash briefly? Have there been recent power surges? This may indicate a failed power supply."
2,1770932640807,"{'message': 'Yes, I've checked the power cable and it's plugged in securely to both the printer and wall outlet.', 'session_id': 'session-printer-001'}","Thanks for confirming. Let‚Äôs try a few targeted checks:\n\n1) Outlet and cord\n- Please test that exact outlet with another device.\n- If possible, try a different power cord (standard IEC C13 ‚Äúkettle‚Äù cable).\n\n2) Power switch and reset\n- Ensure the rear rocker switch is On.\n- Unplug printer, hold the power button (if present) 10 seconds, wait 60 seconds, plug directly into wall (no strip), then power on.\n\n3) Signs of life\n- When plugging in, do you hear any clicks/fans or see any brief LEDs?\n\n4) Fuser door/cover\n- Ensure all doors/covers are fully closed; some models won‚Äôt power if interlocks aren‚Äôt engaged.\n\nLet me know results."
1,1770932645772,"{'message': 'I tried a different outlet in another room and still nothing. What should I try next?', 'session_id': 'session-printer-001'}","Thanks‚Äîsince a different outlet didn‚Äôt help, let‚Äôs isolate the printer:\n\n1) Try a different IEC C13 power cable (from a PC/monitor) directly to a wall outlet.\n2) Verify the rear rocker switch is firmly ON; toggle it OFF/ON once.\n3) Remove all accessories: toner, paper tray, any USB/Ethernet cables. Close all doors, then plug in and try.\n4) Look/listen on plug-in: any faint click, fan, or brief LED?\n\nIf still completely dead, the low‚Äëvoltage power supply or DC controller may have failed. Do you want help locating a service center or part numbers for replacement?"
0,1770932650559,"{'message': 'Okay, I'll contact HP support for a warranty replacement. Thanks for your help!', 'session_id': 'session-printer-001'}","You‚Äôre welcome! Before you call, note the serial number and model (HP LaserJet 3000) and describe the ‚Äúno power, no LEDs, tested other outlet/cable‚Äù steps‚Äîyou‚Äôve done solid troubleshooting. If they ask, mention you power-cycled and removed accessories. \n\nNeed help finding the serial or nearest service center? I can guide you."


## ‚öñÔ∏è Step 6: Evaluate with `mlflow.genai.evaluate()`

**Key MLflow API**: This is where the magic happens!

In [14]:
print("Evaluating conversation...\n")

with mlflow.start_run(run_name=scenario['name']) as run:
    eval_results = mlflow.genai.evaluate(
        data=session_traces,
        scorers=[coherence_judge, context_judge]
    )

print("‚úì Evaluation complete")
result_df = eval_results.result_df
print(f"  Result DataFrame shape: {result_df.shape}")
if not using_databricks_mlflow:
    print("See results at http://localhost:5000/#/experiments/1/chat-sessions/session-printer-001")
    print("See results at http://localhost:5000/#/experiments/1/evaluation-runs?selectedRunUuid={eval_results.run_id}")

2026/02/12 13:44:51 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.


Evaluating conversation...



Evaluating:   0%|          | 0/5 [Elapsed: 00:00, Remaining: ?] 

‚úì Evaluation complete
  Result DataFrame shape: (4, 14)


---

## üìä What Just Happened?

We just demonstrated the complete MLflow evaluation workflow:

1. **`make_judge()`**: Created session-level judges with `{{ conversation }}` template
2. **`mlflow.search_traces()`**: Found all traces for the conversation
3. **`mlflow.genai.evaluate()`**: Evaluated the full conversation (not individual turns)

**Key Insight**: Session-level evaluation assesses the entire conversation holistically, enabling evaluation of:
- Conversation coherence and flow
- Context retention across turns
- Logical progression of the discussion