# MLflow Evaluation Demo: Multi-Turn Conversation with Session-Level Judges

![Multi-turn evaluationjudges](images/llm_as_judge.png)

This notebook demonstrates MLflow 3.9 evaluation capabilities by showing how to:
1. Set up session-level judges directly in the notebook for multi-turn conversation in a session
2. Use `mlflow.genai.evaluate()` API to evaluate conversations
3. Extract and interpret evaluation results
4. Use Evaluation UI and Metrics Dashboard

## What You'll Learn

- How to create session-level judges with `{{ conversation }}` template
- How to call `mlflow.genai.evaluate()` directly
- How to extract results from evaluation DataFrames
- Best practices for multi-turn conversation evaluation

---

In [1]:
import mlflow
print(mlflow.__version__)

3.9.0


## Setup: Import Dependencies

In [2]:
from genai.common.config import AgentConfig
from genai.agents.multi_turn.customer_support_agent_simple import CustomerSupportAgentSimple
from genai.agents.multi_turn.scenarios import get_scenario_printer_troubleshooting
from genai.agents.multi_turn.prompts import (
    get_coherence_judge_instructions,
    get_context_retention_judge_instructions,
    get_support_agent_system_prompt,
)
import mlflow
from mlflow.genai.judges import make_judge
from typing_extensions import Literal
import os
from pathlib import Path

# Load environment variables
try:
    from dotenv import load_dotenv
    env_file = Path(".env")
    if env_file.exists():
        load_dotenv(env_file)
        print("‚úì Loaded environment variables from {env_file.absolute()}")
    else:
        print("‚ÑπÔ∏è  No .env file found. Set environment variables manually.")
except ImportError:
    print("‚ÑπÔ∏è  python-dotenv not installed. Set environment variables manually.")

## Convenience for display
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

‚úì Loaded environment variables from {env_file.absolute()}


## Configuration

In [3]:
# Provider and model configuration
PROVIDER = "databricks"  # or "openai"
AGENT_MODEL = "databricks-gpt-5"
JUDGE_MODEL = "databricks-gemini-2-5-flash"
TEMPERATURE = 1.0
EXPERIMENT_NAME = "customer-support-dev-connect-demo"

# Change to your workspace path here 
USER_WORKSPACE_PATH = "/Users/jules@databricks.com"

print("‚úì Configuration:")
print(f"  Provider: {PROVIDER}")
print(f"  Agent Model: {AGENT_MODEL}")
print(f"  Judge Model: {JUDGE_MODEL}")

‚úì Configuration:
  Provider: databricks
  Agent Model: databricks-gpt-5
  Judge Model: databricks-gemini-2-5-flash


In [4]:
# import os

# # On Databricks, the token and host are available from the notebook context
# os.environ["DATABRICKS_TOKEN"] = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
# os.environ["DATABRICKS_HOST"] = spark.conf.get("spark.databricks.workspaceUrl")

print(f"‚úì DATABRICKS_HOST set to: {os.environ['DATABRICKS_HOST']}")
print(f"‚úì DATABRICKS_TOKEN set (length: {len(os.environ['DATABRICKS_TOKEN'])})")


‚úì DATABRICKS_HOST set to: https://e2-dogfood.staging.cloud.databricks.com
‚úì DATABRICKS_TOKEN set (length: 36)


## Step 1: Setup MLflow Tracking

In [5]:
# Auto loging for OpenAI models
mlflow.openai.autolog()

using_databricks_mlflow = False
if using_databricks_mlflow:
    mlflow.set_tracking_uri("databricks")
    EXPERIMENT_NAME = f"/Users/jules@databricks.com/{EXPERIMENT_NAME}"
    mlflow.set_experiment(EXPERIMENT_NAME)
else:
    mlflow.set_tracking_uri("http://localhost:5000")
    mlflow.set_experiment(EXPERIMENT_NAME)

2026/02/13 06:46:12 INFO mlflow.tracking.fluent: Experiment with name 'customer-support-dev-connect-demo' does not exist. Creating a new experiment.


## Step 2: Initialize Agent (Conversation-Only)

Note: This agent handles conversations only. We'll set up evaluation separately.

In [6]:
config = AgentConfig(
    model=AGENT_MODEL,
    provider=PROVIDER,
    temperature=TEMPERATURE,
    mlflow_experiment=EXPERIMENT_NAME
)

agent = CustomerSupportAgentSimple(config)

print("‚úì Agent initialized (conversation-only)")
print(f"  Provider: {config.provider}")
print(f"  Model: {config.model}")

‚úì Agent initialized (conversation-only)
  Provider: databricks
  Model: databricks-gpt-5


## Step 3: Run Conversation Scenario

In [7]:
# Load printer troubleshooting scenario
scenario = get_scenario_printer_troubleshooting()

print(f"Scenario: {scenario['name']}")
print(f"Turns: {len(scenario['messages'])}")
print(f"Session ID: {scenario['session_id']}")

Scenario: Printer Troubleshooting
Turns: 4
Session ID: session-printer-001


In [8]:
# Run the conversation
# Under the hood, it tags messages with a session id (mlflow.update_current_trace(metadata={"mlflow.trace.session": session_id}))
# Then it calls into OpenAI for responses
conv_result = agent.run_conversation(
    messages=scenario['messages'],
    session_id=scenario['session_id']
)

print("\n‚úì Conversation completed")
print("  Turns: {conv_result['turns']}")


Running 4-turn conversation (Session: session-printer-001)

Turn 1/4
  User: My HP LaserJet 3000 won't turn on at all. The power light doesn't come on.
  Agent: Sorry you‚Äôre having trouble. Let‚Äôs narrow it down:

1) Power source
- Confirm the wall outlet works (test with another device).
- If on a power strip/UPS, try a wall outlet directly.

2) Cables and switches
- Reseat the power cord at the printer and wall.
- Inspect the cord for damage; try another IEC C13 cord if available.
- Ensure the rear power switch (if present) is set to On (|).

3) Power module
- Unplug for 60 seconds, then hold the printer‚Äôs power button (if present) 10 seconds, reconnect, try again.

4) Indicators
- Any brief flicker, beeps, or fan spin?

If still dead, likely a failed power supply/low-voltage PSU. Do you hear/see anything at all, and can you try a different cord/outlet?

Turn 2/4
  User: Yes, I've checked the power cable and it's plugged in securely to both the printer and wall outlet.
  Agent:

---

# üéØ Evaluation Showcase: MLflow Methods

Now we'll demonstrate MLflow's evaluation capabilities step-by-step.

## Step 4: Create Session-Level Judges with `make_judge()`

This is where we showcase MLflow's judge creation API.

### Prompts for judges retrieved

In [9]:
# First, let's see what the judges will evaluate
print("="*70)
print("AGENT PROMPT INSTRUCTIONS")
print("="*70)
print(get_support_agent_system_prompt())
print("\n" + "="*70)
print("COHERENCE JUDGE INSTRUCTIONS")
print("="*70)
print(get_coherence_judge_instructions())
print("\n" + "="*70)
print("CONTEXT RETENTION JUDGE INSTRUCTIONS")
print("="*70)
print(get_context_retention_judge_instructions())
print("\n" + "="*70)
print("\nüí° Both instructions contain {{ conversation }} - this makes them session-level!")

AGENT PROMPT INSTRUCTIONS
You are a helpful customer support agent for TechCorp.

Your responsibilities:
- Assist customers with technical issues
- Ask clarifying questions when needed
- Provide step-by-step troubleshooting
- Remember context from earlier in the conversation
- Be polite, professional, and concise

Guidelines:
- Keep responses under 100 words
- Reference previous messages when relevant
- Guide users through solutions systematically


COHERENCE JUDGE INSTRUCTIONS
You are evaluating the coherence of a multi-turn customer support conversation.

Conversation to evaluate:
{{ conversation }}

Evaluate whether the conversation flows logically:
- Does the agent maintain context across turns?
- Are responses relevant to previous messages?
- Does the conversation follow a logical progression?
- Are there any contradictions or confusing jumps?

Provide your evaluation as:

- Value: True if the conversation is coherent and flows naturally, False if there are significant coherence i

#### Evaluation judge defined

In [10]:
# Configure judge model URI
if PROVIDER == "databricks":
    os.environ["OPENAI_API_KEY"] = os.environ.get("DATABRICKS_TOKEN", "")
    os.environ["OPENAI_API_BASE"] = f"{config.databricks_host}/serving-endpoints"
    # judge_model_uri = f"openai:/{JUDGE_MODEL}"
    judge_model_uri = f"databricks:/{JUDGE_MODEL}"
else:
    judge_model_uri = JUDGE_MODEL

print(f"Judge model URI: {judge_model_uri}")

Judge model URI: databricks:/databricks-gemini-2-5-flash


#### Create our judges with the `make_judge` MLflow API

In [11]:
# Create judges using mlflow.genai.judges.make_judge()
print("Creating judges...\n")

coherence_judge = make_judge(
    name="conversation_coherence",
    model=judge_model_uri,
    instructions=get_coherence_judge_instructions(),
    feedback_value_type=bool
)

context_judge = make_judge(
    name="context_retention",
    model=judge_model_uri,
    instructions=get_context_retention_judge_instructions(),
    feedback_value_type=Literal["excellent", "good", "fair", "poor"]
)

print("‚úì Judges created")
print(f"  Coherence judge is session-level: {coherence_judge.is_session_level_scorer}")
print(f"  Context judge is session-level: {context_judge.is_session_level_scorer}")

Creating judges...

‚úì Judges created
  Coherence judge is session-level: True
  Context judge is session-level: True


## üîçÔ∏é Step 5: Search for Traces with `mlflow.search_traces()`

In [12]:
# Get experiment and search for traces
experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

session_traces = mlflow.search_traces(
    experiment_ids=[experiment.experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{scenario['session_id']}'"
)

print(f"‚úì Found {len(session_traces)} traces")
print(f"  Session ID: {scenario['session_id']}")
print("  Each trace = 1 conversation turn")
display(session_traces[["request_time", "request", "response"]].sort_values(by="request_time", ascending=True))

‚úì Found 4 traces
  Session ID: session-printer-001
  Each trace = 1 conversation turn


  session_traces = mlflow.search_traces(


Unnamed: 0,request_time,request,response
3,1770993972949,"{'message': 'My HP LaserJet 3000 won't turn on at all. The power light doesn't come on.', 'session_id': 'session-printer-001'}","Sorry you‚Äôre having trouble. Let‚Äôs narrow it down:\n\n1) Power source\n- Confirm the wall outlet works (test with another device).\n- If on a power strip/UPS, try a wall outlet directly.\n\n2) Cables and switches\n- Reseat the power cord at the printer and wall.\n- Inspect the cord for damage; try another IEC C13 cord if available.\n- Ensure the rear power switch (if present) is set to On (|).\n\n3) Power module\n- Unplug for 60 seconds, then hold the printer‚Äôs power button (if present) 10 seconds, reconnect, try again.\n\n4) Indicators\n- Any brief flicker, beeps, or fan spin?\n\nIf still dead, likely a failed power supply/low-voltage PSU. Do you hear/see anything at all, and can you try a different cord/outlet?"
2,1770993978312,"{'message': 'Yes, I've checked the power cable and it's plugged in securely to both the printer and wall outlet.', 'session_id': 'session-printer-001'}","Thanks for checking that. Let‚Äôs continue systematically:\n\n1) Outlet test\n- Plug a lamp/phone charger into the same outlet to confirm it‚Äôs live.\n- If using a surge protector/UPS, bypass it and use the wall directly.\n\n2) Power cord\n- Try a different IEC C13 power cord if you have one.\n\n3) Rear power switch\n- Ensure the rear rocker switch is set to On (|).\n\n4) Power reset\n- Unplug the printer for 2 minutes. While unplugged, press/hold the power button (if present) for 10 seconds. Reconnect and try.\n\n5) Signs of life\n- On power-up, do you see any flicker, hear fans/relay clicks?\n\nLet me know results‚Äînext step is checking the internal low‚Äëvoltage power supply."
1,1770993988416,"{'message': 'I tried a different outlet in another room and still nothing. What should I try next?', 'session_id': 'session-printer-001'}","Thanks for testing another outlet. Next steps:\n\n1) Try a different IEC C13 power cord (from a PC/monitor) if you haven‚Äôt yet.\n2) Check the rear rocker switch is On (|). Toggle it Off/On once.\n3) Unplug power, remove any connected USB/Ethernet, wait 2 minutes, hold power button 10 seconds (if present), reconnect and try.\n4) Inspect the rear power inlet for looseness or burn marks.\n\nIf still completely dead (no flicker/click/fan), the low‚Äëvoltage power supply or DC controller likely failed. We can guide board replacement, or arrange service. Would you prefer DIY steps or a repair quote/pickup?"
0,1770993993108,"{'message': 'Okay, I'll contact HP support for a warranty replacement. Thanks for your help!', 'session_id': 'session-printer-001'}","You‚Äôre welcome! Before you call, have this ready:\n- Printer model/serial number (label on back/inside door)\n- Proof of purchase and your contact/shipping info\n- Description of steps you‚Äôve tried (outlets, different cord, power reset)\n\nIf HP needs a case summary, I can draft it. If anything changes or you want DIY repair guidance, just let me know."


## ‚öñÔ∏è Step 6: Evaluate with `mlflow.genai.evaluate()`

**Key MLflow API**: This is where the magic happens!

In [13]:
print("Evaluating conversation...\n")

with mlflow.start_run(run_name=scenario['name']) as run:
    eval_results = mlflow.genai.evaluate(
        data=session_traces,
        scorers=[coherence_judge, context_judge]
    )

print("‚úì Evaluation complete")
result_df = eval_results.result_df
print(f"  Result DataFrame shape: {result_df.shape}")
if not using_databricks_mlflow:
    print("See results at http://localhost:5000/#/experiments/1/chat-sessions/session-printer-001")
    print("See results at http://localhost:5000/#/experiments/1/evaluation-runs?selectedRunUuid={eval_results.run_id}")

2026/02/13 06:46:38 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.


Evaluating conversation...



Evaluating:   0%|          | 0/5 [Elapsed: 00:00, Remaining: ?] 

‚úì Evaluation complete
  Result DataFrame shape: (4, 14)
See results at http://localhost:5000/#/experiments/1/chat-sessions/session-printer-001
See results at http://localhost:5000/#/experiments/1/evaluation-runs?selectedRunUuid={eval_results.run_id}


---

## üìä What Just Happened?

We just demonstrated the complete MLflow evaluation workflow:

1. **`make_judge()`**: Created session-level judges with `{{ conversation }}` template
2. **`mlflow.search_traces()`**: Found all traces for the conversation
3. **`mlflow.genai.evaluate()`**: Evaluated the full conversation (not individual turns)

**Key Insight**: Session-level evaluation assesses the entire conversation holistically, enabling evaluation of:
- Conversation coherence and flow
- Context retention across turns
- Logical progression of the discussion