# MLflow Evaluation Demo: Multi-Turn Conversation with Session-Level Judges

![Multi-turn evaluationjudges](images/llm_as_judge.png)

This notebook demonstrates MLflow 3.9 evaluation capabilities by showing how to:
1. Set up session-level judges directly in the notebook for multi-turn conversation in a session
2. Use `mlflow.genai.evaluate()` API to evaluate conversations
3. Extract and interpret evaluation results


**Note**: This notebook is for databricks environment


## What You'll Learn

- How to create session-level judges with `{{ conversation }}` template
- How to call `mlflow.genai.evaluate()` directly
- How to extract results from evaluation DataFrames
- Best practices for multi-turn conversation evaluation

---

In [0]:
%pip install mlflow==3.9.0

Looking in indexes: https://pypi.org/simple, https://pypi.org/simple
Collecting mlflow==3.9.0
  Downloading mlflow-3.9.0-py3-none-any.whl.metadata (31 kB)
Collecting mlflow-skinny==3.9.0 (from mlflow==3.9.0)
  Downloading mlflow_skinny-3.9.0-py3-none-any.whl.metadata (32 kB)
Collecting mlflow-tracing==3.9.0 (from mlflow==3.9.0)
  Downloading mlflow_tracing-3.9.0-py3-none-any.whl.metadata (19 kB)
Collecting Flask-CORS<7 (from mlflow==3.9.0)
  Downloading flask_cors-6.0.2-py3-none-any.whl.metadata (5.3 kB)
Collecting docker<8,>=4.0.0 (from mlflow==3.9.0)
  Downloading docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting graphene<4 (from mlflow==3.9.0)
  Downloading graphene-3.4.3-py2.py3-none-any.whl.metadata (6.9 kB)
Collecting gunicorn<24 (from mlflow==3.9.0)
  Downloading gunicorn-23.0.0-py3-none-any.whl.metadata (4.4 kB)
Collecting huey<3,>=2.5.4 (from mlflow==3.9.0)
  Downloading huey-2.6.0-py3-none-any.whl.metadata (4.3 kB)
Collecting skops<1 (from mlflow==3.9.0)
  Downloadin

In [0]:
dbutils.library.restartPython()

In [0]:
import mlflow
print(mlflow.__version__)

3.9.0


## Setup: Import Dependencies

In [0]:
from genai.common.config import AgentConfig
from genai.agents.multi_turn.customer_support_agent_simple import CustomerSupportAgentSimple
from genai.agents.multi_turn.scenarios import get_scenario_printer_troubleshooting
from genai.agents.multi_turn.prompts import (
    get_coherence_judge_instructions,
    get_context_retention_judge_instructions,
)
import mlflow
from mlflow.genai.judges import make_judge
from typing_extensions import Literal
import os
from pathlib import Path

# Load environment variables
try:
    from dotenv import load_dotenv
    env_file = Path(".env")
    if env_file.exists():
        load_dotenv(env_file)
        print("✓ Loaded environment variables from {env_file.absolute()}")
    else:
        print("ℹ️  No .env file found. Set environment variables manually.")
except ImportError:
    print("ℹ️  python-dotenv not installed. Set environment variables manually.")

## Convenience for display
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

ℹ️  No .env file found. Set environment variables manually.


## Configuration

In [0]:
# Provider and model configuration
PROVIDER = "databricks"  # or "openai"
AGENT_MODEL = "databricks-gpt-5"
JUDGE_MODEL = "databricks-gemini-2-5-flash"
TEMPERATURE = 1.0
EXPERIMENT_NAME = "customer-support-dev-connect-demo"

# Change to your workspace path here 
USER_WORKSPACE_PATH = "/Users/jules@databricks.com"

print("✓ Configuration:")
print(f"  Provider: {PROVIDER}")
print(f"  Agent Model: {AGENT_MODEL}")
print(f"  Judge Model: {JUDGE_MODEL}")

✓ Configuration:
  Provider: databricks
  Agent Model: databricks-gpt-5
  Judge Model: databricks-gemini-2-5-flash


In [0]:
import os

# On Databricks, the token and host are available from the notebook context
os.environ["DATABRICKS_TOKEN"] = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
os.environ["DATABRICKS_HOST"] = spark.conf.get("spark.databricks.workspaceUrl")

print(f"✓ DATABRICKS_HOST set to: {os.environ['DATABRICKS_HOST']}")
print(f"✓ DATABRICKS_TOKEN set (length: {len(os.environ['DATABRICKS_TOKEN'])})")


✓ DATABRICKS_HOST set to: e2-dogfood.staging.cloud.databricks.com
✓ DATABRICKS_TOKEN set (length: 36)


## Step 1: Setup MLflow Tracking

In [0]:
# Auto loging for OpenAI models
mlflow.openai.autolog()

using_databricks_mlflow = True
if using_databricks_mlflow:
    mlflow.set_tracking_uri("databricks")
    EXPERIMENT_NAME = f"/Users/jules@databricks.com/{EXPERIMENT_NAME}"
    mlflow.set_experiment(EXPERIMENT_NAME)
else:
    mlflow.set_tracking_uri("http://localhost:5000")
    mlflow.set_experiment(EXPERIMENT_NAME)

## Step 2: Initialize Agent (Conversation-Only)

Note: This agent handles conversations only. We'll set up evaluation separately.

In [0]:
config = AgentConfig(
    model=AGENT_MODEL,
    provider=PROVIDER,
    temperature=TEMPERATURE,
    mlflow_experiment=EXPERIMENT_NAME
)

agent = CustomerSupportAgentSimple(config)

print("✓ Agent initialized (conversation-only)")
print(f"  Provider: {config.provider}")
print(f"  Model: {config.model}")

✓ Agent initialized (conversation-only)
  Provider: databricks
  Model: databricks-gpt-5


## Step 3: Run Conversation Scenario

In [0]:
# Load printer troubleshooting scenario
scenario = get_scenario_printer_troubleshooting()

print(f"Scenario: {scenario['name']}")
print(f"Turns: {len(scenario['messages'])}")
print(f"Session ID: {scenario['session_id']}")

Scenario: Printer Troubleshooting
Turns: 4
Session ID: session-printer-001


In [0]:
# Run the conversation
# Under the hood, it tags messages with a session id (mlflow.update_current_trace(metadata={"mlflow.trace.session": session_id}))
# Then it calls into OpenAI for responses
conv_result = agent.run_conversation(
    messages=scenario['messages'],
    session_id=scenario['session_id']
)

print("\n✓ Conversation completed")
print("  Turns: {conv_result['turns']}")


Running 4-turn conversation (Session: session-printer-001)

Turn 1/4
  User: My HP LaserJet 3000 won't turn on at all. The power light doesn't come on.
  Agent: Sorry you’re having trouble. Let’s narrow it down:

1) Power source
- Is the outlet working? Test with another device or try a different outlet.
- Is the printer’s power cable firmly seated on both ends? Inspect for damage.
- If there’s a surge protector/UPS, bypass it and plug directly into the wall.

2) Power switch
- Confirm the rear rocker switch (if present) is set to On.

3) Power reset
- Unplug printer for 60 seconds, press power button (if any) for 10 seconds, then reconnect and try again.

4) Indicators
- Do you hear any fans/clicks or see any lights at all?

If still dead, it may be a failed power supply/formatter. Share results and your setup (cables, surge protector, recent power events).

Turn 2/4
  User: Yes, I've checked the power cable and it's plugged in securely to both the printer and wall outlet.
  Agent: T

[Trace(trace_id=tr-14fca2aa9c2b1345bb7a3ae45f9cb77f), Trace(trace_id=tr-71b5d34479723d2ca95410dee6932038), Trace(trace_id=tr-2861b7d80e25b9598774e59547b784d8), Trace(trace_id=tr-ced5472875d04138539fb05523d0162f)]

---

# 🎯 Evaluation Showcase: MLflow Methods

Now we'll demonstrate MLflow's evaluation capabilities step-by-step.

## Step 4: Create Session-Level Judges with `make_judge()`

This is where we showcase MLflow's judge creation API.

### Prompts for judges retrieved

In [0]:
# First, let's see what the judges will evaluate
print("="*70)
print("COHERENCE JUDGE INSTRUCTIONS")
print("="*70)
print(get_coherence_judge_instructions())
print("\n" + "="*70)
print("CONTEXT RETENTION JUDGE INSTRUCTIONS")
print("="*70)
print(get_context_retention_judge_instructions())
print("\n" + "="*70)
print("\n💡 Both instructions contain {{ conversation }} - this makes them session-level!")

COHERENCE JUDGE INSTRUCTIONS
You are evaluating the coherence of a multi-turn customer support conversation.

Conversation to evaluate:
{{ conversation }}

Evaluate whether the conversation flows logically:
- Does the agent maintain context across turns?
- Are responses relevant to previous messages?
- Does the conversation follow a logical progression?
- Are there any contradictions or confusing jumps?

Provide your evaluation as:

- Value: True if the conversation is coherent and flows naturally, False if there are significant coherence issues.

- Rationale: Explain your reasoning in 2-3 sentences. Consider:
  - Context maintenance: Agent remembers what user said earlier
  - Logical flow: Each turn builds on previous turns
  - Relevance: Responses address the actual questions/issues raised
  - Consistency: No contradictions in advice or information


CONTEXT RETENTION JUDGE INSTRUCTIONS
You are evaluating context retention in a multi-turn customer support conversation.

Conversation 

#### Evaluation judge defined

In [0]:
# Configure judge model URI
if PROVIDER == "databricks":
    os.environ["OPENAI_API_KEY"] = os.environ.get("DATABRICKS_TOKEN", "")
    os.environ["OPENAI_API_BASE"] = f"{config.databricks_host}/serving-endpoints"
    # judge_model_uri = f"openai:/{JUDGE_MODEL}"
    judge_model_uri = f"databricks:/{JUDGE_MODEL}"
else:
    judge_model_uri = JUDGE_MODEL

print(f"Judge model URI: {judge_model_uri}")

Judge model URI: databricks:/databricks-gemini-2-5-flash


#### Create our judges with the `make_judge` MLflow API

In [0]:
# Create judges using mlflow.genai.judges.make_judge()
print("Creating judges...\n")

coherence_judge = make_judge(
    name="conversation_coherence",
    model=judge_model_uri,
    instructions=get_coherence_judge_instructions(),
    feedback_value_type=bool
)

context_judge = make_judge(
    name="context_retention",
    model=judge_model_uri,
    instructions=get_context_retention_judge_instructions(),
    feedback_value_type=Literal["excellent", "good", "fair", "poor"]
)

print("✓ Judges created")
print(f"  Coherence judge is session-level: {coherence_judge.is_session_level_scorer}")
print(f"  Context judge is session-level: {context_judge.is_session_level_scorer}")

Creating judges...

✓ Judges created
  Coherence judge is session-level: True
  Context judge is session-level: True


## 🔍︎ Step 5: Search for Traces with `mlflow.search_traces()`

In [0]:
# Get experiment and search for traces
experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

session_traces = mlflow.search_traces(
    experiment_ids=[experiment.experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{scenario['session_id']}'"
)

print(f"✓ Found {len(session_traces)} traces")
print(f"  Session ID: {scenario['session_id']}")
print("  Each trace = 1 conversation turn")
display(session_traces[["request_time", "request", "response"]].sort_values(by="request_time", ascending=True))

  session_traces = mlflow.search_traces(


✓ Found 16 traces
  Session ID: session-printer-001
  Each trace = 1 conversation turn


request_time,request,response
1770932632151,"List(My HP LaserJet 3000 won't turn on at all. The power light doesn't come on., session-printer-001)","I’m sorry—that’s frustrating. Let’s narrow it down: 1) Power source - Is the outlet working? Test with another device. - Try a different outlet (avoid power strips/surge protectors). 2) Power cable and switch - Ensure the power cord is firmly seated in the printer and wall. - Inspect the cord for damage; try another compatible cable if possible. - Make sure the rear power switch (if present) is set to On. 3) Power/reset - Unplug printer for 60 seconds, hold power button (if any) for 10 seconds, reconnect and try again. If still dead: do you hear any clicks/fans? Any LEDs flash briefly? Have there been recent power surges? This may indicate a failed power supply."
1770932640807,"List(Yes, I've checked the power cable and it's plugged in securely to both the printer and wall outlet., session-printer-001)","Thanks for confirming. Let’s try a few targeted checks: 1) Outlet and cord - Please test that exact outlet with another device. - If possible, try a different power cord (standard IEC C13 “kettle” cable). 2) Power switch and reset - Ensure the rear rocker switch is On. - Unplug printer, hold the power button (if present) 10 seconds, wait 60 seconds, plug directly into wall (no strip), then power on. 3) Signs of life - When plugging in, do you hear any clicks/fans or see any brief LEDs? 4) Fuser door/cover - Ensure all doors/covers are fully closed; some models won’t power if interlocks aren’t engaged. Let me know results."
1770932645772,"List(I tried a different outlet in another room and still nothing. What should I try next?, session-printer-001)","Thanks—since a different outlet didn’t help, let’s isolate the printer: 1) Try a different IEC C13 power cable (from a PC/monitor) directly to a wall outlet. 2) Verify the rear rocker switch is firmly ON; toggle it OFF/ON once. 3) Remove all accessories: toner, paper tray, any USB/Ethernet cables. Close all doors, then plug in and try. 4) Look/listen on plug-in: any faint click, fan, or brief LED? If still completely dead, the low‑voltage power supply or DC controller may have failed. Do you want help locating a service center or part numbers for replacement?"
1770932650559,"List(Okay, I'll contact HP support for a warranty replacement. Thanks for your help!, session-printer-001)","You’re welcome! Before you call, note the serial number and model (HP LaserJet 3000) and describe the “no power, no LEDs, tested other outlet/cable” steps—you’ve done solid troubleshooting. If they ask, mention you power-cycled and removed accessories. Need help finding the serial or nearest service center? I can guide you."
1770957764890,"List(My HP LaserJet 3000 won't turn on at all. The power light doesn't come on., session-printer-001)",
1770957764941,"List(Yes, I've checked the power cable and it's plugged in securely to both the printer and wall outlet., session-printer-001)",
1770957764942,"List(I tried a different outlet in another room and still nothing. What should I try next?, session-printer-001)",
1770957764944,"List(Okay, I'll contact HP support for a warranty replacement. Thanks for your help!, session-printer-001)",
1770957962275,"List(My HP LaserJet 3000 won't turn on at all. The power light doesn't come on., session-printer-001)","Let’s troubleshoot the power issue step by step: 1) Power source - Is the outlet working? Test with another device or try a different outlet. - Remove any surge protector or extension; plug the printer directly into the wall. 2) Power cable and connections - Reseat both ends of the power cord (printer and wall). - Inspect the cord and plug for damage. If available, try another compatible power cable. 3) Power switch and indicators - Ensure the rear power switch (if present) is set to On. - Any faint noises, smells, or flickers when switching on? 4) Power module/fuse - Some models have a replaceable power fuse or external power module—have you checked it? Tell me what you observe at each step."
1770957966792,"List(Yes, I've checked the power cable and it's plugged in securely to both the printer and wall outlet., session-printer-001)","Thanks for confirming the cable. Let’s narrow it down: 1) Outlet test - Plug another device (lamp/phone charger) into the same outlet. Does it work? - Try the printer in a different wall outlet, no surge protector/UPS. 2) Power switch and signs of life - Confirm the rear power switch is on. - When toggling it, any faint fan spin, clicks, or lights—even briefly? 3) Removable power components - This model may have an external power module or internal fuse. Do you see a removable power block or fuse access near the power inlet? 4) Reseat components - Power off/unplug, wait 60 seconds, hold the power button 10 seconds, then reconnect and try again. Report results for each step."


## ⚖️ Step 6: Evaluate with `mlflow.genai.evaluate()`

**Key MLflow API**: This is where the magic happens!

In [0]:
print("Evaluating conversation...\n")

with mlflow.start_run(run_name=scenario['name']) as run:
    eval_results = mlflow.genai.evaluate(
        data=session_traces,
        scorers=[coherence_judge, context_judge]
    )

print("✓ Evaluation complete")
result_df = eval_results.result_df
print(f"  Result DataFrame shape: {result_df.shape}")
if not using_databricks_mlflow:
    print("See results at http://localhost:5000/#/experiments/1/chat-sessions/session-printer-001")
    print("See results at http://localhost:5000/#/experiments/1/evaluation-runs?selectedRunUuid={eval_results.run_id}")

Evaluating conversation...



2026/02/13 22:05:47 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.


Evaluating:   0%|          | 0/17 [Elapsed: 00:00, Remaining: ?] 

✓ Evaluation complete
  Result DataFrame shape: (16, 14)


---

## 📊 What Just Happened?

We just demonstrated the complete MLflow evaluation workflow:

1. **`make_judge()`**: Created session-level judges with `{{ conversation }}` template
2. **`mlflow.search_traces()`**: Found all traces for the conversation
3. **`mlflow.genai.evaluate()`**: Evaluated the full conversation (not individual turns)

**Key Insight**: Session-level evaluation assesses the entire conversation holistically, enabling evaluation of:
- Conversation coherence and flow
- Context retention across turns
- Logical progression of the discussion