# MLflow DeepEval Integration Demo: Multi-Turn Conversation Evaluation

This notebook demonstrates MLflow 3.8's DeepEval integration for session-level evaluation:
1. Use `mlflow.genai.scorers.deepeval` for industry-standard metrics
2. Evaluate multi-turn conversations with DeepEval scorers
3. Extract and interpret DeepEval evaluation results

**Key Feature**: MLflow 3.8+ provides native integration with DeepEval metrics through `mlflow.genai.scorers.deepeval`, enabling you to use industry-standard conversational AI metrics seamlessly within MLflow's evaluation framework.

## What You'll Learn

- How to use DeepEval scorers via MLflow integration
- How to evaluate session-level conversations with industry metrics
- How to extract DeepEval results from evaluation DataFrames
- Best practices for multi-turn conversation evaluation with DeepEval

---

## Setup: Import Dependencies

In [None]:
from genai.common.config import AgentConfig
from genai.agents.multi_turn.customer_support_agent_simple import CustomerSupportAgentSimple
from genai.agents.multi_turn.scenarios import get_scenario_account_access, get_scenario_printer_troubleshooting
import mlflow
from mlflow.genai.scorers.deepeval import (
    ConversationCompleteness,
    KnowledgeRetention,
    TopicAdherence
)
import os
from pathlib import Path

# Load environment variables
try:
    from dotenv import load_dotenv
    env_file = Path(".env")
    if env_file.exists():
        load_dotenv(env_file)
        print(f"‚úì Loaded environment variables from {env_file.absolute()}")
    else:
        print(f"‚ÑπÔ∏è  No .env file found. Set environment variables manually.")
except ImportError:
    print("‚ÑπÔ∏è  python-dotenv not installed. Set environment variables manually.")

# Convenience for display
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

## Configuration

In [None]:
# Provider and model configuration
PROVIDER = "databricks"  # or "openai"
AGENT_MODEL = "databricks-gpt-5-2"
JUDGE_MODEL = "databricks-gemini-2-5-flash"  # Used by DeepEval scorers
TEMPERATURE = 1.0
EXPERIMENT_NAME = "customer-support-deepeval-demo"

print("‚úì Configuration:")
print(f"  Provider: {PROVIDER}")
print(f"  Agent Model: {AGENT_MODEL}")
print(f"  Judge Model (for DeepEval): {JUDGE_MODEL}")

## Step 1: Setup MLflow Tracking

In [None]:
mlflow.openai.autolog()
using_databricks_mlflow = False

if using_databricks_mlflow:
    mlflow.set_tracking_uri("databricks")
    EXPERIMENT_NAME = f"/Users/your-username/{EXPERIMENT_NAME}"
    mlflow.set_experiment(EXPERIMENT_NAME)
else:
    mlflow.set_tracking_uri(None)
    mlflow.set_experiment(EXPERIMENT_NAME)

print(f"‚úì MLflow tracking enabled")
print(f"  Experiment: {EXPERIMENT_NAME}")

## Step 2: Initialize Agent (Conversation-Only)

Note: This agent handles conversations only. We'll set up DeepEval evaluation separately.

In [None]:
config = AgentConfig(
    model=AGENT_MODEL,
    provider=PROVIDER,
    temperature=TEMPERATURE,
    mlflow_experiment=EXPERIMENT_NAME
)

agent = CustomerSupportAgentSimple(config)

print("‚úì Agent initialized (conversation-only)")
print(f"  Provider: {config.provider}")
print(f"  Model: {config.model}")

## Step 3: Configure DeepEval Environment

DeepEval scorers need to know which model endpoint to use for evaluation.

In [None]:
# Configure environment for DeepEval to use Databricks/OpenAI endpoints
if PROVIDER == "databricks":
    databricks_host = os.environ.get("DATABRICKS_HOST", "")
    if databricks_host:
        os.environ["OPENAI_API_KEY"] = os.environ.get("DATABRICKS_TOKEN", "")
        os.environ["OPENAI_API_BASE"] = f"{databricks_host}/serving-endpoints"
        print(f"‚úì DeepEval configured for Databricks")
        print(f"  Endpoint: {databricks_host}")
    judge_model_uri = f"openai:/{JUDGE_MODEL}"
else:
    judge_model_uri = JUDGE_MODEL
    print(f"‚úì DeepEval configured for OpenAI")

print(f"  Judge Model URI: {judge_model_uri}")

## Step 4: Run Conversation Scenario

In [None]:
# Load account access scenario
scenario = get_scenario_account_access()

print(f"Scenario: {scenario['name']}")
print(f"Description: {scenario['description']}")
print(f"Turns: {len(scenario['messages'])}")
print(f"Session ID: {scenario['session_id']}")

In [None]:
# Run the conversation
# Under the hood, it tags messages with a session id (mlflow.update_current_trace(metadata={"mlflow.trace.session": session_id}))
conv_result = agent.run_conversation(
    messages=scenario['messages'],
    session_id=scenario['session_id']
)

print(f"\n‚úì Conversation completed")
print(f"  Turns: {conv_result['turns']}")
print(f"  Session ID: {scenario['session_id']}")
print(f"\nView traces: mlflow ui")
if not using_databricks_mlflow:
    experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
    print(f"  http://127.0.0.1:5000/#/experiments/{experiment.experiment_id}/chat-sessions/{scenario['session_id']}")

---

# üéØ DeepEval Evaluation Showcase: MLflow Integration

Now we'll demonstrate MLflow's DeepEval integration step-by-step.

## Step 5: Create DeepEval Scorers via MLflow Integration

**Key MLflow 3.8 Feature**: Use `mlflow.genai.scorers.deepeval` for industry-standard metrics!

In [None]:
print("Creating DeepEval scorers...\n")

# ConversationCompleteness: Evaluates if conversation satisfies user's needs
completeness_scorer = ConversationCompleteness(
    model=judge_model_uri,
    include_reason=True
)

# KnowledgeRetention: Assesses ability to retain information across turns
knowledge_retention_scorer = KnowledgeRetention(
    model=judge_model_uri,
    include_reason=True
)

# TopicAdherence: Checks if conversation stays on relevant topics
topic_adherence_scorer = TopicAdherence(
    model=judge_model_uri,
    include_reason=True,
    relevant_topics=["customer support", "technical help", "account access"]
)

print("‚úì DeepEval scorers created:")
print(f"  - ConversationCompleteness (session-level: {completeness_scorer.is_session_level_scorer})")
print(f"  - KnowledgeRetention (session-level: {knowledge_retention_scorer.is_session_level_scorer})")
print(f"  - TopicAdherence (session-level: {topic_adherence_scorer.is_session_level_scorer})")
print("\nüí° All DeepEval scorers are session-level by default!")

## Step 6: Search for Traces with `mlflow.search_traces()`

In [None]:
# Get experiment and search for traces
experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

session_traces = mlflow.search_traces(
    locations=[experiment.experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{scenario['session_id']}'"
)

print(f"‚úì Found {len(session_traces)} traces for session '{scenario['session_id']}'")
print(f"  Each trace = 1 conversation turn")
print(f"\nTraces overview:")
display(session_traces[["request_time", "request", "response"]].sort_values(by="request_time", ascending=True))

## Step 7: Evaluate with `mlflow.genai.evaluate()` + DeepEval Scorers

**Key MLflow API**: This is where MLflow's DeepEval integration shines!

In [None]:
print(f"Evaluating conversation with DeepEval scorers...\n")

with mlflow.start_run(run_name=f"{scenario['name']} - DeepEval") as run:
    eval_results = mlflow.genai.evaluate(
        data=session_traces,
        scorers=[
            completeness_scorer,
            knowledge_retention_scorer,
            topic_adherence_scorer
        ]
    )

print("‚úì DeepEval evaluation complete")
print(f"  Run ID: {eval_results.run_id}")
if not using_databricks_mlflow:
    print(f"\nView results:")
    print(f"  http://localhost:5000/#/experiments/{experiment.experiment_id}/evaluation-runs?selectedRunUuid={eval_results.run_id}")

## Step 8: Extract DeepEval Results from DataFrame

DeepEval scores appear as columns in the result DataFrame with format `MetricName/value` and `MetricName/reason`.

In [None]:
result_df = eval_results.result_df

print(f"Result DataFrame shape: {result_df.shape}")
print(f"\nDeepEval metric columns:")
deepeval_cols = [col for col in result_df.columns if any(metric in col for metric in ['ConversationCompleteness', 'KnowledgeRetention', 'TopicAdherence'])]
for col in deepeval_cols:
    print(f"  - {col}")

### Extract Individual Metric Scores

In [None]:
import math

# Extract ConversationCompleteness
completeness_score = result_df['ConversationCompleteness/value'].iloc[0] if 'ConversationCompleteness/value' in result_df.columns else None
completeness_reason = result_df['ConversationCompleteness/reason'].iloc[0] if 'ConversationCompleteness/reason' in result_df.columns else None

print("="*70)
print("üìä ConversationCompleteness")
print("="*70)
if completeness_score is not None and not (isinstance(completeness_score, float) and math.isnan(completeness_score)):
    print(f"Score: {completeness_score}")
    if completeness_reason and not (isinstance(completeness_reason, float) and math.isnan(completeness_reason)):
        print(f"Reason: {completeness_reason}")
else:
    print("Score: N/A (evaluation may have failed)")
    print("üí° Tip: Check that session contains multiple traces")

In [None]:
# Extract KnowledgeRetention
retention_score = result_df['KnowledgeRetention/value'].iloc[0] if 'KnowledgeRetention/value' in result_df.columns else None
retention_reason = result_df['KnowledgeRetention/reason'].iloc[0] if 'KnowledgeRetention/reason' in result_df.columns else None

print("="*70)
print("üß† KnowledgeRetention")
print("="*70)
if retention_score is not None and not (isinstance(retention_score, float) and math.isnan(retention_score)):
    print(f"Score: {retention_score}")
    if retention_reason and not (isinstance(retention_reason, float) and math.isnan(retention_reason)):
        print(f"Reason: {retention_reason}")
else:
    print("Score: N/A (evaluation may have failed)")

In [None]:
# Extract TopicAdherence
topic_score = result_df['TopicAdherence/value'].iloc[0] if 'TopicAdherence/value' in result_df.columns else None
topic_reason = result_df['TopicAdherence/reason'].iloc[0] if 'TopicAdherence/reason' in result_df.columns else None

print("="*70)
print("üéØ TopicAdherence")
print("="*70)
if topic_score is not None and not (isinstance(topic_score, float) and math.isnan(topic_score)):
    print(f"Score: {topic_score}")
    if topic_reason and not (isinstance(topic_reason, float) and math.isnan(topic_reason)):
        print(f"Reason: {topic_reason}")
else:
    print("Score: N/A (evaluation may have failed)")

### View Aggregated Metrics

MLflow also provides aggregated metrics across all traces.

In [None]:
print("="*70)
print("üìà Aggregated Metrics (mean across traces)")
print("="*70)
print(f"\nMetrics summary:")
for key, value in eval_results.metrics.items():
    print(f"  {key}: {value}")

---

## üìä What Just Happened?

We just demonstrated the complete MLflow + DeepEval workflow:

1. **Created DeepEval Scorers**: Used `mlflow.genai.scorers.deepeval` for industry-standard metrics
   - ConversationCompleteness
   - KnowledgeRetention
   - TopicAdherence

2. **Searched Traces**: Used `mlflow.search_traces()` to find all conversation turns

3. **Evaluated with DeepEval**: Used `mlflow.genai.evaluate()` with DeepEval scorers

4. **Extracted Results**: Retrieved scores and reasoning from the result DataFrame

**Key Insights**:
- DeepEval scorers integrate seamlessly with MLflow's evaluation framework
- All DeepEval conversational metrics are session-level by default
- Results include both scores (0.0-1.0) and human-readable reasoning
- MLflow automatically aggregates metrics across traces

**Benefits over MLflow Native Judges**:
- Industry-standard metrics with established definitions
- Pre-built conversational AI evaluation metrics
- Active DeepEval community and metric library
- Consistent scoring across different use cases

---

## üîÑ Try Another Scenario

Let's evaluate a different conversation to see how DeepEval metrics compare.

In [None]:
# Load printer troubleshooting scenario
scenario2 = get_scenario_printer_troubleshooting()

print(f"Scenario: {scenario2['name']}")
print(f"Description: {scenario2['description']}")
print(f"Session ID: {scenario2['session_id']}")

# Run conversation
conv_result2 = agent.run_conversation(
    messages=scenario2['messages'],
    session_id=scenario2['session_id']
)

print(f"\n‚úì Conversation completed with {conv_result2['turns']} turns")

In [None]:
# Search traces for second scenario
session_traces2 = mlflow.search_traces(
    locations=[experiment.experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{scenario2['session_id']}'"
)

print(f"‚úì Found {len(session_traces2)} traces for '{scenario2['name']}'")

In [None]:
# Evaluate with DeepEval
with mlflow.start_run(run_name=f"{scenario2['name']} - DeepEval") as run:
    eval_results2 = mlflow.genai.evaluate(
        data=session_traces2,
        scorers=[
            completeness_scorer,
            knowledge_retention_scorer,
            TopicAdherence(
                model=judge_model_uri,
                include_reason=True,
                relevant_topics=["customer support", "technical help", "printer problems"]
            )
        ]
    )

print("‚úì Evaluation complete")
print(f"\nMetrics summary:")
for key, value in eval_results2.metrics.items():
    print(f"  {key}: {value}")

## üéì Key Takeaways

1. **MLflow 3.8+ DeepEval Integration**: Seamlessly use industry-standard metrics within MLflow

2. **Session-Level Evaluation**: DeepEval scorers automatically evaluate entire conversations

3. **Rich Results**: Get both quantitative scores (0.0-1.0) and qualitative reasoning

4. **Flexible Metrics**: Customize scorers (e.g., `relevant_topics` for TopicAdherence)

5. **Unified Workflow**: Same `mlflow.genai.evaluate()` API for both native judges and DeepEval scorers

---

## üìö Next Steps

- Explore more DeepEval metrics (ContextualRecall, ContextualPrecision, etc.)
- Combine DeepEval scorers with MLflow native judges for hybrid evaluation
- Batch evaluate multiple conversation sessions for quality monitoring
- Use MLflow UI to compare evaluation results across different scenarios

**Documentation**:
- [MLflow DeepEval Integration](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/deepeval-scorers/)
- [DeepEval Metrics](https://docs.confident-ai.com/)
- [MLflow Session Tracking](https://mlflow.org/docs/latest/genai/tracing/track-users-sessions/)