# Tutorial 1.7: Evaluating Agents with MLflow

## LLM-as-Judge Evaluation for GenAI Applications

This notebook teaches you how to systematically evaluate agents and LLM applications using MLflow's evaluation framework. You'll learn to use built-in judges, create custom scorers, and integrate third-party evaluation libraries like DeepEval.

### What You'll Learn

- ‚úÖ Why agent evaluation matters
- ‚úÖ MLflow built-in scorers (RelevanceToQuery, Correctness, Guidelines, Safety)
- ‚úÖ Creating custom scorers with the `@scorer` decorator
- ‚úÖ DeepEval integration for conversational evaluation
- ‚úÖ Session-level multi-turn evaluation

### Prerequisites
- Completed notebooks 1.1-1.6
- Understanding of MLflow tracing

### Estimated Time: 25-30 minutes

---
## Step 1: Why Evaluate Agents?

Agent evaluation is critical for:

| Challenge | Solution |
|-----------|----------|
| Non-deterministic outputs | Statistical evaluation over multiple runs |
| Complex reasoning chains | Step-by-step quality assessment |
| Multi-turn conversations | Session-level coherence metrics |
| Safety and compliance | Automated guardrail checking |
| Regression detection | Baseline comparisons |

### LLM-as-Judge Pattern

Instead of brittle string matching, we use LLMs to evaluate LLM outputs:

```
Agent Output ‚Üí Judge LLM ‚Üí Score + Reasoning
```

![LLM as a judge concept](images/llm-as-judge.png)

MLflow provides:
1. **Built-in scorers** - Pre-configured judges for common metrics
2. **Custom scorers** - Define your own evaluation logic
3. **Third-party integrations** - DeepEval, RAGAS, Phoenix, and more

---
## Step 2: Environment Setup

In [1]:
import mlflow
from dotenv import load_dotenv
from utils.clnt_utils import is_databricks_ai_gateway_client, get_databricks_ai_gateway_client, get_openai_client, get_ai_gateway_model_names

# Load environment
load_dotenv()

# Configure MLflow
EXPERIMENT_NAME = "1-agent-evaluation"
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment(EXPERIMENT_NAME)

# Check if we are using a Databricks AI Gateway client
use_databricks_provider = is_databricks_ai_gateway_client()
if use_databricks_provider:
    client = get_databricks_ai_gateway_client()
    models = get_ai_gateway_model_names()
    JUDGE_MODEL = models[2]
    AGENT_MODEL = models[0]
else:
    # Initialize as an OpenAI client
    client = get_openai_client()
    JUDGE_MODEL = "gpt-5.2"
    AGENT_MODEL = "gpt-5.2"

# Enable autologging
mlflow.openai.autolog()

print("‚úÖ Environment configured for agent evaluation")
print(f"use_databricks_provider: {use_databricks_provider}")
print(f"MLflow tracking: {mlflow.get_tracking_uri()}")
print(f"Experiment: {EXPERIMENT_NAME}")
print(f"Agent model : {AGENT_MODEL}")
print(f"Judge model : {JUDGE_MODEL}")

2026/02/07 14:12:45 INFO mlflow.tracking.fluent: Experiment with name '1-agent-evaluation' does not exist. Creating a new experiment.


‚úÖ Environment configured for agent evaluation
use_databricks_provider: False
MLflow tracking: http://localhost:5000
Experiment: 1-agent-evaluation
Agent model : gpt-5.2
Judge model : gpt-5.2


---
## Step 3: Create a Simple Agent for Evaluation

Let's create a simple Q&A agent that we'll evaluate. This is a good first stepping stone to the next steps in which we use a number of built-in judges, followed by custom scorers.

In [2]:
from typing import Any

class SimpleQAAgent:
    """
    A simple Q&A agent for demonstration.
    """
    
    def __init__(self, client:Any, model: str = "gpt-5.2"):
        self.model = model
        self.client = client
    
    @mlflow.trace(name="qa_agent", span_type="AGENT")
    def answer(self, question: str) -> str:
        """Answer a question."""
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": """You are a helpful Agent Assistant. Provide concise, accurate answers, 
                no hallucinations, with a focus on MLflow and GenAI."""},
                {"role": "user", "content": question}
            ],
            temperature=0.7,
        )
        return response.choices[0].message.content

# Initialize agent
agent = SimpleQAAgent(client=client, model=AGENT_MODEL)

# Test the agent with a single question
test_question = "What is MLflow for GenAI?"
test_response = agent.answer(test_question)
print(f"Question: {test_question}")
print(f"\nResponse: {test_response}")

Question: What is MLflow for GenAI?

Response: MLflow for GenAI is the set of MLflow capabilities designed to **build, evaluate, track, and deploy generative AI applications** (LLMs and agentic systems) with the same rigor you‚Äôd use for traditional ML‚Äîcovering prompts, models, tools, retrieval, and outputs.

Key pieces:

- **Experiment tracking for LLM apps**: Log prompts, model/provider (e.g., OpenAI, Anthropic, local), parameters (temperature, max tokens), retrieved context, tool calls, and outputs‚Äîso runs are reproducible and comparable.
- **Prompt & model management**: Version and register prompts and GenAI ‚Äúmodels‚Äù/chains so you can promote changes through dev ‚Üí staging ‚Üí prod with governance.
- **Evaluation (LLM evals)**: Run standardized evaluations on datasets (QA, summarization, RAG, safety), including automated metrics and LLM-as-judge style scoring, plus human review workflows where applicable.
- **Tracing/observability**: Capture traces of multi-step chains/ag

### Quick Evaluation: Is This Response Good?

Now that we have a response, how do we know if it's good? We could read it manually, but that doesn't scale. Instead, let's use MLflow's built-in `RelevanceToQuery` scorer to have an LLM judge evaluate the response.

In [3]:
import os
from mlflow.genai.scorers import RelevanceToQuery

# Configure judge model URI based on provider
if use_databricks_provider:
    databricks_token = os.environ.get("DATABRICKS_TOKEN")
    ai_gateway_base_url = os.environ.get("AI_GATEWAY_BASE_URL")
    os.environ["OPENAI_API_KEY"] = databricks_token
    os.environ["OPENAI_API_BASE"] = ai_gateway_base_url
    judge_model_uri = f"openai:/{JUDGE_MODEL}"
else:
    judge_model_uri = f"openai:/{JUDGE_MODEL}"

# Create a quick evaluation of the single test response
quick_eval_data = [{
    "inputs": {"question": test_question},
    "outputs": {"response": test_response}  # Pre-computed output, no predict_fn needed
}]

# Evaluate with RelevanceToQuery scorer
quick_scorer = RelevanceToQuery(model=judge_model_uri)

print("üîÑ Evaluating the test response with RelevanceToQuery scorer...\n")

quick_result = mlflow.genai.evaluate(
    data=quick_eval_data,
    scorers=[quick_scorer]
)

# Show the result
print("üìä Quick Evaluation Result:")
print("-" * 40)
score = quick_result.metrics.get("relevance_to_query/mean", "N/A")
print(f"   Relevance Score: {score}")
print("\n‚úÖ The LLM judge evaluated our agent's response!")
print("   Now let's learn about all the built-in scorers available...")

2026/02/07 14:13:01 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.


üîÑ Evaluating the test response with RelevanceToQuery scorer...



Evaluating:   0%|          | 0/1 [Elapsed: 00:00, Remaining: ?] 

üìä Quick Evaluation Result:
----------------------------------------
   Relevance Score: 1.0

‚úÖ The LLM judge evaluated our agent's response!
   Now let's learn about all the built-in scorers available...


---
## Step 4: MLflow Built-in Scorers

MLflow provides [pre-configured scorers](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/predefined/) for common evaluation needs:

| Scorer | Purpose | Inputs |
|--------|---------|--------|
| `RelevanceToQuery` | Is the response relevant to the question? | inputs, outputs |
| `Correctness` | Is the response factually correct? | outputs, expectations |
| `Guidelines` | Does the response follow specific guidelines? | outputs |
| `Safety` | Is the response safe and appropriate? | outputs |

In [9]:
import os
from mlflow.genai.scorers import (
    RelevanceToQuery,
    Correctness,
    Guidelines,
    Safety
)

if use_databricks_provider:
    databricks_token = os.environ.get("DATABRICKS_TOKEN")
    ai_gateway_base_url = os.environ.get("AI_GATEWAY_BASE_URL")
    
    # Try configuring as OpenAI-compatible endpoint
    os.environ["OPENAI_API_KEY"] = databricks_token
    os.environ["OPENAI_API_BASE"] = ai_gateway_base_url
    
    judge_model_uri = f"databricks:/{JUDGE_MODEL}"
    print("üîß Configured for Databricks AI Gateway")
    print(f"   Base URL: {ai_gateway_base_url}")
    print(f"   Model: {JUDGE_MODEL}")
else:
    judge_model_uri = f"openai:/{JUDGE_MODEL}"

print(f"üîß Judge model URI: {judge_model_uri}")

# Initialize built-in scorers
relevance_scorer = RelevanceToQuery(model=judge_model_uri)
correctness_scorer = Correctness(model=judge_model_uri)
safety_scorer = Safety(model=judge_model_uri)
guidelines_scorer = Guidelines(
    model=judge_model_uri,
    guidelines=[ """
Response should be appropriately detailed:
- Simple factual questions: < 200 words  
- Technical how-to questions: < 500 words
- Complex architectural questions: < 1000 words
"""
    ],
    name="custom_guidelines"
)

print("\n‚úÖ Built-in scorers initialized:")
print("   - RelevanceToQuery, Correctness, Safety, Guidelines")

üîß Judge model URI: openai:/gpt-5.2

‚úÖ Built-in scorers initialized:
   - RelevanceToQuery, Correctness, Safety, Guidelines


---
## Step 5: Create Evaluation Dataset

This comprises three steps:
 1. Create an evaluation set that we will use in our evalution of the metrics to judge against the answer
 2. Create a prediction function that will take question as in input, and return a LLM response to the question
 3. run mlflow evaluation with all the scorers

### Why Create an Evaluation Dataset?

An evaluation dataset is the foundation of systematic agent testing. Without it, you're essentially "eyeballing" outputs‚Äîwhich doesn't scale and misses edge cases. The dataset enables **repeatable, automated evaluation** so you can detect regressions when you change prompts, models, or agent logic.

### What Purpose Does It Serve?

| Purpose | Benefit |
|---------|---------|
| **Benchmark Performance** | Measure how well your agent performs on known questions |
| **Detect Regressions** | Catch quality drops when updating prompts or models |
| **Compare Configurations** | A/B test different models, temperatures, or prompts |
| **Validate Edge Cases** | Include tricky questions that previously caused failures |
| **Document Expected Behavior** | Serve as living documentation of what your agent should do |

### How Many Evaluation Pairs?

**Rule of thumb: Start with 20-50 examples for development, scale to 100-500 for production.**

- **Minimum viable**: 10-20 pairs to catch obvious issues
- **Development**: 20-50 pairs covering main use cases
- **Pre-production**: 50-100 pairs including edge cases
- **Production monitoring**: 100-500+ pairs for statistical significance

The dataset below uses 8 examples for tutorial brevity, but real evaluations need more coverage.

### Dataset Structure

- `inputs`: The questions/prompts sent to the agent (dict with your input field names)
- `expectations` (optional): Expected answers for correctness checking

**Important:** Use `expected_response` as the field name in expectations‚Äîthis is what MLflow's `Correctness` scorer looks for. Alternative: use `expected_facts` for fact-based evaluation.

In [6]:
# Evaluation dataset focused on MLflow for GenAI and Agent Observability
# NOTE: Use 'expected_response' - this is the field name that MLflow's Correctness scorer expects
from mlflow.genai import create_dataset
eval_dataset = [
    {
        "inputs": {"question": "What is MLflow Tracing and why is it important for GenAI applications?"},
        "expectations": {"expected_response": "MLflow Tracing provides observability for GenAI applications by capturing the complete execution flow including LLM calls, retrieval steps, tool usage, and agent reasoning. It's important because it enables debugging, performance analysis, and understanding of complex AI pipelines."}
    },
    {
        "inputs": {"question": "How does MLflow help with prompt management in GenAI development?"},
        "expectations": {"expected_response": "MLflow's Prompt Registry allows you to version control prompts, tag and search prompt versions, link prompts to experiments, and collaborate with teams. This ensures reproducibility and systematic prompt engineering."}
    },
    {
        "inputs": {"question": "What are spans in MLflow Tracing and what types are available?"},
        "expectations": {"expected_response": "Spans are units of work captured during tracing. MLflow supports span types including LLM (for model calls), RETRIEVER (for RAG retrieval), TOOL (for function calls), AGENT (for agent orchestration), CHAIN (for sequential operations), and EMBEDDING (for vector operations)."}
    },
    {
        "inputs": {"question": "How can you evaluate LLM outputs using MLflow?"},
        "expectations": {"expected_response": "MLflow provides an evaluation framework with built-in scorers like RelevanceToQuery, Correctness, Guidelines, and Safety. You can also create custom scorers using the @scorer decorator or integrate third-party libraries like DeepEval and RAGAS."}
    },
    {
        "inputs": {"question": "What is the LLM-as-Judge pattern and how does MLflow support it?"},
        "expectations": {"expected_response": "LLM-as-Judge uses an LLM to evaluate outputs from another LLM, replacing brittle string matching with intelligent assessment. MLflow supports this through built-in scorers that use configurable judge models to provide scores and reasoning explanations."}
    },
    {
        "inputs": {"question": "How do you track costs and token usage in MLflow for GenAI?"},
        "expectations": {"expected_response": "MLflow automatically logs token usage (prompt tokens, completion tokens, total tokens) and can calculate costs based on model pricing. This data is captured in traces and experiment runs, enabling cost analysis and optimization across different models and configurations."}
    },
    {
        "inputs": {"question": "What frameworks does MLflow integrate with for GenAI auto-tracing?"},
        "expectations": {"expected_response": "MLflow provides auto-tracing for 40+ frameworks including OpenAI, Anthropic, LangChain, LlamaIndex, AWS Bedrock, Google Vertex AI, Cohere, Ollama, DSPy, AutoGen, and CrewAI. Auto-tracing automatically captures LLM calls without manual instrumentation."}
    },
    {
        "inputs": {"question": "How do you implement session-level tracing for multi-turn conversations?"},
        "expectations": {"expected_response": "Key concepts: (1) stable session identifier, (2) tag traces with session_id, (3) filter traces by session, (4) MLflow search capabilities"}
    },
]

# Create and register the dataset
dataset  = create_dataset(
    name="regression_test_suite",
    experiment_id= mlflow.get_experiment_by_name(EXPERIMENT_NAME).experiment_id,  
    tags={"type": "regression", "priority": "critical"},
)

dataset.merge_records(eval_dataset)

print(f"‚úÖ Evaluation dataset created with {len(eval_dataset)} examples")
print("\nüìã Questions cover:")
print("   - MLflow Tracing fundamentals")
print("   - Prompt management")
print("   - Span types and structure")
print("   - LLM evaluation methods")
print("   - LLM-as-Judge pattern")
print("   - Cost and token tracking")
print("   - Framework integrations")
print("   - Session-level observability")

‚úÖ Evaluation dataset created with 8 examples

üìã Questions cover:
   - MLflow Tracing fundamentals
   - Prompt management
   - Span types and structure
   - LLM evaluation methods
   - LLM-as-Judge pattern
   - Cost and token tracking
   - Framework integrations
   - Session-level observability


### How predict_fn works?

#### When and Why to Use predict_fn
##### predict_fn is required when:

 * Your dataset only contains inputs (and optionally expectations)
 * You need MLflow to call your agent/model to generate the outputs for evaluation

##### predict_fn is NOT needed when:

 * Your data already contains pre-computed outputs
 * You're passing MLflow traces (which already contain the inputs/outputs)

When `mlflow.genai.evaluate()` runs:

1. It iterates through each item in `eval_dataset`
2. Extracts item["inputs"] and passes it to `predict_fn(inputs)`
3. Your function returns outputs (e.g., {"response": "..."})

Scorers then evaluate using inputs, outputs, and expectations
 `[inputs, outputs responses (from llm), expecttions] --> scorers`

 - **eval_dataset**: Defines what to test (questions + expected answers)
 - **predict_fn**: Defines how to get answers (calls your agent)
 - **scorers**: Define how to judge quality

In [11]:
def predict_fn(question: str) -> dict:
    """
    Prediction function wrapper for evaluation.
    
    Note: mlflow.genai.evaluate() unpacks the 'inputs' dict as keyword arguments,
    so the function signature must match the keys in your dataset's 'inputs' field.
    
    Dataset: {"inputs": {"question": "..."}} 
    Called as: predict_fn(question="...")
    
    Args:
        question: The question string (unpacked from inputs dict)
        
    Returns:
        Dictionary with 'response' key
    """
    response = agent.answer(question)
    return {"response": response}

print("‚úÖ Prediction function defined")
print("   Signature: predict_fn(question: str) -> dict")

‚úÖ Prediction function defined
   Signature: predict_fn(question: str) -> dict


---
## Step 6: Run Evaluation with Built-in Scorers

In [12]:
# Run evaluation
print("üîÑ Running evaluation with built-in scorers...\n")

results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=predict_fn,
    scorers=[
        relevance_scorer,
        correctness_scorer,
        safety_scorer,
        guidelines_scorer
    ]
)

print("\n‚úÖ Evaluation complete!")

# Display metrics
print("\nüìä Metrics Summary:")
print("-" * 50)
if results.metrics:
    for metric_name, value in results.metrics.items():
        if isinstance(value, float):
            print(f"  {metric_name}: {value:.3f}")
        else:
            print(f"  {metric_name}: {value}")
else:
    print("  No metrics returned")
    print("\n‚ö†Ô∏è  Scorers returned None - this usually means the judge model call failed.")
    print("  Check the 'error_message' or similar columns above for details.")

2026/02/07 15:23:15 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset. To disable this check, set the MLFLOW_GENAI_EVAL_SKIP_TRACE_VALIDATION environment variable to True.


üîÑ Running evaluation with built-in scorers...



Evaluating:   0%|          | 0/8 [Elapsed: 00:00, Remaining: ?] 


‚úÖ Evaluation complete!

üìä Metrics Summary:
--------------------------------------------------
  safety/mean: 1.000
  relevance_to_query/mean: 1.000
  custom_guidelines/mean: 0.571
  correctness/mean: 0.429


---
## Step 7: Custom Scorers with @scorer Decorator

Create your own evaluation logic using the `@scorer` decorator.

Beyond using built-in metrics, you can define custom scorers to capture specific subject matter expertise. This is particularly useful when standard scorers cannot effectively gauge unique or nuanced aspects of your model's responses. While this example uses simple scorers for brevity and demonstration, you should tailor your custom metrics to reflect the specialized requirements of your specific domain.

In [13]:
from mlflow.genai import scorer

@scorer
def response_length_check(outputs: dict) -> bool:
    """
    Check if response is within acceptable length.
    Returns True if response is between 200 and 500 characters.
    """
    response = outputs.get("response", "")
    length = len(response)
    return 20 <= length <= 500

@scorer
def contains_keywords(outputs: dict, expectations: dict) -> bool:
    """
    Check if response contains key terms from expected answer.
    """
    response = outputs.get("response", "").lower()
    # Use 'expected_response' to match the dataset field name
    expected = expectations.get("expected_response", "").lower()
    
    # Extract key words (simple approach)
    key_words = [word for word in expected.split() if len(word) > 4]
    
    # Check if at least 30% of key words are present
    # If no keywords to check, fail conservatively (may indicate data issue)
    if not key_words:
        return False
    
    matches = sum(1 for word in key_words if word in response)
    return matches / len(key_words) >= 0.3

@scorer
def no_hallucination_markers(outputs: dict) -> bool:
    """
    Check for common hallucination markers.
    """
    response = outputs.get("response", "").lower()
    
    hallucination_markers = [
        "i think",
        "i believe",
        "probably",
        "might be",
        "i'm not sure",
        "as far as i know"
    ]
    
    return not any(marker in response for marker in hallucination_markers)

print("‚úÖ Custom scorers defined:")
print("   - response_length_check")
print("   - contains_keywords")
print("   - no_hallucination_markers")

‚úÖ Custom scorers defined:
   - response_length_check
   - contains_keywords
   - no_hallucination_markers


### Run evaluation on our Customer Judges

In [14]:
# Run evaluation with custom scorers
print("üîÑ Running evaluation with custom scorers...\n")

custom_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=predict_fn,
    scorers=[
        response_length_check,
        contains_keywords,
        no_hallucination_markers
    ]
)

print("\n‚úÖ Custom evaluation complete!")
print("\nCustom Metrics Summary:")
print("-" * 40)
for metric_name, value in custom_results.metrics.items():
    if isinstance(value, float):
        print(f"  {metric_name}: {value:.3f}")
    else:
        print(f"  {metric_name}: {value}")

2026/02/07 15:24:00 INFO mlflow.genai.utils.data_validation: Testing model prediction with the first sample in the dataset. To disable this check, set the MLFLOW_GENAI_EVAL_SKIP_TRACE_VALIDATION environment variable to True.


üîÑ Running evaluation with custom scorers...



Evaluating:   0%|          | 0/8 [Elapsed: 00:00, Remaining: ?] 


‚úÖ Custom evaluation complete!

Custom Metrics Summary:
----------------------------------------
  response_length_check/mean: 0.125
  no_hallucination_markers/mean: 1.000
  contains_keywords/mean: 0.750


---
## Step 8: DeepEval Integration

MLflow integrates with [DeepEval](https://mlflow.org/docs/latest/genai/eval-monitor/scorers/third-party/deepeval/) for advanced conversational AI evaluation metrics.

| DeepEval Scorer | Purpose |
|-----------------|----------|
| `ConversationCompleteness` | Did the conversation achieve its goal? |
| `KnowledgeRetention` | Does the agent remember context? |
| `TopicAdherence` | Does the agent stay on topic? |
| `Toxicity` | Is the agent response harmful or toxic in tonality|

In [17]:
from mlflow.genai.scorers.deepeval import (
    ConversationCompleteness,
    KnowledgeRetention,
    TopicAdherence,
    Toxicity,
)

# Initialize DeepEval scorers
jude_model_uri = f"openai:/{JUDGE_MODEL}"

completeness_scorer = ConversationCompleteness(model=jude_model_uri, threshold=0.7, include_reason=True)
retention_scorer = KnowledgeRetention(model=jude_model_uri, threshold=0.7, include_reason=True)
toxicity_scorer = Toxicity(model=jude_model_uri, threshold=0.7, include_reason=True)
topic_scorer = TopicAdherence(model=jude_model_uri, threshold=0.7, include_reason=True, relevant_topics=["MLflow", "machine learning", "AI", "data science", "genai", "agent", "observability", "prompt engineering", "prompt management", "prompt registry", "experiment tracking"])


print("‚úÖ DeepEval scorers initialized:")
print("   - ConversationCompleteness")
print("   - KnowledgeRetention")
print("   - TopicAdherence")
print("   - Toxicity")
print("\n Next, let's evaluate a multi-turn conversation using these DeepEval scorers")

‚úÖ DeepEval scorers initialized:
   - ConversationCompleteness
   - KnowledgeRetention
   - TopicAdherence
   - Toxicity

 Next, let's evaluate a multi-turn conversation using these DeepEval scorers


---
## Step 9: Multi-Turn Conversation Agent

For session-level evaluation, we need an agent that handles conversations.

In [18]:
import uuid

class ConversationalAgent:
    """
    An agent that maintains conversation history for multi-turn interactions.
    """
    
    def __init__(self, client: Any, model: str = "gpt-5.2"):
        self.model = model
        self.client = client
        self.conversation_history = []
        self.session_id = str(uuid.uuid4())
    
    def reset(self):
        """Reset conversation history and start new session."""
        self.conversation_history = []
        self.session_id = str(uuid.uuid4())
    
    @mlflow.trace(name="conversational_agent", span_type="AGENT")
    def chat(self, user_message: str) -> str:
        """
        Send a message and get a response, maintaining history.
        """
        # Tag trace with session ID for grouping
        mlflow.update_current_trace(metadata={
            "mlflow.trace.session": self.session_id,
            "turn_number": len(self.conversation_history) // 2 + 1
        })
        
        # Add user message to history
        self.conversation_history.append({
            "role": "user",
            "content": user_message
        })
        
        # Prepare messages with system prompt. Add previous context to the conversation.
        messages = [
            {"role": "system", 
            "content": """You are a helpful MLflow expert assistant. Answer questions about MLflow clearly, accurately,
                        concisely, without hallucinations, and accurately. Remember previous context in the conversation."""}
        ] + self.conversation_history
        
        # Get response from the Agent LLM
        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=0.7,
        )
        
        assistant_message = response.choices[0].message.content
        
        # Add to history
        self.conversation_history.append({
            "role": "assistant",
            "content": assistant_message
        })
        
        return assistant_message

print("‚úÖ ConversationalAgent defined with session tracking")

‚úÖ ConversationalAgent defined with session tracking


### Simulate a multi-turn conversation

In [19]:
# Simulate a multi-turn conversation
conv_agent = ConversationalAgent(client=client, model=AGENT_MODEL)

conversation_turns = [
    "What is MLflow for GenAI?",
    "What are its main GenAI main components?",
    "Tell me more about the Tracing component.",
    "How does it compare to other tools?",
    "What is the difference between Tracing and Tracking?",
    "How do I get started with MLflow for GenAI?",
]

print("üó£Ô∏è Multi-Turn Conversation\n")
print("=" * 60)

for i, user_msg in enumerate(conversation_turns, 1):
    print(f"\n[Turn {i}]")
    print(f"User: {user_msg}")
    response = conv_agent.chat(user_msg)
    print(f"Agent: {response}")

print("\n" + "=" * 60)
print(f"\n‚úÖ Conversation complete (Session: {conv_agent.session_id[:8]}...)")



üó£Ô∏è Multi-Turn Conversation


[Turn 1]
User: What is MLflow for GenAI?




Agent: MLflow for GenAI is the set of MLflow features designed to **build, evaluate, track, and deploy GenAI applications**‚Äîespecially **LLM-based** systems like chatbots, RAG pipelines, and agentic workflows‚Äîusing the same experiment tracking and lifecycle management approach MLflow provides for traditional ML.

Key pieces in MLflow for GenAI typically include:

- **Prompt & model tracking**: Log prompts, prompt templates, model/provider info (e.g., OpenAI, Anthropic, local models), parameters (temperature, max tokens), and outputs as part of MLflow runs so results are reproducible and comparable.
- **Evaluation for LLM apps**: Run structured evaluations for tasks like QA, summarization, classification, etc., including:
  - **Model-based / LLM-as-judge metrics** (e.g., relevance, correctness, groundedness)
  - **Heuristic metrics** (e.g., exact match, ROUGE, toxicity checks)
  - Dataset-driven comparisons across prompts/models
- **Tracing / observability** (where supported): Captu



Agent: Main MLflow-for-GenAI components (as they‚Äôre commonly used today) are:

1) **Experiment Tracking (Runs)**
- Log prompts/prompt templates, model/provider, decoding params (temperature, max_tokens), inputs/outputs, artifacts (e.g., retrieved docs), and metrics.

2) **LLM Evaluation**
- Evaluate prompts/models on datasets.
- Supports task-oriented evaluators and metrics (including LLM-as-judge style metrics where configured), plus comparisons across runs.

3) **Tracing / Observability**
- Trace multi-step GenAI executions (e.g., RAG: retrieval ‚Üí rerank ‚Üí prompt ‚Üí LLM ‚Üí post-process).
- Useful for debugging quality, latency, and failures; helps attribute issues to specific steps.

4) **Model Packaging & Serving (GenAI apps)**
- Package LLM pipelines/apps for reproducible execution.
- Deploy/serve them via MLflow serving mechanisms (depending on your environment).

5) **Model Registry (Governance)**
- Version and register GenAI apps/models.
- Manage stage transitions (e.g.,



Agent: MLflow Tracing is the observability layer for GenAI apps. It records a **trace** for each request (or run) and breaks it into **spans** that represent steps in your pipeline‚Äîso you can see *what happened*, *where time was spent*, and *which step produced which output*.

### What it captures
A trace typically includes:

- **Spans (step-level events)** such as:
  - LLM call(s) (provider/model, parameters, prompt/messages)
  - Retrieval (query, top‚Äëk, returned document IDs/snippets)
  - Reranking, tool/function calls, parsing, post-processing
- **Inputs/outputs per span**
  - Prompts/messages in, responses out
  - Retrieved context, intermediate artifacts
- **Timing and performance**
  - Latency per span and end-to-end latency
- **Metadata**
  - Model name/version, temperature/max_tokens, request IDs, user/session IDs (if you add them), error details/stack traces

### Why it‚Äôs useful
- **Debugging quality**: Identify whether bad answers are due to retrieval (wrong docs), prom



Agent: MLflow Tracing overlaps with ‚ÄúLLM observability‚Äù tools, but it‚Äôs positioned differently: it‚Äôs meant to **connect request-level traces to the MLflow lifecycle** (runs, evaluation, registry, deployment), rather than being a standalone observability product.

## Comparison dimensions

### 1) Scope: end-to-end ML lifecycle vs. observability-only
- **MLflow Tracing**: Observability + tight integration with **experiment tracking**, **LLM evaluation**, and **model/app versioning** (Registry). Good when you want one system of record from dev ‚Üí eval ‚Üí deploy.
- **Dedicated LLM observability tools** (e.g., LangSmith, Arize Phoenix, WhyLabs, Helicone, Humanloop, Weights & Biases Weave, etc.): Often go deeper on **production monitoring**, dataset curation, annotation workflows, prompt playgrounds, and collaboration features.

### 2) Framework/vendor integration
- **MLflow Tracing**: Best when you‚Äôre already using MLflow (especially in MLflow-centric platforms). Instrumentation



Agent: **Tracking** and **Tracing** solve different problems in MLflow:

## MLflow Tracking (Experiment Tracking)
**Purpose:** Record and compare *experiments/versions* of a model or GenAI app.

**What you log:**
- **Parameters** (e.g., temperature, top_p, prompt version, retriever settings)
- **Metrics** (e.g., accuracy, groundedness score, latency aggregates)
- **Artifacts** (prompt templates, evaluation datasets, configs, model/app code)
- **Run metadata** (tags like git SHA, data version)

**Granularity:** Typically **one run = one configuration** (one experiment setting), often evaluated over many examples.

**Primary use cases:**
- Compare prompt/model variants
- Reproduce results
- Track offline evaluation outcomes
- Register/promote a specific version to production

## MLflow Tracing
**Purpose:** Observe *individual executions/requests* of a GenAI app to understand behavior and performance.

**What it records:**
- A **trace** per request (or session), made of **spans** (steps)


---
## Step 10: Session-Level Evaluation with DeepEval

In [20]:
# Search for traces from this session
experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)

session_traces = mlflow.search_traces(
    locations=[experiment.experiment_id],
    filter_string=f"metadata.`mlflow.trace.session` = '{conv_agent.session_id}'"
)

print(f"üìä Found {len(session_traces)} traces for session {conv_agent.session_id[:8]}...")

üìä Found 6 traces for session 75f8c0e9...


### Run the evaluation

In [21]:
with mlflow.start_run(run_name="DeepEval") as run:
    deepeval_results = mlflow.genai.evaluate(
        data= session_traces,
        scorers=[
            completeness_scorer,
            retention_scorer,
            topic_scorer,
            toxicity_scorer
        ]
    )

print("\n‚úÖ DeepEval session evaluation complete!")
print("\nDeepEval Metrics:")
print("-" * 40)
for metric_name, value in deepeval_results.metrics.items():
    if isinstance(value, float):
        print(f"  {metric_name}: {value:.3f}")
    else:
        print(f"  {metric_name}: {value}")

Evaluating:   0%|          | 0/7 [Elapsed: 00:00, Remaining: ?] 


‚úÖ DeepEval session evaluation complete!

DeepEval Metrics:
----------------------------------------
  Toxicity/mean: 1.000
  ConversationCompleteness/mean: 1.000
  KnowledgeRetention/mean: 1.000
  TopicAdherence/mean: 1.000


---
## Step 11: Viewing Evaluation Results in MLflow UI

In [22]:
print("""
‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë         Viewing Evaluation Results in MLflow UI              ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù

üîç EXPERIMENTS VIEW:
   Navigate to: http://localhost:5000
   Select: "11-agent-evaluation" experiment
   
   You'll see:
   - Session-level and Evaluation runs with scorer metrics
   - Pass/fail rates for each scorer
   - Detailed reasoning from LLM judges

üìä EVALUATION RESULTS:
   Each run includes:
   - Metric values (0-1 scores or boolean)
   - Judge reasoning explanations
   - Input/output pairs
   - Artifacts with detailed results

üéØ KEY METRICS TO MONITOR:
   
   Built-in Scorers:
   - relevance_to_query/score: Response relevance
   - correctness/score: Factual accuracy
   - safety/score: Safety compliance
   - guidelines/score: Guideline adherence
   
   Custom Scorers:
   - response_length_check: Length validation
   - contains_keywords: Keyword presence
   - no_hallucination_markers: Confidence check
   
   DeepEval Scorers:
   - conversation_completeness/score: Goal achievement
   - knowledge_retention/score: Context memory
   - topic_adherence/score: Topic focus

üí° TIPS:
   1. Compare multiple evaluation runs
   2. Filter by scorer to find failures
   3. Read judge reasoning for insights
   4. Track metrics over time for regression detection

""")


‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë         Viewing Evaluation Results in MLflow UI              ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù

üîç EXPERIMENTS VIEW:
   Navigate to: http://localhost:5000
   Select: "11-agent-evaluation" experiment

   You'll see:
   - Session-level and Evaluation runs with scorer metrics
   - Pass/fail rates for each scorer
   - Detailed reasoning from LLM judges

üìä EVALUATION RESULTS:
   Each run includes:
   - Metric values (0-1 scores or boolean)
   - Judge reasoning explanations
   - Input/output pairs
   - Artifacts with detailed results

üéØ KEY METRICS TO MONITOR:

   Built-in Scorers:
   - relevance_to_query/score: Response relevanc

---
## Summary

In this notebook, you learned how to evaluate agents and LLM applications using MLflow's evaluation framework.

### ‚úÖ What You Learned

**Built-in Scorers:**
- `RelevanceToQuery` - Check if responses are relevant
- `Correctness` - Verify factual accuracy
- `Guidelines` - Enforce custom guidelines
- `Safety` - Ensure safe outputs

**Custom Scorers:**
- Use `@scorer` decorator for custom logic
- Access `outputs` and `expectations` in scorers
- Return boolean or numeric scores

**DeepEval Integration:**
- `ConversationCompleteness` - Goal achievement
- `KnowledgeRetention` - Context memory
- `TopicAdherence` - Topic focus

**Session-Level Evaluation:**
- Tag traces with session IDs
- Search traces by session
- Evaluate multi-turn conversations

### üîë Key Patterns

```python
# Built-in scorer
from mlflow.genai.scorers import RelevanceToQuery
scorer = RelevanceToQuery(model="openai:/gpt-4o-mini")

# Custom scorer
from mlflow.genai import scorer
@scorer
def my_scorer(outputs) -> bool:
    return condition

# Run evaluation
results = mlflow.genai.evaluate(
    data=session_tracew,
    scorers=[scorer1, scorer2]
)
```

### üìö Next Steps

Continue to **Tutorial 1.8: Complete RAG Application** to build a full RAG system with RAGAS evaluation metrics.