# Hands-On Lab: Building Agent Systems with Databricks

## MLflow Evaluations for GenAI Agents on Databricks
>
> **Goal**: Learn how to evaluate ML and GenAI models using **MLflow Evaluations** in Databricks, understand where results live in the UI, and how to operationalize evaluation workflows.

## 1. Evaluations Overview
>
> **Goal**: Learn how to evaluate **GenAI / LLM applications** using **MLflow Evaluations** in Databricks, understand how these evaluations differ from classic ML, and where to inspect results in the Databricks UI.


## 2. What Are MLflow Evaluations?

MLflow Evaluations provide a **standardized framework** for evaluating:

- **GenAI / LLM applications** (text quality, relevance, safety)
- **Classic ML models** (classification, regression)

For agentic workflows, evaluations focus less on "accuracy" and more on **judgment-based metrics** that assess output quality.

Key MLflow Agent Evaluation benefits:

- Native support for **LLM-as-a-judge** evaluators
- Row-level qualitative and quantitative results
- Reproducible evaluation artifacts tied to prompts, models, and data

In agent evaluations, the nature of evaluations shift from *“Is this response good?”* rather than *“Is this prediction correct?”*

## 3. Where MLflow Evaluations Fit in a GenAI Workflow

In GenAI applications, MLflow Evaluations typically happen:

1. After prompt, chain, or agent changes
2. Before deploying or updating a serving endpoint
3. Continuously after deployment for quality regression detection

They integrate with:

- MLflow Experiments (prompt + model iteration)
- Unity Catalog Model Registry (governed promotion)
- Databricks Model Serving (endpoint-based evaluation)

![](./_images/mlflow3.jpg)

## 4. Navigating through the user interface

1. MLflow is available on Databricks out of the box
2. Key Components of the MLflow interface consists of:
-   Experiments
-   Models
-   Serving

Let's first navigate to **Experiments** in the left-hand navigation

![](./_images/experiments.png)

## 5. Running an Evaluation (High-Level Flow)

At a high level, running an evaluation involves:

- A trained model (or endpoint)
- An evaluation dataset
- An evaluation configuration (metrics, evaluators)

MLflow handles metric computation and logging automatically. In Databricks, Tracing can be turned on with a single command.

![](./_images/tracing_log.jpg)

## 6. Evaluating a GenAI / LLM Application

### What Is Being Evaluated?

In GenAI, you are typically evaluating:

- Prompt templates
- Foundation models or fine-tuned LLMs
- Chains, agents, or tools
- End-to-end application outputs

### Inputs to a GenAI Evaluation

- Input prompts or questions
- Optional reference answers
- Model or endpoint outputs
- One or more evaluators (LLM, rule-based, or human)



![](./_images/tracing.jpg)

## Let's put this to the test and try evaluating our multi-agent supervisor and knowledge assistant more. Below is a list of sample questions for you to ask!

- Why am I experiencing no service or intermittent connectivity on my device?
- Why are my data speeds slow or my calls dropping, and what can we check to fix this?
- Is there an outage or known network issue in my area affecting my service?
- Could this issue be caused by my device rather than the network, and how can we tell?
- Is there a problem with my activation, SIM/eSIM, or number porting?
- Can you explain these charges on my bill or whether my plan is set up correctly?
- Why does this issue keep happening even after I’ve contacted support before?
- Are there any troubleshooting steps I may have missed that could fix this quickly?
- Does this issue need to be escalated, and what will that process look like?
- Is there a way I could resolve this issue myself in the future without calling support?

## 7. Exploring Evaluation Results in the UI

After the evaluation completes:

1. Navigate to the **Experiment Runs** page
2. Select the run associated with the Knowledge Assistant evaluation
3. Review:
   - Logs/Traces
   - Scorers
   - Prompts


## 8. GenAI Evaluators Explained

MLflow supports multiple evaluator types for GenAI:

### Automated LLM-as-a-Judge Evaluators

- Answer relevance
- Correctness (vs reference)
- Faithfulness / groundedness

### Safety & Policy Evaluators

- Toxicity
- Bias
- Harmful content

### Custom & Rule-Based Evaluators

- Regex or keyword checks
- Business logic validations

## 9. Using GenAI Evaluations for Governance & Promotion

GenAI evaluations are critical for:

- Justifying prompt or model changes
- Preventing silent quality regressions
- Enabling governed promotion to Production

UI flow:

1. Register the model or GenAI pipeline from the run
2. View evaluation artifacts directly on the model version
3. Use metrics and qualitative examples during stage reviews

## 10. Best Practices Summary

**Standardize Evaluations**
- Use consistent datasets and metrics

**Log Early and Often**
- Treat evaluations as first-class artifacts

**Automate**
- Include evaluations in CI/CD or training pipelines

**Use Human Review Strategically**
- Especially for GenAI edge cases


## 11. Common GenAI Evaluation Pitfalls

- Relying on a single LLM judge
- Not versioning prompts or evaluation data
- Evaluating only averages instead of per-row failures
- Ignoring qualitative review for edge cases

## 12. Wrap-Up & Next Steps

In this workshop, we covered:

- How MLflow Evaluations support GenAI workflows
- How GenAI evaluation differs from classic ML
- Where to inspect qualitative and quantitative results in the UI
- How to use evaluations for governance and deployment decisions

**Next steps**:

- Add GenAI evaluations to prompt iteration loops
- Combine automated and human review
- Use evaluation artifacts to gate production changes