# MLflow3
MLflow 3.0 unifies traditional machine learning, deep learning, and generative AI development within a single platform, eliminating the need for separate specialized tools. Its new GenAI capabilities include production-scale tracing, an enhanced quality evaluation experience, feedback collection APIs and UI, and comprehensive version tracking for both prompts and applications. The platform supports a complete GenAI development workflow: you can debug with tracing, measure quality using LLM judges, improve performance with expert feedback, track changes with robust versioning, and monitor your models in production—all exemplified with an e-commerce chatbot use case. In addition to these powerful features, the SDK has been greatly simplified, the UI refreshed, and Agent Evaluation is now integrated into the Databricks-hosted MLflow 3.0 experience, all fully deployment-agnostic.

- 🔍 GenAI Observability at Scale: Monitor & debug GenAI apps anywhere - deployed on Databricks or ANY cloud - with production-scale real-time tracing and enhanced UIs. [Link](https://docs.databricks.com/aws/en/mlflow3/genai/tracing/)
- 📊 Revamped GenAI Evaluation: Evaluate app quality using a brand-new SDK, simpler evaluation interface  and a refreshed UI. [Link](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/)
- ⚙️ Customizable Evaluation: Tailor AI  judges or custom metrics to your use case. [Link](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/)
- 👀 Monitoring: Schedule automatic quality evaluations (beta). [Link](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/run-scorer-in-prod)
- 🧪 Leverage Production Logs to Improve Quality:  Turn real user traces into curated, versioned evaluation datasets to continuously improve app performance. [Link](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/build-eval-dataset)
- 📝 Close the Loop with Feedback: Capture end-user feedback from your app’s UI. [Link](https://docs.databricks.com/aws/en/mlflow3/genai/tracing/collect-user-feedback/)
- 👥 Domain Expert Labeling: Send traces to human experts for ground truth or target output labeling. [Link](https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/expert-feedback/label-existing-traces)
- 📁 Prompt Management: Prompt Registry for versioning. [Link](https://docs.databricks.com/aws/en/mlflow3/genai/prompt-version-mgmt/prompt-registry/create-and-edit-prompts)
- 🧩 App Version Tracking: Link app versions to quality evaluations. [Link](https://docs.databricks.com/aws/en/mlflow3/genai/prompt-version-mgmt/version-tracking/track-application-versions-with-mlflow)

🚀To give MLflow 3.0 a try:

- MLflow 3 is enabled in your workspace.  Simply pip install mlflow[databricks]>=3.1 to use.
- View the [documentation](https://docs.databricks.com/aws/en/mlflow3/genai/) or try the [quickstarts](https://docs.databricks.com/aws/en/mlflow3/genai/getting-started/)
- Easily migrate from Agent Evaluation or MLflow 2.0 by using the [migration guide](https://docs.databricks.com/aws/en/mlflow3/genai/agent-eval-migration) - MLflow 3 is designed to minimize breaking changes 
- Read our [blog post](https://www.databricks.com/blog/mlflow-30-unified-ai-experimentation-observability-and-governance) or view our [website](https://www.managed-mlflow.com/genai)

TODO: add the gif

# MLflow 3 UI with the Knowledge Assistant Agent
Leveraging your tech support Knowledge Assistant agent is an effective way to showcase the full feature set and workflow of the MLflow 3 user interface. Below is a structured approach to illustrating each MLflow 3 UI capability using your deployed agent as a real-world example.

## 1. Traces
Every question submitted to your Knowledge Assistant—whether about device issues or billing policies—automatically generates a detailed trace in MLflow 3. These traces capture:
- The user input/query
- Construction of prompts/instructions
- Retrieval steps from various knowledge sources
- The final, delivered response

By clicking on the MLflow experiment created by the agent, you will be directed to the Experiment interface. The initial tab displayed is Traces.

<img src=https://raw.githubusercontent.com/chen-data-ai/Agent-Bricks-Workshop/refs/heads/main/resources/screenshots/Screenshot_Traces.png width="60%">

### Monitoring Dashboard Overview
Once monitoring is enabled for your Knowledge Assistant agent, you'll see concise evaluation metrics for each response:

- Precision: How accurately the answer reflects facts from your knowledge base.
- Groundedness: Whether the response is directly supported by your uploaded sources.
- Relevance to Query: How well the response addresses the user's exact question.
- Chunk Relevance: Whether the content retrieved is most pertinent to the query.

To enable this feature, navigate to the Monitoring tab and select a schema from Unity where traces will be archived. 

<img src=https://raw.githubusercontent.com/chen-data-ai/Agent-Bricks-Workshop/refs/heads/main/resources/screenshots/Screenshot_monitoring_setup.png width="60%">

Next, choose the AI judges and then update the metrics.

<img src=https://raw.githubusercontent.com/chen-data-ai/Agent-Bricks-Workshop/refs/heads/main/resources/screenshots/Screenshot_monitoring_setup_2.png width="60%">

You can also define and track personalized metrics tailored to your specific business needs or quality standards. This flexibility ensures the dashboard reflects what matters most for your use case, helping you quickly identify strengths and improvement areas for your agent.

Note: To start capturing evaluation metrics, you need to update the agent’s configuration to enable evaluation.

### Charts 

In the Trace tab, you can view an operational metrics dashboard that provides key statistics—such as response time, tokens generated, and usage patterns—directly alongside each set of traces. This allows you to monitor and analyze agent performance in real time, helping you quickly identify trends, bottlenecks, or anomalies right from the trace interface.

<img src=https://raw.githubusercontent.com/chen-data-ai/Agent-Bricks-Workshop/refs/heads/main/resources/screenshots/Screenshot_Charts.png width="60%">

## 2. Evaluations
The Evaluation tab lets you systematically assess the quality of your GenAI app or agent using curated evaluation datasets. These datasets include real or synthetic inputs, optional ground truth expectations, and relevant metadata to guide meaningful quality measurement.

You can provide evaluation data in two ways:
- MLflow Evaluation Datasets (Recommended): Purpose-built and versioned datasets stored in Unity Catalog, supporting collaborative workflows, lineage tracking, seamless integration, and fine-grained visualization within the MLflow UI.
- Arbitrary Datasets: For quick prototyping, you can use standard data structures like lists of dicts, Pandas DataFrames, or Spark DataFrames.

### Adding an Evaluation Dataset Through the UI
When creating an evaluation dataset, either through the UI or by passing data directly, you’ll define the following core fields:
- inputs: The input(s) for your app (e.g., a user question or context), stored as a JSON-serializable dict.	
- expectations: Ground truth labels, such as the correct answer or expected facts, stored as a JSON-serializable dict.	

If you completed the labeling exercise from the previous notebook, the labels you created will also appear in the evaluation section.

<img src=https://raw.githubusercontent.com/chen-data-ai/Agent-Bricks-Workshop/refs/heads/main/resources/screenshots/Screenshot_evaluation_1.png width="60%">

You can also create a separate dataset to evaluate the model. To do this, you will need to create a schema and a table to store the input data used in the evaluation.

<img src=https://raw.githubusercontent.com/chen-data-ai/Agent-Bricks-Workshop/refs/heads/main/resources/screenshots/Screenshot_evaluation_2.png width="60%">


<img src=https://raw.githubusercontent.com/chen-data-ai/Agent-Bricks-Workshop/refs/heads/main/resources/screenshots/Screenshot_evaluation_3.png width="60%">


This is very similar to the labeling exercise we did earlier, where you can add examples, expectations, and guidelines. You can even add tags if needed.

<img src=https://raw.githubusercontent.com/chen-data-ai/Agent-Bricks-Workshop/refs/heads/main/resources/screenshots/Screenshot_evaluation_4.png width="60%">

