# Production Monitoring: Automated Quality at Scale

MLflow's production monitoring automatically runs quality assessments on a sample of your production traffic, ensuring your GenAI app maintains high quality standards without manual intervention. MLflow lets you use the **same** metrics you defined for offline evaluation in production, enabling you to have consistent quality evaluation across your entire application lifecycle - dev to prod.

Key benefits:

- **Automated evaluation** - Run LLM judges on production traces with configurable sampling rates
- **Continuous quality assessment** - Monitor quality metrics in real-time without disrupting user experience
- **Cost-effective monitoring** - Smart sampling strategies to balance coverage with computational cost

Production monitoring enables you to deploy confidently, knowing that you will proactively detect issues so you can address them before they cause a major impact to your users.

![monitoring-overview](https://i.imgur.com/wv4p562.gif)

This notebook demonstrates how to take the quality metrics you created in development and turn them into production monitoring with MLflow's new scorer registration system. You'll learn how the same scorers that work offline for evaluation automatically become online monitoring - no rebuilding required.


## Install packages (only required if running in a Databricks Notebook)

In [None]:
%pip install -U -r ../../requirements.txt
dbutils.library.restartPython()

## Environment Setup

Load environment variables and verify MLflow configuration.


In [None]:
import sys
sys.path.append('../')
sys.path.append('../../')

import os
from dotenv import load_dotenv
import mlflow
from mlflow_demo.utils import *

if mlflow.utils.databricks_utils.is_in_databricks_notebook():
  print("Running in Databricks Notebook")
  setup_databricks_notebook_env()
else:
  print("Running in Local IDE")
  setup_local_ide_env()

# Verify key variables are loaded
print('=== Environment Setup ===')
print(f'DATABRICKS_HOST: {os.getenv("DATABRICKS_HOST")}')
print(f'MLFLOW_EXPERIMENT_ID: {os.getenv("MLFLOW_EXPERIMENT_ID")}')
print(f'LLM_MODEL: {os.getenv("LLM_MODEL")}')
print(f'UC_CATALOG: {os.getenv("UC_CATALOG")}')
print(f'UC_SCHEMA: {os.getenv("UC_SCHEMA")}')
print('✅ Environment variables loaded successfully!')

import logging
logging.getLogger("urllib3").setLevel(logging.ERROR)
logging.getLogger("mlflow").setLevel(logging.ERROR)

In [None]:
# Get helper functions for showing links to generated traces
from mlflow_demo.utils import generate_trace_links

# 🔄 From Offline Evaluation to Online Monitoring

The key insight of MLflow's production monitoring is simple: **the same scorers you use for offline evaluation automatically become your online monitoring**. No need to rebuild, reconfigure, or rewrite your quality assessment logic.

## The "Same Metrics Everywhere" Approach

- **Development**: Use Guidelines, Safety, and custom scorers to evaluate quality
- **Testing**: Run the same scorers on evaluation datasets
- **Production**: Register those exact scorers for automated monitoring
- **Analysis**: Compare quality using consistent metrics across all stages

## From Notebook 2 to Production

In this tutorial, we'll take the scorers we created in **Notebook 2: Create Quality Metrics** and turn them into production monitoring in just a few lines of code.

**📚 Documentation**

- [**Run Scorers in Production**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/run-scorer-in-prod) - Production monitoring setup

**▶️ Run the next cells to recreate and register your quality metrics for production monitoring!**


# 📝 Step 1: Register Quality Scorers for Monitoring

Let's recreate the exact same scorers from Notebook 2 and register them for production monitoring. This demonstrates how your development metrics seamlessly become production monitoring.

**▶️ Run the next cells to register your scorers!**


In [None]:
from mlflow.genai.scorers import scorer
from mlflow.genai.scorers import Guidelines

# Copy / paste the exact same scorers from Notebook 2: Create Quality Metrics

# Tone of voice Guideline - Ensure professional tone
tone = Guidelines(
  name='tone',
  guidelines="""The response maintains a professional tone.""")

# Accuracy Guideline - Verify all facts come from provided data
@scorer
def accuracy(trace):
    """
    Custom accuracy scorer that evaluates only the email body content,
    excluding the subject line to avoid false negatives on creative/generic subjects.

    This demonstrates how to wrap the proven Guidelines judge with custom data extraction.
    """
    import json
    from mlflow.genai.judges import meets_guidelines
    # Extract the original request
    outputs = json.loads(trace.data.response)
    email_body = outputs.get('email_body')
    user_input = outputs.get('user_input')
    input_facts = trace.search_spans(span_type="RETRIEVER")[0].outputs

    accuracy_guideline = """The email_body correctly references all factual information from the provided_info based on these rules:
- All factual information must be directly sourced from the provided data with NO fabrication
- Names, dates, numbers, and company details must be 100% accurate with no errors
- Meeting discussions must be summarized with the exact same sentiment and priority as presented in the data
- Support ticket information must include correct ticket IDs, status, and resolution details when available
- All product usage statistics must be presented with the same metrics provided in the data
- No references to CloudFlow features, services, or offerings unless specifically mentioned in the customer data
- AUTOMATIC FAIL if any information is mentioned that is not explicitly provided in the data
- It is OK if the email_body follows the user_input request to omit certain facts, as long as no fabricated facts are introduced."""

    # Use the proven Guidelines judge with our extracted email body
    return meets_guidelines(guidelines=accuracy_guideline, context={'provided_info': input_facts, 'email': email_body, 'user_input': user_input})
# Personalization Guideline - Ensure emails are tailored to specific customers
@scorer
def personalized(trace):
    """
    Custom personalization scorer that evaluates only the email body content,
    excluding the subject line to avoid false negatives on creative/generic subjects.

    This demonstrates how to wrap the proven Guidelines judge with custom data extraction.
    """
    import json
    from mlflow.genai.judges import meets_guidelines
    # Extract the original request
    outputs = json.loads(trace.data.response)
    email_body = outputs.get('email_body')
    user_input = outputs.get('user_input')
    input_facts = trace.search_spans(span_type="RETRIEVER")[0].outputs

    personalized_guideline = """The email_body demonstrates clear personalization based on the provided_info based on these rules:
- Email must begin by referencing the most recent meeting/interaction
- Immediately next, the email must address the customer's MOST pressing concern as evidenced in the data
- Content structure must be customized based on the account's health status (critical issues first for "Fair" or "Poor" accounts)
- Industry-specific language must be used that reflects the customer's sector
- Recommendations must ONLY reference features that are:
  a) Listed as "least_used_features" in the data, AND
  b) Directly related to the "potential_opportunity" field
- Relationship history must be acknowledged (new vs. mature relationship)
- Deal stage must influence communication approach (implementation vs. renewal vs. growth)
- AUTOMATIC FAIL if recommendations could be copied to another customer in a different situation"""

    # Use the proven Guidelines judge with our extracted email body
    return meets_guidelines(guidelines=personalized_guideline, context={'provided_info': input_facts, 'email': email_body, 'user_input': user_input})

# Relevance Guideline - Prioritize content by urgency
@scorer
def relevance(trace):
    """
    Custom relevance scorer that evaluates only the email body content,
    excluding the subject line to avoid false negatives on creative/generic subjects.

    This demonstrates how to wrap the proven Guidelines judge with custom data extraction.
    """
    import json
    from mlflow.genai.judges import meets_guidelines
    # Extract the original request
    outputs = json.loads(trace.data.response)
    email_body = outputs.get('email_body')
    user_input = outputs.get('user_input')
    input_facts = trace.search_spans(span_type="RETRIEVER")[0].outputs

    relevance_guideline = """The email_body prioritizes content that matters to the recipient in the provided_info based on these rules:
- Critical support tickets (status="Open (Critical)") must be addressed after the greeting, reference to the most recent interaction, any pleasantries, and references to closed tickets
- Time-sensitive action items must be addressed before general updates
- Content must be ordered by descending urgency as defined by:
  1. Critical support issues
  2. Action items explicitly stated in most recent meeting
  3. Upcoming renewal if within 30 days
  4. Recently resolved issues
  5. Usage trends and recommendations
- No more than ONE feature recommendation for accounts with open critical issues
- No mentions of company news, product releases, or success stories not directly requested by the customer
- No calls to action unrelated to the immediate needs in the data
- AUTOMATIC FAIL if the email requests a meeting without being tied to a specific action item or opportunity in the data"""

    # Use the proven Guidelines judge with our extracted email body
    return meets_guidelines(guidelines=relevance_guideline, context={'provided_info': input_facts, 'email': email_body, 'user_input': user_input})



First, we remove all existing registered scorers since the demo by default registers & starts monitoring for the same scorers as we will use below.

In [None]:
from mlflow.genai.scorers import list_scorers, delete_scorer

registered_scorers = list_scorers()
for s in registered_scorers:
    print(f'Deleting existing registered scorer: {s.name}')
    delete_scorer(name=s.name)

Now, we will register and start the scorers.

In [None]:
from mlflow.genai.scorers import ScorerSamplingConfig

print('🚀 Registering and starting production monitoring scorers...')

tone.register()
tone.start(sampling_config=ScorerSamplingConfig(sample_rate=1)) # run on 100% of productions traces
print('✅ Tone scorer registered and started!')

accuracy.register()
accuracy.start(sampling_config=ScorerSamplingConfig(sample_rate=1)) # run on 100% of productions traces
print('✅ Accuracy scorer registered and started!')

personalized.register()
personalized.start(sampling_config=ScorerSamplingConfig(sample_rate=1)) # run on 100% of productions traces
print('✅ Personalized scorer registered and started!')

relevance.register()
relevance.start(sampling_config=ScorerSamplingConfig(sample_rate=1)) # run on 100% of productions traces
print('✅ Relevance scorer registered and started!')

print('🎯 All production monitoring scorers are now active!')

# 🔍 Step 2: View Monitoring Results in MLflow UI

Let's issue a few queries so we can see the monitoring results in the MLflow Trace UI.  Since monitoring runs every ~15 minutes, please wait ~20 mins after running these queries to check the UI.


In [None]:
from mlflow_demo.agent.email_generator import EmailGenerator

email_agent = EmailGenerator()


result = email_agent.generate_email_with_retrieval("EcomSolutions LLC", user_input="Include a product recommendation to improve their usage.")
generate_trace_links(result['trace_id']);

result = email_agent.generate_email_with_retrieval("LogiTrans Solutions", user_input="Talk in all caps!")
generate_trace_links(result['trace_id']);



# 🎯 Summary and Next Steps

Congratulations! You've successfully implemented production monitoring for GenAI applications using MLflow 3.x.

## What You've Accomplished

✅ **Connected offline to online** - Used the same scorers from development in production monitoring  
✅ **Registered quality scorers** - Set up 5 scorers for automated monitoring  

**📚 Continue Learning**

- [Production Monitoring Guide](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/run-scorer-in-prod) - Complete setup documentation
- [MLflow GenAI Evaluation](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/) - Full evaluation and monitoring workflow
