

# Production Monitoring: Automated Quality at Scale

MLflow's production monitoring automatically runs quality assessments on a sample of your production traffic, ensuring your GenAI app maintains high quality standards without manual intervention. MLflow lets you use the same metrics you defined for offline evaluation in production, enabling you to have consistent quality evaluation across your entire application lifecycle - dev to prod.

**Key benefits:** 

- Automated evaluation - Run LLM judges on production traces with configurable sampling rates
- Continuous quality assessment - Monitor quality metrics in real-time without disrupting user experience
- Cost-effective monitoring - Smart sampling strategies to balance coverage with computational cost

Production monitoring enables you to deploy confidently, knowing that you will proactively detect issues so you can address them before they cause a major impact to your users.

<img src="https://i.imgur.com/wv4p562.gif">

In [0]:
from databricks.agents.monitoring import (
  AssessmentsSuiteConfig,
  GuidelinesJudge,
  create_external_monitor,
)

# Create external monitor for automated production monitoring
external_monitor = create_external_monitor(
  # Change to a Unity Catalog schema where you have CREATE TABLE permissions.
  catalog_name=UC_CATALOG,
  schema_name=UC_SCHEMA,
  assessments_config=AssessmentsSuiteConfig(
    sample=1.0,  # sampling rate
    assessments=[
        # Builtin judges
        RelevanceToQuery(),
        RetrievalGroundedness(),
        RetrievalRelevance(),
        Safety(),
        # Guidelines can refer to the request and response.
        GuidelinesJudge(guidelines={
  'accuracy': [
    """The response correctly references all factual information from the provided_info based on these rules:
- All factual information must be directly sourced from the provided data with NO fabrication
- Names, dates, numbers, and company details must be 100% accurate with no errors
- Meeting discussions must be summarized with the exact same sentiment and priority as presented in the data
- Support ticket information must include correct ticket IDs, status, and resolution details when available
- All product usage statistics must be presented with the same metrics provided in the data
- No references to CloudFlow features, services, or offerings unless specifically mentioned in the customer data
- AUTOMATIC FAIL if any information is mentioned that is not explicitly provided in the data"""
  ],
  'personalized': [
    """The response demonstrates clear personalization based on the provided_info based on these rules:
- Email must begin by referencing the most recent meeting/interaction
- Immediatly next, the email must address the customer's MOST pressing concern as evidenced in the data
- Content structure must be customized based on the account's health status (critical issues first for "Fair" or "Poor" accounts)
- Industry-specific language must be used that reflects the customer's sector
- Recommendations must ONLY reference features that are:
  a) Listed as "least_used_features" in the data, AND
  b) Directly related to the "potential_opportunity" field
- Relationship history must be acknowledged (new vs. mature relationship)
- Deal stage must influence communication approach (implementation vs. renewal vs. growth)
- AUTOMATIC FAIL if recommendations could be copied to another customer in a different situation"""
  ],
  'relevance': [
    """The response prioritizes content that matters to the recipient in the provided_info based on these rules:
- Critical support tickets (status="Open (Critical)") must be addressed after the greeting, reference to the most recent interaction, any pleasantrys, and references to closed tickets
    - it is ok if they name is slightly different as long as it is clearly the same issue as in the provided_info
- Time-sensitive action items must be addressed before general updates
- Content must be ordered by descending urgency as defined by:
  1. Critical support issues
  2. Action items explicitly stated in most recent meeting
  3. Upcoming renewal if within 30 days
  4. Recently resolved issues
  5. Usage trends and recommendations
- No more than ONE feature recommendation for accounts with open critical issues
- No mentions of company news, product releases, or success stories not directly requested by the customer
- No calls to action unrelated to the immediate needs in the data
- AUTOMATIC FAIL if the email requests a meeting without being tied to a specific action item or opportunity in the data"""
  ],
}),
    ],
  ),
)

print(f"Monitor created: {external_monitor}")
