# Find & Fix Quality Issues

## Systematic workflow for discovering problems, creating solutions, and deploying improvements safely

Learn to systematically find and fix quality issues in your GenAI application using a real-world scenario. In this hands-on workflow, we'll address accuracy problems where the email generator is "hallucinating" information not present in the customer data.

## The Problem Scenario

Your email generation system has been running in production, but users report accuracy issues. Some emails contain fabricated details about:

- Product features not mentioned in customer data
- Meeting discussions that didn't happen as described
- Support ticket information with incorrect details

## The Solution Workflow

Follow this systematic 6-step process to identify, fix, and safely deploy improvements:

1. **🔍 Discover quality issues** Find problematic traces using evaluation results
2. **📊 Create Evaluation Datasets** Turn bad traces into targeted evaluation sets, good traces into regression sets
3. **🎯 Iterate on changes** Use MLflow Prompt Registry to track your changes
4. **🧪 Evaluate changes improved quality** Test that your changes addressed the quality problems
5. **🛡️ Verify changes didn't cause a regression** Ensure fixes don't break user inputs that already work well
6. **🚀 Deploy**

This approach ensures evidence-based improvements rather than guesswork, with quantitative validation and safe deployment practices.

![test-new-prompt](https://i.imgur.com/4wlhT63.gif)

This notebook provides hands-on experience with MLflow's complete quality improvement workflow for GenAI applications. You'll learn to systematically identify quality issues in production traces, create evaluation datasets, test improved prompts, and make data-driven deployment decisions.


## Install packages (only required if running in a Databricks Notebook)

In [None]:
%pip install -U -r ../../requirements.txt
dbutils.library.restartPython()

## Environment Setup

Load environment variables and verify MLflow configuration.


In [None]:
import sys
sys.path.append('../')
sys.path.append('../../')

import os
from dotenv import load_dotenv
import mlflow
from mlflow_demo.utils import *

if mlflow.utils.databricks_utils.is_in_databricks_notebook():
  print("Running in Databricks Notebook")
  setup_databricks_notebook_env()
else:
  print("Running in Local IDE")
  setup_local_ide_env()

# Verify key variables are loaded
print('=== Environment Setup ===')
print(f'DATABRICKS_HOST: {os.getenv("DATABRICKS_HOST")}')
print(f'MLFLOW_EXPERIMENT_ID: {os.getenv("MLFLOW_EXPERIMENT_ID")}')
print(f'LLM_MODEL: {os.getenv("LLM_MODEL")}')
print(f'UC_CATALOG: {os.getenv("UC_CATALOG")}')
print(f'UC_SCHEMA: {os.getenv("UC_SCHEMA")}')
print('✅ Environment variables loaded successfully!')

import logging
logging.getLogger("urllib3").setLevel(logging.ERROR)
logging.getLogger("mlflow").setLevel(logging.ERROR)

In [None]:
# Get helper functions for showing links to generated traces
from mlflow_demo.utils import generate_trace_links, generate_dataset_link, generate_prompt_link, generate_evaluation_comparison_link, generate_evaluation_links

# 🔍 Step 1: Discover Quality Issues in Production Traces

Let's start by examining real production traces to identify quality problems. Once you have tracing in place and judges defined, you can systematically improve your GenAI application by analyzing where it's failing.

Your email generation system has been running in production, but users report accuracy issues. Some emails contain fabricated details about:

- Product features not mentioned in customer data
- Meeting discussions that didn't happen as described
- Support ticket information with incorrect details

Here, we will:

1. Use the `accuracy` scorer that we created in the previous tutorial to identify problematic traces we need to fix
2. Use the `accuracy`, `personalized`, and `relevance` scorers to identify traces that are performing well so we can ensure our change doesn't regress quality

### **IMPORTANT**: In this notebook, we show you how to find traces using the MLflow SDKs. You can also perform these same steps using the MLflow Experiment UI - see the UI animation at the top of this notebook!

**📚 Documentation**

- [**Build Evaluation Datasets**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/build-eval-dataset) - Create datasets from production traces
- [**Search and Filter Traces**](https://docs.databricks.com/aws/en/mlflow3/genai/tracing/manage-traces) - Find specific traces by criteria

**▶️ Run the next cells to search for traces with quality issues!**


In [None]:
import mlflow
import pandas as pd

print('🔍 Searching for production traces with evaluation results...')
# Get 50 traces from  from the current experiment
# The tag is used to identify the traces that the demo initially loaded - think of this as your production set of traces from real users.
traces = mlflow.search_traces(max_results=50, filter_string='tags.sample_data = "yes"')

print(f'📊 Found {len(traces)} total traces')

# Initialize lists to collect trace data
failed_accuracy_traces_data = []
high_quality_traces_data = []

print('\n🔍 Analyzing trace assessments for quality issues...')

# Iterate through DataFrame rows
for idx, trace_row in traces.iterrows():

  # Access trace info
  trace_info = trace_row.info

  # Get assessments - adapt based on actual data structure
  assessments = trace_row.assessments
  assessments_count = len(assessments)


  # Get trace_id - adapt based on actual data structure
  trace_id = getattr(trace_info, 'trace_id', None)
  if trace_id is None and 'trace_id' in trace_row:
    trace_id = trace_row['trace_id']

  if assessments_count == 0:
    if trace_id:
      print(f'⚠️  No assessments found for trace {trace_id[:8]}... (skipping)')
    continue

  # Track the number of passed assessments for this trace
  passed_assessments = 0
  failed_accuracy = False

  for assessment in assessments:
    # Get the scorer's assessments
    assessment_name = assessment.get('assessment_name')
    assessment_feedback = assessment.get('feedback', {}).get('value')

    if assessment_name == 'accuracy' and assessment_feedback == 'no':
      failed_accuracy = True
    elif (
      assessment_name in ['relevance', 'personalized', 'accuracy']
      and assessment_feedback == 'yes'
      ):
        passed_assessments += 1

  # Categorize traces based on quality and store full trace data, up to 5 of each
  if failed_accuracy and len(failed_accuracy_traces_data) < 5:
    failed_accuracy_traces_data.append(trace_row)
    if trace_id:
      _, trace_url = generate_trace_links(trace_id, print_urls=False)
      print(f'❌ Failed accuracy: {trace_id[:8]}... (view trace: {trace_url})')
  elif passed_assessments >= 3 and len(high_quality_traces_data) < 5:
    high_quality_traces_data.append(trace_row)
    if trace_id:
      _, trace_url = generate_trace_links(trace_id, print_urls=False)
      print(
        f'✅ High quality: {trace_id[:8]}... (passed {passed_assessments} assessments, view trace: {trace_url})'
      )

# Create DataFrames with same structure as search_traces(...) - this is required to create Evaluation Datasets in the next step
print('\n💾 Saving traces as DataFrames for dataset creation...')
failed_accuracy_traces = pd.DataFrame(failed_accuracy_traces_data)
high_quality_traces = pd.DataFrame(high_quality_traces_data)

# 📊 Step 2: Create Evaluation and Regression Datasets

Now that we've identified problematic traces, let's create two Evaluation Datasets. An MLflow Evaluation Dataset is an MLflow primitive that allows you to curate collections of traces to use for evaluation. Evaluation Datasets are stored as Delta Tables in Unity Catalog.

1. **Dataset with problematic traces**: Contains traces with accuracy issues (to test our improvements)
2. **Regression dataset with high-quality traces**: Contains high-quality traces (to ensure we don't break existing functionality)

These datasets enable systematic testing of prompt improvements and ensure we don't introduce new problems while fixing existing ones.

**📚 Documentation**

- [**Create Evaluation Datasets**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/build-eval-dataset) - Build datasets from traces
- [**Dataset Management**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/build-eval-dataset#manage-datasets) - Manage evaluation datasets

**▶️ Run the next cells to create these datasets!**


In [None]:
from mlflow.genai import datasets
from datetime import datetime
# Dataset locations (uses the same Unity Catalog from the initial setup - these can be changed if needed)
UC_CATALOG = os.getenv('UC_CATALOG', 'default_catalog')
UC_SCHEMA = os.getenv('UC_SCHEMA', 'default_schema')

# Use unique names for datasets to avoid conflicts - you can change these if desired
LOW_ACCURACY_DATASET_NAME = f'low_accuracy_dataset_{datetime.now().strftime("%Y%m%d%H%M%S")}'
REGRESSION_DATASET_NAME = f'regression_dataset_{datetime.now().strftime("%Y%m%d%H%M%S")}'

print(f'\n📊 Creating datasets in {UC_CATALOG}.{UC_SCHEMA}...')

# Create low accuracy dataset - this will fail if you've already created a dataset with this name
low_accuracy_dataset = datasets.create_dataset(
    uc_table_name=f'{UC_CATALOG}.{UC_SCHEMA}.{LOW_ACCURACY_DATASET_NAME}',
)
print(f'✅ Created new low accuracy dataset: {UC_CATALOG}.{UC_SCHEMA}.{LOW_ACCURACY_DATASET_NAME}')

# Add traces from step 1 with accuracy issues to the low accuracy dataset
low_accuracy_dataset.merge_records(failed_accuracy_traces)
print(f'📝 Added {len(failed_accuracy_traces)} records into low accuracy dataset')
generate_dataset_link(low_accuracy_dataset.dataset_id)

# Create regression dataset - this will fail if you've already created a dataset with this name
regression_dataset = datasets.create_dataset(
    uc_table_name=f'{UC_CATALOG}.{UC_SCHEMA}.{REGRESSION_DATASET_NAME}',
    )
print(f'✅ Created new regression dataset: {UC_CATALOG}.{UC_SCHEMA}.{REGRESSION_DATASET_NAME}')

# Add traces from step 1 with high quality to the regression dataset
regression_dataset.merge_records(high_quality_traces)
print(f'📝 Added {len(high_quality_traces)} records into regression dataset')
generate_dataset_link(regression_dataset.dataset_id)

# 🔍 Step 3: Investigate Quality Issues: Understanding What's Going Wrong

Now let's dive deeper into the problematic traces to understand the root causes of our quality issues. By analyzing the judge rationales and specific examples, we can identify patterns in the failures and design targeted improvements.

### **IMPORTANT**: In this notebook, we show you how to view this data using the MLflow SDKs. You can also perform these same steps using the MLflow Experiment UI:

![investigate](https://i.imgur.com/DFN91Pu.gif)

## What We'll Investigate

**Judge Rationales Analysis**: Examine why the accuracy scorer flagged these traces as problematic

- What specific fabrications or hallucinations occurred?
- Which types of information are being invented (features, meetings, tickets)?
- Are there patterns in the failure modes?

**Input vs Output Comparison**: Compare customer data with generated emails

- Identify where the model went beyond provided data
- Find instances of invented product features or capabilities
- Spot fabricated meeting details or support ticket information

**Root Cause Analysis**: Determine the underlying prompt issues

- Is the current prompt too vague about data constraints?
- Are there missing guardrails against fabrication?
- Does the prompt encourage creativity over accuracy?

This investigation will inform our prompt improvements in the next step, ensuring we address the actual causes rather than symptoms.

**📚 Documentation**

- [**Trace Analysis**](https://docs.databricks.com/aws/en/mlflow3/genai/tracing/manage-traces) - Analyze production traces
- [**Judge Feedback**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-scorers#judge-feedback) - Understanding evaluation rationales

**▶️ Click on the trace links above to examine specific failures in the MLflow UI, then continue to Step 2!**


In [None]:
# Investigate failed traces and extract judge rationales
print('🔍 Investigating Failed Accuracy Traces:')
print('=' * 80)

for idx, failed_trace in failed_accuracy_traces.iterrows():
    # Get trace_id - it's stored in the info object
    trace_info = failed_trace.info
    trace_id = getattr(trace_info, 'trace_id', None)

    _, trace_url = generate_trace_links(trace_id, print_urls=False)
    print(f'\n❌ Trace ID: {trace_id[:8] if trace_id else "Unknown"}...')
    print(f'   🔗 View Trace: {trace_url}')

    # Extract and display accuracy assessment rationale
    assessments = failed_trace.assessments
    for assessment in assessments:
        assessment_name = assessment.get('assessment_name')
        assessment_rationale = assessment.get('rationale')
        assessment_feedback = assessment.get('feedback', {}).get('value')

        if assessment_name == 'accuracy' and assessment_feedback == 'no':
            # Print the judge's rationale for why this failed accuracy
            print(f'   📝 Accuracy Judge Rationale:')
            # Word wrap the rationale for better readability
            import textwrap
            wrapped_rationale = textwrap.fill(assessment_rationale, width=80, initial_indent='      ', subsequent_indent='      ')
            print(wrapped_rationale)
            break


print('\n💡 Patterns we noticed when reviewing these rationales:')
print('   🚫 Fabricated product features not in customer data')
print('   🚫 Invented meeting details or discussions')
print('   🚫 Made-up support ticket information')
print('   🚫 References to CloudFlow capabilities not mentioned in input')
print('\n🎯 These patterns will inform our prompt improvements in Step 4!')
print('=' * 80)

# 🎯 Step 4: Develop and Test Improved Prompts

Now that we understand the problem (hallucinated information) and have datasets to test against, let's create an improved prompt. We'll use MLflow Prompt Registry to manage prompt versions and enable safe iteration.

Based on our root cause analysis, the main issue is that the current prompt doesn't explicitly prevent fabrication. Our improved prompt will include:

- **Explicit NO FABRICATION rules** to prevent hallucinations
- **Clear data-only requirements** for all factual references
- **Enhanced accuracy guidelines** with specific failure conditions
- **Structured content prioritization** for better focus

**📚 Documentation**

- [**Create and Edit Prompts**](https://docs.databricks.com/aws/en/mlflow3/genai/prompt-version-mgmt/prompt-registry/create-and-edit-prompts) - Prompt Registry management

**▶️ Run the next cells to create and compare prompt versions!**


In [None]:
import mlflow
# Create improved prompt based on identified quality issues
FIXED_PROMPT_TEMPLATE = """You are an expert sales communication assistant for CloudFlow Inc. Your task is to generate a personalized, professional follow-up email for our sales representatives to send to their customers at the end of the day.

## CRITICAL: NO FABRICATION RULE
**ABSOLUTE REQUIREMENT**: You must ONLY reference information that is explicitly provided in the customer data. DO NOT:
- Invent or mention any CloudFlow features, services, or capabilities not listed in the data
- Fabricate any details about meetings, tickets, or usage that aren't provided
- Add any product recommendations beyond what's specifically mentioned in the customer data
- Create any information not directly sourced from the input JSON

**AUTOMATIC FAILURE** occurs if you mention anything not explicitly provided in the data.

## INPUT DATA
You will be provided with a JSON object containing:
- Account information
- Recent activity data (meetings, product usage, support tickets)
- Sales representative details

## EMAIL REQUIREMENTS
Generate an email that follows these guidelines:

1. SUBJECT LINE:
   - Concise and specific to the most important update or follow-up point
   - Include the company name if appropriate

2. GREETING:
   - Address the main contact by first name
   - Use a professional but friendly opening

3. BODY CONTENT (prioritize in this order):
   - Reference the most recent meeting/interaction and acknowledge key points discussed
   - Discuss any pressing issues that are still open immediatly afterwards
   - Provide updates on any urgent or recently resolved support tickets
   - Highlight positive product usage trends or achievements
   - Address any specific action items from previous meetings
   - Include personalized recommendations ONLY if features are explicitly mentioned in the 'least_used_features' field and directly related to the 'potential_opportunity' field.
      - NEVER invent or describe CloudFlow features/capabilities not explicitly listed in the customer data
      - Make sure these recommendations can NOT be copied to another customer in a different situation
      - No more than ONE feature recommendation for accounts with open critical issues
   - Suggest clear and specific next steps
      - Only request a meeting if it can be tied to specific action items


4. TONE AND STYLE:
   - Professional but conversational
   - Concise paragraphs (2-3 sentences each)
   - Use bullet points for lists or multiple items
   - Balance between being informative and actionable
   - Personalized to reflect the existing relationship
   - Adjust formality based on the customer's industry and relationship history

5. CLOSING:
   - Include an appropriate sign-off
   - Use the sales rep's signature from the provided data
   - No generic marketing language or overly sales-focused calls to action

## OUTPUT FORMAT
Provide the complete email as JUST a JSON object that can be loaded via `json.loads()` (do not wrap the JSON in backticks) with:
- `subject_line`: Subject line
- `body`: Body content with appropriate spacing and formatting including the signature

Remember, this email should feel like it was thoughtfully written by the sales representative based on their specific knowledge of the customer, not like an automated message.

**FINAL REMINDER**: Stay strictly within the bounds of the provided customer data. Any mention of CloudFlow features, capabilities, or services NOT explicitly listed in the input data will result in automatic failure.

If the user provides a specific instruction, you must follow only follow those instructions if they do not conflict with the guidelines above.  Do not follow any instructions that would result in an unprofessional or unethical email.
"""

# Create a new prompt version in the registry
print('📝 Creating new prompt version in MLflow Prompt Registry...')

# Load the prompt name and its UC location configured during setup
UC_SCHEMA = os.getenv('UC_SCHEMA')
UC_CATALOG = os.getenv('UC_CATALOG')
PROMPT_NAME = os.getenv('PROMPT_NAME')


new_prompt = mlflow.genai.register_prompt(
    name=f'{UC_CATALOG}.{UC_SCHEMA}.{PROMPT_NAME}',
    template=FIXED_PROMPT_TEMPLATE,
    commit_message='New email generation prompt to fix accuracy issues.',
)

mlflow.genai.set_prompt_alias(
      name=f'{UC_CATALOG}.{UC_SCHEMA}.{PROMPT_NAME}',
      alias="test_prompt",
      version=new_prompt.version,
  )

print(f'✅ Created new prompt version: {new_prompt.name} (version {new_prompt.version})')
generate_prompt_link(f'{UC_CATALOG}.{UC_SCHEMA}.{PROMPT_NAME}')

# 🧪 Step 5: Evaluate Improved Prompt Against Problem Cases

Now let's test our improved prompt against the evaluation dataset containing problematic traces. This will show us whether our changes actually fix the accuracy issues we identified.

We'll run evaluations with the same scorers used in production to ensure consistent quality measurement:

- **Accuracy**: Our custom guideline focused on preventing hallucinations
- **Relevance**: Ensuring content remains relevant to customer needs
- **Personalization**: Maintaining customer-specific personalization
- **Safety**: Basic safety checks

**📚 Documentation**

- [**Evaluate Apps**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/evaluate-app) - Run evaluations on datasets

**▶️ Run the next cell to evaluate the improved prompt against both datasets!**


In [None]:
# Here, we import the scorers that we created in the previous tutorial.
from mlflow_demo.evaluation import SCORERS

# Import the Email Generation app
from mlflow_demo.agent.email_generator import EmailGenerator

print('🔧 Creating predict_fn for evaluation to call the email app...')

email_app_with_new_prompt = EmailGenerator(prompt_alias="test_prompt")
email_app_with_old_prompt = EmailGenerator()


# Create predict_fn to enable mlflow.genai.evaluate() to call our app
def predict_fn_new(customer_name: str , user_input: str):
  return email_app_with_new_prompt.generate_email_with_retrieval(customer_name, user_input)

def predict_fn_old(customer_name: str , user_input: str):
  return email_app_with_old_prompt.generate_email_with_retrieval(customer_name, user_input)

# Run evaluations on the low accuracy dataset with both prompts
print('🚀 Running evaluation of the old prompt...')

eval_results_old = mlflow.genai.evaluate(
    data=low_accuracy_dataset,
    predict_fn=predict_fn_old,
    scorers=SCORERS,
)

print('✅ Old prompt evaluation completed!')
print('🚀 Running evaluation of the new prompt...')
eval_results_new = mlflow.genai.evaluate(
    data=low_accuracy_dataset,
    predict_fn=predict_fn_new,
    scorers=SCORERS,
)

print('✅ New prompt evaluation completed!')
print('=' * 60)

# Use the MLflow UI to compare the results
print('✅ View the results in the MLflow comparison UI:')
generate_evaluation_comparison_link(eval_results_new.run_id, eval_results_old.run_id);

## Now, you can keep iterating on the prompt using the evaluation results and scorer outputs to guide your improvements!

# 🛡️ Step 6: Verify No Regressions on High-Quality Examples

Now we need to ensure our improvements don't break existing functionality. We'll run the improved prompt against our regression dataset containing high-quality traces to verify that:

- Existing good examples remain good
- Quality metrics don't decrease for well-performing cases
- New accuracy rules don't interfere with proper personalization and relevance

This step is crucial for safe deployment - we want to fix problems without creating new ones.

**▶️ Run the next cell to verify no regressions on high-quality examples!**

In [None]:
# Here, we will re-use the app and scorers from above to verify no regressions on high-quality examples!

# Run evaluations on the high quality accuracy dataset with both prompts
print('🚀 Running evaluation of the new prompt...')

regression_results = mlflow.genai.evaluate(
    data=regression_dataset,
    predict_fn=predict_fn_new,
    scorers=SCORERS,
)

generate_evaluation_links(regression_results.run_id)


# 🚀 Step 7: Deploy the new version

Based on our evaluation results, we can now make an informed decision about additional improvements to reach our quality bar OR feel confident deploying the prompt knowing quality will improve without any regressions!

# 🎯 Tutorial Complete: Systematic GenAI Quality Improvement

Congratulations! You've successfully completed the end-to-end workflow for systematically improving GenAI quality using MLflow's evaluation and prompt management capabilities.

## What You've Accomplished

✅ **Identified Quality Issues** - Found traces with accuracy problems through systematic analysis  
✅ **Created Targeted Datasets** - Built evaluation datasets from problematic traces and regression datasets from high-quality examples  
✅ **Developed Improved Prompts** - Created enhanced prompts with explicit accuracy requirements using MLflow Prompt Registry  
✅ **Validated Improvements** - Tested improved prompts against problem cases and verified no regressions  

## The Complete MLflow Quality Improvement Workflow

1. **🔍 Discover quality issues** Find problematic traces using evaluation results
2. **📊 Create Evaluation Datasets** Turn bad traces into targeted evaluation sets, good traces into regression sets
3. **🎯 Iterate on changes** Use MLflow Prompt Registry to track your changes
4. **🧪 Evaluate changes improved quality** Test that your changes addressed the quality problems
5. **🛡️ Verify changes didn't cause a regression** Ensure fixes don't break user inputs that already work well
6. **🚀 Deploy**

This approach ensures evidence-based improvements rather than guesswork, with quantitative validation and safe deployment practices.


## Resources for Continued Learning

📚 **MLflow Documentation**

- [**Evaluation Guide**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/evaluate-app)
- [**Evaluation Datasets Guide**](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/build-eval-dataset)

🎯 **Remember**: Great evaluation is iterative. Start with the basics, learn from your results, and progressively build more sophisticated assessment as your understanding deepens.