# Human Review: Expert Feedback for GenAI Quality

App developers are most often not domain experts in the business use case, and therefore, need help from domain experts to understand what is a low vs high quality response. MLflow's labeling capabilities enable domain experts to systematically review and label GenAI application traces with ground truth, providing invaluable insights to tune your automated quality metrics and understand how your application should respond.

Key benefits:

- **Expert validation** - Domain experts can review traces and provide structured feedback on quality dimensions
- **Systematic labeling** - Create consistent labeling schemas that capture business-critical quality aspects
- **Quality improvement loop** - Convert expert feedback into training data for LLM judges and evaluation benchmarks

MLflow's Review App provides a pre-built UI, designed for business users, that anyone in your company can use, even if they don't have access to your Databricks workspace.

![human-feedback-overview](https://i.imgur.com/7LNlgDP.gif)

This notebook demonstrates how to collect expert feedback through MLflow's labeling capabilities. You'll learn to create labeling sessions, add traces for review, and access the Review App for systematic quality improvement through human-in-the-loop evaluation.


## Install packages (only required if running in a Databricks Notebook)

In [1]:
%pip install -U -r ../../requirements.txt
dbutils.library.restartPython()

/Users/eric.peter/Github/mlflow_genai_email_demo/.venv/bin/python: No module named pip
Note: you may need to restart the kernel to use updated packages.


NameError: name 'dbutils' is not defined

## Environment Setup

Load environment variables and verify MLflow configuration.


In [1]:
import sys
sys.path.append('../')
sys.path.append('../../')

import os
from dotenv import load_dotenv
import mlflow
from mlflow_demo.utils import *

if mlflow.utils.databricks_utils.is_in_databricks_notebook():
  print("Running in Databricks Notebook")
  setup_databricks_notebook_env()
else:
  print("Running in Local IDE")
  setup_local_ide_env()

# Verify key variables are loaded
print('=== Environment Setup ===')
print(f'DATABRICKS_HOST: {os.getenv("DATABRICKS_HOST")}')
print(f'MLFLOW_EXPERIMENT_ID: {os.getenv("MLFLOW_EXPERIMENT_ID")}')
print(f'LLM_MODEL: {os.getenv("LLM_MODEL")}')
print(f'UC_CATALOG: {os.getenv("UC_CATALOG")}')
print(f'UC_SCHEMA: {os.getenv("UC_SCHEMA")}')
print('✅ Environment variables loaded successfully!')

import logging
logging.getLogger("urllib3").setLevel(logging.ERROR)
logging.getLogger("mlflow").setLevel(logging.ERROR)

Running in Local IDE
=== Environment Setup ===
DATABRICKS_HOST: https://db-ml-models-prod-us-west.cloud.databricks.com
MLFLOW_EXPERIMENT_ID: 1203903883862022
LLM_MODEL: databricks-claude-3-7-sonnet
UC_CATALOG: ep
UC_SCHEMA: notes_test2
✅ Environment variables loaded successfully!


In [2]:
# Get helper functions for showing links to generated traces
from mlflow_demo.utils import generate_labeling_schema_link,generate_labeling_session_link

# 🏷️ Step 1: Create Labeling Schemas

Define structured feedback schemas that capture the quality dimensions important to your business. These schemas guide reviewers to provide consistent, actionable feedback that can be used to improve automated quality assessment.

**Schema Design Best Practices**

- **Simple Yes/No questions** are easier for reviewers and provide clear signals
- **Detailed instructions** help ensure consistent labeling across reviewers
- **Enable comments** to capture qualitative insights beyond the binary rating
- **Focus on business-critical dimensions** rather than generic quality aspects

**📚 Documentation**

- [**Expert Feedback**](https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/expert-feedback/label-existing-traces)
- [**Labeling Schemas**](https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/concepts/labeling-schemas)

**▶️ Run the next cell to create email quality assessment schemas!**


In [3]:
from mlflow.genai import label_schemas

# Create comprehensive labeling schemas for email quality assessment
# Note that we intentionally use the same names as our scorers - this allows us to see the LLM scores alongside human labels in the UI.
schema_configs = {
  'accuracy': {
    'title': 'Are all facts accurate?',
    'instruction': 'Check that all information comes from customer data with no fabrication or errors.',
    'type': 'feedback',  # Subjective assessment
  },
  'personalized': {
    'title': 'Is this email personalized?',
    'instruction': "Evaluate if the email is tailored to this customer's specific situation and cannot be reused for others.",
    'type': 'feedback',
  },
  'relevance': {
    'title': 'Is the email relevant to this customer?',
    'instruction': 'Check if urgent issues are prioritized first and content follows proper importance order.',
    'type': 'feedback',
  },
}

print('📋 Creating labeling schemas for email quality assessment...\n')

# Create label schemas using MLflow's label_schemas API
created_schemas = {}
for schema_name, config in schema_configs.items():
  # Create schema with proper configuration following documentation patterns
  schema = label_schemas.create_label_schema(
    name=schema_name,
    type=config['type'],
    title=config['title'],
    input=label_schemas.InputCategorical(options=['yes', 'no']),  # Simple binary choice
    instruction=config['instruction'],
    enable_comment=True,  # Enable comments
    overwrite=True,  # Allow updating this schema if it exists
  )
  created_schemas[schema_name] = schema
  print(f'✅ Created schema: **{schema_name}**')
  print(f'   📝 Title: {config["title"]}')
  print(f'   💡 Type: {config["type"]}')
  print(f'   📋 Instructions: {config["instruction"][:80]}...')
  print()

generate_labeling_schema_link();

📋 Creating labeling schemas for email quality assessment...





✅ Created schema: **accuracy**
   📝 Title: Are all facts accurate?
   💡 Type: feedback
   📋 Instructions: Check that all information comes from customer data with no fabrication or error...





✅ Created schema: **personalized**
   📝 Title: Is this email personalized?
   💡 Type: feedback
   📋 Instructions: Evaluate if the email is tailored to this customer's specific situation and cann...





✅ Created schema: **relevance**
   📝 Title: Is the email relevant to this customer?
   💡 Type: feedback
   📋 Instructions: Check if urgent issues are prioritized first and content follows proper importan...

🔗 View labeling schemas in MLflow UI:
   🏷️ Label Schemas: https://db-ml-models-prod-us-west.cloud.databricks.com/ml/experiments/1203903883862022/label-schemas


# 🎯 Step 2: Create Labeling Session

Create a labeling session that combines the schemas. A Labeling Session is a special type of MLflow Run organizes a set of traces for review by specific experts using selected labeling schemas. It acts as a queue for the review process.

**📚 Documentation**

- [**Expert Feedback**](https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/expert-feedback/label-existing-traces)
- [**Labeling Sessions**](https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/concepts/labeling-sessions)

**▶️ Run the next cell to create a session**


In [4]:
# Create labeling session following MLflow documentation patterns
import mlflow

print('📝 Creating labeling session...\n')

# Create session with timestamp to ensure uniqueness (following docs pattern)
session_name = f'Email quality review'

# Create labeling session with proper configuration
labeling_session = mlflow.genai.create_labeling_session(
  name=session_name,
  assigned_users=[],  # Users can be assigned later via the MLflow UI
  label_schemas=['accuracy', 'personalized', 'relevance'],  # Use the schemas created above
)

print(f'✅ Created labeling session: {session_name}')
generate_labeling_session_link(labeling_session.labeling_session_id);


📝 Creating labeling session...

✅ Created labeling session: Email quality review
🔗 View labeling sessions in MLflow UI:
   🏷️ Labeling Session: https://db-ml-models-prod-us-west.cloud.databricks.com/ml/experiments/1203903883862022/labeling-sessions?selectedLabelingSessionId=0ab987d3-42b7-4d45-9a93-c301d4e7b3e6


# 📊 Step 3: Add Traces to Labeling Session

Now we'll add traces to our labeling session so domain experts can review them. This step demonstrates how to programmatically add specific traces that need expert feedback.

### **IMPORTANT**: In this notebook, we show you how to view this data using the MLflow SDKs. You can also perform these same steps using the MLflow Experiment UI.

![add](https://i.imgur.com/mOrdoF5.gif)

**📚 Documentation**

- [**Add Traces to Labeling Session**](https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/expert-feedback/label-existing-traces#step-4-generate-traces-and-add-to-the-labeling-session)

**▶️ Run the next cell to add traces to the labeling session!**

In [5]:
print('🔍 Searching for production traces to add...')
# Get 5 traces from  from the current experiment
# The tag is used to identify the traces that the demo initially loaded - think of this as your production set of traces from real users.
traces = mlflow.search_traces(max_results=5, filter_string='tags.sample_data = "yes"')

labeling_session.add_traces(traces)

print(f'✅ Added {len(traces)} traces to labeling session: {session_name}')

generate_labeling_session_link(labeling_session.labeling_session_id);

🔍 Searching for production traces to add...
✅ Added 5 traces to labeling session: Email quality review
🔗 View labeling sessions in MLflow UI:
   🏷️ Labeling Session: https://db-ml-models-prod-us-west.cloud.databricks.com/ml/experiments/1203903883862022/labeling-sessions?selectedLabelingSessionId=0ab987d3-42b7-4d45-9a93-c301d4e7b3e6


# 🖥️ Step 4: Share the Review App with your domain experts and label!

The Review App provides an intuitive UI for domain experts to review and label traces. It's specifically designed for business users who do not have direct access to your Databricks workspace but need to provide quality feedback.

**📚 Documentation**

- [**Review App**](https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/concepts/review-app)

**▶️ Run the next cell to get a link to the Review App - open the app and label!**

In [6]:
print('\n📱 Review App URL (for domain experts):')
print(f'   🔗 {labeling_session.url}')
print('   ℹ️  Share this link with reviewers to label traces')

print('\n' + '=' * 60)


📱 Review App URL (for domain experts):
   🔗 https://db-ml-models-prod-us-west.cloud.databricks.com/ml/review-v2/94e770b23f174005a4b73dcb7cddb011/tasks/labeling/0ab987d3-42b7-4d45-9a93-c301d4e7b3e6
   ℹ️  Share this link with reviewers to label traces



# 📊 Step 5: Accessing and Using Labeling Results

After experts complete labeling, you can see the results in the MLflow UI or access the results through MLflow SDKs to use them to improve your application in several ways:

## Using Human Feedback Results

- **Understand quality patterns** - Identify common issues and strengths
- **Create training data** - Use labeled examples to train custom LLM judges
- **Build evaluation datasets** - Convert labeled traces into systematic test sets
- **Validate automated metrics** - Check if your automated judges align with human assessment
- **Improve prompts** - Use specific feedback to enhance prompt engineering
- **Guide model selection** - Compare model performance on human-validated examples

**▶️ Run the next cell to see how to programmatically access labeling results!**

In [9]:
# Real code examples for accessing labeling results
from mlflow_demo.utils.mlflow_helpers import get_mlflow_experiment_id

print('📊 ACCESSING LABELING RESULTS')
print('=' * 60)

print('\n🔍 After experts complete labeling, you can access results via the UI:\n')

generate_labeling_session_link(labeling_session.labeling_session_id)

print('\n🔍 After experts complete labeling, you can access results programmatically:\n')

traces = mlflow.search_traces(run_id=labeling_session.mlflow_run_id)

traces

📊 ACCESSING LABELING RESULTS

🔍 After experts complete labeling, you can access results via the UI:

🔗 View labeling sessions in MLflow UI:
   🏷️ Labeling Session: https://db-ml-models-prod-us-west.cloud.databricks.com/ml/experiments/1203903883862022/labeling-sessions?selectedLabelingSessionId=0ab987d3-42b7-4d45-9a93-c301d4e7b3e6

🔍 After experts complete labeling, you can access results programmatically:



Unnamed: 0,trace_id,trace,client_request_id,state,request_time,execution_duration,request,response,trace_metadata,tags,spans,assessments
0,tr-a85973d622b4aafae661f7b371cce37d,"{""info"": {""trace_id"": ""tr-a85973d622b4aafae661...",tr-a85973d622b4aafae661f7b371cce37d,TraceState.OK,1754453499937,13571,"{'customer_name': 'Energex Solutions', 'user_i...",{'email_subject': 'Energex Solutions: Implemen...,"{'mlflow.trace.sizeStats': '{""total_size_bytes...","{'mlflow.user': 'eric.peter@databricks.com', '...","[{'trace_id': 'qFlz1iK0qvrmYfezcczjfQ==', 'spa...",[{'assessment_id': 'a-1dbcaf6a953f45dcb29108da...
1,tr-80b769a1121711c366384ace202dc3f8,"{""info"": {""trace_id"": ""tr-80b769a1121711c36638...",tr-80b769a1121711c366384ace202dc3f8,TraceState.OK,1754453499582,13143,"{'customer_name': 'Skyline Realty', 'user_inpu...",{'email_subject': 'Skyline Realty: Follow-up o...,"{'mlflow.trace.sizeStats': '{""total_size_bytes...","{'mlflow.user': 'eric.peter@databricks.com', '...","[{'trace_id': 'gLdpoRIXEcNmOErOIC3D+A==', 'spa...",[{'assessment_id': 'a-fefc0700aece4a0093be6a46...
2,tr-1cb816d596868308611d24706e5e7369,"{""info"": {""trace_id"": ""tr-1cb816d596868308611d...",tr-1cb816d596868308611d24706e5e7369,TraceState.OK,1754453496025,11208,"{'customer_name': 'BrewMasters Co.', 'user_inp...",{'email_subject': 'BrewMasters Co. - Follow-up...,"{'mlflow.trace.sizeStats': '{""total_size_bytes...","{'mlflow.user': 'eric.peter@databricks.com', '...","[{'trace_id': 'HLgW1ZaGgwhhHSRwbl5zaQ==', 'spa...",[{'assessment_id': 'a-406d6e932fa941c5a400ae22...
3,tr-8c23c6e188b8f1d8c835b94553f42e10,"{""info"": {""trace_id"": ""tr-8c23c6e188b8f1d8c835...",tr-8c23c6e188b8f1d8c835b94553f42e10,TraceState.OK,1754453494921,10450,"{'customer_name': 'HealthFirst', 'user_input':...",{'email_subject': 'HealthFirst Strategic Plann...,"{'mlflow.trace.sizeStats': '{""total_size_bytes...","{'mlflow.user': 'eric.peter@databricks.com', '...","[{'trace_id': 'jCPG4Yi48djINblFU/QuEA==', 'spa...",[{'assessment_id': 'a-2b0131ba1a1645a9955ac9ca...
4,tr-f15608821136a5dd85b0792c64149eee,"{""info"": {""trace_id"": ""tr-f15608821136a5dd85b0...",tr-f15608821136a5dd85b0792c64149eee,TraceState.OK,1754453491676,18370,"{'customer_name': 'AeroDynamics', 'user_input'...",{'email_subject': 'AeroDynamics Risk Assessmen...,"{'mlflow.trace.sizeStats': '{""total_size_bytes...","{'mlflow.user': 'eric.peter@databricks.com', '...","[{'trace_id': '8VYIghE2pd2FsHksZBSe7g==', 'spa...",[{'assessment_id': 'a-beb09050606e4d229363fe6a...


# 🎯 Summary and Next Steps

Congratulations! You've successfully implemented human review workflows for GenAI quality improvement.

## What You've Accomplished

✅ **Generated traces** for expert review with complex scenarios  
✅ **Created labeling schemas** that capture business-critical quality aspects  
✅ **Set up labeling sessions** accessible through the Review App  
✅ **Learned the workflow** for collecting structured expert feedback

## Key Takeaways

- **Human expertise is invaluable** for understanding quality in your domain
- **Structured schemas** ensure consistent, actionable feedback
- **The Review App** makes it easy for non-technical users to contribute
- **Labels become training data** for automated quality assessment

**📚 Continue Learning**

- [Human Feedback Collection](https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/) - Complete guide
- [Building Evaluation Datasets](https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/build-eval-dataset) - Use labels for evaluation
