# 🏥 Pharmaceutical Field Force Effectiveness - Anomaly Detection Demo

## Business Context

**Scenario**: Monitor sales rep performance across regions to detect unusual patterns:
- 💰 Expense fraud or policy violations
- 📉 Territory coverage issues and productivity gaps
- 🚨 Unrealistic prescription claims
- 📚 Training needs identification

**Data**: Sales rep daily activity with calls, prescriptions, samples, expenses across US, EU, APAC

## What You'll Learn (45 min comprehensive demo)

1. **Auto-Discovery** - Zero-config vs manual tuning
2. **Segment-Based Monitoring** - Regional baselines (US vs EU vs APAC)
3. **Parameter Tuning** - Contamination, hyperparameters, model comparison
4. **Feature Contributions** - SHAP-based root cause analysis
5. **Drift Detection** - When to retrain models
6. **Multi-Type Features** - Numeric, categorical, datetime, boolean
7. **Production Integration** - DQEngine, YAML, quarantine workflows

---

**📋 Table of Contents:**
- Section 1: Setup & Realistic Data (5 min)
- Section 2: Auto-Discovery & Manual Tuning (12 min)
- Section 3: Segment-Based Monitoring (8 min)
- Section 4: Feature Contributions & Root Cause (8 min)
- Section 5: Drift Detection & Retraining (6 min)
- Section 6: Production Integration (6 min)


---

## Section 1: Setup & Data Generation (5 min)

First, install DQX with anomaly support if not already installed:
```bash
%pip install databricks-labs-dqx[anomaly]
```


In [None]:
# Imports
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta
import random
import numpy as np

from databricks.labs.dqx.anomaly import AnomalyEngine, has_no_anomalies, AnomalyParams, IsolationForestConfig
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.check_funcs import is_not_null, is_in_range
from databricks.labs.dqx.config import OutputConfig
from databricks.sdk import WorkspaceClient

# Initialize
spark = SparkSession.builder.getOrCreate()
ws = WorkspaceClient()
dq_engine = DQEngine(ws)
anomaly_engine = AnomalyEngine(ws)

# Set seeds for reproducibility
random.seed(42)
np.random.seed(42)

print("✅ Setup complete!")
print(f"   Spark version: {spark.version}")


In [None]:
# === CONFIGURABLE PARAMETERS ===
# Adjust these to experiment with different thresholds

# Anomaly score threshold (0-1 scale)
# Lower = more sensitive (more anomalies detected, higher false positives)
# Higher = more specific (fewer anomalies, lower false positives)
ANOMALY_SCORE_THRESHOLD = 0.5

# Drift detection threshold (z-score)
# Typical range: 2.0 (sensitive) to 5.0 (conservative)
DRIFT_THRESHOLD = 3.0

print(f"📋 Configuration:")
print(f"   Anomaly Score Threshold: {ANOMALY_SCORE_THRESHOLD}")
print(f"   Drift Detection Threshold: {DRIFT_THRESHOLD}")
print(f"\n💡 You can change these values and re-run cells to see different results")


### Generate Realistic Sales Rep Activity Data

We'll create 10,000 rows of daily sales rep activity with:
- **Mixed data types**: Numeric, categorical, datetime, boolean
- **Regional patterns**: Different baselines for US (high expenses), EU (moderate), APAC (high volume)
- **Call type variations**: Promotional (higher cost), Educational (more samples), Follow-up (shorter)
- **Temporal trends**: Seasonal variations (Q4 boost), weekday patterns
- **Rep-specific behavior**: Consistent performers vs. inconsistent ones
- **Realistic correlations**: More calls → more prescriptions, promotional → higher expenses
- **Injected anomalies**: ~3% anomalous records (expense fraud, low productivity, data quality issues)


In [None]:
# Generate sales rep activity data with realistic patterns
def generate_field_force_data(num_rows=10000, anomaly_rate=0.03):
    """Generate pharmaceutical field force activity data with realistic patterns."""
    data = []
    regions = ["US", "EU", "APAC"]
    call_types = ["promotional", "educational", "follow_up"]
    num_reps = 100  # Increased from 50 to 100 reps
    
    # Regional baseline patterns (realistic differences)
    regional_patterns = {
        "US": {"calls": 8, "prescriptions": 12, "samples": 25, "expenses": 150, "remote_rate": 0.4},
        "EU": {"calls": 6, "prescriptions": 9, "samples": 18, "expenses": 100, "remote_rate": 0.5},
        "APAC": {"calls": 10, "prescriptions": 15, "samples": 30, "expenses": 120, "remote_rate": 0.3},
    }
    
    # Call type modifiers (affect baseline metrics)
    call_type_modifiers = {
        "promotional": {"calls": 1.0, "prescriptions": 1.2, "samples": 1.4, "expenses": 1.3, "remote": 0.2},
        "educational": {"calls": 0.9, "prescriptions": 0.8, "samples": 1.6, "expenses": 1.1, "remote": 0.6},
        "follow_up": {"calls": 0.7, "prescriptions": 1.0, "samples": 0.6, "expenses": 0.8, "remote": 0.5},
    }
    
    # Rep performance profiles (some reps are consistently better/worse)
    rep_profiles = {}
    for rep_id in range(num_reps):
        # 70% average, 20% high performers, 10% low performers
        perf_type = np.random.choice(["average", "high", "low"], p=[0.7, 0.2, 0.1])
        if perf_type == "high":
            multiplier = np.random.uniform(1.2, 1.5)
        elif perf_type == "low":
            multiplier = np.random.uniform(0.6, 0.8)
        else:
            multiplier = np.random.uniform(0.9, 1.1)
        rep_profiles[f"REP{rep_id:03d}"] = multiplier
    
    start_date = datetime(2024, 1, 1)
    end_date = datetime(2024, 12, 31)
    total_days = (end_date - start_date).days
    
    for i in range(num_rows):
        rep_id = f"REP{i % num_reps:03d}"
        region = random.choice(regions)
        call_type = random.choice(call_types)
        
        # Get baseline patterns
        pattern = regional_patterns[region]
        call_modifier = call_type_modifiers[call_type]
        rep_multiplier = rep_profiles[rep_id]
        
        # Generate date with temporal trends
        days_offset = random.randint(0, total_days)
        call_date = start_date + timedelta(days=days_offset)
        
        # Seasonal multiplier (Q4 boost for pharma year-end push)
        month = call_date.month
        if month in [10, 11, 12]:  # Q4
            seasonal_multiplier = 1.15
        elif month in [1, 2]:  # Post-holiday slump
            seasonal_multiplier = 0.9
        else:
            seasonal_multiplier = 1.0
        
        # Weekday effect (lower activity on Fridays, higher Mon-Thu)
        weekday = call_date.weekday()
        if weekday == 4:  # Friday
            weekday_multiplier = 0.85
        elif weekday in [0, 1]:  # Monday, Tuesday
            weekday_multiplier = 1.05
        else:
            weekday_multiplier = 1.0
        
        # Track ground truth for validation
        anomaly_type_label = None
        
        # Normal patterns (97% of data)
        if random.random() > anomaly_rate:
            anomaly_type_label = "normal"
            
            # Base metrics with all multipliers
            combined_multiplier = rep_multiplier * seasonal_multiplier * weekday_multiplier
            
            # Calls (with correlation to rep performance)
            calls_base = pattern["calls"] * call_modifier["calls"] * combined_multiplier
            calls = max(1, int(np.random.normal(calls_base, calls_base * 0.2)))
            
            # Prescriptions (correlated with calls - more calls → more prescriptions)
            prescriptions_base = pattern["prescriptions"] * call_modifier["prescriptions"] * combined_multiplier
            # Add correlation: prescription rate increases slightly with more calls
            call_correlation = min(1.2, calls / calls_base)
            prescriptions = max(0, int(np.random.normal(prescriptions_base * call_correlation, prescriptions_base * 0.25)))
            
            # Samples (correlated with call type)
            samples_base = pattern["samples"] * call_modifier["samples"] * combined_multiplier
            samples = max(0, int(np.random.normal(samples_base, samples_base * 0.2)))
            
            # Expenses (correlated with calls and call type)
            expenses_base = pattern["expenses"] * call_modifier["expenses"] * combined_multiplier
            # Add correlation: more calls = slightly higher expenses
            expense_correlation = min(1.15, calls / calls_base)
            expenses = max(10, round(np.random.normal(expenses_base * expense_correlation, expenses_base * 0.15), 2))
            
            # Remote flag (depends on call type and region)
            is_remote = random.random() < (pattern["remote_rate"] * call_modifier["remote"])
            
        else:
            # Inject realistic anomalies (3% of data)
            anomaly_type = random.choice([
                "high_expense_fraud",
                "low_productivity",
                "unrealistic_prescriptions",
                "data_quality_issue"
            ])
            anomaly_type_label = anomaly_type
            
            if anomaly_type == "high_expense_fraud":
                # Excessive expenses with low output (potential fraud)
                calls = max(1, int(pattern["calls"] * 0.4))
                prescriptions = max(0, int(pattern["prescriptions"] * 0.3))
                samples = max(0, int(pattern["samples"] * 0.5))
                expenses = round(pattern["expenses"] * random.uniform(3.0, 5.0), 2)
                is_remote = False  # Fraudsters often claim in-person visits
                
            elif anomaly_type == "low_productivity":
                # Many calls but few results (training need or territory issue)
                calls = int(pattern["calls"] * random.uniform(2.0, 3.0))
                prescriptions = max(0, int(pattern["prescriptions"] * random.uniform(0.15, 0.3)))
                samples = int(pattern["samples"] * random.uniform(0.4, 0.6))
                expenses = round(pattern["expenses"] * random.uniform(1.3, 1.6), 2)
                is_remote = random.random() < 0.4
                
            elif anomaly_type == "unrealistic_prescriptions":
                # Suspiciously high prescription rate (investigation needed)
                calls = max(1, int(pattern["calls"] * random.uniform(0.8, 1.2)))
                prescriptions = int(pattern["prescriptions"] * random.uniform(3.0, 5.0))
                samples = int(pattern["samples"] * random.uniform(2.0, 3.0))
                expenses = round(pattern["expenses"] * random.uniform(0.9, 1.3), 2)
                is_remote = False
                
            else:  # data_quality_issue
                # Outliers that don't follow normal patterns (data entry errors)
                calls = random.choice([0, int(pattern["calls"] * 10)])  # Either 0 or way too high
                prescriptions = random.choice([int(pattern["prescriptions"] * -1) if random.random() < 0.3 else 0, 
                                              int(pattern["prescriptions"] * 8)])
                samples = random.choice([0, int(pattern["samples"] * 12)])
                expenses = round(random.choice([1, pattern["expenses"] * 20]), 2)
                is_remote = random.random() < 0.5
        
        data.append((
            f"ACT{i:06d}",  # Unique activity_id (primary key)
            rep_id,
            region,
            call_date,
            calls,
            prescriptions,
            samples,
            expenses,
            is_remote,
            call_type,
            anomaly_type_label  # Ground truth for validation
        ))
    
    return data

# Generate data
print("🔄 Generating sales rep activity data with realistic patterns...")
field_force_data = generate_field_force_data(num_rows=10000, anomaly_rate=0.03)

schema = StructType([
    StructField("activity_id", StringType(), False),  # Primary key
    StructField("rep_id", StringType(), False),
    StructField("region", StringType(), False),
    StructField("call_date", DateType(), False),
    StructField("calls_made", IntegerType(), False),
    StructField("prescriptions_generated", IntegerType(), False),
    StructField("samples_distributed", IntegerType(), False),
    StructField("expenses", DoubleType(), False),
    StructField("is_remote", BooleanType(), False),
    StructField("call_type", StringType(), False),
    StructField("true_anomaly_type", StringType(), False),  # Ground truth
])

df_sales = spark.createDataFrame(field_force_data, schema)

print("\n📊 Sample of field force activity data:")
display(df_sales.orderBy("call_date").limit(10))

print(f"\n✅ Generated {df_sales.count()} rows with ~3% injected anomalies")
print(f"   Regions: {df_sales.select('region').distinct().count()}")
print(f"   Call types: {df_sales.select('call_type').distinct().count()}")
print(f"   Unique reps: {df_sales.select('rep_id').distinct().count()}")
print(f"   Date range: {df_sales.agg(F.min('call_date'), F.max('call_date')).first()}")
print(f"   Segments (region × call_type): {df_sales.select('region', 'call_type').distinct().count()}")


In [None]:
# Save to table for training
catalog = spark.sql("SELECT current_catalog()").first()[0]
schema_name = "dqx_demo"
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema_name}")

table_name = f"{catalog}.{schema_name}.field_force_activity"
df_sales.write.mode("overwrite").saveAsTable(table_name)

print(f"✅ Data saved to: {table_name}")


In [None]:
# Split data into training (80%) and test (20%) sets
# Training: historical "normal" data to learn patterns
# Test: new data to detect anomalies (simulates production)

df_train, df_test = df_sales.randomSplit([0.8, 0.2], seed=42)

print(f"📊 Data Split:")
print(f"   Training set: {df_train.count()} rows")
print(f"   Test set: {df_test.count()} rows")
print(f"\\n💡 We train on historical data and score on new data (like production)")


---

## Section 2: Auto-Discovery vs Manual Tuning (12 min)

### 2.1 Auto-Discovery (Zero Configuration)

Let's start with zero configuration - DQX will automatically select columns and detect segments.


In [None]:
# Let DQX automatically discover the best columns and segments
# Use exclude_columns to skip ID and ground truth columns
# This enables auto-discovery on remaining columns
model_name_auto = anomaly_engine.train(
    df=df_train,
    model_name="field_force_auto",
    exclude_columns=['activity_id', 'true_anomaly_type'],  # Exclude ID and ground truth
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)
print(f"\\n📊 Auto-discovery complete!")
print(f"   Model: {model_name_auto}")


In [None]:
# Check what was auto-discovered
registry_df = spark.table(f"{catalog}.{schema_name}.anomaly_model_registry")

# For segmented models, we query by the base model name
# The registry contains individual segment entries
base_name_parts = model_name_auto.split(".")
if len(base_name_parts) == 3:
    base_model_name_only = base_name_parts[2]  # Get just the model name without catalog.schema
else:
    base_model_name_only = model_name_auto

# Get a representative segment to show configuration
sample_model = registry_df.filter(
    F.col("model_name").like(f"%{base_model_name_only}%")
).orderBy("training_time").first()

print(f"\\n📋 Auto-Discovered Configuration:")
if sample_model:
    print(f"   Model: {model_name_auto}")
    print(f"   Columns: {sample_model['columns']}")
    print(f"   Segments: {sample_model['segment_by']}")
    print(f"   Column types: {sample_model['column_types']}")
    
    # Count total segments if segmented model
    if sample_model['segment_by']:
        segment_count = registry_df.filter(
            F.col("model_name").like(f"%{base_model_name_only}%")
        ).count()
        print(f"   Total segments trained: {segment_count}")
else:
    print("   ⚠️ Model not found in registry")

print(f"\\n💡 DQX prioritized: numeric > boolean > categorical > datetime")


In [None]:
from databricks.labs.dqx.rule import DQDatasetRule

# Score with auto-discovered model (just pass the model name!)
checks_auto = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": model_name_auto,  # Just pass the model name - segments are handled automatically!
            "score_threshold": ANOMALY_SCORE_THRESHOLD,  # Use configurable threshold
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

df_scored_auto = dq_engine.apply_checks(df_test, checks_auto)
anomalies_auto = df_scored_auto.filter(F.col("anomaly_score") >= ANOMALY_SCORE_THRESHOLD)

print(f"\\n⚠️  Auto-discovery found {anomalies_auto.count()} anomalies (threshold: {ANOMALY_SCORE_THRESHOLD}):\\n")
display(anomalies_auto.orderBy(F.col("anomaly_score").desc()).select(
    "rep_id", "region", "calls_made", "prescriptions_generated", "expenses",
    F.round("anomaly_score", 3).alias("score")
).limit(10))


### 2.1.2 Validate Auto-Discovery Performance

Let's validate how well auto-discovery worked by checking detection rate and false positives.


### 2.1.1 Explore the `_info` Column

DQX provides a structured `_info` column that contains all anomaly metadata in one place.


In [None]:
# Display the _info column structure
print("📊 The _info Column Structure:\n")
print("_info: struct {")
print("  anomaly: struct {")
print("    check_name: string        # Check function name")
print("    score: double              # Anomaly score (0-1)")
print("    is_anomaly: boolean        # True if score > threshold")
print("    threshold: double          # Detection threshold used")
print("    model: string              # Model name")
print("    segment: map<string,string> # Segment values (null for global)")
print("    contributions: map<string,double> # SHAP values (if requested)")
print("    confidence_std: double     # Ensemble std (if ensemble)")
print("  }")
print("}")



print("\n📋 Sample _info values for detected anomalies:\n")

# Show top 3 anomalies with their _info (using _info to filter)
sample_anomalies = df_scored_auto.filter(
    F.col("_info.anomaly"."is_anomaly"]  # ✅ Recommended way
).orderBy(F.col("_info.anomaly"."score"].desc()).limit(3)

for row in sample_anomalies.collect():
    print(f"Activity {row['activity_id']}:")
    print(f"  Region: {row['region']}, Calls: {row['calls_made']}, Expenses: ${row['expenses']:.2f}")
    
    # Extract _info.anomaly array
    anomaly_info = row['_info']['anomaly']  # Get first element
    print(f"  _info.anomaly:")
    print(f"    check_name: {anomaly_info['check_name']}")
    print(f"    score: {anomaly_info['score']:.3f}")
    print(f"    is_anomaly: {anomaly_info['is_anomaly']}")
    print(f"    threshold: {anomaly_info['threshold']}")
    print(f"    model: {anomaly_info['model']}")
    print(f"    segment: {anomaly_info['segment']}")
    print()

print("💡 Benefits of _info column:")
print("   ✅ All metadata in one place")
print("   ✅ Self-documenting schema")
print("   ✅ Easy to query: df.filter(col('_info.anomaly.is_anomaly'))")
print("   ✅ Extensible: Future checks (drift, profiling) can add their own keys")


In [None]:
# === COMPREHENSIVE VALIDATION WITH GROUND TRUTH ===

print(f"🔍 Validation using threshold: {ANOMALY_SCORE_THRESHOLD}\n")

# Classify predictions (deduplication now handled in library)
df_classified = df_scored_auto.withColumn(
    "predicted_anomaly",
    F.when(F.col("anomaly_score") >= ANOMALY_SCORE_THRESHOLD, True).otherwise(False)
).withColumn(
    "is_true_anomaly",
    F.when(F.col("true_anomaly_type") != "normal", True).otherwise(False)
)

# === 1. Overall Confusion Matrix ===
print("📊 Confusion Matrix:")
confusion = df_classified.groupBy("is_true_anomaly", "predicted_anomaly").count()
display(confusion.orderBy("is_true_anomaly", "predicted_anomaly"))

# Calculate metrics
tp = df_classified.filter((F.col("is_true_anomaly") == True) & (F.col("predicted_anomaly") == True)).count()
fp = df_classified.filter((F.col("is_true_anomaly") == False) & (F.col("predicted_anomaly") == True)).count()
tn = df_classified.filter((F.col("is_true_anomaly") == False) & (F.col("predicted_anomaly") == False)).count()
fn = df_classified.filter((F.col("is_true_anomaly") == True) & (F.col("predicted_anomaly") == False)).count()

precision = tp / (tp + fp) if (tp + fp) > 0 else 0
recall = tp / (tp + fn) if (tp + fn) > 0 else 0
f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

print("\n📈 Performance Metrics:")
metrics_df = spark.createDataFrame([
    ("Precision", round(precision * 100, 2)),
    ("Recall", round(recall * 100, 2)),
    ("F1-Score", round(f1_score * 100, 2))
], ["Metric", "Value_%"])
display(metrics_df)

# === 2. Detection Rate by Anomaly Type ===
print("\n🎯 Detection Performance by Anomaly Type:")

anomaly_type_stats = df_classified.filter(
    F.col("true_anomaly_type") != "normal"
).groupBy("true_anomaly_type").agg(
    F.count("*").alias("total_count"),
    F.sum(F.when(F.col("predicted_anomaly"), 1).otherwise(0)).alias("detected_count"),
    F.avg("anomaly_score").alias("avg_score"),
    F.max("anomaly_score").alias("max_score"),
    F.min("anomaly_score").alias("min_score")
).withColumn(
    "detection_rate",
    (F.col("detected_count") / F.col("total_count") * 100).cast("decimal(5,2)")
).orderBy(F.desc("detection_rate"))

display(anomaly_type_stats)

# === 3. Score Distribution by Type (Visualization) ===
print("\n📊 Score Distribution by Anomaly Type:")
score_by_type = df_classified.select(
    "true_anomaly_type",
    F.round("anomaly_score", 2).alias("anomaly_score")
)
display(score_by_type)  # Databricks will auto-create histogram

# === 4. False Positives Analysis ===
false_positives = df_classified.filter(
    (F.col("true_anomaly_type") == "normal") & 
    (F.col("predicted_anomaly") == True)
)

print(f"\n⚠️ False Positives: {false_positives.count()} records")
print(f"Top false positive scores:")
display(
    false_positives.select(
        "activity_id", "rep_id", "region", 
        "calls_made", "expenses",
        F.round("anomaly_score", 3).alias("score")
    ).orderBy(F.desc("anomaly_score")).limit(5)
)

# === 5. Threshold Sensitivity Analysis ===
print(f"\n🔬 Impact of Different Thresholds:")
threshold_analysis = []
for thresh in [0.3, 0.4, 0.5, 0.6, 0.7]:
    detected = df_classified.filter(
        (F.col("is_true_anomaly") == True) & 
        (F.col("anomaly_score") >= thresh)
    ).count()
    false_pos = df_classified.filter(
        (F.col("is_true_anomaly") == False) & 
        (F.col("anomaly_score") >= thresh)
    ).count()
    total_anomalies = df_classified.filter(F.col("is_true_anomaly") == True).count()
    
    threshold_analysis.append((
        thresh,
        detected,
        round(detected/total_anomalies*100, 1) if total_anomalies > 0 else 0,
        false_pos
    ))

threshold_df = spark.createDataFrame(
    threshold_analysis,
    ["Threshold", "Detected", "Detection_Rate_%", "False_Positives"]
)
display(threshold_df)

print(f"\n💡 Current threshold ({ANOMALY_SCORE_THRESHOLD}) is highlighted above.")
print(f"   Lower threshold = more sensitive (catches more, but more false alarms)")
print(f"   Higher threshold = more specific (misses some, but fewer false alarms)")

# Store metrics for comparison later
expected_anomalies = df_classified.filter(F.col("is_true_anomaly") == True).count()
normal_records_expected = df_classified.filter(F.col("is_true_anomaly") == False).count()
detected_anomalies = anomalies_auto.count()
detection_rate = (recall * 100)
fp_rate = (fp / normal_records_expected * 100) if normal_records_expected > 0 else 0

auto_metrics = {
    "detected": detected_anomalies,
    "detection_rate": detection_rate,
    "fp_rate": fp_rate
}


In [None]:
# === SHAP FEATURE CONTRIBUTIONS BY ANOMALY TYPE ===
print("🔬 Computing SHAP contributions per anomaly type...")
print("   (This may take a moment...)\n")

# Re-score with SHAP enabled
checks_with_shap = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": model_name_auto,
            "score_threshold": ANOMALY_SCORE_THRESHOLD,
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"],
            "include_contributions": True  # <- Enable SHAP
        }
    )
]

df_with_shap = dq_engine.apply_checks(df_test, checks_with_shap)

# Use the scored DataFrame directly (deduplication now handled in library)
df_shap_truth = df_with_shap

# === Analyze Top Contributors Per Anomaly Type ===
for anomaly_type in ["high_expense_fraud", "low_productivity", "unrealistic_prescriptions", "data_quality_issue"]:
    print(f"\n{'='*60}")
    print(f"📊 SHAP Analysis: {anomaly_type.upper().replace('_', ' ')}")
    print(f"{'='*60}")
    
    # Get samples of this type that were detected
    samples = df_shap_truth.filter(
        (F.col("true_anomaly_type") == anomaly_type) &
        (F.col("anomaly_score") >= ANOMALY_SCORE_THRESHOLD)
    ).select(
        "activity_id", 
        "anomaly_score",
        "anomaly_contributions",
        "calls_made", "prescriptions_generated", "expenses"
    ).limit(3)
    
    # Check if we have samples
    sample_count = samples.count()
    if sample_count == 0:
        print(f"   ⚠️ No detected anomalies of this type (try lowering threshold)")
        continue
    
    # Display the anomalies
    print(f"\n✅ Sample detected anomalies ({sample_count}):")
    display(samples.select("activity_id", "calls_made", "prescriptions_generated", "expenses", 
                           F.round("anomaly_score", 3).alias("score")))
    
    # Extract and display top SHAP contributors
    # Note: anomaly_contributions is a Map<String, Double>
    for row in samples.collect():
        contributions = row["anomaly_contributions"]
        if contributions:
            print(f"\n  🔍 Activity {row['activity_id']} - Top contributing features:")
            sorted_contrib = sorted(contributions.items(), key=lambda x: abs(x[1]), reverse=True)[:5]
            for feature, value in sorted_contrib:
                print(f"      {feature:30s}: {value*100:6.1f}%")

print(f"\n\n💡 Key Insights:")
print(f"   • high_expense_fraud: Usually driven by 'expenses' features")
print(f"   • low_productivity: Driven by calls vs prescriptions ratio")
print(f"   • unrealistic_prescriptions: Driven by prescription-related features")
print(f"   • data_quality_issue: Mix of features with extreme values")
print(f"\n✅ SHAP helps explain WHY each anomaly was detected!")


### 2.2 Manual Column Selection & Parameter Tuning

Now let's manually select specific columns and tune hyperparameters for better performance.


In [None]:
# Manually specify columns (no segmentation)
# Note: We explicitly select features, excluding activity_id and true_anomaly_type
model_name_manual = anomaly_engine.train(
    df=df_train,
    columns=["calls_made", "prescriptions_generated", "expenses", "samples_distributed"],
    model_name="field_force_manual",
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)


In [None]:
# Score with manually configured model
checks_manual = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": model_name_manual,
            "score_threshold": ANOMALY_SCORE_THRESHOLD,  # Use configurable threshold
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

df_scored_manual = dq_engine.apply_checks(df_test, checks_manual)
anomalies_manual = df_scored_manual.filter(F.col("anomaly_score") >= ANOMALY_SCORE_THRESHOLD)

print(f"\\n⚠️  Manual config found {anomalies_manual.count()} anomalies (threshold: {ANOMALY_SCORE_THRESHOLD}):\\n")
display(anomalies_manual.orderBy(F.col("anomaly_score").desc()).select(
    "rep_id", "region", "calls_made", "prescriptions_generated", "expenses",
    F.round("anomaly_score", 3).alias("score")
).limit(10))


# Store metrics for comparison
manual_detected = anomalies_manual.count()
manual_detection_rate = (manual_detected / expected_anomalies) * 100 if expected_anomalies > 0 else 0
manual_flagged = df_scored_manual.filter(F.col("anomaly_score") >= ANOMALY_SCORE_THRESHOLD).count()
manual_fp = max(0, manual_flagged - expected_anomalies)
manual_fp_rate = (manual_fp / normal_records_expected) * 100 if normal_records_expected > 0 else 0

manual_metrics = {
    "detected": manual_detected,
    "detection_rate": manual_detection_rate,
    "fp_rate": manual_fp_rate
}

print(f"\n📊 Manual Tuning Performance: {manual_detected} detected ({manual_detection_rate:.1f}% rate), {manual_fp_rate:.2f}% FP")


### 2.3 Model Comparison

Let's compare the auto-discovered vs manually tuned models:


In [None]:
# Compare models
print("📊 Model Comparison:\\n")
comparison = registry_df.filter(
    F.col("model_uri").isin([model_name_auto, model_name_manual])
).select(
    "model_name",
    "columns",
    "training_rows",
    "metrics"
).collect()

for model in comparison:
    print(f"{'='*60}")
    print(f"Model: {model['model_name']}")
    print(f"Columns: {model['columns']}")
    print(f"Training rows: {model['training_rows']}")
    print(f"Metrics: {model['metrics']}")
    print()

print("💡 Tuning Tips:")
print("   - contamination: Set to expected anomaly rate (0.01-0.1)")
print("   - num_trees: More trees = more stable (100-200)")
print("   - max_samples: Smaller = faster, larger = more accurate (256-1024)")
print("   - Start with auto-discovery, then refine based on domain knowledge")


---

## Section 3: Segment-Based Monitoring (8 min)

Different regions have different patterns. Train per-region models for accurate baselines.


In [None]:
# Train with regional segmentation
print("🌍 Training region-specific anomaly models...\\n")

# Note: We explicitly select features, excluding activity_id and true_anomaly_type
model_name_segmented = anomaly_engine.train(
    df=df_train,
    columns=["calls_made", "prescriptions_generated", "samples_distributed", "expenses"],
    segment_by=["region"],  # Train separate model per region
    model_name="field_force_regional",
    params=AnomalyParams(
        algorithm_config=IsolationForestConfig(contamination=0.05, num_trees=150, random_seed=42)
    ),
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)

print(f"\\n✅ Regional models trained!")
print("   DQX automatically trained 3 models (US, EU, APAC)")


In [None]:
# Compare regional baselines
regional_models = spark.table(f"{catalog}.{schema_name}.anomaly_model_registry").filter(
    F.col("model_name") == "field_force_regional"
)

print("📊 Regional Model Baselines:\\n")
for row in regional_models.select("segment_values", "training_rows", "baseline_stats").collect():
    region = row['segment_values']['region']
    print(f"Region: {region}")
    print(f"  Training rows: {row['training_rows']}")
    print(f"  Baseline stats: {row['baseline_stats']}")
    print()

print("🔍 Notice: Each region has different baselines!")
print("   US: Higher expenses ($150 avg)")
print("   EU: Lower expenses ($100 avg)")
print("   APAC: Highest volume (10 calls, 15 prescriptions avg)")


In [None]:
from databricks.labs.dqx.rule import DQDatasetRule

# Score with regional models (automatic routing)
checks_regional = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": "field_force_regional",
            "score_threshold": ANOMALY_SCORE_THRESHOLD,  # Use configurable threshold
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

df_scored_regional = dq_engine.apply_checks(df_test, checks_regional)

print(f"⚠️  Regional anomalies by region (threshold: {ANOMALY_SCORE_THRESHOLD}):\\n")
display(df_scored_regional.filter(F.col("anomaly_score") >= ANOMALY_SCORE_THRESHOLD).groupBy("region").agg(
    F.count("*").alias("anomaly_count"),
    F.avg("anomaly_score").alias("avg_score"),
    F.max("anomaly_score").alias("max_score")
).orderBy("region"))

print("\\n📋 Top regional anomalies:")
display(df_scored_regional.filter(F.col("anomaly_score") >= ANOMALY_SCORE_THRESHOLD).orderBy(
    F.col("anomaly_score").desc()
).select(
    "rep_id", "region", "calls_made", "prescriptions_generated", "expenses",
    F.round("anomaly_score", 3).alias("score")
).limit(10))


# Store metrics for comparison
anomalies_regional = df_scored_regional.filter(F.col("anomaly_score") >= ANOMALY_SCORE_THRESHOLD)
segmented_detected = anomalies_regional.count()
segmented_detection_rate = (segmented_detected / expected_anomalies) * 100 if expected_anomalies > 0 else 0
segmented_flagged = df_scored_regional.filter(F.col("anomaly_score") >= ANOMALY_SCORE_THRESHOLD).count()
segmented_fp = max(0, segmented_flagged - expected_anomalies)
segmented_fp_rate = (segmented_fp / normal_records_expected) * 100 if normal_records_expected > 0 else 0

segmented_metrics = {
    "detected": segmented_detected,
    "detection_rate": segmented_detection_rate,
    "fp_rate": segmented_fp_rate
}

print(f"\n📊 Segmented Performance: {segmented_detected} detected ({segmented_detection_rate:.1f}% rate), {segmented_fp_rate:.2f}% FP")


---

## Section 4: Feature Contributions & Root Cause (8 min)

**Why is a record anomalous?** Use SHAP to understand which columns drove the anomaly score.


In [None]:
from databricks.labs.dqx.rule import DQDatasetRule

# Score with SHAP-based feature contributions
checks_with_contrib = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": "field_force_regional",
            "score_threshold": ANOMALY_SCORE_THRESHOLD,  # Use configurable threshold
            "include_contributions": True,  # Enable SHAP explanations
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

df_with_contrib = dq_engine.apply_checks(df_test, checks_with_contrib)

print(f"🔍 Top Anomalies with Feature Contributions (SHAP, threshold: {ANOMALY_SCORE_THRESHOLD}):\\n")
anomalies_contrib = df_with_contrib.filter(
    F.col("anomaly_score") >= ANOMALY_SCORE_THRESHOLD
).orderBy(F.col("anomaly_score").desc()).limit(10)

display(anomalies_contrib.select(
    "rep_id", "region",
    "calls_made", "prescriptions_generated", "samples_distributed", "expenses",
    F.round("anomaly_score", 3).alias("score"),
    "anomaly_contributions"
))


In [None]:
# Analyze contribution patterns for root cause
print("📊 Root Cause Analysis:\\n")

top_anomaly = anomalies_contrib.first()
print(f"🔸 Top Anomaly: REP={top_anomaly['rep_id']}, Region={top_anomaly['region']}")
print(f"   Score: {top_anomaly['anomaly_score']:.3f}")
print(f"   Values:")
print(f"     • calls_made: {top_anomaly['calls_made']}")
print(f"     • prescriptions: {top_anomaly['prescriptions_generated']}")
print(f"     • samples: {top_anomaly['samples_distributed']}")
print(f"     • expenses: ${top_anomaly['expenses']:.2f}")
print(f"\\n   📈 Feature Contributions (SHAP):")

if top_anomaly['anomaly_contributions']:
    sorted_contribs = sorted(
        top_anomaly['anomaly_contributions'].items(),
        key=lambda x: x[1],
        reverse=True
    )
    for feature, contribution in sorted_contribs:
        print(f"      {feature:30s}: {contribution:.3f} ({contribution*100:.1f}%)")

print("\\n💡 Business Interpretation Examples:")
print("   • High 'expenses' contribution → Potential fraud or policy violation")
print("   • High 'calls_made' + low 'prescriptions' → Training need or territory issue")
print("   • High 'prescriptions' contribution → Unrealistic claims to investigate")
print("   • Balanced contributions → Multivariate anomaly (multiple factors)")


---

### 📊 Approach Comparison & Recommendations

Let's compare all three approaches to see which performed best.


In [None]:
# Compare all three approaches
print(f"🏆 Performance Comparison (Threshold: {ANOMALY_SCORE_THRESHOLD})\\n")
print("="*80)

comparison_data = [
    ("Auto-Discovery", auto_metrics['detected'], auto_metrics['detection_rate'], auto_metrics['fp_rate']),
    ("Manual Tuned", manual_metrics['detected'], manual_metrics['detection_rate'], manual_metrics['fp_rate']),
    ("Segmented (Regional)", segmented_metrics['detected'], segmented_metrics['detection_rate'], segmented_metrics['fp_rate']),
]

# Create DataFrame for comparison
comparison_df = spark.createDataFrame(comparison_data, ["Approach", "Detected", "Detection_Rate_%", "FP_Rate_%"])
display(comparison_df)

# Determine winner
best_detection = max(auto_metrics['detection_rate'], manual_metrics['detection_rate'], segmented_metrics['detection_rate'])
best_fp = min(auto_metrics['fp_rate'], manual_metrics['fp_rate'], segmented_metrics['fp_rate'])

print("\\n🎯 Key Findings:\\n")

if segmented_metrics['detection_rate'] == best_detection:
    print("✅ WINNER: Segmented approach has the BEST detection rate!")
    print(f"   {segmented_metrics['detection_rate']:.1f}% detection with {segmented_metrics['fp_rate']:.2f}% false positives")
elif manual_metrics['detection_rate'] == best_detection:
    print("✅ WINNER: Manual tuning has the BEST detection rate!")
    print(f"   {manual_metrics['detection_rate']:.1f}% detection with {manual_metrics['fp_rate']:.2f}% false positives")
else:
    print("✅ WINNER: Auto-discovery has the BEST detection rate!")
    print(f"   {auto_metrics['detection_rate']:.1f}% detection with {auto_metrics['fp_rate']:.2f}% false positives")

print("\\n💡 Recommendations:\\n")
print("| Approach | When to Use |")
print("|----------|-------------|")
print("| **Auto-Discovery** | Quick start, exploration, uniform data |")
print("| **Manual Tuned** | Production, known important features, single baseline |")
print("| **Segmented** | Multi-region/multi-product with different baselines |")
print("\\n📈 Best Practice: Start with auto-discovery, validate results, then refine with")
print("   manual tuning or segmentation based on your business context.")

# Show what segmentation helps with
print("\\n🌍 Why Segmentation Works:")
print(f"   • Different regions have different 'normal' patterns")
print(f"   • US avg expenses: $150, APAC: $120, EU: $100")
print(f"   • Segmented models catch region-specific anomalies better")
print(f"   • Reduces false positives from natural regional differences")

print(f"\n📝 Note: All approaches use the same threshold ({ANOMALY_SCORE_THRESHOLD}).")
print(f"   To experiment with different thresholds, change ANOMALY_SCORE_THRESHOLD at the top and re-run.")


---

## Section 5: Drift Detection & Retraining (6 min)

Data distributions change over time. DQX can detect when your model becomes stale.


In [None]:
# Simulate drift: New patterns (more remote work, lower expenses post-policy change)
def generate_drifted_data(num_rows=200):
    """Generate Q3 data with shifted distribution (post-policy change)."""
    data = []
    regions = ["US", "EU", "APAC"]
    call_types = ["promotional", "educational", "follow_up"]
    
    # NEW PATTERNS: More remote work, lower expenses, similar productivity
    new_patterns = {
        "US": {"calls": (9, 2), "prescriptions": (12, 3), "samples": (20, 4), "expenses": (100, 20)},  # -33% expenses
        "EU": {"calls": (7, 1.5), "prescriptions": (9, 2), "samples": (15, 3), "expenses": (70, 15)},   # -30% expenses
        "APAC": {"calls": (11, 3), "prescriptions": (15, 4), "samples": (25, 6), "expenses": (85, 20)}, # -29% expenses
    }
    
    start_date = datetime(2024, 7, 1)  # Q3 data
    
    for i in range(num_rows):
        region = random.choice(regions)
        pattern = new_patterns[region]
        
        calls = max(1, int(np.random.normal(pattern["calls"][0], pattern["calls"][1])))
        prescriptions = max(0, int(np.random.normal(pattern["prescriptions"][0], pattern["prescriptions"][1])))
        samples = max(0, int(np.random.normal(pattern["samples"][0], pattern["samples"][1])))
        expenses = max(10, round(np.random.normal(pattern["expenses"][0], pattern["expenses"][1]), 2))
        is_remote = random.random() < 0.7  # 70% remote now (was 30%)
        call_type = random.choice(call_types)
        
        days_offset = random.randint(0, 90)
        call_date = start_date + timedelta(days=days_offset)
        
        # Add activity_id and true_anomaly_type to match schema (11 columns)
        data.append((
            f"ACT_DRIFT{i:06d}",  # activity_id (primary key)
            f"REP{i % 50:03d}",   # rep_id
            region,
            call_date,
            calls,
            prescriptions,
            samples,
            expenses,
            is_remote,
            call_type,
            "normal"  # true_anomaly_type (drift data is normal, just shifted distribution)
        ))
    
    return data

# Generate and compare
drifted_data = generate_drifted_data(num_rows=200)
df_drifted = spark.createDataFrame(drifted_data, schema)

print("📊 Original vs Drifted Data Comparison:\\n")
print("Original (Q1-Q2 2024):") 
display(df_sales.agg(
    F.avg("expenses").alias("avg_expenses"),
    F.avg(F.col("is_remote").cast("int")).alias("remote_rate")
))

print("Drifted (Q3 2024 - post policy change):")
display(df_drifted.agg(
    F.avg("expenses").alias("avg_expenses"),
    F.avg(F.col("is_remote").cast("int")).alias("remote_rate")
))

print("✅ Distribution shifted:")
print("   • Expenses: -30% (policy change)")
print("   • Remote work: +133% (70% vs 30%)")


In [None]:
# === EXPLICIT DRIFT STATISTICS ===
# Python warnings can get lost in Databricks output, so let's explicitly compute drift

from databricks.labs.dqx.anomaly.drift_detector import compute_drift_score

print("📊 Explicit Drift Analysis:\n")
print("=" * 70)

# Get the trained model's baseline statistics
registry_df = spark.table(f"{catalog}.{schema_name}.anomaly_model_registry")
regional_models = registry_df.filter(F.col("model_name").like("%field_force_regional%"))

# Check drift for each region
for row in regional_models.collect():
    region = row["segment_values"]["region"] if row["segment_values"] else "Global"
    baseline_stats = row["baseline_stats"]
    columns = row["columns"]
    
    # Filter drifted data for this region/segment
    if row["segment_values"]:
        segment_filter = " AND ".join([f"{k} = '{v}'" for k, v in row["segment_values"].items()])
        df_segment = df_drifted.filter(segment_filter)
    else:
        df_segment = df_drifted
    
    # Compute drift
    drift_result = compute_drift_score(
        df_segment.select(columns),
        columns,
        baseline_stats,
        DRIFT_THRESHOLD
    )
    
    print(f"\n🌍 Region: {region}")
    print(f"   Drift Score: {drift_result.drift_score:.2f}")
    print(f"   Drift Detected: {'🚨 YES' if drift_result.drift_detected else '✅ NO'}")
    
    if drift_result.drifted_columns:
        print(f"   Drifted Columns: {', '.join(drift_result.drifted_columns)}")
        
        # Show baseline vs current stats for drifted columns
        print(f"\n   📈 Baseline vs Current:")
        for col in drift_result.drifted_columns:
            if col in baseline_stats:
                baseline = baseline_stats[col]
                current = df_segment.select(col).agg(
                    F.avg(col).alias("mean"),
                    F.stddev(col).alias("stddev")
                ).first()
                
                print(f"      {col}:")
                print(f"         Baseline: mean={baseline['mean']:.2f}, std={baseline['std']:.2f}")
                print(f"         Current:  mean={current['mean']:.2f}, std={current['stddev']:.2f}")
                print(f"         Change:   {((current['mean'] - baseline['mean']) / baseline['mean'] * 100):.1f}%")

print("\n" + "=" * 70)
print(f"\n💡 Threshold: {DRIFT_THRESHOLD}")
print("   Drift score > threshold → Model needs retraining")
print("   High drift in 'expenses' is expected (policy change: -30% expenses)")


In [None]:
from databricks.labs.dqx.rule import DQDatasetRule

# Score drifted data with drift detection enabled
checks_with_drift = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": "field_force_regional",
            "drift_threshold": DRIFT_THRESHOLD,  # Use configurable threshold
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

print(f"🔍 Scoring drifted data with drift detection (threshold: {DRIFT_THRESHOLD})...\n")

df_drift_scored = dq_engine.apply_checks(df_drifted, checks_with_drift)

print(f"\n💡 Drift score > {DRIFT_THRESHOLD} → Significant distribution shift, retrain recommended")
print("   DQX will show UserWarnings if drift is detected:")
print("   🚨 'DATA DRIFT DETECTED in columns: expenses (drift score: 4.2)...'")
print("\n✅ Check cell output above for any drift UserWarnings.")


In [None]:
# Retrain with combined data
df_combined = df_sales.union(df_drifted)

print("🔄 Retraining model with combined data (old + new patterns)...\\n")

# Note: We explicitly select features, excluding activity_id and true_anomaly_type
model_name_retrained = anomaly_engine.train(
    df=df_combined,
    columns=["calls_made", "prescriptions_generated", "samples_distributed", "expenses"],
    segment_by=["region"],
    model_name="field_force_regional",  # Same name = new version
    params=AnomalyParams(
        algorithm_config=IsolationForestConfig(contamination=0.05, num_trees=150, random_seed=42)
    ),
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)

print("\\n✅ Model retrained!")
print("   • Old model automatically archived")
print("   • New model active and includes both historical and recent patterns")
print("   • Baseline updated to reflect new expense policy and remote work rates")
print("\\n💡 Best Practice: Set up drift monitoring in production, retrain monthly/quarterly")


---

## Section 6: Production Integration (6 min)

Integrate anomaly detection into your DQX workflows for automated monitoring.


In [None]:
# Combine anomaly detection with traditional DQ checks
from databricks.labs.dqx.rule import DQRowRule, DQDatasetRule

checks_combined = [
    # Traditional data quality checks - one per column
    DQRowRule(
        criticality="error",
        check_func=is_not_null,
        column="rep_id",
        name="rep_id_not_null"
    ),
    DQRowRule(
        criticality="error",
        check_func=is_not_null,
        column="region",
        name="region_not_null"
    ),
    DQRowRule(
        criticality="error",
        check_func=is_not_null,
        column="call_date",
        name="call_date_not_null"
    ),
    DQRowRule(
        criticality="error",
        check_func=is_in_range,
        column="calls_made",
        name="calls_range",
        check_func_kwargs={"min_limit": 0, "max_limit": 50
        }
    ),
    DQRowRule(
        criticality="error",
        check_func=is_in_range,
        column="expenses",
        name="expenses_range",
        check_func_kwargs={"min_limit": 0, "max_limit": 1000}
    ),
    
    # ML-based anomaly detection with explanations
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": "field_force_regional",
            "score_threshold": ANOMALY_SCORE_THRESHOLD,  # Use configurable threshold
            "include_contributions": True,
            "drift_threshold": DRIFT_THRESHOLD,  # Use configurable threshold
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

# Apply all checks together
df_full_dq = dq_engine.apply_checks(df_test, checks_combined)

# Summary
print(f"📊 Full Data Quality Summary (threshold: {ANOMALY_SCORE_THRESHOLD}):\\n")
total_rows = df_full_dq.count()
anomalies_found = df_full_dq.filter(F.col("anomaly_score") >= ANOMALY_SCORE_THRESHOLD).count()

# Note: Traditional check condition columns would have specific names based on implementation
print(f"Total Rows: {total_rows}")
print(f"Anomalies Detected: {anomalies_found}")
print(f"Clean Records: {total_rows - anomalies_found}")
print(f"\\n✅ All checks applied in single pass!")


In [None]:
# === QUARANTINE WORKFLOW (DQX Standard Pattern) ===
# Use DQX's built-in split method to separate valid from quarantined records

print("🔀 Applying quarantine workflow...")
print(f"   Threshold: {ANOMALY_SCORE_THRESHOLD}\n")

# Split valid and quarantined data using DQX standard method
valid_df, quarantine_df = dq_engine.apply_checks_and_split(df_test, checks_combined)

print(f"✅ Valid records: {valid_df.count()}")
print(f"⚠️  Quarantined for review: {quarantine_df.count()}\n")

# Save both valid and quarantine data using DQX standard method
dq_engine.save_results_in_table(
    output_df=valid_df,
    quarantine_df=quarantine_df,
    output_config=OutputConfig(
        location=f"{catalog}.{schema_name}.field_force_clean",
        mode="overwrite"
    ),
    quarantine_config=OutputConfig(
        location=f"{catalog}.{schema_name}.field_force_quarantine",
        mode="overwrite"
    )
)

print(f"💾 Saved valid data to: {catalog}.{schema_name}.field_force_clean")
print(f"💾 Saved quarantine to: {catalog}.{schema_name}.field_force_quarantine")

# Display quarantine summary
print("\n📊 Quarantine Summary by Region:")
quarantine_summary = spark.table(f"{catalog}.{schema_name}.field_force_quarantine").groupBy("region").agg(
    F.count("*").alias("count"),
    F.avg("anomaly_score").alias("avg_score"),
    F.max("anomaly_score").alias("max_score")
).orderBy("region")
display(quarantine_summary)

# Show top quarantined records with explanations
print("\n📋 Top Quarantined Records (for Manual Review):")
display(
    spark.table(f"{catalog}.{schema_name}.field_force_quarantine")
    .orderBy(F.desc("anomaly_score"))
    .select(
        "activity_id", "rep_id", "region", "calls_made", "prescriptions_generated", 
        "expenses", F.round("anomaly_score", 3).alias("score"),
        "anomaly_contributions", "_errors"
    )
    .limit(10)
)

print("\n💡 Quarantine Workflow Best Practices:")
print("   1. Anomalies automatically sent to quarantine table")
print("   2. Review team investigates using anomaly_contributions")
print("   3. Check _errors column for all DQ violations")
print("   4. Confirmed issues → escalate to appropriate team")
print("   5. False positives → retune model or adjust threshold")


### YAML Configuration for Production

For automated workflows, define checks in YAML:

```yaml
run_configs:
  - name: field_force_monitoring
    input_config:
      location: catalog.schema.field_force_activity
    
    # Traditional checks
    quality_checks:
      - function: is_not_null
        arguments:
          columns: [rep_id, region, call_date]
      - function: is_in_range
        arguments:
          column: calls_made
          min_value: 0
          max_value: 50
      - function: is_in_range
        arguments:
          column: expenses
          min_value: 0
          max_value: 1000
    
    # Anomaly detection
    anomaly_config:
      columns: [calls_made, prescriptions_generated, samples_distributed, expenses]
      segment_by: [region]
      model_name: field_force_regional
      registry_table: catalog.schema.anomaly_model_registry
      params:
        algorithm_config:
          contamination: 0.05
          num_trees: 150
          random_state: 42
        sample_fraction: 1.0
    
    # Quarantine configuration
    quarantine_config:
      enabled: true
      table: catalog.schema.field_force_quarantine
      
    # Output configuration
    output_config:
      location: catalog.schema.field_force_clean
      save_mode: overwrite
```

**Run with:**
```bash
# Train model (one-time or scheduled)
databricks bundle run anomaly_trainer

# Run quality checks (scheduled, e.g., daily)
databricks bundle run quality_checker
```


---

## 🎓 Summary

### What You Learned:

1. ✅ **Auto-Discovery vs Manual Tuning** - Start with zero-config, refine with domain knowledge
2. ✅ **Parameter Tuning** - contamination, num_trees, max_samples for better performance
3. ✅ **Segment-Based Monitoring** - Regional baselines prevent false positives (US vs EU vs APAC)
4. ✅ **Feature Contributions** - SHAP-based root cause analysis for investigation
5. ✅ **Drift Detection** - Automated signals for when to retrain models
6. ✅ **Multi-Type Features** - Numeric, categorical, datetime, boolean all work together
7. ✅ **Production Integration** - DQEngine + YAML workflows + quarantine handling

### Key Takeaways:

- **Start simple**: `train(df)` with auto-discovery, then refine
- **Tune parameters**: Set contamination to expected anomaly rate, increase num_trees for stability
- **Use segments**: Different baselines for different groups prevent false positives
- **Enable contributions**: Root cause analysis is critical for business value
- **Monitor drift**: Set up drift detection for automated retraining signals
- **Combine checks**: Anomaly detection complements traditional DQ rules
- **Quarantine workflow**: Automate review process with explanations

### Model Comparison Results:

| Approach | Columns | Segments | Tuning | Use Case |
|----------|---------|----------|--------|----------|
| Auto-discovery | Auto (priority-based) | Auto (if applicable) | Default | Quick start, exploration |
| Manual tuned | Hand-picked | Manual | Custom hyperparameters | Production, refined monitoring |
| Regional | Hand-picked | By region | Tuned contamination | Multi-region with different baselines |

### Next Steps:

1. **Apply to your data**: `train(df=spark.table("your_table"))`
2. **Set up YAML workflows**: Automate training and checking
3. **Integrate quarantine**: Build review process with feature contributions
4. **Schedule retraining**: Weekly/monthly based on drift monitoring
5. **Monitor metrics**: Track anomaly rates, drift scores, false positive rates

### Resources:

- [DQX Anomaly Detection Documentation](https://databrickslabs.github.io/dqx/guide/anomaly_detection)
- [API Reference](https://databrickslabs.github.io/dqx/reference/quality_checks#has_no_anomalies)
- [GitHub Repository](https://github.com/databrickslabs/dqx)

---

**Questions? Feedback?** Open an issue on GitHub or contact the DQX team!
