# 🏥 Pharmaceutical Field Force Effectiveness - Anomaly Detection Demo

## Business Context

**Scenario**: Monitor sales rep performance across regions to detect unusual patterns:
- 💰 Expense fraud or policy violations
- 📉 Territory coverage issues and productivity gaps
- 🚨 Unrealistic prescription claims
- 📚 Training needs identification

**Data**: Sales rep daily activity with calls, prescriptions, samples, expenses across US, EU, APAC

## What You'll Learn (45 min comprehensive demo)

1. **Auto-Discovery** - Zero-config vs manual tuning
2. **Segment-Based Monitoring** - Regional baselines (US vs EU vs APAC)
3. **Parameter Tuning** - Contamination, hyperparameters, model comparison
4. **Feature Contributions** - SHAP-based root cause analysis
5. **Drift Detection** - When to retrain models
6. **Multi-Type Features** - Numeric, categorical, datetime, boolean
7. **Production Integration** - DQEngine, YAML, quarantine workflows

---

**📋 Table of Contents:**
- Section 1: Setup & Realistic Data (5 min)
- Section 2: Auto-Discovery & Manual Tuning (12 min)
- Section 3: Segment-Based Monitoring (8 min)
- Section 4: Feature Contributions & Root Cause (8 min)
- Section 5: Drift Detection & Retraining (6 min)
- Section 6: Production Integration (6 min)


---

## Section 1: Setup & Data Generation (5 min)

First, install DQX with anomaly support if not already installed:
```bash
%pip install databricks-labs-dqx[anomaly]
```


In [0]:
# Imports
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta
import random
import numpy as np

from databricks.labs.dqx.anomaly import AnomalyEngine, has_no_anomalies, AnomalyParams, IsolationForestConfig
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.check_funcs import is_not_null, is_in_range
from databricks.labs.dqx.config import OutputConfig
from databricks.sdk import WorkspaceClient

# Initialize
spark = SparkSession.builder.getOrCreate()
ws = WorkspaceClient()
dq_engine = DQEngine(ws)
anomaly_engine = AnomalyEngine(ws)

# Set seeds for reproducibility
random.seed(42)
np.random.seed(42)

print("✅ Setup complete!")
print(f"   Spark version: {spark.version}")


✅ Setup complete!
   Spark version: 4.0.0


In [0]:
# === CONFIGURABLE PARAMETERS ===
# Adjust these to experiment with different thresholds

# Anomaly score threshold (0-1 scale)
# Lower = more sensitive (more anomalies detected, higher false positives)
# Higher = more specific (fewer anomalies, lower false positives)
ANOMALY_SCORE_THRESHOLD = 0.5

# Drift detection threshold (z-score)
# Typical range: 2.0 (sensitive) to 5.0 (conservative)
DRIFT_THRESHOLD = 3.0

print(f"📋 Configuration:")
print(f"   Anomaly Score Threshold: {ANOMALY_SCORE_THRESHOLD}")
print(f"   Drift Detection Threshold: {DRIFT_THRESHOLD}")
print(f"\n💡 You can change these values and re-run cells to see different results")


📋 Configuration:
   Anomaly Score Threshold: 0.5
   Drift Detection Threshold: 3.0

💡 You can change these values and re-run cells to see different results


### Generate Realistic Sales Rep Activity Data

We'll create 10,000 rows of daily sales rep activity with:
- **Mixed data types**: Numeric, categorical, datetime, boolean
- **Regional patterns**: Different baselines for US (high expenses), EU (moderate), APAC (high volume)
- **Call type variations**: Promotional (higher cost), Educational (more samples), Follow-up (shorter)
- **Temporal trends**: Seasonal variations (Q4 boost), weekday patterns
- **Rep-specific behavior**: Consistent performers vs. inconsistent ones
- **Realistic correlations**: More calls → more prescriptions, promotional → higher expenses
- **Injected anomalies**: ~3% anomalous records (expense fraud, low productivity, data quality issues)


In [0]:
# Generate sales rep activity data with realistic patterns
def generate_field_force_data(num_rows=10000, anomaly_rate=0.03):
    """Generate pharmaceutical field force activity data with realistic patterns."""
    data = []
    regions = ["US", "EU", "APAC"]
    call_types = ["promotional", "educational", "follow_up"]
    num_reps = 100  # Increased from 50 to 100 reps
    
    # Regional baseline patterns (realistic differences)
    regional_patterns = {
        "US": {"calls": 8, "prescriptions": 12, "samples": 25, "expenses": 150, "remote_rate": 0.4},
        "EU": {"calls": 6, "prescriptions": 9, "samples": 18, "expenses": 100, "remote_rate": 0.5},
        "APAC": {"calls": 10, "prescriptions": 15, "samples": 30, "expenses": 120, "remote_rate": 0.3},
    }
    
    # Call type modifiers (affect baseline metrics)
    call_type_modifiers = {
        "promotional": {"calls": 1.0, "prescriptions": 1.2, "samples": 1.4, "expenses": 1.3, "remote": 0.2},
        "educational": {"calls": 0.9, "prescriptions": 0.8, "samples": 1.6, "expenses": 1.1, "remote": 0.6},
        "follow_up": {"calls": 0.7, "prescriptions": 1.0, "samples": 0.6, "expenses": 0.8, "remote": 0.5},
    }
    
    # Rep performance profiles (some reps are consistently better/worse)
    rep_profiles = {}
    for rep_id in range(num_reps):
        # 70% average, 20% high performers, 10% low performers
        perf_type = np.random.choice(["average", "high", "low"], p=[0.7, 0.2, 0.1])
        if perf_type == "high":
            multiplier = np.random.uniform(1.2, 1.5)
        elif perf_type == "low":
            multiplier = np.random.uniform(0.6, 0.8)
        else:
            multiplier = np.random.uniform(0.9, 1.1)
        rep_profiles[f"REP{rep_id:03d}"] = multiplier
    
    start_date = datetime(2024, 1, 1)
    end_date = datetime(2024, 12, 31)
    total_days = (end_date - start_date).days
    
    for i in range(num_rows):
        rep_id = f"REP{i % num_reps:03d}"
        region = random.choice(regions)
        call_type = random.choice(call_types)
        
        # Get baseline patterns
        pattern = regional_patterns[region]
        call_modifier = call_type_modifiers[call_type]
        rep_multiplier = rep_profiles[rep_id]
        
        # Generate date with temporal trends
        days_offset = random.randint(0, total_days)
        call_date = start_date + timedelta(days=days_offset)
        
        # Seasonal multiplier (Q4 boost for pharma year-end push)
        month = call_date.month
        if month in [10, 11, 12]:  # Q4
            seasonal_multiplier = 1.15
        elif month in [1, 2]:  # Post-holiday slump
            seasonal_multiplier = 0.9
        else:
            seasonal_multiplier = 1.0
        
        # Weekday effect (lower activity on Fridays, higher Mon-Thu)
        weekday = call_date.weekday()
        if weekday == 4:  # Friday
            weekday_multiplier = 0.85
        elif weekday in [0, 1]:  # Monday, Tuesday
            weekday_multiplier = 1.05
        else:
            weekday_multiplier = 1.0
        
        # Track ground truth for validation
        anomaly_type_label = None
        
        # Normal patterns (97% of data)
        if random.random() > anomaly_rate:
            anomaly_type_label = "normal"
            
            # Base metrics with all multipliers
            combined_multiplier = rep_multiplier * seasonal_multiplier * weekday_multiplier
            
            # Calls (with correlation to rep performance)
            calls_base = pattern["calls"] * call_modifier["calls"] * combined_multiplier
            calls = max(1, int(np.random.normal(calls_base, calls_base * 0.2)))
            
            # Prescriptions (correlated with calls - more calls → more prescriptions)
            prescriptions_base = pattern["prescriptions"] * call_modifier["prescriptions"] * combined_multiplier
            # Add correlation: prescription rate increases slightly with more calls
            call_correlation = min(1.2, calls / calls_base)
            prescriptions = max(0, int(np.random.normal(prescriptions_base * call_correlation, prescriptions_base * 0.25)))
            
            # Samples (correlated with call type)
            samples_base = pattern["samples"] * call_modifier["samples"] * combined_multiplier
            samples = max(0, int(np.random.normal(samples_base, samples_base * 0.2)))
            
            # Expenses (correlated with calls and call type)
            expenses_base = pattern["expenses"] * call_modifier["expenses"] * combined_multiplier
            # Add correlation: more calls = slightly higher expenses
            expense_correlation = min(1.15, calls / calls_base)
            expenses = max(10, round(np.random.normal(expenses_base * expense_correlation, expenses_base * 0.15), 2))
            
            # Remote flag (depends on call type and region)
            is_remote = random.random() < (pattern["remote_rate"] * call_modifier["remote"])
            
        else:
            # Inject realistic anomalies (3% of data)
            anomaly_type = random.choice([
                "high_expense_fraud",
                "low_productivity",
                "unrealistic_prescriptions",
                "data_quality_issue"
            ])
            anomaly_type_label = anomaly_type
            
            if anomaly_type == "high_expense_fraud":
                # Excessive expenses with low output (potential fraud)
                calls = max(1, int(pattern["calls"] * 0.4))
                prescriptions = max(0, int(pattern["prescriptions"] * 0.3))
                samples = max(0, int(pattern["samples"] * 0.5))
                expenses = round(pattern["expenses"] * random.uniform(3.0, 5.0), 2)
                is_remote = False  # Fraudsters often claim in-person visits
                
            elif anomaly_type == "low_productivity":
                # Many calls but few results (training need or territory issue)
                calls = int(pattern["calls"] * random.uniform(2.0, 3.0))
                prescriptions = max(0, int(pattern["prescriptions"] * random.uniform(0.15, 0.3)))
                samples = int(pattern["samples"] * random.uniform(0.4, 0.6))
                expenses = round(pattern["expenses"] * random.uniform(1.3, 1.6), 2)
                is_remote = random.random() < 0.4
                
            elif anomaly_type == "unrealistic_prescriptions":
                # Suspiciously high prescription rate (investigation needed)
                calls = max(1, int(pattern["calls"] * random.uniform(0.8, 1.2)))
                prescriptions = int(pattern["prescriptions"] * random.uniform(3.0, 5.0))
                samples = int(pattern["samples"] * random.uniform(2.0, 3.0))
                expenses = round(pattern["expenses"] * random.uniform(0.9, 1.3), 2)
                is_remote = False
                
            else:  # data_quality_issue
                # Outliers that don't follow normal patterns (data entry errors)
                calls = random.choice([0, int(pattern["calls"] * 10)])  # Either 0 or way too high
                prescriptions = random.choice([int(pattern["prescriptions"] * -1) if random.random() < 0.3 else 0, 
                                              int(pattern["prescriptions"] * 8)])
                samples = random.choice([0, int(pattern["samples"] * 12)])
                expenses = round(random.choice([1, pattern["expenses"] * 20]), 2)
                is_remote = random.random() < 0.5
        
        data.append((
            f"ACT{i:06d}",  # Unique activity_id (primary key)
            rep_id,
            region,
            call_date,
            calls,
            prescriptions,
            samples,
            expenses,
            is_remote,
            call_type,
            anomaly_type_label  # Ground truth for validation
        ))
    
    return data

# Generate data
print("🔄 Generating sales rep activity data with realistic patterns...")
field_force_data = generate_field_force_data(num_rows=10000, anomaly_rate=0.03)

schema = StructType([
    StructField("activity_id", StringType(), False),  # Primary key
    StructField("rep_id", StringType(), False),
    StructField("region", StringType(), False),
    StructField("call_date", DateType(), False),
    StructField("calls_made", IntegerType(), False),
    StructField("prescriptions_generated", IntegerType(), False),
    StructField("samples_distributed", IntegerType(), False),
    StructField("expenses", DoubleType(), False),
    StructField("is_remote", BooleanType(), False),
    StructField("call_type", StringType(), False),
    StructField("true_anomaly_type", StringType(), False),  # Ground truth
])

df_sales = spark.createDataFrame(field_force_data, schema)

print("\n📊 Sample of field force activity data:")
display(df_sales.orderBy("call_date").limit(10))

print(f"\n✅ Generated {df_sales.count()} rows with ~3% injected anomalies")
print(f"   Regions: {df_sales.select('region').distinct().count()}")
print(f"   Call types: {df_sales.select('call_type').distinct().count()}")
print(f"   Unique reps: {df_sales.select('rep_id').distinct().count()}")
print(f"   Date range: {df_sales.agg(F.min('call_date'), F.max('call_date')).first()}")
print(f"   Segments (region × call_type): {df_sales.select('region', 'call_type').distinct().count()}")


🔄 Generating sales rep activity data with realistic patterns...

📊 Sample of field force activity data:


activity_id,rep_id,region,call_date,calls_made,prescriptions_generated,samples_distributed,expenses,is_remote,call_type,true_anomaly_type
ACT008411,REP011,EU,2024-01-01,5,6,20,76.75,True,promotional,normal
ACT005556,REP056,US,2024-01-01,5,6,29,117.22,False,promotional,normal
ACT000962,REP062,APAC,2024-01-01,5,4,15,82.0,False,follow_up,normal
ACT002119,REP019,EU,2024-01-01,6,11,26,136.97,True,promotional,normal
ACT004378,REP078,US,2024-01-01,8,14,21,258.89,False,promotional,normal
ACT008520,REP020,US,2024-01-01,6,14,17,89.97,False,follow_up,normal
ACT004406,REP006,US,2024-01-01,4,8,23,103.4,False,follow_up,normal
ACT004426,REP026,APAC,2024-01-01,6,13,11,73.32,False,follow_up,normal
ACT008416,REP016,EU,2024-01-01,3,3,38,57.43,True,educational,normal
ACT007759,REP059,EU,2024-01-01,5,5,37,88.64,False,educational,normal



✅ Generated 10000 rows with ~3% injected anomalies
   Regions: 3
   Call types: 3
   Unique reps: 100
   Date range: Row(min(call_date)=datetime.date(2024, 1, 1), max(call_date)=datetime.date(2024, 12, 31))
   Segments (region × call_type): 9


In [0]:
# Save to table for training
catalog = "vbdemos"
schema_name = "dqx_demo"
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema_name}")

table_name = f"{catalog}.{schema_name}.field_force_activity"
df_sales.write.mode("overwrite").saveAsTable(table_name)

print(f"✅ Data saved to: {table_name}")


✅ Data saved to: vbdemos.dqx_demo.field_force_activity


In [0]:
# Split data into training (80%) and test (20%) sets
# Training: historical "normal" data to learn patterns
# Test: new data to detect anomalies (simulates production)

df_train, df_test = df_sales.randomSplit([0.8, 0.2], seed=42)

print(f"📊 Data Split:")
print(f"   Training set: {df_train.count()} rows")
print(f"   Test set: {df_test.count()} rows")
print(f"\\n💡 We train on historical data and score on new data (like production)")


📊 Data Split:
   Training set: 8062 rows
   Test set: 1938 rows
\n💡 We train on historical data and score on new data (like production)


---

## Section 2: Auto-Discovery vs Manual Tuning (12 min)

### 2.1 Auto-Discovery (Zero Configuration)

Let's start with zero configuration - DQX will automatically select columns and detect segments.


In [0]:
# Let DQX automatically discover the best columns and segments
# Use exclude_columns to skip ID and ground truth columns
# This enables auto-discovery on remaining columns
model_name_auto = anomaly_engine.train(
    df=df_train,
    model_name="field_force_auto",
    exclude_columns=['activity_id', 'true_anomaly_type'],  # Exclude ID and ground truth
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)
print(f"\\n📊 Auto-discovery complete!")
print(f"   Model: {model_name_auto}")


Excluding 2 columns from auto-discovery: ['activity_id', 'true_anomaly_type']
Auto-selected 6 columns: ['calls_made', 'expenses', 'prescriptions_generated', 'samples_distributed', 'is_remote', 'call_date']
Auto-detected 2 segment columns: ['region', 'call_type'] (9 total segments)


  client.get_latest_versions(model_name, stages=None)


Training segment 1/9: region=EU_call_type=follow_up


  model_uri, was_skipped = _train_one_segment_with_validation(
🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-46d44651b38147f48df030a7f6a10b97?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_auto__seg_region=EU_call_type=follow_up' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '3' of model 'vbdemos.dqx_demo.field_force_auto__seg_region=eu_call_type=follow_up': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_auto__seg_region=eu_call_type=follow_up/version/3?o=984752964297111
[90m09:18:00[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 2/9: region=US_call_type=educational


  model_uri, was_skipped = _train_one_segment_with_validation(
🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-8058043139184ea48e783e3238a9433d?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_auto__seg_region=US_call_type=educational' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '3' of model 'vbdemos.dqx_demo.field_force_auto__seg_region=us_call_type=educational': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_auto__seg_region=us_call_type=educational/version/3?o=984752964297111
[90m09:19:45[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 3/9: region=US_call_type=promotional


  model_uri, was_skipped = _train_one_segment_with_validation(
🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-2c80658977b54c3fb4ada2eeab1013eb?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_auto__seg_region=US_call_type=promotional' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '3' of model 'vbdemos.dqx_demo.field_force_auto__seg_region=us_call_type=promotional': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_auto__seg_region=us_call_type=promotional/version/3?o=984752964297111
[90m09:20:50[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 4/9: region=US_call_type=follow_up


  model_uri, was_skipped = _train_one_segment_with_validation(
🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-8823de62a99543e1a86df9efe08cad9c?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_auto__seg_region=US_call_type=follow_up' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '3' of model 'vbdemos.dqx_demo.field_force_auto__seg_region=us_call_type=follow_up': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_auto__seg_region=us_call_type=follow_up/version/3?o=984752964297111
[90m09:21:33[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 5/9: region=APAC_call_type=promotional


  model_uri, was_skipped = _train_one_segment_with_validation(
🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-36abd76189a9486bbfb72c43e55b177f?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_auto__seg_region=APAC_call_type=promotional' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '3' of model 'vbdemos.dqx_demo.field_force_auto__seg_region=apac_call_type=promotional': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_auto__seg_region=apac_call_type=promotional/version/3?o=984752964297111
[90m09:22:10[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 6/9: region=EU_call_type=educational


  model_uri, was_skipped = _train_one_segment_with_validation(
🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-cb0dbefd72c14d82ac03850d50f96c66?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_auto__seg_region=EU_call_type=educational' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '3' of model 'vbdemos.dqx_demo.field_force_auto__seg_region=eu_call_type=educational': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_auto__seg_region=eu_call_type=educational/version/3?o=984752964297111
[90m09:22:53[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 7/9: region=APAC_call_type=educational


  model_uri, was_skipped = _train_one_segment_with_validation(
🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-87e8c2aeeb2c4acf84a3d32bc4e93350?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_auto__seg_region=APAC_call_type=educational' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '3' of model 'vbdemos.dqx_demo.field_force_auto__seg_region=apac_call_type=educational': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_auto__seg_region=apac_call_type=educational/version/3?o=984752964297111
[90m09:24:13[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 8/9: region=APAC_call_type=follow_up


  model_uri, was_skipped = _train_one_segment_with_validation(
🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-ce9cb47d3eac4045aaa5ba8bbd477888?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_auto__seg_region=APAC_call_type=follow_up' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '3' of model 'vbdemos.dqx_demo.field_force_auto__seg_region=apac_call_type=follow_up': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_auto__seg_region=apac_call_type=follow_up/version/3?o=984752964297111
[90m09:24:52[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 9/9: region=EU_call_type=promotional


  model_uri, was_skipped = _train_one_segment_with_validation(
🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-cc995601113b4b63b6e4641a4566afca?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_auto__seg_region=EU_call_type=promotional' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '3' of model 'vbdemos.dqx_demo.field_force_auto__seg_region=eu_call_type=promotional': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_auto__seg_region=eu_call_type=promotional/version/3?o=984752964297111
[90m09:25:31[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


   Trained 9/9 segment models for: vbdemos.dqx_demo.field_force_auto
   Registry: vbdemos.dqx_demo.anomaly_model_registry
\n📊 Auto-discovery complete!
   Model: vbdemos.dqx_demo.field_force_auto


In [0]:
# Check what was auto-discovered
registry_df = spark.table(f"{catalog}.{schema_name}.anomaly_model_registry")

# For segmented models, we query by the base model name
# The registry contains individual segment entries
base_name_parts = model_name_auto.split(".")
if len(base_name_parts) == 3:
    base_model_name_only = base_name_parts[2]  # Get just the model name without catalog.schema
else:
    base_model_name_only = model_name_auto

# Get a representative segment to show configuration
sample_model = registry_df.filter(
    (F.col("model_name").startswith(f"{base_model_name_only}__seg_")) &
    (F.col("status") == "active")
).orderBy("training_time").first()

print(f"\\n📋 Auto-Discovered Configuration:")
if sample_model:
    print(f"   Model: {model_name_auto}")
    print(f"   Columns: {sample_model['columns']}")
    print(f"   Segments: {sample_model['segment_by']}")
    print(f"   Column types: {sample_model['column_types']}")
    
    # Count total segments if segmented model
    if sample_model['segment_by']:
        segment_count = registry_df.filter(
            (F.col("model_name").startswith(f"{base_model_name_only}__seg_")) &
            (F.col("status") == "active")
        ).count()
        print(f"   Total segments trained: {segment_count}")
else:
    print("   ⚠️ Model not found in registry")

print(f"\\n💡 DQX prioritized: numeric > boolean > categorical > datetime")

# Diagnostic: Check for segment coverage
print(f"\\n🔍 Segment Coverage Diagnostic:")
trained_segments = registry_df.filter(
    (F.col("model_name").startswith(f"{base_model_name_only}__seg_")) &
    (F.col("status") == "active")
).select("segment_values").collect()

test_segments = df_test.select("region", "call_type").distinct().collect()

print(f"   Trained segments: {len(trained_segments)}")
print(f"   Test data segments: {len(test_segments)}")

# Find segments in test that aren't in training
trained_combos = set()
for row in trained_segments:
    if row["segment_values"]:
        combo = (row["segment_values"]["region"], row["segment_values"]["call_type"])
        trained_combos.add(combo)

test_combos = set((row["region"], row["call_type"]) for row in test_segments)
missing_in_training = test_combos - trained_combos

if missing_in_training:
    print(f"   ⚠️  WARNING: {len(missing_in_training)} segment(s) in test data NOT in training:")
    for region, call_type in missing_in_training:
        count = df_test.filter((F.col("region") == region) & (F.col("call_type") == call_type)).count()
        print(f"      - region={region}, call_type={call_type} ({count} rows will have null scores)")
else:
    print(f"   ✅ All test segments have trained models")


\n📋 Auto-Discovered Configuration:
   Model: vbdemos.dqx_demo.field_force_auto
   Columns: ['calls_made', 'expenses', 'prescriptions_generated', 'samples_distributed', 'is_remote', 'call_date']
   Segments: ['region', 'call_type']
   Column types: None
   Total segments trained: 27
\n💡 DQX prioritized: numeric > boolean > categorical > datetime


In [0]:
from databricks.labs.dqx.rule import DQDatasetRule

# Score with auto-discovered model (just pass the model name!)
checks_auto = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": model_name_auto,  # Just pass the model name - segments are handled automatically!
            "score_threshold": ANOMALY_SCORE_THRESHOLD,  # Use configurable threshold
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

df_scored_auto = dq_engine.apply_checks(df_test, checks_auto)

# Save to table to avoid re-computation in subsequent cells
auto_scored_table = f"{catalog}.{schema_name}.auto_scored_temp"
df_scored_auto.write.mode("overwrite").saveAsTable(auto_scored_table)
df_scored_auto = spark.table(auto_scored_table)

anomalies_auto = df_scored_auto.filter(F.col("_info.anomaly.score") >= ANOMALY_SCORE_THRESHOLD)

print(f"\\n⚠️  Auto-discovery found {anomalies_auto.count()} anomalies (threshold: {ANOMALY_SCORE_THRESHOLD}):\\n")
display(anomalies_auto.orderBy(F.col("_info.anomaly.score").desc()).select(
    "rep_id", "region", "calls_made", "prescriptions_generated", "expenses",
    F.round("_info.anomaly.score", 3).alias("score")
).limit(10))


Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



\n⚠️  Auto-discovery found 694 anomalies (threshold: 0.5):\n


rep_id,region,calls_made,prescriptions_generated,expenses,score
REP000,APAC,0,120,2400.0,0.76
REP071,US,80,96,1.0,0.716
REP001,APAC,100,120,2400.0,0.707
REP098,EU,60,0,2000.0,0.702
REP033,APAC,100,120,1.0,0.688
REP001,APAC,100,0,2400.0,0.682
REP065,APAC,0,0,1.0,0.662
REP048,US,80,-12,1.0,0.659
REP037,EU,12,16,227.65,0.658
REP094,EU,60,-9,1.0,0.655


### 2.1.2 Validate Auto-Discovery Performance

Let's validate how well auto-discovery worked by checking detection rate and false positives.


In [0]:
# Display the _info column structure
print("📊 The _info Column Structure:\n")
print("_info: struct {")
print("  anomaly: struct {")
print("    check_name: string        # Check function name")
print("    score: double              # Anomaly score (0-1)")
print("    is_anomaly: boolean        # True if score > threshold")
print("    threshold: double          # Detection threshold used")
print("    model: string              # Model name")
print("    segment: map<string,string> # Segment values (null for global)")
print("    contributions: map<string,double> # SHAP values (if requested)")
print("    confidence_std: double     # Ensemble std (if ensemble)")
print("  }")
print("}")



print("\n📋 Sample _info values for detected anomalies:\n")

# Show top 3 anomalies with their _info (using _info to filter)
sample_anomalies = df_scored_auto.filter(
    F.col('_info.anomaly.is_anomaly')  # ✅ Recommended way
).orderBy(F.col("_info.anomaly.score").desc()).limit(3)

for row in sample_anomalies.collect():
    print(f"Activity {row['activity_id']}:")
    print(f"  Region: {row['region']}, Calls: {row['calls_made']}, Expenses: ${row['expenses']:.2f}")
    
    # Extract _info.anomaly array
    anomaly_info = row['_info']['anomaly']  # Get first element
    print(f"  _info.anomaly:")
    print(f"    check_name: {anomaly_info['check_name']}")
    print(f"    score: {anomaly_info['score']:.3f}")
    print(f"    is_anomaly: {anomaly_info['is_anomaly']}")
    print(f"    threshold: {anomaly_info['threshold']}")
    print(f"    model: {anomaly_info['model']}")
    print(f"    segment: {anomaly_info['segment']}")
    print()

print("💡 Benefits of _info column:")
print("   ✅ All metadata in one place")
print("   ✅ Self-documenting schema")
print("   ✅ Easy to query: df.filter(col('_info.anomaly.is_anomaly'))")
print("   ✅ Extensible: Future checks (drift, profiling) can add their own keys")


📊 The _info Column Structure:

_info: struct {
  anomaly: struct {
    check_name: string        # Check function name
    score: double              # Anomaly score (0-1)
    is_anomaly: boolean        # True if score > threshold
    threshold: double          # Detection threshold used
    model: string              # Model name
    segment: map<string,string> # Segment values (null for global)
    contributions: map<string,double> # SHAP values (if requested)
    confidence_std: double     # Ensemble std (if ensemble)
  }
}

📋 Sample _info values for detected anomalies:

Activity ACT002300:
  Region: APAC, Calls: 0, Expenses: $2400.00
  _info.anomaly:
    check_name: has_no_anomalies
    score: 0.760
    is_anomaly: True
    threshold: 0.5
    model: vbdemos.dqx_demo.field_force_auto
    segment: {'region': 'APAC', 'call_type': 'promotional'}

Activity ACT009671:
  Region: US, Calls: 80, Expenses: $1.00
  _info.anomaly:
    check_name: has_no_anomalies
    score: 0.716
    is_anomal

In [0]:
# === COMPREHENSIVE VALIDATION WITH GROUND TRUTH ===print(f"🔍 Validation using threshold: {ANOMALY_SCORE_THRESHOLD}\n")# Classify predictions (deduplication now handled in library)df_classified = df_scored_auto.withColumn(    "predicted_anomaly",    F.when(F.col("_info.anomaly.score") >= ANOMALY_SCORE_THRESHOLD, True).otherwise(False)).withColumn(    "is_true_anomaly",    F.when(F.col("true_anomaly_type") != "normal", True).otherwise(False))# Save classified results to table for reuseclassified_table = f"{catalog}.{schema_name}.classified_temp"df_classified.write.mode("overwrite").saveAsTable(classified_table)df_classified = spark.table(classified_table)# Check for null scores (unseen segments)null_score_count = df_classified.filter(F.col("_info.anomaly.score").isNull()).count()if null_score_count > 0:    print(f"⚠️  Warning: {null_score_count} rows have null anomaly scores")    print(f"   This happens when test data has segment combinations not seen in training.")    print(f"   These rows are excluded from metrics calculations.\n")# === 1. Overall Confusion Matrix ===print("📊 Confusion Matrix:")confusion = df_classified.groupBy("is_true_anomaly", "predicted_anomaly").count()display(confusion.orderBy("is_true_anomaly", "predicted_anomaly"))# Calculate metricstp = df_classified.filter((F.col("is_true_anomaly") == True) & (F.col("predicted_anomaly") == True)).count()fp = df_classified.filter((F.col("is_true_anomaly") == False) & (F.col("predicted_anomaly") == True)).count()tn = df_classified.filter((F.col("is_true_anomaly") == False) & (F.col("predicted_anomaly") == False)).count()fn = df_classified.filter((F.col("is_true_anomaly") == True) & (F.col("predicted_anomaly") == False)).count()precision = tp / (tp + fp) if (tp + fp) > 0 else 0recall = tp / (tp + fn) if (tp + fn) > 0 else 0f1_score = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0print("\n📈 Performance Metrics:")metrics_df = spark.createDataFrame([    ("Precision", round(precision * 100, 2)),    ("Recall", round(recall * 100, 2)),    ("F1-Score", round(f1_score * 100, 2))], ["Metric", "Value_%"])display(metrics_df)# === 2. Detection Rate by Anomaly Type ===print("\n🎯 Detection Performance by Anomaly Type:")anomaly_type_stats = df_classified.filter(    F.col("true_anomaly_type") != "normal").groupBy("true_anomaly_type").agg(    F.count("*").alias("total_count"),    F.sum(F.when(F.col("predicted_anomaly"), 1).otherwise(0)).alias("detected_count"),    F.avg("_info.anomaly.score").alias("avg_score"),    F.max("_info.anomaly.score").alias("max_score"),    F.min("_info.anomaly.score").alias("min_score")).withColumn(    "detection_rate",    (F.col("detected_count") / F.col("total_count") * 100).cast("decimal(5,2)")).orderBy(F.desc("detection_rate"))display(anomaly_type_stats)# === 3. Score Distribution by Type (Visualization) ===print("\n📊 Score Distribution by Anomaly Type:")# Check for null scores (rows that didn't match any segment)null_scores = df_classified.filter(F.col("_info.anomaly.score").isNull()).count()if null_scores > 0:    print(f"   ⚠️ Note: {null_scores} rows have null scores (unseen segment combinations)\n")score_by_type = df_classified.filter(F.col("_info.anomaly.score").isNotNull()).select(    "true_anomaly_type",    F.round("_info.anomaly.score", 2).alias("anomaly_score"))display(score_by_type)# === 4. Threshold Sensitivity Analysis ===print("\n📊 Threshold Sensitivity Analysis (how detection rate changes with threshold):")threshold_analysis = []for threshold in [0.3, 0.4, 0.5, 0.6, 0.7]:    detected = df_classified.filter(        (F.col("true_anomaly_type") != "normal") &        (F.col("_info.anomaly.score") >= threshold)    ).count()        false_positives = df_classified.filter(        (F.col("true_anomaly_type") == "normal") &        (F.col("_info.anomaly.score") >= threshold)    ).count()        total_true_anomalies = df_classified.filter(F.col("true_anomaly_type") != "normal").count()    detection_rate = (detected / total_true_anomalies * 100) if total_true_anomalies > 0 else 0        threshold_analysis.append((threshold, detected, round(detection_rate, 1), false_positives))threshold_df = spark.createDataFrame(threshold_analysis,                                      ["Threshold", "Detected", "Detection_Rate_%", "False_Positives"])display(threshold_df)print("\n💡 Current threshold (0.5) is highlighted above.")print("   Lower threshold = more sensitive (catches more, but more false alarms)")print("   Higher threshold = more specific (misses some, but fewer false alarms)")

🔍 Validation using threshold: 0.5

📊 Confusion Matrix:


is_true_anomaly,predicted_anomaly,count
False,False,1244
False,True,632
True,True,62



📈 Performance Metrics:


Metric,Value_%
Precision,8.52
Recall,100.0
F1-Score,15.7



🎯 Detection Performance by Anomaly Type:


true_anomaly_type,total_count,detected_count,avg_score,max_score,min_score,detection_rate
high_expense_fraud,13,13,0.5638033680235439,0.6147081859344267,0.5188592573718921,100.0
unrealistic_prescriptions,18,18,0.5961840054812518,0.7183249548182099,0.5200788297227336,100.0
data_quality_issue,11,11,0.6935086022924349,0.8107061939032955,0.5830723081695247,100.0
low_productivity,13,13,0.5813169929021732,0.6237845427915375,0.5044568083092166,100.0



📊 Score Distribution by Anomaly Type:


true_anomaly_type,_info.anomaly.score
normal,0.47
normal,0.46
normal,0.45
normal,0.58
normal,0.5
normal,0.49
normal,0.46
normal,0.53
normal,0.45
normal,0.45



⚠️ False Positives: 666 records
Top false positive scores:


activity_id,rep_id,region,calls_made,expenses,score
ACT005159,REP059,EU,11,256.15,0.665
ACT008737,REP037,EU,12,227.65,0.658
ACT003463,REP063,EU,11,188.55,0.629
ACT006445,REP045,APAC,2,11.27,0.626
ACT003343,REP043,APAC,15,254.74,0.624



🔬 Impact of Different Thresholds:


Threshold,Detected,Detection_Rate_%,False_Positives
0.3,55,100.0,1890
0.4,55,100.0,1890
0.5,55,100.0,666
0.6,24,43.6,11
0.7,5,9.1,0



💡 Current threshold (0.5) is highlighted above.
   Lower threshold = more sensitive (catches more, but more false alarms)
   Higher threshold = more specific (misses some, but fewer false alarms)


In [0]:
# === SHAP FEATURE CONTRIBUTIONS BY ANOMALY TYPE ===
print("🔬 Computing SHAP contributions per anomaly type...")
print("   (This may take a moment...)\n")

# Re-score with SHAP enabled
checks_with_shap = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": model_name_auto,
            "score_threshold": ANOMALY_SCORE_THRESHOLD,
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"],
            "include_contributions": True  # <- Enable SHAP
        }
    )
]

df_with_shap = dq_engine.apply_checks(df_test, checks_with_shap)

# Save SHAP results to table for efficient reuse
shap_results_table = f"{catalog}.{schema_name}.shap_analysis_temp"
df_with_shap.write.mode("overwrite").saveAsTable(shap_results_table)
df_shap_truth = spark.table(shap_results_table)

# === Analyze Top Contributors Per Anomaly Type ===
for anomaly_type in ["high_expense_fraud", "low_productivity", "unrealistic_prescriptions", "data_quality_issue"]:
    print(f"\n{'='*60}")
    print(f"📊 SHAP Analysis: {anomaly_type.upper().replace('_', ' ')}")
    print(f"{'='*60}")
    
    # Get samples of this type that were detected (with ordering for deterministic results)
    samples_df = df_shap_truth.filter(
        (F.col("true_anomaly_type") == anomaly_type) &
        (F.col("_info.anomaly.score") >= ANOMALY_SCORE_THRESHOLD)
    ).select(
        "activity_id", 
        F.col("_info.anomaly.score").alias("anomaly_score"),
        F.col("_info.anomaly.contributions").alias("anomaly_contributions"),
        "calls_made", "prescriptions_generated", "expenses"
    ).orderBy(F.col("anomaly_score").desc()).limit(3)  # Order for deterministic results!
    
    # Collect once and reuse
    samples = samples_df.collect()
    
    if len(samples) == 0:
        print(f"   ⚠️ No detected anomalies of this type (try lowering threshold)")
        continue
    
    # Display the same samples we'll analyze
    print(f"\n✅ Sample detected anomalies ({len(samples)}):")
    display(spark.createDataFrame(samples).select(
        "activity_id", "calls_made", "prescriptions_generated", "expenses", 
        F.round("anomaly_score", 3).alias("score")
    ))
    
    # Extract and display top SHAP contributors (using the same collected samples)
    for row in samples:
        contributions = row["anomaly_contributions"]
        if contributions:
            print(f"\n  🔍 Activity {row['activity_id']} - Top contributing features:")
            sorted_contrib = sorted(contributions.items(), key=lambda x: abs(x[1]), reverse=True)[:5]
            for feature, value in sorted_contrib:
                print(f"      {feature:30s}: {value*100:6.1f}%")

print(f"\n\n💡 Key Insights:")
print(f"   • high_expense_fraud: Usually driven by 'expenses' features")
print(f"   • low_productivity: Driven by calls vs prescriptions ratio")
print(f"   • unrealistic_prescriptions: Driven by prescription-related features")
print(f"   • data_quality_issue: Mix of features with extreme values")
print(f"\n✅ SHAP helps explain WHY each anomaly was detected!")


🔬 Computing SHAP contributions per anomaly type...
   (This may take a moment...)



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]




📊 SHAP Analysis: HIGH EXPENSE FRAUD

✅ Sample detected anomalies (3):


activity_id,calls_made,prescriptions_generated,expenses,score
ACT008127,4,4,525.66,0.523
ACT004153,2,2,368.89,0.612
ACT000949,3,3,488.37,0.555



  🔍 Activity ACT009483 - Top contributing features:
      expenses                      :   44.2%
      prescriptions_generated       :   11.6%
      call_date_dow_cos             :    8.3%
      is_remote_bool                :    7.9%
      call_date_is_weekend          :    5.9%

  🔍 Activity ACT008127 - Top contributing features:
      expenses                      :   41.2%
      prescriptions_generated       :   12.3%
      call_date_month_cos           :   11.9%
      call_date_month_sin           :    7.5%
      is_remote_bool                :    7.3%

  🔍 Activity ACT006199 - Top contributing features:
      expenses                      :   38.7%
      samples_distributed           :   23.5%
      calls_made                    :    9.4%
      prescriptions_generated       :    7.7%
      call_date_is_weekend          :    5.9%

📊 SHAP Analysis: LOW PRODUCTIVITY

✅ Sample detected anomalies (3):


activity_id,calls_made,prescriptions_generated,expenses,score
ACT009650,29,2,185.96,0.599
ACT006404,17,2,226.64,0.584



  🔍 Activity ACT001047 - Top contributing features:
      calls_made                    :   32.6%
      samples_distributed           :   22.1%
      prescriptions_generated       :   18.9%
      call_date_is_weekend          :    7.4%
      is_remote_bool                :    5.3%

  🔍 Activity ACT005975 - Top contributing features:
      calls_made                    :   34.9%
      call_date_is_weekend          :   15.8%
      prescriptions_generated       :   15.7%
      samples_distributed           :   13.6%
      call_date_month_cos           :    6.4%

  🔍 Activity ACT006088 - Top contributing features:
      calls_made                    :   31.1%
      samples_distributed           :   19.6%
      call_date_month_cos           :   11.9%
      prescriptions_generated       :    8.7%
      call_date_is_weekend          :    8.3%

📊 SHAP Analysis: UNREALISTIC PRESCRIPTIONS

✅ Sample detected anomalies (3):


activity_id,calls_made,prescriptions_generated,expenses,score
ACT000696,11,61,117.04,0.594
ACT003894,6,39,156.33,0.567
ACT000999,5,28,122.84,0.624



  🔍 Activity ACT000696 - Top contributing features:
      prescriptions_generated       :   58.9%
      samples_distributed           :   11.1%
      call_date_month_sin           :    7.3%
      call_date_dow_sin             :    5.8%
      is_remote_bool                :    4.5%

  🔍 Activity ACT004912 - Top contributing features:
      prescriptions_generated       :   42.0%
      call_date_is_weekend          :   17.7%
      call_date_month_cos           :   13.5%
      samples_distributed           :    9.3%
      is_remote_bool                :    4.3%

  🔍 Activity ACT007263 - Top contributing features:
      prescriptions_generated       :   59.7%
      samples_distributed           :    9.9%
      call_date_dow_sin             :    7.3%
      is_remote_bool                :    5.1%
      call_date_is_weekend          :    4.7%

📊 SHAP Analysis: DATA QUALITY ISSUE

✅ Sample detected anomalies (2):


activity_id,calls_made,prescriptions_generated,expenses,score
ACT008305,60,72,2000.0,0.788
ACT001999,100,-15,1.0,0.646
ACT003395,0,0,1.0,0.673



  🔍 Activity ACT002048 - Top contributing features:
      calls_made                    :   33.2%
      samples_distributed           :   33.0%
      expenses                      :    9.2%
      prescriptions_generated       :    8.8%
      call_date_is_weekend          :    6.7%

  🔍 Activity ACT001513 - Top contributing features:
      expenses                      :   37.8%
      prescriptions_generated       :   31.1%
      samples_distributed           :   15.1%
      calls_made                    :    8.3%
      is_remote_bool                :    2.8%

  🔍 Activity ACT005067 - Top contributing features:
      prescriptions_generated       :   40.9%
      expenses                      :   17.2%
      samples_distributed           :   16.1%
      calls_made                    :   15.0%
      is_remote_bool                :    3.6%


💡 Key Insights:
   • high_expense_fraud: Usually driven by 'expenses' features
   • low_productivity: Driven by calls vs prescriptions ratio
   • unr

### 2.2 Manual Column Selection & Parameter Tuning

Now let's manually select specific columns and tune hyperparameters for better performance.


In [0]:
# Manually specify columns (no segmentation)
# Note: We explicitly select features, excluding activity_id and true_anomaly_type
model_name_manual = anomaly_engine.train(
    df=df_train,
    columns=["calls_made", "prescriptions_generated", "expenses", "samples_distributed"],
    model_name="field_force_manual",
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)


  client.get_latest_versions(model_name, stages=None)
🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-87061547503d44a0abb6f6fb9a600833?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_manual' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '2' of model 'vbdemos.dqx_demo.field_force_manual': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_manual/version/2?o=984752964297111
[90m09:43:11[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


   Model trained: vbdemos.dqx_demo.field_force_manual
   Model URI: models:/vbdemos.dqx_demo.field_force_manual/2
   Registry: vbdemos.dqx_demo.anomaly_model_registry


In [0]:
# Score with manually configured model
checks_manual = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": model_name_manual,
            "score_threshold": ANOMALY_SCORE_THRESHOLD,  # Use configurable threshold
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

df_scored_manual = dq_engine.apply_checks(df_test, checks_manual)
anomalies_manual = df_scored_manual.filter(F.col("_info.anomaly.score") >= ANOMALY_SCORE_THRESHOLD)

print(f"\\n⚠️  Manual config found {anomalies_manual.count()} anomalies (threshold: {ANOMALY_SCORE_THRESHOLD}):\\n")
display(anomalies_manual.orderBy(F.col("_info.anomaly.score").desc()).select(
    "rep_id", "region", "calls_made", "prescriptions_generated", "expenses",
    F.round("_info.anomaly.score", 3).alias("score")
).limit(10))


# Store metrics for comparison
manual_detected = anomalies_manual.count()
manual_detection_rate = (manual_detected / expected_anomalies) * 100 if expected_anomalies > 0 else 0
manual_flagged = df_scored_manual.filter(F.col("_info.anomaly.score") >= ANOMALY_SCORE_THRESHOLD).count()
manual_fp = max(0, manual_flagged - expected_anomalies)
manual_fp_rate = (manual_fp / normal_records_expected) * 100 if normal_records_expected > 0 else 0

manual_metrics = {
    "detected": manual_detected,
    "detection_rate": manual_detection_rate,
    "fp_rate": manual_fp_rate
}

print(f"\n📊 Manual Tuning Performance: {manual_detected} detected ({manual_detection_rate:.1f}% rate), {manual_fp_rate:.2f}% FP")


Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



\n⚠️  Manual config found 182 anomalies (threshold: 0.5):\n


rep_id,region,calls_made,prescriptions_generated,expenses,score
REP020,APAC,100,120,2400.0,0.87
REP046,APAC,0,120,2400.0,0.852
REP000,APAC,0,120,2400.0,0.852
REP052,APAC,100,0,2400.0,0.837
REP082,US,80,96,3000.0,0.836
REP015,APAC,100,120,1.0,0.832
REP018,EU,60,-9,2000.0,0.821
REP010,EU,60,72,2000.0,0.817
REP068,EU,0,72,2000.0,0.795
REP001,US,0,96,1.0,0.79



📊 Manual Tuning Performance: 182 detected (330.9% rate), 6.72% FP


### 2.3 Model Comparison

Let's compare the auto-discovered vs manually tuned models:


In [0]:
# Compare models
print("📊 Model Comparison:\\n")
comparison = registry_df.filter(
    F.col("model_uri").isin([model_name_auto, model_name_manual])
).select(
    "model_name",
    "columns",
    "training_rows",
    "metrics"
).collect()

for model in comparison:
    print(f"{'='*60}")
    print(f"Model: {model['model_name']}")
    print(f"Columns: {model['columns']}")
    print(f"Training rows: {model['training_rows']}")
    print(f"Metrics: {model['metrics']}")
    print()

print("💡 Tuning Tips:")
print("   - contamination: Set to expected anomaly rate (0.01-0.1)")
print("   - num_trees: More trees = more stable (100-200)")
print("   - max_samples: Smaller = faster, larger = more accurate (256-1024)")
print("   - Start with auto-discovery, then refine based on domain knowledge")


📊 Model Comparison:\n
💡 Tuning Tips:
   - contamination: Set to expected anomaly rate (0.01-0.1)
   - num_trees: More trees = more stable (100-200)
   - max_samples: Smaller = faster, larger = more accurate (256-1024)
   - Start with auto-discovery, then refine based on domain knowledge


---

## Section 3: Segment-Based Monitoring (8 min)

Different regions have different patterns. Train per-region models for accurate baselines.


In [0]:
# Train with regional segmentation
print("🌍 Training region-specific anomaly models...\\n")

# Note: We explicitly select features, excluding activity_id and true_anomaly_type
model_name_segmented = anomaly_engine.train(
    df=df_train,
    columns=["calls_made", "prescriptions_generated", "samples_distributed", "expenses"],
    segment_by=["region"],  # Train separate model per region
    model_name="field_force_regional",
    params=AnomalyParams(
        algorithm_config=IsolationForestConfig(contamination=0.05, num_trees=150, random_seed=42)
    ),
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)

print(f"\\n✅ Regional models trained!")
print("   DQX automatically trained 3 models (US, EU, APAC)")


🌍 Training region-specific anomaly models...\n


  client.get_latest_versions(model_name, stages=None)


Training segment 1/3: region=US


🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-c84b0be096944a34965e4f520fe89f6c?o=984752964297111
Successfully registered model 'vbdemos.dqx_demo.field_force_regional__seg_region=us'.


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '1' of model 'vbdemos.dqx_demo.field_force_regional__seg_region=us': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_regional__seg_region=us/version/1?o=984752964297111
[90m09:43:55[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 2/3: region=EU


🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-c3061a462684448580e2f74ddbcdd36f?o=984752964297111
Successfully registered model 'vbdemos.dqx_demo.field_force_regional__seg_region=eu'.


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '1' of model 'vbdemos.dqx_demo.field_force_regional__seg_region=eu': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_regional__seg_region=eu/version/1?o=984752964297111
[90m09:44:32[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 3/3: region=APAC


🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-68b94590822941beaa07bdd0e1d3aa87?o=984752964297111
Successfully registered model 'vbdemos.dqx_demo.field_force_regional__seg_region=apac'.


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '1' of model 'vbdemos.dqx_demo.field_force_regional__seg_region=apac': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_regional__seg_region=apac/version/1?o=984752964297111
[90m09:45:04[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


   Trained 3/3 segment models for: vbdemos.dqx_demo.field_force_regional
   Registry: vbdemos.dqx_demo.anomaly_model_registry
\n✅ Regional models trained!
   DQX automatically trained 3 models (US, EU, APAC)


In [0]:
# Compare regional baselines
regional_models = spark.table(f"{catalog}.{schema_name}.anomaly_model_registry").filter(
    (F.col("model_name").startswith(f"{catalog}.{schema_name}.field_force_regional__seg_")) &
    (F.col("status") == "active")
)

print("📊 Regional Model Baselines:\\n")
for row in regional_models.select("segment_values", "training_rows", "baseline_stats").collect():
    region = row['segment_values']['region']
    print(f"Region: {region}")
    print(f"  Training rows: {row['training_rows']}")
    print(f"  Baseline stats: {row['baseline_stats']}")
    print()

print("🔍 Notice: Each region has different baselines!")
print("   US: Higher expenses ($150 avg)")
print("   EU: Lower expenses ($100 avg)")
print("   APAC: Highest volume (10 calls, 15 prescriptions avg)")


📊 Regional Model Baselines:\n
🔍 Notice: Each region has different baselines!
   US: Higher expenses ($150 avg)
   EU: Lower expenses ($100 avg)
   APAC: Highest volume (10 calls, 15 prescriptions avg)


In [0]:
from databricks.labs.dqx.rule import DQDatasetRule

# Score with regional models (automatic routing)
checks_regional = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": "field_force_regional",
            "score_threshold": ANOMALY_SCORE_THRESHOLD,  # Use configurable threshold
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

df_scored_regional = dq_engine.apply_checks(df_test, checks_regional)

print(f"⚠️  Regional anomalies by region (threshold: {ANOMALY_SCORE_THRESHOLD}):\\n")
display(df_scored_regional.filter(F.col("_info.anomaly.score") >= ANOMALY_SCORE_THRESHOLD).groupBy("region").agg(
    F.count("*").alias("anomaly_count"),
    F.avg("_info.anomaly.score").alias("avg_score"),
    F.max("_info.anomaly.score").alias("max_score")
).orderBy("region"))

print("\\n📋 Top regional anomalies:")
display(df_scored_regional.filter(F.col("_info.anomaly.score") >= ANOMALY_SCORE_THRESHOLD).orderBy(
    F.col("_info.anomaly.score").desc()
).select(
    "rep_id", "region", "calls_made", "prescriptions_generated", "expenses",
    F.round("_info.anomaly.score", 3).alias("score")
).limit(10))


# Store metrics for comparison
anomalies_regional = df_scored_regional.filter(F.col("_info.anomaly.score") >= ANOMALY_SCORE_THRESHOLD)
segmented_detected = anomalies_regional.count()
segmented_detection_rate = (segmented_detected / expected_anomalies) * 100 if expected_anomalies > 0 else 0
segmented_flagged = df_scored_regional.filter(F.col("_info.anomaly.score") >= ANOMALY_SCORE_THRESHOLD).count()
segmented_fp = max(0, segmented_flagged - expected_anomalies)
segmented_fp_rate = (segmented_fp / normal_records_expected) * 100 if normal_records_expected > 0 else 0

segmented_metrics = {
    "detected": segmented_detected,
    "detection_rate": segmented_detection_rate,
    "fp_rate": segmented_fp_rate
}

print(f"\n📊 Segmented Performance: {segmented_detected} detected ({segmented_detection_rate:.1f}% rate), {segmented_fp_rate:.2f}% FP")


Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



⚠️  Regional anomalies by region (threshold: 0.5):\n


region,anomaly_count,avg_score,max_score
APAC,67,0.5984046077271297,0.8663759876236616
EU,73,0.5836047866287966,0.8304100136027274
US,69,0.5945251031475103,0.8319096168135492


\n📋 Top regional anomalies:


rep_id,region,calls_made,prescriptions_generated,expenses,score
REP020,APAC,100,120,2400.0,0.866
REP052,APAC,100,0,2400.0,0.842
REP046,APAC,0,120,2400.0,0.836
REP000,APAC,0,120,2400.0,0.836
REP082,US,80,96,3000.0,0.832
REP010,EU,60,72,2000.0,0.83
REP015,APAC,100,120,1.0,0.826
REP018,EU,60,-9,2000.0,0.8
REP068,EU,0,72,2000.0,0.799
REP001,US,0,96,1.0,0.783



📊 Segmented Performance: 209 detected (380.0% rate), 8.15% FP


---

## Section 4: Feature Contributions & Root Cause (8 min)

**Why is a record anomalous?** Use SHAP to understand which columns drove the anomaly score.


In [0]:
from databricks.labs.dqx.rule import DQDatasetRule

# Score with SHAP-based feature contributions
checks_with_contrib = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": "field_force_regional",
            "score_threshold": ANOMALY_SCORE_THRESHOLD,  # Use configurable threshold
            "include_contributions": True,  # Enable SHAP explanations
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

df_with_contrib = dq_engine.apply_checks(df_test, checks_with_contrib)

# Save regional contrib results to table for efficient reuse
contrib_table = f"{catalog}.{schema_name}.regional_contrib_temp"
df_with_contrib.write.mode("overwrite").saveAsTable(contrib_table)
df_with_contrib = spark.table(contrib_table)

print(f"🔍 Top Anomalies with Feature Contributions (SHAP, threshold: {ANOMALY_SCORE_THRESHOLD}):\\n")
anomalies_contrib = df_with_contrib.filter(
    F.col("_info.anomaly.score") >= ANOMALY_SCORE_THRESHOLD
).orderBy(F.col("_info.anomaly.score").desc()).limit(10)

display(anomalies_contrib.select(
    "rep_id", "region",
    "calls_made", "prescriptions_generated", "samples_distributed", "expenses",
    F.round("_info.anomaly.score", 3).alias("score"),
    "_info.anomaly.contributions"
))


Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



🔍 Top Anomalies with Feature Contributions (SHAP, threshold: 0.5):\n


rep_id,region,calls_made,prescriptions_generated,samples_distributed,expenses,score,contributions
REP020,APAC,100,120,360,2400.0,0.866,"Map(calls_made -> 0.2614958715579856, prescriptions_generated -> 0.24900054891493292, samples_distributed -> 0.21462117233552688, expenses -> 0.2748824071915546)"
REP052,APAC,100,0,360,2400.0,0.842,"Map(calls_made -> 0.3223438812156769, prescriptions_generated -> 0.07296493858403298, samples_distributed -> 0.2585777456035029, expenses -> 0.3461134345967871)"
REP046,APAC,0,120,360,2400.0,0.836,"Map(calls_made -> 0.08209630538106558, prescriptions_generated -> 0.3059512470410685, samples_distributed -> 0.27031216723221535, expenses -> 0.34164028034565064)"
REP000,APAC,0,120,360,2400.0,0.836,"Map(calls_made -> 0.08209630538106558, prescriptions_generated -> 0.3059512470410685, samples_distributed -> 0.27031216723221535, expenses -> 0.34164028034565064)"
REP082,US,80,96,0,3000.0,0.832,"Map(calls_made -> 0.2691688729771543, prescriptions_generated -> 0.3167999566233494, samples_distributed -> 0.11261951013137486, expenses -> 0.30141166026812155)"
REP010,EU,60,72,0,2000.0,0.83,"Map(calls_made -> 0.27798522819379506, prescriptions_generated -> 0.3134029908368244, samples_distributed -> 0.0922568509351584, expenses -> 0.3163549300342222)"
REP015,APAC,100,120,360,1.0,0.826,"Map(calls_made -> 0.3269197050892823, prescriptions_generated -> 0.3252647670353453, samples_distributed -> 0.26089554082678945, expenses -> 0.08691998704858288)"
REP018,EU,60,-9,216,2000.0,0.8,"Map(calls_made -> 0.30059533189420745, prescriptions_generated -> 0.0844229297167624, samples_distributed -> 0.24312370363852387, expenses -> 0.37185803475050644)"
REP068,EU,0,72,0,2000.0,0.799,"Map(calls_made -> 0.12218082127226514, prescriptions_generated -> 0.3803792960205415, samples_distributed -> 0.1008197641062422, expenses -> 0.3966201186009511)"
REP001,US,0,96,300,1.0,0.783,"Map(calls_made -> 0.12991245563073936, prescriptions_generated -> 0.4373705683250926, samples_distributed -> 0.29650177380031123, expenses -> 0.13621520224385666)"


In [0]:
# Analyze contribution patterns for root cause
print("📊 Root Cause Analysis:\\n")

top_anomaly = anomalies_contrib.first()
print(f"🔸 Top Anomaly: REP={top_anomaly['rep_id']}, Region={top_anomaly['region']}")
print(f"   Score: {top_anomaly['_info']['anomaly']['score']:.3f}")
print(f"   Values:")
print(f"     • calls_made: {top_anomaly['calls_made']}")
print(f"     • prescriptions: {top_anomaly['prescriptions_generated']}")
print(f"     • samples: {top_anomaly['samples_distributed']}")
print(f"     • expenses: ${top_anomaly['expenses']:.2f}")
print(f"\\n   📈 Feature Contributions (SHAP):")

if top_anomaly['_info']['anomaly']['contributions']:
    sorted_contribs = sorted(
        top_anomaly['_info']['anomaly']['contributions'].items(),
        key=lambda x: x[1],
        reverse=True
    )
    for feature, contribution in sorted_contribs:
        print(f"      {feature:30s}: {contribution:.3f} ({contribution*100:.1f}%)")

print("\\n💡 Business Interpretation Examples:")
print("   • High 'expenses' contribution → Potential fraud or policy violation")
print("   • High 'calls_made' + low 'prescriptions' → Training need or territory issue")
print("   • High 'prescriptions' contribution → Unrealistic claims to investigate")
print("   • Balanced contributions → Multivariate anomaly (multiple factors)")


📊 Root Cause Analysis:\n
🔸 Top Anomaly: REP=REP020, Region=APAC
   Score: 0.866
   Values:
     • calls_made: 100
     • prescriptions: 120
     • samples: 360
     • expenses: $2400.00
\n   📈 Feature Contributions (SHAP):
      expenses                      : 0.275 (27.5%)
      calls_made                    : 0.261 (26.1%)
      prescriptions_generated       : 0.249 (24.9%)
      samples_distributed           : 0.215 (21.5%)
\n💡 Business Interpretation Examples:
   • High 'expenses' contribution → Potential fraud or policy violation
   • High 'calls_made' + low 'prescriptions' → Training need or territory issue
   • High 'prescriptions' contribution → Unrealistic claims to investigate
   • Balanced contributions → Multivariate anomaly (multiple factors)


---

### 📊 Approach Comparison & Recommendations

Let's compare all three approaches to see which performed best.


In [0]:
# Compare all three approaches
print(f"🏆 Performance Comparison (Threshold: {ANOMALY_SCORE_THRESHOLD})\\n")
print("="*80)

comparison_data = [
    ("Auto-Discovery", auto_metrics['detected'], auto_metrics['detection_rate'], auto_metrics['fp_rate']),
    ("Manual Tuned", manual_metrics['detected'], manual_metrics['detection_rate'], manual_metrics['fp_rate']),
    ("Segmented (Regional)", segmented_metrics['detected'], segmented_metrics['detection_rate'], segmented_metrics['fp_rate']),
]

# Create DataFrame for comparison
comparison_df = spark.createDataFrame(comparison_data, ["Approach", "Detected", "Detection_Rate_%", "FP_Rate_%"])
display(comparison_df)

# Determine winner
best_detection = max(auto_metrics['detection_rate'], manual_metrics['detection_rate'], segmented_metrics['detection_rate'])
best_fp = min(auto_metrics['fp_rate'], manual_metrics['fp_rate'], segmented_metrics['fp_rate'])

print("\\n🎯 Key Findings:\\n")

if segmented_metrics['detection_rate'] == best_detection:
    print("✅ WINNER: Segmented approach has the BEST detection rate!")
    print(f"   {segmented_metrics['detection_rate']:.1f}% detection with {segmented_metrics['fp_rate']:.2f}% false positives")
elif manual_metrics['detection_rate'] == best_detection:
    print("✅ WINNER: Manual tuning has the BEST detection rate!")
    print(f"   {manual_metrics['detection_rate']:.1f}% detection with {manual_metrics['fp_rate']:.2f}% false positives")
else:
    print("✅ WINNER: Auto-discovery has the BEST detection rate!")
    print(f"   {auto_metrics['detection_rate']:.1f}% detection with {auto_metrics['fp_rate']:.2f}% false positives")

print("\\n💡 Recommendations:\\n")
print("| Approach | When to Use |")
print("|----------|-------------|")
print("| **Auto-Discovery** | Quick start, exploration, uniform data |")
print("| **Manual Tuned** | Production, known important features, single baseline |")
print("| **Segmented** | Multi-region/multi-product with different baselines |")
print("\\n📈 Best Practice: Start with auto-discovery, validate results, then refine with")
print("   manual tuning or segmentation based on your business context.")

# Show what segmentation helps with
print("\\n🌍 Why Segmentation Works:")
print(f"   • Different regions have different 'normal' patterns")
print(f"   • US avg expenses: $150, APAC: $120, EU: $100")
print(f"   • Segmented models catch region-specific anomalies better")
print(f"   • Reduces false positives from natural regional differences")

print(f"\n📝 Note: All approaches use the same threshold ({ANOMALY_SCORE_THRESHOLD}).")
print(f"   To experiment with different thresholds, change ANOMALY_SCORE_THRESHOLD at the top and re-run.")


🏆 Performance Comparison (Threshold: 0.5)\n


Approach,Detected,Detection_Rate_%,FP_Rate_%
Auto-Discovery,721,1310.909090909091,35.23809523809524
Manual Tuned,182,330.90909090909093,6.71957671957672
Segmented (Regional),209,380.0,8.148148148148149


\n🎯 Key Findings:\n
✅ WINNER: Auto-discovery has the BEST detection rate!
   1310.9% detection with 35.24% false positives
\n💡 Recommendations:\n
| Approach | When to Use |
|----------|-------------|
| **Auto-Discovery** | Quick start, exploration, uniform data |
| **Manual Tuned** | Production, known important features, single baseline |
| **Segmented** | Multi-region/multi-product with different baselines |
\n📈 Best Practice: Start with auto-discovery, validate results, then refine with
   manual tuning or segmentation based on your business context.
\n🌍 Why Segmentation Works:
   • Different regions have different 'normal' patterns
   • US avg expenses: $150, APAC: $120, EU: $100
   • Segmented models catch region-specific anomalies better
   • Reduces false positives from natural regional differences

📝 Note: All approaches use the same threshold (0.5).
   To experiment with different thresholds, change ANOMALY_SCORE_THRESHOLD at the top and re-run.


---

## Section 5: Drift Detection & Retraining (6 min)

Data distributions change over time. DQX can detect when your model becomes stale.


In [0]:
# Simulate drift: New patterns (more remote work, lower expenses post-policy change)
def generate_drifted_data(num_rows=200):
    """Generate Q3 data with shifted distribution (post-policy change)."""
    data = []
    regions = ["US", "EU", "APAC"]
    call_types = ["promotional", "educational", "follow_up"]
    
    # NEW PATTERNS: More remote work, lower expenses, similar productivity
    new_patterns = {
        "US": {"calls": (9, 2), "prescriptions": (12, 3), "samples": (20, 4), "expenses": (100, 20)},  # -33% expenses
        "EU": {"calls": (7, 1.5), "prescriptions": (9, 2), "samples": (15, 3), "expenses": (70, 15)},   # -30% expenses
        "APAC": {"calls": (11, 3), "prescriptions": (15, 4), "samples": (25, 6), "expenses": (85, 20)}, # -29% expenses
    }
    
    start_date = datetime(2024, 7, 1)  # Q3 data
    
    for i in range(num_rows):
        region = random.choice(regions)
        pattern = new_patterns[region]
        
        calls = max(1, int(np.random.normal(pattern["calls"][0], pattern["calls"][1])))
        prescriptions = max(0, int(np.random.normal(pattern["prescriptions"][0], pattern["prescriptions"][1])))
        samples = max(0, int(np.random.normal(pattern["samples"][0], pattern["samples"][1])))
        expenses = max(10, round(np.random.normal(pattern["expenses"][0], pattern["expenses"][1]), 2))
        is_remote = random.random() < 0.7  # 70% remote now (was 30%)
        call_type = random.choice(call_types)
        
        days_offset = random.randint(0, 90)
        call_date = start_date + timedelta(days=days_offset)
        
        # Add activity_id and true_anomaly_type to match schema (11 columns)
        data.append((
            f"ACT_DRIFT{i:06d}",  # activity_id (primary key)
            f"REP{i % 50:03d}",   # rep_id
            region,
            call_date,
            calls,
            prescriptions,
            samples,
            expenses,
            is_remote,
            call_type,
            "normal"  # true_anomaly_type (drift data is normal, just shifted distribution)
        ))
    
    return data

# Generate and compare
drifted_data = generate_drifted_data(num_rows=200)
df_drifted = spark.createDataFrame(drifted_data, schema)

print("📊 Original vs Drifted Data Comparison:\\n")
print("Original (Q1-Q2 2024):") 
display(df_sales.agg(
    F.avg("expenses").alias("avg_expenses"),
    F.avg(F.col("is_remote").cast("int")).alias("remote_rate")
))

print("Drifted (Q3 2024 - post policy change):")
display(df_drifted.agg(
    F.avg("expenses").alias("avg_expenses"),
    F.avg(F.col("is_remote").cast("int")).alias("remote_rate")
))

print("✅ Distribution shifted:")
print("   • Expenses: -30% (policy change)")
print("   • Remote work: +133% (70% vs 30%)")


📊 Original vs Drifted Data Comparison:\n
Original (Q1-Q2 2024):


avg_expenses,remote_rate
138.774208,0.1796


Drifted (Q3 2024 - post policy change):


avg_expenses,remote_rate
82.71954999999994,0.685


✅ Distribution shifted:
   • Expenses: -30% (policy change)
   • Remote work: +133% (70% vs 30%)


In [0]:
# === EXPLICIT DRIFT STATISTICS ===
# Python warnings can get lost in Databricks output, so let's explicitly compute drift

from databricks.labs.dqx.anomaly.drift_detector import compute_drift_score

print("📊 Explicit Drift Analysis:\n")
print("=" * 70)

# Get the trained model's baseline statistics
registry_df = spark.table(f"{catalog}.{schema_name}.anomaly_model_registry")
regional_models = registry_df.filter(
    (F.col("model_name").startswith(f"{catalog}.{schema_name}.field_force_regional__seg_")) &
    (F.col("status") == "active")
)

# Check drift for each region
for row in regional_models.collect():
    region = row["segment_values"]["region"] if row["segment_values"] else "Global"
    baseline_stats = row["baseline_stats"]
    columns = row["columns"]
    
    # Filter drifted data for this region/segment
    if row["segment_values"]:
        segment_filter = " AND ".join([f"{k} = '{v}'" for k, v in row["segment_values"].items()])
        df_segment = df_drifted.filter(segment_filter)
    else:
        df_segment = df_drifted
    
    # Compute drift
    drift_result = compute_drift_score(
        df_segment.select(columns),
        columns,
        baseline_stats,
        DRIFT_THRESHOLD
    )
    
    print(f"\n🌍 Region: {region}")
    print(f"   Drift Score: {drift_result.drift_score:.2f}")
    print(f"   Drift Detected: {'🚨 YES' if drift_result.drift_detected else '✅ NO'}")
    
    if drift_result.drifted_columns:
        print(f"   Drifted Columns: {', '.join(drift_result.drifted_columns)}")
        
        # Show baseline vs current stats for drifted columns
        print(f"\n   📈 Baseline vs Current:")
        for col in drift_result.drifted_columns:
            if col in baseline_stats:
                baseline = baseline_stats[col]
                current = df_segment.select(col).agg(
                    F.avg(col).alias("mean"),
                    F.stddev(col).alias("stddev")
                ).first()
                
                print(f"      {col}:")
                print(f"         Baseline: mean={baseline['mean']:.2f}, std={baseline['std']:.2f}")
                print(f"         Current:  mean={current['mean']:.2f}, std={current['stddev']:.2f}")
                print(f"         Change:   {((current['mean'] - baseline['mean']) / baseline['mean'] * 100):.1f}%")

print("\n" + "=" * 70)
print(f"\n💡 Threshold: {DRIFT_THRESHOLD}")
print("   Drift score > threshold → Model needs retraining")
print("   High drift in 'expenses' is expected (policy change: -30% expenses)")


📊 Explicit Drift Analysis:


🌍 Region: US
   Drift Score: 0.71
   Drift Detected: ✅ NO

🌍 Region: EU
   Drift Score: 0.60
   Drift Detected: ✅ NO

🌍 Region: APAC
   Drift Score: 0.53
   Drift Detected: ✅ NO


💡 Threshold: 3.0
   Drift score > threshold → Model needs retraining
   High drift in 'expenses' is expected (policy change: -30% expenses)


In [0]:
from databricks.labs.dqx.rule import DQDatasetRule

# Score drifted data with drift detection enabled
checks_with_drift = [
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": "field_force_regional",
            "drift_threshold": DRIFT_THRESHOLD,  # Use configurable threshold
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

print(f"🔍 Scoring drifted data with drift detection (threshold: {DRIFT_THRESHOLD})...\n")

df_drift_scored = dq_engine.apply_checks(df_drifted, checks_with_drift)

print(f"\n💡 Drift score > {DRIFT_THRESHOLD} → Significant distribution shift, retrain recommended")
print("   DQX will show UserWarnings if drift is detected:")
print("   🚨 'DATA DRIFT DETECTED in columns: expenses (drift score: 4.2)...'")
print("\n✅ Check cell output above for any drift UserWarnings.")


🔍 Scoring drifted data with drift detection (threshold: 3.0)...



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]




💡 Drift score > 3.0 → Significant distribution shift, retrain recommended
   🚨 'DATA DRIFT DETECTED in columns: expenses (drift score: 4.2)...'



In [0]:
# Retrain with combined data
df_combined = df_sales.union(df_drifted)

print("🔄 Retraining model with combined data (old + new patterns)...\\n")

# Note: We explicitly select features, excluding activity_id and true_anomaly_type
model_name_retrained = anomaly_engine.train(
    df=df_combined,
    columns=["calls_made", "prescriptions_generated", "samples_distributed", "expenses"],
    segment_by=["region"],
    model_name="field_force_regional",  # Same name = new version
    params=AnomalyParams(
        algorithm_config=IsolationForestConfig(contamination=0.05, num_trees=150, random_seed=42)
    ),
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)

print("\\n✅ Model retrained!")
print("   • Old model automatically archived")
print("   • New model active and includes both historical and recent patterns")
print("   • Baseline updated to reflect new expense policy and remote work rates")
print("\\n💡 Best Practice: Set up drift monitoring in production, retrain monthly/quarterly")


🔄 Retraining model with combined data (old + new patterns)...\n


  client.get_latest_versions(model_name, stages=None)


Training segment 1/3: region=US


🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-338be17046a1494586569989feb6d991?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_regional__seg_region=US' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '2' of model 'vbdemos.dqx_demo.field_force_regional__seg_region=us': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_regional__seg_region=us/version/2?o=984752964297111
[90m09:46:38[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 2/3: region=EU


🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-33059f789ddd42f2babd17cde4efc424?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_regional__seg_region=EU' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '2' of model 'vbdemos.dqx_demo.field_force_regional__seg_region=eu': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_regional__seg_region=eu/version/2?o=984752964297111
[90m09:47:13[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


Training segment 3/3: region=APAC


🔗 View Logged Model at: https://adb-984752964297111.11.azuredatabricks.net/ml/experiments/3582259051881804/models/m-f5fe525ecca740b692a2673325cc4f04?o=984752964297111
Registered model 'vbdemos.dqx_demo.field_force_regional__seg_region=APAC' already exists. Creating a new version of this model...


Uploading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]

🔗 Created version '2' of model 'vbdemos.dqx_demo.field_force_regional__seg_region=apac': https://adb-984752964297111.11.azuredatabricks.net/explore/data/models/vbdemos/dqx_demo/field_force_regional__seg_region=apac/version/2?o=984752964297111
[90m09:47:50[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.anomaly_model_registry table[0m


   Trained 3/3 segment models for: vbdemos.dqx_demo.field_force_regional
   Registry: vbdemos.dqx_demo.anomaly_model_registry
\n✅ Model retrained!
   • Old model automatically archived
   • New model active and includes both historical and recent patterns
   • Baseline updated to reflect new expense policy and remote work rates
\n💡 Best Practice: Set up drift monitoring in production, retrain monthly/quarterly


---

## Section 6: Production Integration (6 min)

Integrate anomaly detection into your DQX workflows for automated monitoring.


In [0]:
# Combine anomaly detection with traditional DQ checks
from databricks.labs.dqx.rule import DQRowRule, DQDatasetRule

checks_combined = [
    # Traditional data quality checks - one per column
    DQRowRule(
        criticality="error",
        check_func=is_not_null,
        column="rep_id",
        name="rep_id_not_null"
    ),
    DQRowRule(
        criticality="error",
        check_func=is_not_null,
        column="region",
        name="region_not_null"
    ),
    DQRowRule(
        criticality="error",
        check_func=is_not_null,
        column="call_date",
        name="call_date_not_null"
    ),
    DQRowRule(
        criticality="error",
        check_func=is_in_range,
        column="calls_made",
        name="calls_range",
        check_func_kwargs={"min_limit": 0, "max_limit": 50
        }
    ),
    DQRowRule(
        criticality="error",
        check_func=is_in_range,
        column="expenses",
        name="expenses_range",
        check_func_kwargs={"min_limit": 0, "max_limit": 1000}
    ),
    
    # ML-based anomaly detection with explanations
    DQDatasetRule(
        criticality="error",
        check_func=has_no_anomalies,
        check_func_kwargs={
            "model": "field_force_regional",
            "score_threshold": ANOMALY_SCORE_THRESHOLD,  # Use configurable threshold
            "include_contributions": True,
            "drift_threshold": DRIFT_THRESHOLD,  # Use configurable threshold
            "registry_table": f"{catalog}.{schema_name}.anomaly_model_registry",
            "merge_columns": ["activity_id"]
        }
    )
]

# Apply all checks together
df_full_dq = dq_engine.apply_checks(df_test, checks_combined)

# Summary
print(f"📊 Full Data Quality Summary (threshold: {ANOMALY_SCORE_THRESHOLD}):\\n")
total_rows = df_full_dq.count()
anomalies_found = df_full_dq.filter(F.col("_info.anomaly.score") >= ANOMALY_SCORE_THRESHOLD).count()

# Note: Traditional check condition columns would have specific names based on implementation
print(f"Total Rows: {total_rows}")
print(f"Anomalies Detected: {anomalies_found}")
print(f"Clean Records: {total_rows - anomalies_found}")
print(f"\\n✅ All checks applied in single pass!")


Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



📊 Full Data Quality Summary (threshold: 0.5):\n
Total Rows: 1979
Anomalies Detected: 170
Clean Records: 1809
\n✅ All checks applied in single pass!


In [0]:
# === QUARANTINE WORKFLOW (DQX Standard Pattern) ===
# Use DQX's built-in split method to separate valid from quarantined records

print("🔀 Applying quarantine workflow...")
print(f"   Threshold: {ANOMALY_SCORE_THRESHOLD}\n")

# Split valid and quarantined data using DQX standard method
valid_df, quarantine_df = dq_engine.apply_checks_and_split(df_test, checks_combined)

print(f"✅ Valid records: {valid_df.count()}")
print(f"⚠️  Quarantined for review: {quarantine_df.count()}\n")

# Save both valid and quarantine data using DQX standard method
dq_engine.save_results_in_table(
    output_df=valid_df,
    quarantine_df=quarantine_df,
    output_config=OutputConfig(
        location=f"{catalog}.{schema_name}.field_force_clean",
        mode="overwrite"
    ),
    quarantine_config=OutputConfig(
        location=f"{catalog}.{schema_name}.field_force_quarantine",
        mode="overwrite"
    )
)

print(f"💾 Saved valid data to: {catalog}.{schema_name}.field_force_clean")
print(f"💾 Saved quarantine to: {catalog}.{schema_name}.field_force_quarantine")

# Display quarantine summary
print("\n📊 Quarantine Summary by Region:")
quarantine_summary = spark.table(f"{catalog}.{schema_name}.field_force_quarantine").groupBy("region").agg(
    F.count("*").alias("count"),
    F.avg("_info.anomaly.score").alias("avg_score"),
    F.max("_info.anomaly.score").alias("max_score")
).orderBy("region")
display(quarantine_summary)

# Show top quarantined records with explanations
print("\n📋 Top Quarantined Records (for Manual Review):")
display(
    spark.table(f"{catalog}.{schema_name}.field_force_quarantine")
    .orderBy(F.desc("_info.anomaly.score"))
    .select(
        "activity_id", "rep_id", "region", "calls_made", "prescriptions_generated", 
        "expenses", F.round("_info.anomaly.score", 3).alias("score"),
        "_info.anomaly.contributions", "_errors"
    )
    .limit(10)
)

print("\n💡 Quarantine Workflow Best Practices:")
print("   1. Anomalies automatically sent to quarantine table")
print("   2. Review team investigates using anomaly_contributions")
print("   3. Check _errors column for all DQ violations")
print("   4. Confirmed issues → escalate to appropriate team")
print("   5. False positives → retune model or adjust threshold")


🔀 Applying quarantine workflow...
   Threshold: 0.5



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/9 [00:00<?, ?it/s]



✅ Valid records: 1809


[90m09:48:33[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.field_force_clean table[0m


⚠️  Quarantined for review: 170



[90m09:48:43[0m [1m[32m INFO[0m [1m[d.l.dqx.io] Saving data to vbdemos.dqx_demo.field_force_quarantine table[0m


💾 Saved valid data to: vbdemos.dqx_demo.field_force_clean
💾 Saved quarantine to: vbdemos.dqx_demo.field_force_quarantine

📊 Quarantine Summary by Region:


region,count,avg_score,max_score
APAC,54,0.5903794234472398,0.8322849405615649
EU,60,0.582417352234857,0.8330360961263729
US,56,0.5822045923808167,0.835045329010688



📋 Top Quarantined Records (for Manual Review):


activity_id,rep_id,region,calls_made,prescriptions_generated,expenses,score,contributions,_errors
ACT009415,REP015,US,0,96,3000.0,0.835,"Map(calls_made -> 0.10631982243496092, prescriptions_generated -> 0.33254312831768346, samples_distributed -> 0.24195372061025172, expenses -> 0.3191833286371039)","List(List(expenses_range, Value '3000.0' in Column 'expenses' not in range: [0, 1000], List(expenses), null, is_in_range, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()), List(has_anomalies, Anomaly score exceeded threshold 0.5, null, null, has_no_anomalies, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()))"
ACT006910,REP010,EU,60,72,2000.0,0.833,"Map(calls_made -> 0.2658435459989941, prescriptions_generated -> 0.31564225571973165, samples_distributed -> 0.0900326772224565, expenses -> 0.3284815210588178)","List(List(calls_range, Value '60' in Column 'calls_made' not in range: [0, 50], List(calls_made), null, is_in_range, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()), List(expenses_range, Value '2000.0' in Column 'expenses' not in range: [0, 1000], List(expenses), null, is_in_range, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()), List(has_anomalies, Anomaly score exceeded threshold 0.5, null, null, has_no_anomalies, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()))"
ACT008305,REP005,EU,60,72,2000.0,0.833,"Map(calls_made -> 0.2658435459989941, prescriptions_generated -> 0.31564225571973165, samples_distributed -> 0.0900326772224565, expenses -> 0.3284815210588178)","List(List(calls_range, Value '60' in Column 'calls_made' not in range: [0, 50], List(calls_made), null, is_in_range, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()), List(expenses_range, Value '2000.0' in Column 'expenses' not in range: [0, 1000], List(expenses), null, is_in_range, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()), List(has_anomalies, Anomaly score exceeded threshold 0.5, null, null, has_no_anomalies, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()))"
ACT001513,REP013,APAC,0,120,2400.0,0.832,"Map(calls_made -> 0.10525298671848196, prescriptions_generated -> 0.30818490535889204, samples_distributed -> 0.2518655965132138, expenses -> 0.33469651140941226)","List(List(expenses_range, Value '2400.0' in Column 'expenses' not in range: [0, 1000], List(expenses), null, is_in_range, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()), List(has_anomalies, Anomaly score exceeded threshold 0.5, null, null, has_no_anomalies, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()))"
ACT005715,REP015,APAC,100,120,1.0,0.817,"Map(calls_made -> 0.3129244778050766, prescriptions_generated -> 0.3356937305839634, samples_distributed -> 0.26127784623923506, expenses -> 0.09010394537172497)","List(List(calls_range, Value '100' in Column 'calls_made' not in range: [0, 50], List(calls_made), null, is_in_range, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()), List(has_anomalies, Anomaly score exceeded threshold 0.5, null, null, has_no_anomalies, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()))"
ACT009843,REP043,EU,0,0,2000.0,0.79,"Map(calls_made -> 0.11824994156010901, prescriptions_generated -> 0.09014817426975365, samples_distributed -> 0.3386912082537647, expenses -> 0.4529106759163727)","List(List(expenses_range, Value '2000.0' in Column 'expenses' not in range: [0, 1000], List(expenses), null, is_in_range, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()), List(has_anomalies, Anomaly score exceeded threshold 0.5, null, null, has_no_anomalies, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()))"
ACT002048,REP048,US,80,-12,1.0,0.746,"Map(calls_made -> 0.4232513837088163, prescriptions_generated -> 0.12280773581009478, samples_distributed -> 0.3434950714184874, expenses -> 0.1104458090626015)","List(List(calls_range, Value '80' in Column 'calls_made' not in range: [0, 50], List(calls_made), null, is_in_range, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()), List(has_anomalies, Anomaly score exceeded threshold 0.5, null, null, has_no_anomalies, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()))"
ACT001999,REP099,APAC,100,-15,1.0,0.746,"Map(calls_made -> 0.5483104892953596, prescriptions_generated -> 0.1605520203185372, samples_distributed -> 0.16361866226667346, expenses -> 0.1275188281194297)","List(List(calls_range, Value '100' in Column 'calls_made' not in range: [0, 50], List(calls_made), null, is_in_range, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()), List(has_anomalies, Anomaly score exceeded threshold 0.5, null, null, has_no_anomalies, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()))"
ACT005436,REP036,US,0,96,1.0,0.736,"Map(calls_made -> 0.17760590292499587, prescriptions_generated -> 0.569141004374334, samples_distributed -> 0.13142613562558944, expenses -> 0.12182695707508057)","List(List(has_anomalies, Anomaly score exceeded threshold 0.5, null, null, has_no_anomalies, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()))"
ACT005067,REP067,EU,0,72,1.0,0.723,"Map(calls_made -> 0.16434198042508835, prescriptions_generated -> 0.5912580550484519, samples_distributed -> 0.13540963849064191, expenses -> 0.10899032603581778)","List(List(has_anomalies, Anomaly score exceeded threshold 0.5, null, null, has_no_anomalies, 2025-12-23T09:48:44.895Z, 26e05245-c51b-4ed2-ba0c-e03f8c026fb1, Map()))"



💡 Quarantine Workflow Best Practices:
   1. Anomalies automatically sent to quarantine table
   2. Review team investigates using anomaly_contributions
   3. Check _errors column for all DQ violations
   4. Confirmed issues → escalate to appropriate team
   5. False positives → retune model or adjust threshold


### YAML Configuration for Production

For automated workflows, define checks in YAML:

```yaml
run_configs:
  - name: field_force_monitoring
    input_config:
      location: catalog.schema.field_force_activity
    
    # Traditional checks
    quality_checks:
      - function: is_not_null
        arguments:
          columns: [rep_id, region, call_date]
      - function: is_in_range
        arguments:
          column: calls_made
          min_value: 0
          max_value: 50
      - function: is_in_range
        arguments:
          column: expenses
          min_value: 0
          max_value: 1000
    
    # Anomaly detection
    anomaly_config:
      columns: [calls_made, prescriptions_generated, samples_distributed, expenses]
      segment_by: [region]
      model_name: field_force_regional
      registry_table: catalog.schema.anomaly_model_registry
      params:
        algorithm_config:
          contamination: 0.05
          num_trees: 150
          random_state: 42
        sample_fraction: 1.0
    
    # Quarantine configuration
    quarantine_config:
      enabled: true
      table: catalog.schema.field_force_quarantine
      
    # Output configuration
    output_config:
      location: catalog.schema.field_force_clean
      save_mode: overwrite
```

**Run with:**
```bash
# Train model (one-time or scheduled)
databricks bundle run anomaly_trainer

# Run quality checks (scheduled, e.g., daily)
databricks bundle run quality_checker
```


---

## 🎓 Summary

### What You Learned:

1. ✅ **Auto-Discovery vs Manual Tuning** - Start with zero-config, refine with domain knowledge
2. ✅ **Parameter Tuning** - contamination, num_trees, max_samples for better performance
3. ✅ **Segment-Based Monitoring** - Regional baselines prevent false positives (US vs EU vs APAC)
4. ✅ **Feature Contributions** - SHAP-based root cause analysis for investigation
5. ✅ **Drift Detection** - Automated signals for when to retrain models
6. ✅ **Multi-Type Features** - Numeric, categorical, datetime, boolean all work together
7. ✅ **Production Integration** - DQEngine + YAML workflows + quarantine handling

### Key Takeaways:

- **Start simple**: `train(df)` with auto-discovery, then refine
- **Tune parameters**: Set contamination to expected anomaly rate, increase num_trees for stability
- **Use segments**: Different baselines for different groups prevent false positives
- **Enable contributions**: Root cause analysis is critical for business value
- **Monitor drift**: Set up drift detection for automated retraining signals
- **Combine checks**: Anomaly detection complements traditional DQ rules
- **Quarantine workflow**: Automate review process with explanations

### Model Comparison Results:

| Approach | Columns | Segments | Tuning | Use Case |
|----------|---------|----------|--------|----------|
| Auto-discovery | Auto (priority-based) | Auto (if applicable) | Default | Quick start, exploration |
| Manual tuned | Hand-picked | Manual | Custom hyperparameters | Production, refined monitoring |
| Regional | Hand-picked | By region | Tuned contamination | Multi-region with different baselines |

### Next Steps:

1. **Apply to your data**: `train(df=spark.table("your_table"))`
2. **Set up YAML workflows**: Automate training and checking
3. **Integrate quarantine**: Build review process with feature contributions
4. **Schedule retraining**: Weekly/monthly based on drift monitoring
5. **Monitor metrics**: Track anomaly rates, drift scores, false positive rates

### Resources:

- [DQX Anomaly Detection Documentation](https://databrickslabs.github.io/dqx/guide/anomaly_detection)
- [API Reference](https://databrickslabs.github.io/dqx/reference/quality_checks#has_no_anomalies)
- [GitHub Repository](https://github.com/databrickslabs/dqx)

---

**Questions? Feedback?** Open an issue on GitHub or contact the DQX team!


In [None]:
# === CLEANUP: Drop temporary tables ===

temp_tables = [
    f"{catalog}.{schema_name}.auto_scored_temp",
    f"{catalog}.{schema_name}.classified_temp",
    f"{catalog}.{schema_name}.shap_analysis_temp",
    f"{catalog}.{schema_name}.regional_contrib_temp"
]

print("🧹 Cleaning up temporary tables...\n")
for table in temp_tables:
    try:
        spark.sql(f"DROP TABLE IF EXISTS {table}")
        print(f"   ✅ Dropped {table}")
    except Exception as e:
        print(f"   ⚠️  Could not drop {table}: {e}")

print("\n✅ Cleanup complete!")