# üè• Pharmaceutical Field Force Effectiveness - Anomaly Detection Demo

## Business Context

**Scenario**: Monitor sales rep performance across regions to detect unusual patterns:
- üí∞ Expense fraud or policy violations
- üìâ Territory coverage issues and productivity gaps
- üö® Unrealistic prescription claims
- üìö Training needs identification

**Data**: Sales rep daily activity with calls, prescriptions, samples, expenses across US, EU, APAC

## What You'll Learn (45 min comprehensive demo)

1. **Auto-Discovery** - Zero-config vs manual tuning
2. **Segment-Based Monitoring** - Regional baselines (US vs EU vs APAC)
3. **Parameter Tuning** - Contamination, hyperparameters, model comparison
4. **Feature Contributions** - SHAP-based root cause analysis
5. **Drift Detection** - When to retrain models
6. **Multi-Type Features** - Numeric, categorical, datetime, boolean
7. **Production Integration** - DQEngine, YAML, quarantine workflows

---

**üìã Table of Contents:**
- Section 1: Setup & Realistic Data (5 min)
- Section 2: Auto-Discovery & Manual Tuning (12 min)
- Section 3: Segment-Based Monitoring (8 min)
- Section 4: Feature Contributions & Root Cause (8 min)
- Section 5: Drift Detection & Retraining (6 min)
- Section 6: Production Integration (6 min)


---

## Section 1: Setup & Data Generation (5 min)

First, install DQX with anomaly support if not already installed:
```bash
%pip install databricks-labs-dqx[anomaly]
```


In [None]:
# Imports
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta
import random
import numpy as np

from databricks.labs.dqx.anomaly import train, has_no_anomalies, AnomalyParams, IsolationForestConfig
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.check_funcs import is_not_null, is_in_range
from databricks.sdk import WorkspaceClient

# Initialize
spark = SparkSession.builder.getOrCreate()
ws = WorkspaceClient()
dq_engine = DQEngine(ws)

# Set seeds for reproducibility
random.seed(42)
np.random.seed(42)

print("‚úÖ Setup complete!")
print(f"   Spark version: {spark.version}")


### Generate Realistic Sales Rep Activity Data

We'll create 1000 rows of daily sales rep activity with:
- **Mixed data types**: Numeric, categorical, datetime, boolean
- **Regional patterns**: Different baselines for US (high expenses), EU (moderate), APAC (high volume)
- **Injected anomalies**: ~5% anomalous records (expense fraud, low productivity, unrealistic prescriptions)


In [None]:
# Generate sales rep activity data with realistic patterns
def generate_field_force_data(num_rows=1000, anomaly_rate=0.05):
    """Generate pharmaceutical field force activity data."""
    data = []
    regions = ["US", "EU", "APAC"]
    call_types = ["promotional", "educational", "follow_up"]
    
    # Regional baseline patterns (realistic differences)
    regional_patterns = {
        "US": {"calls": (8, 2), "prescriptions": (12, 3), "samples": (25, 5), "expenses": (150, 30)},
        "EU": {"calls": (6, 1.5), "prescriptions": (9, 2), "samples": (18, 4), "expenses": (100, 20)},
        "APAC": {"calls": (10, 3), "prescriptions": (15, 4), "samples": (30, 7), "expenses": (120, 25)},
    }
    
    start_date = datetime(2024, 1, 1)
    
    for i in range(num_rows):
        region = random.choice(regions)
        pattern = regional_patterns[region]
        
        # Normal patterns (95% of data)
        if random.random() > anomaly_rate:
            calls = max(1, int(np.random.normal(pattern["calls"][0], pattern["calls"][1])))
            prescriptions = max(0, int(np.random.normal(pattern["prescriptions"][0], pattern["prescriptions"][1])))
            samples = max(0, int(np.random.normal(pattern["samples"][0], pattern["samples"][1])))
            expenses = max(10, round(np.random.normal(pattern["expenses"][0], pattern["expenses"][1]), 2))
            is_remote = random.random() < 0.3
            call_type = random.choice(call_types)
        else:
            # Inject realistic anomalies (5% of data)
            anomaly_type = random.choice(["high_expense", "low_productivity", "unrealistic_prescriptions"])
            
            if anomaly_type == "high_expense":
                # Excessive expenses with low output (potential fraud)
                calls = max(1, int(np.random.normal(pattern["calls"][0] * 0.5, 1)))
                prescriptions = max(0, int(np.random.normal(pattern["prescriptions"][0] * 0.4, 1)))
                samples = max(0, int(np.random.normal(pattern["samples"][0] * 0.6, 2)))
                expenses = round(pattern["expenses"][0] * random.uniform(2.5, 4.0), 2)  # 2.5-4x normal
                is_remote = False
                call_type = "promotional"
            
            elif anomaly_type == "low_productivity":
                # Many calls but few results (training need)
                calls = int(pattern["calls"][0] * random.uniform(1.8, 2.5))
                prescriptions = max(0, int(pattern["prescriptions"][0] * random.uniform(0.2, 0.4)))
                samples = int(pattern["samples"][0] * random.uniform(0.3, 0.5))
                expenses = round(pattern["expenses"][0] * random.uniform(1.2, 1.5), 2)
                is_remote = random.random() < 0.5
                call_type = "follow_up"
            
            else:  # unrealistic_prescriptions
                # Suspiciously high prescription rate (investigation needed)
                calls = max(1, int(np.random.normal(pattern["calls"][0], 1)))
                prescriptions = int(pattern["prescriptions"][0] * random.uniform(2.5, 4.0))
                samples = int(pattern["samples"][0] * random.uniform(1.5, 2.0))
                expenses = round(pattern["expenses"][0] * random.uniform(0.8, 1.2), 2)
                is_remote = False
                call_type = "promotional"
        
        days_offset = random.randint(0, 180)
        call_date = start_date + timedelta(days=days_offset)
        
        data.append((
            f"REP{i % 50:03d}",  # 50 unique reps
            region,
            call_date,
            calls,
            prescriptions,
            samples,
            expenses,
            is_remote,
            call_type
        ))
    
    return data

# Generate data
print("üîÑ Generating sales rep activity data...")
field_force_data = generate_field_force_data(num_rows=1000, anomaly_rate=0.05)

schema = StructType([
    StructField("rep_id", StringType(), False),
    StructField("region", StringType(), False),
    StructField("call_date", DateType(), False),
    StructField("calls_made", IntegerType(), False),
    StructField("prescriptions_generated", IntegerType(), False),
    StructField("samples_distributed", IntegerType(), False),
    StructField("expenses", DoubleType(), False),
    StructField("is_remote", BooleanType(), False),
    StructField("call_type", StringType(), False),
])

df_sales = spark.createDataFrame(field_force_data, schema)

print("\nüìä Sample of field force activity data:")
df_sales.orderBy("call_date").show(10, truncate=False)

print(f"\n‚úÖ Generated {df_sales.count()} rows with ~5% injected anomalies")
print(f"   Regions: {df_sales.select('region').distinct().count()}")
print(f"   Unique reps: {df_sales.select('rep_id').distinct().count()}")
print(f"   Date range: {df_sales.agg(F.min('call_date'), F.max('call_date')).first()}")


In [None]:
# Save to table for training
catalog = spark.sql("SELECT current_catalog()").first()[0]
schema_name = "dqx_demo"
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema_name}")

table_name = f"{catalog}.{schema_name}.field_force_activity"
df_sales.write.mode("overwrite").saveAsTable(table_name)

print(f"‚úÖ Data saved to: {table_name}")


---

## Section 2: Auto-Discovery vs Manual Tuning (12 min)

### 2.1 Auto-Discovery (Zero Configuration)

Let's start with zero configuration - DQX will automatically select columns and detect segments.


In [None]:
# Train with ZERO configuration (auto-discovery)
print("üéØ Training with AUTO-DISCOVERY (zero config)...\\n")

model_uri_auto = train(
    df=spark.table(table_name),
    # NO columns specified - auto-discovered!
    # NO segments specified - auto-discovered!
    # Model name auto-generated!
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)

print(f"\\n‚úÖ Auto-discovery model trained!")
print(f"   Model URI: {model_uri_auto}")

# Check what was auto-discovered
registry_df = spark.table(f"{catalog}.{schema_name}.anomaly_model_registry")
auto_model = registry_df.filter(F.col("model_uri") == model_uri_auto).first()

print(f"\\nüìã Auto-Discovered Configuration:")
print(f"   Columns: {auto_model['columns']}")
print(f"   Segments: {auto_model['segment_by']}")
print(f"   Column types: {auto_model['column_types']}")
print(f"\\nüí° DQX prioritized: numeric > boolean > categorical > datetime")


In [None]:
# Score with auto-discovered model
checks_auto = [
    has_no_anomalies(
        score_threshold=0.5,
        registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
    )
]

df_scored_auto = dq_engine.apply_checks_by_metadata(df_sales, checks_auto)
anomalies_auto = df_scored_auto.filter(F.col("anomaly_score") >= 0.5)

print(f"\\n‚ö†Ô∏è  Auto-discovery found {anomalies_auto.count()} anomalies:\\n")
anomalies_auto.orderBy(F.col("anomaly_score").desc()).select(
    "rep_id", "region", "calls_made", "prescriptions_generated", "expenses",
    F.round("anomaly_score", 3).alias("score")
).show(10, truncate=False)


### 2.2 Manual Column Selection & Parameter Tuning

Now let's manually select specific columns and tune hyperparameters for better performance.


In [None]:
# Train with MANUAL configuration and tuned parameters
print("üéØ Training with MANUAL tuning...\\n")

model_uri_manual = train(
    df=spark.table(table_name),
    columns=["calls_made", "prescriptions_generated", "samples_distributed", "expenses"],  # Manual selection
    model_name="field_force_tuned",
    params=AnomalyParams(
        isolation_forest=IsolationForestConfig(
            contamination=0.05,  # Expected 5% anomaly rate (matches our data)
            n_estimators=150,    # More trees for stability (default 100)
            max_samples=512,     # Subsample size for speed
            random_state=42      # Reproducibility
        ),
        sample_fraction=1.0,     # Use all data (no sampling)
        max_rows=None            # No row limit
    ),
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)

print(f"\\n‚úÖ Manual tuned model trained!")
print(f"   Model URI: {model_uri_manual}")


In [None]:
# Score with tuned model
checks_manual = [
    has_no_anomalies(
        model="field_force_tuned",
        score_threshold=0.5,
        registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
    )
]

df_scored_manual = dq_engine.apply_checks_by_metadata(df_sales, checks_manual)
anomalies_manual = df_scored_manual.filter(F.col("anomaly_score") >= 0.5)

print(f"\\n‚ö†Ô∏è  Manual tuned model found {anomalies_manual.count()} anomalies:\\n")
anomalies_manual.orderBy(F.col("anomaly_score").desc()).select(
    "rep_id", "region", "calls_made", "prescriptions_generated", "expenses",
    F.round("anomaly_score", 3).alias("score")
).show(10, truncate=False)


### 2.3 Model Comparison

Let's compare the auto-discovered vs manually tuned models:


In [None]:
# Compare models
print("üìä Model Comparison:\\n")
comparison = registry_df.filter(
    F.col("model_uri").isin([model_uri_auto, model_uri_manual])
).select(
    "model_name",
    "columns",
    "training_rows",
    "metrics"
).collect()

for model in comparison:
    print(f"{'='*60}")
    print(f"Model: {model['model_name']}")
    print(f"Columns: {model['columns']}")
    print(f"Training rows: {model['training_rows']}")
    print(f"Metrics: {model['metrics']}")
    print()

print("üí° Tuning Tips:")
print("   - contamination: Set to expected anomaly rate (0.01-0.1)")
print("   - n_estimators: More trees = more stable (100-200)")
print("   - max_samples: Smaller = faster, larger = more accurate (256-1024)")
print("   - Start with auto-discovery, then refine based on domain knowledge")


---

## Section 3: Segment-Based Monitoring (8 min)

Different regions have different patterns. Train per-region models for accurate baselines.


In [None]:
# Train with regional segmentation
print("üåç Training region-specific anomaly models...\\n")

model_uri_segmented = train(
    df=spark.table(table_name),
    columns=["calls_made", "prescriptions_generated", "samples_distributed", "expenses"],
    segment_by=["region"],  # Train separate model per region
    model_name="field_force_regional",
    params=AnomalyParams(
        isolation_forest=IsolationForestConfig(contamination=0.05, n_estimators=150, random_state=42)
    ),
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)

print(f"\\n‚úÖ Regional models trained!")
print("   DQX automatically trained 3 models (US, EU, APAC)")


In [None]:
# Compare regional baselines
regional_models = spark.table(f"{catalog}.{schema_name}.anomaly_model_registry").filter(
    F.col("model_name") == "field_force_regional"
)

print("üìä Regional Model Baselines:\\n")
for row in regional_models.select("segment_values", "training_rows", "baseline_stats").collect():
    region = row['segment_values']['region']
    print(f"Region: {region}")
    print(f"  Training rows: {row['training_rows']}")
    print(f"  Baseline stats: {row['baseline_stats']}")
    print()

print("üîç Notice: Each region has different baselines!")
print("   US: Higher expenses ($150 avg)")
print("   EU: Lower expenses ($100 avg)")
print("   APAC: Highest volume (10 calls, 15 prescriptions avg)")


In [None]:
# Score with regional models (automatic routing)
checks_regional = [
    has_no_anomalies(
        model="field_force_regional",
        score_threshold=0.5,
        registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
    )
]

df_scored_regional = dq_engine.apply_checks_by_metadata(df_sales, checks_regional)

print("‚ö†Ô∏è  Regional anomalies by region:\\n")
df_scored_regional.filter(F.col("anomaly_score") >= 0.5).groupBy("region").agg(
    F.count("*").alias("anomaly_count"),
    F.avg("anomaly_score").alias("avg_score"),
    F.max("anomaly_score").alias("max_score")
).orderBy("region").show()

print("\\nüìã Top regional anomalies:")
df_scored_regional.filter(F.col("anomaly_score") >= 0.5).orderBy(
    F.col("anomaly_score").desc()
).select(
    "rep_id", "region", "calls_made", "prescriptions_generated", "expenses",
    F.round("anomaly_score", 3).alias("score")
).show(10, truncate=False)


---

## Section 4: Feature Contributions & Root Cause (8 min)

**Why is a record anomalous?** Use SHAP to understand which columns drove the anomaly score.


In [None]:
# Score with SHAP-based feature contributions
checks_with_contrib = [
    has_no_anomalies(
        model="field_force_regional",
        score_threshold=0.5,
        include_contributions=True,  # Enable SHAP explanations
        registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
    )
]

df_with_contrib = dq_engine.apply_checks_by_metadata(df_sales, checks_with_contrib)

print("üîç Top Anomalies with Feature Contributions (SHAP):\\n")
anomalies_contrib = df_with_contrib.filter(
    F.col("anomaly_score") >= 0.5
).orderBy(F.col("anomaly_score").desc()).limit(10)

anomalies_contrib.select(
    "rep_id", "region",
    "calls_made", "prescriptions_generated", "samples_distributed", "expenses",
    F.round("anomaly_score", 3).alias("score"),
    "anomaly_contributions"
).show(10, truncate=False)


In [None]:
# Analyze contribution patterns for root cause
print("üìä Root Cause Analysis:\\n")

top_anomaly = anomalies_contrib.first()
print(f"üî∏ Top Anomaly: REP={top_anomaly['rep_id']}, Region={top_anomaly['region']}")
print(f"   Score: {top_anomaly['anomaly_score']:.3f}")
print(f"   Values:")
print(f"     ‚Ä¢ calls_made: {top_anomaly['calls_made']}")
print(f"     ‚Ä¢ prescriptions: {top_anomaly['prescriptions_generated']}")
print(f"     ‚Ä¢ samples: {top_anomaly['samples_distributed']}")
print(f"     ‚Ä¢ expenses: ${top_anomaly['expenses']:.2f}")
print(f"\\n   üìà Feature Contributions (SHAP):")

if top_anomaly['anomaly_contributions']:
    sorted_contribs = sorted(
        top_anomaly['anomaly_contributions'].items(),
        key=lambda x: x[1],
        reverse=True
    )
    for feature, contribution in sorted_contribs:
        print(f"      {feature:30s}: {contribution:.3f} ({contribution*100:.1f}%)")

print("\\nüí° Business Interpretation Examples:")
print("   ‚Ä¢ High 'expenses' contribution ‚Üí Potential fraud or policy violation")
print("   ‚Ä¢ High 'calls_made' + low 'prescriptions' ‚Üí Training need or territory issue")
print("   ‚Ä¢ High 'prescriptions' contribution ‚Üí Unrealistic claims to investigate")
print("   ‚Ä¢ Balanced contributions ‚Üí Multivariate anomaly (multiple factors)")


---

## Section 5: Drift Detection & Retraining (6 min)

Data distributions change over time. DQX can detect when your model becomes stale.


In [None]:
# Simulate drift: New patterns (more remote work, lower expenses post-policy change)
def generate_drifted_data(num_rows=200):
    """Generate Q3 data with shifted distribution (post-policy change)."""
    data = []
    regions = ["US", "EU", "APAC"]
    call_types = ["promotional", "educational", "follow_up"]
    
    # NEW PATTERNS: More remote work, lower expenses, similar productivity
    new_patterns = {
        "US": {"calls": (9, 2), "prescriptions": (12, 3), "samples": (20, 4), "expenses": (100, 20)},  # -33% expenses
        "EU": {"calls": (7, 1.5), "prescriptions": (9, 2), "samples": (15, 3), "expenses": (70, 15)},   # -30% expenses
        "APAC": {"calls": (11, 3), "prescriptions": (15, 4), "samples": (25, 6), "expenses": (85, 20)}, # -29% expenses
    }
    
    start_date = datetime(2024, 7, 1)  # Q3 data
    
    for i in range(num_rows):
        region = random.choice(regions)
        pattern = new_patterns[region]
        
        calls = max(1, int(np.random.normal(pattern["calls"][0], pattern["calls"][1])))
        prescriptions = max(0, int(np.random.normal(pattern["prescriptions"][0], pattern["prescriptions"][1])))
        samples = max(0, int(np.random.normal(pattern["samples"][0], pattern["samples"][1])))
        expenses = max(10, round(np.random.normal(pattern["expenses"][0], pattern["expenses"][1]), 2))
        is_remote = random.random() < 0.7  # 70% remote now (was 30%)
        call_type = random.choice(call_types)
        
        days_offset = random.randint(0, 90)
        call_date = start_date + timedelta(days=days_offset)
        
        data.append((f"REP{i % 50:03d}", region, call_date, calls, prescriptions, samples, expenses, is_remote, call_type))
    
    return data

# Generate and compare
drifted_data = generate_drifted_data(num_rows=200)
df_drifted = spark.createDataFrame(drifted_data, schema)

print("üìä Original vs Drifted Data Comparison:\\n")
print("Original (Q1-Q2 2024):") 
df_sales.agg(
    F.avg("expenses").alias("avg_expenses"),
    F.avg(F.col("is_remote").cast("int")).alias("remote_rate")
).show()

print("Drifted (Q3 2024 - post policy change):")
df_drifted.agg(
    F.avg("expenses").alias("avg_expenses"),
    F.avg(F.col("is_remote").cast("int")).alias("remote_rate")
).show()

print("‚úÖ Distribution shifted:")
print("   ‚Ä¢ Expenses: -30% (policy change)")
print("   ‚Ä¢ Remote work: +133% (70% vs 30%)")


In [None]:
# Score drifted data with drift detection enabled
checks_with_drift = [
    has_no_anomalies(
        model="field_force_regional",
        drift_threshold=3.0,  # Z-score threshold (default)
        registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
    )
]

print("üîç Scoring drifted data with drift detection...\\n")
print("‚ö†Ô∏è  Watch for drift warnings in output!\\n")

df_drift_scored = dq_engine.apply_checks_by_metadata(df_drifted, checks_with_drift)

print("\\n‚ÑπÔ∏è  Drift warnings appear as UserWarnings like:")
print("   'Data drift detected in columns: expenses (drift score: 4.2)'")
print("   'Model may be stale. Retrain using: train(...)'")
print("\\nüí° Drift score > 3.0 ‚Üí Significant distribution shift, retrain recommended")


In [None]:
# Retrain with combined data
df_combined = df_sales.union(df_drifted)

print("üîÑ Retraining model with combined data (old + new patterns)...\\n")

model_uri_retrained = train(
    df=df_combined,
    columns=["calls_made", "prescriptions_generated", "samples_distributed", "expenses"],
    segment_by=["region"],
    model_name="field_force_regional",  # Same name = new version
    params=AnomalyParams(
        isolation_forest=IsolationForestConfig(contamination=0.05, n_estimators=150, random_state=42)
    ),
    registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
)

print("\\n‚úÖ Model retrained!")
print("   ‚Ä¢ Old model automatically archived")
print("   ‚Ä¢ New model active and includes both historical and recent patterns")
print("   ‚Ä¢ Baseline updated to reflect new expense policy and remote work rates")
print("\\nüí° Best Practice: Set up drift monitoring in production, retrain monthly/quarterly")


---

## Section 6: Production Integration (6 min)

Integrate anomaly detection into your DQX workflows for automated monitoring.


In [None]:
# Combine anomaly detection with traditional DQ checks
checks_combined = [
    # Traditional data quality checks
    is_not_null(columns=["rep_id", "region", "call_date"]),
    is_in_range(column="calls_made", min_value=0, max_value=50),
    is_in_range(column="expenses", min_value=0, max_value=1000),
    
    # ML-based anomaly detection with explanations
    has_no_anomalies(
        model="field_force_regional",
        score_threshold=0.5,
        include_contributions=True,
        drift_threshold=3.0,
        registry_table=f"{catalog}.{schema_name}.anomaly_model_registry"
    )
]

# Apply all checks together
df_full_dq = dq_engine.apply_checks_by_metadata(df_sales, checks_combined)

# Summary
print("üìä Full Data Quality Summary:\\n")
total_rows = df_full_dq.count()
anomalies_found = df_full_dq.filter(F.col("anomaly_score") >= 0.5).count()

# Note: Traditional check condition columns would have specific names based on implementation
print(f"Total Rows: {total_rows}")
print(f"Anomalies Detected: {anomalies_found}")
print(f"Clean Records: {total_rows - anomalies_found}")
print(f"\\n‚úÖ All checks applied in single pass!")


In [None]:
# Quarantine anomalies for review
quarantine_table = f"{catalog}.{schema_name}.field_force_quarantine"

quarantine_df = df_full_dq.filter(
    F.col("anomaly_score") >= 0.5
).select(
    "*",
    F.current_timestamp().alias("quarantine_timestamp"),
    F.lit("anomaly_detected").alias("quarantine_reason")
)

quarantine_df.write.mode("overwrite").saveAsTable(quarantine_table)

print(f"‚úÖ Quarantined {quarantine_df.count()} anomalies to: {quarantine_table}")
print("\\nüìã Quarantine Summary by Region:")
spark.table(quarantine_table).groupBy("region").agg(
    F.count("*").alias("count"),
    F.avg("anomaly_score").alias("avg_score"),
    F.max("anomaly_score").alias("max_score")
).orderBy("region").show()

print("\\nüí° Quarantine Workflow:")
print("   1. Anomalies automatically sent to quarantine table")
print("   2. Review team investigates using anomaly_contributions")
print("   3. Confirmed issues ‚Üí escalate to appropriate team")
print("   4. False positives ‚Üí retune model or adjust threshold")


### YAML Configuration for Production

For automated workflows, define checks in YAML:

```yaml
run_configs:
  - name: field_force_monitoring
    input_config:
      location: catalog.schema.field_force_activity
    
    # Traditional checks
    quality_checks:
      - function: is_not_null
        arguments:
          columns: [rep_id, region, call_date]
      - function: is_in_range
        arguments:
          column: calls_made
          min_value: 0
          max_value: 50
      - function: is_in_range
        arguments:
          column: expenses
          min_value: 0
          max_value: 1000
    
    # Anomaly detection
    anomaly_config:
      columns: [calls_made, prescriptions_generated, samples_distributed, expenses]
      segment_by: [region]
      model_name: field_force_regional
      registry_table: catalog.schema.anomaly_model_registry
      params:
        isolation_forest:
          contamination: 0.05
          n_estimators: 150
          random_state: 42
        sample_fraction: 1.0
    
    # Quarantine configuration
    quarantine_config:
      enabled: true
      table: catalog.schema.field_force_quarantine
      
    # Output configuration
    output_config:
      location: catalog.schema.field_force_clean
      save_mode: overwrite
```

**Run with:**
```bash
# Train model (one-time or scheduled)
databricks bundle run anomaly_trainer

# Run quality checks (scheduled, e.g., daily)
databricks bundle run quality_checker
```


---

## üéì Summary

### What You Learned:

1. ‚úÖ **Auto-Discovery vs Manual Tuning** - Start with zero-config, refine with domain knowledge
2. ‚úÖ **Parameter Tuning** - contamination, n_estimators, max_samples for better performance
3. ‚úÖ **Segment-Based Monitoring** - Regional baselines prevent false positives (US vs EU vs APAC)
4. ‚úÖ **Feature Contributions** - SHAP-based root cause analysis for investigation
5. ‚úÖ **Drift Detection** - Automated signals for when to retrain models
6. ‚úÖ **Multi-Type Features** - Numeric, categorical, datetime, boolean all work together
7. ‚úÖ **Production Integration** - DQEngine + YAML workflows + quarantine handling

### Key Takeaways:

- **Start simple**: `train(df)` with auto-discovery, then refine
- **Tune parameters**: Set contamination to expected anomaly rate, increase n_estimators for stability
- **Use segments**: Different baselines for different groups prevent false positives
- **Enable contributions**: Root cause analysis is critical for business value
- **Monitor drift**: Set up drift detection for automated retraining signals
- **Combine checks**: Anomaly detection complements traditional DQ rules
- **Quarantine workflow**: Automate review process with explanations

### Model Comparison Results:

| Approach | Columns | Segments | Tuning | Use Case |
|----------|---------|----------|--------|----------|
| Auto-discovery | Auto (priority-based) | Auto (if applicable) | Default | Quick start, exploration |
| Manual tuned | Hand-picked | Manual | Custom hyperparameters | Production, refined monitoring |
| Regional | Hand-picked | By region | Tuned contamination | Multi-region with different baselines |

### Next Steps:

1. **Apply to your data**: `train(df=spark.table("your_table"))`
2. **Set up YAML workflows**: Automate training and checking
3. **Integrate quarantine**: Build review process with feature contributions
4. **Schedule retraining**: Weekly/monthly based on drift monitoring
5. **Monitor metrics**: Track anomaly rates, drift scores, false positive rates

### Resources:

- [DQX Anomaly Detection Documentation](https://databrickslabs.github.io/dqx/guide/anomaly_detection)
- [API Reference](https://databrickslabs.github.io/dqx/reference/quality_checks#has_no_anomalies)
- [GitHub Repository](https://github.com/databrickslabs/dqx)

---

**Questions? Feedback?** Open an issue on GitHub or contact the DQX team!


# üè• Pharmaceutical Field Force Effectiveness - Anomaly Detection Demo

##Business Context

**Scenario**: Monitor sales rep performance across regions to detect unusual patterns that may indicate:
- Expense fraud or policy violations
- Territory coverage issues
- Unrealistic prescription claims
- Training needs or process gaps

**Data**: Sales rep daily activity including calls, prescriptions generated, samples distributed, and expenses

## What You'll Learn (30-45 min)

1. **Auto-Discovery**: Zero-config model training
2. **Segment-Based Monitoring**: Regional baselines (US vs EU vs APAC)
3. **Feature Contributions**: Root cause analysis for anomalies
4. **Drift Detection**: When to retrain models
5. **Multi-Type Features**: Numeric, categorical, datetime, boolean
6. **Production Integration**: DQEngine and YAML workflows

---

## Section 1: Setup & Data Generation (5 min)

First, install DQX with anomaly support if not already installed:
```bash
%pip install databricks-labs-dqx[anomaly]
```

In [None]:
# Imports
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta
import random
import numpy as np

from databricks.labs.dqx.anomaly import train, has_no_anomalies, AnomalyParams
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient

# Initialize
spark = SparkSession.builder.getOrCreate()
ws = WorkspaceClient()
dq_engine = DQEngine(ws)

# Set seeds for reproducibility
random.seed(42)
np.random.seed(42)

print("‚úÖ Setup complete!")
print(f"   Spark version: {spark.version}")