# üìä Simple Anomaly Detection Demo

## Learn Anomaly Detection in 15 Minutes

This beginner-friendly demo shows you how to:
- Understand what anomaly detection is and why it matters
- Detect unusual patterns in your data with zero configuration
- Tune detection sensitivity to your needs
- Understand why specific records are flagged
- Integrate anomaly detection into production workflows

**Dataset**: Simple sales transactions (universally relatable, no domain expertise required)

**Time**: 12-17 minutes


---

## Section 0: What is Anomaly Detection in Data Quality?

Before we dive into the code, let's understand what anomaly detection is and why it's valuable.

### The Data Quality Challenge: Known vs Unknown Issues

#### üéØ Known Unknowns (Traditional Data Quality)

These are issues **you can anticipate** and write rules for:

| Issue Type | Example | Solution |
|------------|---------|----------|
| Null values | `amount` is NULL | `is_not_null(column="amount")` |
| Out of range | Price is negative | `is_in_range(column="price", min=0)` |
| Invalid format | Email without @ symbol | Regex validation |

**Works great when you know what to look for!**

#### üîç Unknown Unknowns (Anomaly Detection)

These are issues **you DON'T know to look for**:

- Unusual **patterns** across multiple columns
- Outlier **combinations** that are individually valid
- Subtle **data corruption** that passes all rules

**Problem**: You can't write rules for things you haven't thought of!

**Solution**: ML-based anomaly detection learns "normal" patterns from your data and flags deviations.

### Concrete Example

```
Known Unknown:  "Amount must be positive"
                ‚Üí is_in_range(min=0)
                ‚úÖ Catches: amount = -50

Unknown Unknown: "Transaction for $47,283 at 3am on Sunday for 2 items"
                 ‚Üí Anomaly detection
                 ‚úÖ Catches: All fields valid individually, but pattern is unusual
```

### Why Anomaly Detection Matters

- ‚úÖ **Catches issues before they become problems** - Early warning system
- ‚úÖ **No need to anticipate every failure mode** - Adapts to your data
- ‚úÖ **Learns patterns automatically** - No manual rule writing
- ‚úÖ **Complements rule-based checks** - Use both together for comprehensive quality

### Unity Catalog Integration

#### Built-in Quality Monitoring (Unity Catalog)

Unity Catalog includes **table-level** anomaly detection:
- Monitors column statistics and distributions
- Alerts on schema changes, cardinality shifts
- Tracks null rate changes over time
- Great for monitoring table health

#### When to Use DQX Anomaly Detection

DQX provides **row-level** anomaly detection:
- Detect unusual individual **records/transactions**
- **Multi-column pattern** detection (e.g., price + quantity + time)
- **Custom models per segment** (e.g., different regions, categories)
- **Feature contributions** to understand WHY records are anomalous
- **Integration** with existing DQX quality pipelines

#### Complementary Approach

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Rule-Based Checks (Known Unknowns)                     ‚îÇ
‚îÇ  ‚Ä¢ is_not_null, is_in_range, regex validation          ‚îÇ
‚îÇ  ‚Ä¢ Schema validation, referential integrity             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                           +
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  ML Anomaly Detection (Unknown Unknowns)                ‚îÇ
‚îÇ  ‚Ä¢ Pattern detection, outlier identification            ‚îÇ
‚îÇ  ‚Ä¢ Multi-column relationship validation                 ‚îÇ
‚îÇ  DQX: Row-level anomaly detection                       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                           +
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Unity Catalog Monitoring (Table Health)                ‚îÇ
‚îÇ  ‚Ä¢ Schema drift, cardinality changes                    ‚îÇ
‚îÇ  ‚Ä¢ Column statistics, metadata tracking                 ‚îÇ
‚îÇ  UC: Table freshness and completeness                   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Key Takeaways

- üí° **Anomaly detection finds issues you didn't know to look for**
- üí° **Complements (doesn't replace) rule-based checks - use both!**
- üí° **Unity Catalog monitors table freshness and completeness, DQX monitors data inside the tables**
- üí° **Together, they provide comprehensive quality coverage**

Let's see how easy it is to add anomaly detection to your pipeline! üöÄ


---

## Prerequisites: Install DQX with Anomaly Support

Before running this demo, install DQX with anomaly detection extras:

```python
%pip install 'databricks-labs-dqx[anomaly]'
dbutils.library.restartPython()
```

**What's included in `[anomaly]` extras:**
- `scikit-learn` - Machine learning algorithms (Isolation Forest)
- `mlflow` - Model tracking and registry
- `shap` - Feature contributions for explainability
- `cloudpickle` - Model serialization

**Note**: On ML Runtimes and Serverless compute, most dependencies are already pre-installed.


In [None]:
# OPTIONAL: Install DQX with anomaly extras
# Uncomment and run if you haven't installed DQX yet

# %pip install 'databricks-labs-dqx[anomaly]'
# dbutils.library.restartPython()

# Configure widgets for catalog and schema
dbutils.widgets.text("demo_catalog", "main", "Catalog Name")
dbutils.widgets.text("demo_schema", "dqx_demo", "Schema Name")


---

## Configuration: Catalog and Schema

Configure where to store demo data and models. By default, uses `main` catalog and `dqx_demo` schema.
You can change these using the widgets above if needed.

---

## Section 1: Setup & Data Generation

First, let's set up our environment and create simple sales transaction data.


In [None]:
# Imports
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta
import random
import numpy as np

from databricks.labs.dqx.anomaly import AnomalyEngine, has_no_anomalies
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.rule import DQDatasetRule, DQRowRule
from databricks.labs.dqx.check_funcs import is_not_null, is_in_range
from databricks.sdk import WorkspaceClient

# Initialize
spark = SparkSession.builder.getOrCreate()
ws = WorkspaceClient()
anomaly_engine = AnomalyEngine(ws)
dq_engine = DQEngine(ws)

# Set seeds for reproducibility
random.seed(42)
np.random.seed(42)

print("‚úÖ Setup complete!")
print(f"   Spark version: {spark.version}")


In [None]:
# Generate simple sales transaction data
def generate_sales_data(num_rows=1000, anomaly_rate=0.05):
    """
    Generate sales transaction data with injected anomalies.
    
    Normal patterns:
    - Amount: $10-500 per transaction
    - Quantity: 1-10 items
    - Business hours: 9am-6pm weekdays
    - Regional consistency
    
    Anomalies (5% - matches default expected_anomaly_rate):
    - Pricing errors (VERY extreme: 40-50x or 1/40 of normal amounts)
    - Quantity spikes (bulk orders 100-150 items = 20-30x normal)
    - Timing anomalies (off-hours + 5-8x amount + 25-40 quantity)
    - Multi-factor (6-10x amount + 35-60 quantity + always off-hours)
    """
    data = []
    categories = ["Electronics", "Clothing", "Food", "Books", "Home"]
    regions = ["North", "South", "East", "West"]
    
    # Regional pricing patterns (normal baseline)
    region_patterns = {
        "North": {"base_amount": 200, "quantity": 5},
        "South": {"base_amount": 150, "quantity": 4},
        "East": {"base_amount": 180, "quantity": 4},
        "West": {"base_amount": 220, "quantity": 6},
    }
    
    start_date = datetime(2024, 1, 1, 9, 0)  # Jan 1, 2024, 9am
    
    for i in range(num_rows):
        transaction_id = f"TXN{i:06d}"
        category = random.choice(categories)
        region = random.choice(regions)
        pattern = region_patterns[region]
        
        # Generate timestamp (mostly business hours weekdays)
        days_offset = random.randint(0, 90)  # 3 months of data
        hours_offset = random.randint(0, 9)  # 9am-6pm = 9 hours
        date = start_date + timedelta(days=days_offset, hours=hours_offset)
        
        # Skip weekends for normal transactions
        if date.weekday() >= 5:  # Saturday=5, Sunday=6
            date = date - timedelta(days=date.weekday() - 4)  # Move to Friday
        
        # Inject anomalies (VERY extreme to reliably exceed 0.60 threshold)
        if random.random() < anomaly_rate:
            # Bias towards more extreme anomaly types for better detection
            anomaly_type = random.choices(
                ["pricing", "quantity", "timing", "multi_factor"],
                weights=[2, 2, 1, 3]  # Favor pricing, quantity, and multi-factor
            )[0]
            
            if anomaly_type == "pricing":
                # Pricing error: VERY extreme amounts (40-50x or 1/40 of normal)
                multiplier = random.choice([random.uniform(40, 50), 1.0 / random.uniform(35, 45)])
                amount = round(pattern["base_amount"] * multiplier, 2)
                quantity = int(np.random.normal(pattern["quantity"], 0.3))  # Near-normal quantity
            
            elif anomaly_type == "quantity":
                # Bulk order spike (100-150 items = 20-30x normal) 
                amount = round(pattern["base_amount"] * random.uniform(0.95, 1.05), 2)  # Normal amount
                quantity = random.randint(100, 150)
            
            elif anomaly_type == "timing":
                # Off-hours transaction WITH very unusual amount (multi-factor)
                amount = round(pattern["base_amount"] * random.uniform(5.0, 8.0), 2)  # 5-8x normal
                quantity = random.randint(25, 40)  # 5-8x normal
                date = date.replace(hour=random.choice([2, 3, 4, 22, 23]))  # Late night/early morning
                # Or make it weekend
                if random.random() > 0.5:
                    date = date + timedelta(days=(5 - date.weekday()))  # Move to Saturday
            
            else:  # multi-factor: EXTREME multi-dimensional anomaly
                # Extreme regional mismatch + very unusual quantity + off-hours
                other_region = random.choice([r for r in regions if r != region])
                amount = round(region_patterns[other_region]["base_amount"] * random.uniform(6.0, 10.0), 2)
                quantity = random.randint(35, 60)  # 7-12x normal
                date = date.replace(hour=random.choice([2, 3, 4, 22, 23]))  # Always off-hours
        
        else:
            # Normal transaction (tighter variance for more consistent patterns)
            amount = round(pattern["base_amount"] * random.uniform(0.85, 1.15), 2)
            quantity = max(1, int(np.random.normal(pattern["quantity"], 1)))
        
        # Ensure valid ranges
        amount = max(10, min(10000, amount))
        quantity = max(1, min(150, quantity))  # Allow bulk orders up to 150
        
        data.append((transaction_id, date, amount, quantity, category, region))
    
    return data

# Generate data
print("üîÑ Generating sales transaction data...\n")
sales_data = generate_sales_data(num_rows=1000, anomaly_rate=0.05)

schema = StructType([
    StructField("transaction_id", StringType(), False),
    StructField("date", TimestampType(), False),
    StructField("amount", DoubleType(), False),
    StructField("quantity", IntegerType(), False),
    StructField("category", StringType(), False),
    StructField("region", StringType(), False),
])

df_sales = spark.createDataFrame(sales_data, schema)

print("üìä Sample of sales transactions:")
display(df_sales.orderBy("date"))

print(f"\n‚úÖ Generated {df_sales.count()} sales transactions")
print(f"   Expected anomalies: ~{int(df_sales.count() * 0.05)} (5%)")
print(f"\nüí° Data includes:")
print(f"   ‚Ä¢ Normal patterns: Business hours, typical amounts (170-230), reasonable quantities (4-6)")
print(f"   ‚Ä¢ Injected anomalies: VERY extreme deviations (40-50x pricing, 100-150 quantity, multi-factor)")
print(f"\nüéØ Anomaly detection will identify patterns that deviate significantly from normal behavior")
print(f"\nüìå Note: 5% anomaly rate matches the model's default 'expected_anomaly_rate' parameter")


In [None]:
# Get catalog and schema from widgets
catalog = dbutils.widgets.get("demo_catalog")
schema_name = dbutils.widgets.get("demo_schema")

print(f"üìÇ Using catalog: {catalog}")
print(f"üìÇ Using schema: {schema_name}\n")

# Create schema if it doesn't exist
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema_name}")

# Save data to table
table_name = f"{catalog}.{schema_name}.sales_transactions"
df_sales.write.mode("overwrite").saveAsTable(table_name)

print(f"‚úÖ Data saved to: {table_name}")


In [None]:
# Set up registry table for tracking trained models
registry_table = f"{catalog}.{schema_name}.anomaly_model_registry_101"
print(f"üìã Model registry table: {registry_table}")

# Clean up any existing registry from previous runs
spark.sql(f"DROP TABLE IF EXISTS {registry_table}")
print(f"‚úÖ Registry ready for new models")


---

## Section 2: Combined Quality Checks - Rule-Based + ML Anomaly Detection

Let's build a comprehensive quality pipeline that combines:
1. **Rule-based checks** (known unknowns): nulls, ranges, formats
2. **ML anomaly detection** (unknown unknowns): unusual patterns

With **ZERO configuration**, the system will:
- Automatically select relevant columns for anomaly detection
- Auto-detect if segmentation is needed (e.g., separate models per region)
- Train ensemble models (2 models by default for robustness)
- Score all transactions with both rule-based AND anomaly checks
- Provide feature contributions to explain WHY records are flagged

**You provide**: Just the data and rules  
**DQX provides**: Everything else, optimized for performance!


In [None]:
# Train anomaly detection model with zero configuration
print("üéØ Training anomaly detection model...")
print("   DQX will automatically discover patterns in your data\n")

model_uri_auto = anomaly_engine.train(
    df=spark.table(table_name),
    model_name="sales_auto",
    registry_table=registry_table
)

print(f"‚úÖ Model trained successfully!")
print(f"   Model URI: {model_uri_auto}")

# View what DQX created for you
print(f"\nüìã Trained Models:\n")

display(
    spark.table(registry_table)
    .filter(F.col("identity.model_name").startswith("sales_auto"))
    .select(
        "identity.model_name",
        "training.columns", 
        "segmentation.segment_by",
        "segmentation.segment_values",
        "training.training_rows",
        "training.training_time",
        "identity.status"
    )
    .orderBy("identity.model_name")
)

print("\nüí° Understanding the Results:")
print("   ‚Ä¢ DQX automatically found patterns in your data")
print("   ‚Ä¢ If 'segment_by' has values, DQX created separate models for different groups")
print("   ‚Ä¢ Each row is a trained model ready to score new data")


### üí° Viewing Models in Databricks UI

Your trained models are automatically registered in **Unity Catalog Model Registry**. Here's how to view them:

**Option 1: Catalog Explorer**
1. Click **Catalog** in the left sidebar
2. Navigate to your catalog ‚Üí schema
3. Look for models named `sales_auto` (or `sales_auto_ensemble_0`, `sales_auto_ensemble_1` for ensemble models)
4. Click on a model to see:
   - Model versions
   - MLflow run details (parameters, metrics)
   - Model lineage and schema

**Option 2: MLflow Experiments**
1. Click **Experiments** in the left sidebar
2. Find your notebook's experiment (automatically created per notebook)
3. View all training runs with:
   - Hyperparameters (contamination, num_trees, etc.)
   - Validation metrics (precision, recall, F1)
   - Model artifacts and signatures

**What DQX Logs Automatically:**
- ‚úÖ **Parameters**: contamination, num_trees, subsampling_rate, random_seed
- ‚úÖ **Metrics**: precision, recall, F1 score, validation accuracy
- ‚úÖ **Model Signature**: Input/output schemas for Unity Catalog
- ‚úÖ **Model Artifacts**: Serialized sklearn model + feature metadata

**Model URI Format:**
```
models:/<catalog>.<schema>.<model_name>/<version>
```
Example: `models:/main.dqx_demo.sales_auto/1`


In [None]:
    # Apply quality checks: combine rule-based + ML anomaly detection
    print("üîç Applying quality checks to all transactions...\n")

    # Define all quality checks
    checks_combined = [
        # Rule-based checks for known issues
        DQRowRule(check_func=is_not_null, check_func_kwargs={"column": "transaction_id"}),
        DQRowRule(check_func=is_not_null, check_func_kwargs={"column": "amount"}),
        DQRowRule(check_func=is_in_range, check_func_kwargs={"column": "amount", "min_limit": 0, "max_limit": 100000}),
        DQRowRule(check_func=is_in_range, check_func_kwargs={"column": "quantity", "min_limit": 1, "max_limit": 1000}),
        
        # ML anomaly detection for unusual patterns
        DQDatasetRule(
            check_func=has_no_anomalies,
            check_func_kwargs={
                "merge_columns": ["transaction_id"],
                "model": "sales_auto",
                "registry_table": registry_table
                # Default: 2 models for confidence, explains why data is anomalous, threshold 0.60
            }
        )
    ]

    df_scored = dq_engine.apply_checks(df_sales, checks_combined)

    # Get records flagged as anomalies
    anomalies = df_scored.filter(F.size(F.col("_errors")) > 0)

    print(f"‚úÖ Quality checks complete!")
    print(f"\nüìä Results:")
    print(f"   Total transactions: {df_scored.count()}")
    print(f"   Anomalies found: {anomalies.count()} ({(anomalies.count() / df_scored.count()) * 100:.1f}%)")
    print(f"\nüîù Top 10 anomalies:\n")

    display(anomalies.orderBy(F.col("_info.anomaly.score").desc()).select(
        "transaction_id", "date", "amount", "quantity", "category", "region",
        F.round("_info.anomaly.score", 3).alias("anomaly_score"),
        F.col("_info.anomaly.contributions").alias("why_anomalous")
    ).limit(10))

    print("\nüí° What Just Happened:")
    print("   ‚Ä¢ Rule-based checks caught known issues (nulls, out-of-range values)")
    print("   ‚Ä¢ Anomaly detection found unusual patterns you didn't explicitly define")
    print("   ‚Ä¢ The 'why_anomalous' column explains what made each record unusual")
    print("   ‚Ä¢ Threshold of 0.60 balances finding issues vs false alarms")


---

## Section 3: Understanding Your Results

Let's explore the anomalies we found and learn how to interpret anomaly scores.

**What you'll learn:**
- How anomaly scores work (0 to 1 scale, based on Isolation Forest)
- What makes a score "high" vs "normal"  
- Why certain records were flagged as unusual

**Important**: Anomaly scores are NOT probabilities or confidence levels! They measure how "easy" it is to separate a record from the rest of your data. Think of it as: "How different is this record from normal patterns?"

### üìä How Isolation Forest Works

The algorithm builds decision trees and measures how many "splits" are needed to isolate each record:

- **Anomalies** (shown in the image diagram): Isolated near the top with few splits ‚Üí High score
- **Normal data**: Requires many splits deep in the tree ‚Üí Low score

*[Image: Add Isolation Forest visualization here showing anomaly isolation vs normal data]*

This is why the score represents "isolation ease" rather than statistical confidence!


In [None]:
# Analyze score distribution
print("üìä Anomaly Score Distribution:\n")

score_stats = df_scored.select("_info.anomaly.score").describe()
display(score_stats)

# Show score ranges (aligned with 0.60 threshold)
print("üìà Score Range Breakdown:\n")

score_ranges = df_scored.select(
    F.count(F.when(F.col("_info.anomaly.score") < 0.4, 1)).alias("normal_0.0_0.4"),
    F.count(F.when((F.col("_info.anomaly.score") >= 0.4) & (F.col("_info.anomaly.score") < 0.6), 1)).alias("borderline_0.4_0.6"),
    F.count(F.when((F.col("_info.anomaly.score") >= 0.6) & (F.col("_info.anomaly.score") < 0.75), 1)).alias("flagged_0.6_0.75"),
    F.count(F.when(F.col("_info.anomaly.score") >= 0.75, 1)).alias("highly_anomalous_0.75_1.0"),
).first()

total = df_scored.count()
print(f"Normal (0.0-0.4):             {score_ranges['normal_0.0_0.4']:4d} ({score_ranges['normal_0.0_0.4']/total*100:5.1f}%) ‚Üê Not flagged")
print(f"Borderline (0.4-0.6):         {score_ranges['borderline_0.4_0.6']:4d} ({score_ranges['borderline_0.4_0.6']/total*100:5.1f}%) ‚Üê Near threshold (not flagged)")
print(f"Flagged (0.6-0.75):           {score_ranges['flagged_0.6_0.75']:4d} ({score_ranges['flagged_0.6_0.75']/total*100:5.1f}%) ‚Üê ANOMALIES (flagged)")
print(f"Highly Anomalous (0.75-1.0):  {score_ranges['highly_anomalous_0.75_1.0']:4d} ({score_ranges['highly_anomalous_0.75_1.0']/total*100:5.1f}%) ‚Üê ANOMALIES (extreme)")

print(f"\nüí° What Do These Scores Mean?")
print(f"   ‚Ä¢ Scores are based on how 'isolated' a record is from normal patterns")
print(f"   ‚Ä¢ Low scores (0.0-0.4): Blends in with normal data (NOT flagged)")
print(f"   ‚Ä¢ Borderline (0.4-0.6): Near the 0.60 threshold (NOT flagged)")
print(f"   ‚Ä¢ High scores (‚â•0.6): Stands out as different (FLAGGED as anomalies)")
print(f"   ‚Ä¢ This is NOT a probability - it's based on how many 'splits' are needed to isolate the record")
print(f"   ‚Ä¢ The threshold (0.60) is tuned empirically, not a statistical significance level")


In [None]:
# Compare normal vs anomalous transactions (using 0.60 threshold)
print("üîç Normal vs Anomalous Transaction Comparison:\n")

normal_stats = df_scored.filter(F.col("_info.anomaly.score") < 0.6).agg(
    F.avg("amount").alias("avg_amount"),
    F.avg("quantity").alias("avg_quantity"),
    F.count("*").alias("count")
).first()

anomaly_stats = df_scored.filter(F.col("_info.anomaly.score") >= 0.6).agg(
    F.avg("amount").alias("avg_amount"),
    F.avg("quantity").alias("avg_quantity"),
    F.count("*").alias("count")
).first()

print("Normal Transactions (score < 0.60):")
print(f"   Count: {normal_stats['count']} ({normal_stats['count']/df_scored.count()*100:.1f}%)")
print(f"   Avg Amount: ${normal_stats['avg_amount']:.2f}")
print(f"   Avg Quantity: {normal_stats['avg_quantity']:.1f}")

print("\nFlagged Anomalies (score ‚â• 0.60):")
print(f"   Count: {anomaly_stats['count']} ({anomaly_stats['count']/df_scored.count()*100:.1f}%)")
print(f"   Avg Amount: ${anomaly_stats['avg_amount']:.2f}")
print(f"   Avg Quantity: {anomaly_stats['avg_quantity']:.1f}")

print("\nüí° Expected Results:")
print("   ‚Ä¢ Normal transactions should be ~95% of data with typical amounts/quantities")
print("   ‚Ä¢ Anomalies should be ~5% with extreme or unusual patterns")


---

## Section 4: Tuning the Threshold

The threshold controls which records get flagged as anomalies. It's like setting a "sensitivity dial":

- **Lower threshold** (e.g., 0.4-0.5): More sensitive, flags more records as unusual
- **Higher threshold** (e.g., 0.65-0.75): Less sensitive, flags only very unusual records

**The default of 0.60** was chosen through testing across various datasets. It balances:
- Finding real issues (recall)
- Avoiding false alarms (precision)

**Remember**: This is NOT a statistical confidence level! It's a cutoff on the "isolation score" that determines what's unusual enough to investigate.

Let's see how changing the threshold affects results!


In [None]:
# Try different thresholds
print("üéöÔ∏è  Testing Different Thresholds:\n")
print("Threshold | Anomalies | % of Data | Interpretation")
print("-" * 78)

thresholds = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
total_count = df_scored.count()

for threshold in thresholds:
    anomaly_count = df_scored.filter(F.col("_info.anomaly.score") >= threshold).count()
    percentage = (anomaly_count / total_count) * 100
    
    # Interpretation based on actual detection rate (data-driven)
    if percentage >= 90:
        interpretation = "Extremely sensitive (flags almost everything)"
    elif percentage >= 50:
        interpretation = "Too sensitive (flags majority of data)"
    elif percentage >= 10:
        interpretation = "Moderate (may need adjustment)"
    elif percentage >= 1:
        interpretation = "Balanced (default 0.60 - good starting point)"
    elif percentage > 0:
        interpretation = "Very strict (only extreme cases)"
    else:
        interpretation = "Too strict (misses all anomalies)"
    
    print(f"   {threshold:.1f}   |   {anomaly_count:4d}    |  {percentage:5.1f}%  | {interpretation}")

print("\nüí° How to Choose Your Threshold:")
print("   ‚Ä¢ Start with default 0.60 (typically catches 1-5% of data)")
print("   ‚Ä¢ Too many alerts to investigate? ‚Üí Increase to 0.65 or 0.70")
print("   ‚Ä¢ Missing real issues? ‚Üí Decrease to 0.50 or 0.55")
print("   ‚Ä¢ The 'right' threshold depends on:")
print("     - Your investigation capacity (how many alerts can you handle?)")
print("     - Your risk tolerance (cost of missing an issue vs false alarm)")
print("\nüí° This table shows how YOUR data responds to different thresholds")
print("   Use it to find the sweet spot for your use case!")


In [None]:
# Let's look at borderline cases near the 0.60 threshold
print("üîç Examining Borderline Cases (scores 0.55-0.65):\n")

borderline = df_scored.filter(
    (F.col("_info.anomaly.score") >= 0.55) & 
    (F.col("_info.anomaly.score") <= 0.65)
).orderBy(F.col("_info.anomaly.score").desc())

borderline_count = borderline.count()
print(f"Found {borderline_count} borderline transactions near the 0.60 threshold:\n")

if borderline_count > 0:
    display(borderline.select(
        "transaction_id", "amount", "quantity", "category", "region",
        F.round("_info.anomaly.score", 3).alias("score")
    ).limit(10))
    
    print("\nüí° These are on the edge - slight threshold changes will include/exclude them")
    print("   Review these to calibrate your threshold for your use case")
else:
    print("   No records in this range - try adjusting the range!")
    print("   This suggests a clear separation between normal and anomalous data")


---

## Section 5: Manual Column Selection (Optional - Advanced)

**Note**: This section is optional and shows advanced features. Feel free to skip to Section 7 for production patterns!

Auto-discovery is great for exploration, but for production you might want explicit control over which features the model uses.

Let's train a model with **manually selected columns**.


In [None]:
# Train with manual column selection
print("üéØ Training model with manual column selection...\n")

model_uri_manual = anomaly_engine.train(
    df=spark.table(table_name),
    columns=["amount", "quantity", "date"],  # Explicitly specify which columns to use
    model_name="sales_manual",
    registry_table=registry_table
)

print(f"‚úÖ Manual model trained!")
print(f"   Model URI: {model_uri_manual}")

# Compare auto vs manual in the registry
print(f"\nüìä Auto vs Manual Comparison:")
print(f"   View both models side-by-side in the registry:\n")

display(
    spark.table(registry_table)
    .filter(
        (F.col("identity.model_name") == "sales_auto") | 
        (F.col("identity.model_name") == "sales_manual")
    )
    .select(
        "identity.model_name",
        "training.columns",
        "segmentation.segment_by",
        "training.training_rows",
        "identity.status"
    )
    .orderBy("identity.model_name", "training.training_time")
)

print(f"\nüí° Key Differences:")
print(f"   ‚Ä¢ Auto model: Discovered columns automatically + may have segmentation")
print(f"   ‚Ä¢ Manual model: You explicitly chose 3 columns (amount, quantity, date)")
print(f"\nüí° When to use each approach:")
print(f"   ‚Ä¢ Auto-discovery: Exploration, quick start, don't know what matters")
print(f"   ‚Ä¢ Manual selection: Production, control features, domain knowledge")
print(f"   ‚Ä¢ Both are valid! Start with auto, refine with manual")


In [None]:
# Score with manual model
print("üîç Scoring with manual model...\n")

checks_manual = [
    DQDatasetRule(
        check_func=has_no_anomalies,
        check_func_kwargs={
            "merge_columns": ["transaction_id"],
            "model": "sales_manual",
            "score_threshold": 0.5,
            "registry_table": registry_table
        }
    )
]

df_scored_manual = dq_engine.apply_checks(df_sales, checks_manual)
# Filter by _errors column (standard DQX pattern) to get flagged anomalies
anomalies_manual = df_scored_manual.filter(F.size(F.col("_errors")) > 0)

print(f"‚ö†Ô∏è  Manual model found {anomalies_manual.count()} anomalies")
print(f"   (Auto model found {anomalies.count()} anomalies)")
print(f"\nüîù Top 5 anomalies from manual model:\n")

display(anomalies_manual.orderBy(F.col("_info.anomaly.score").desc()).select(
    "transaction_id", "amount", "quantity", "date",
    F.round("_info.anomaly.score", 3).alias("score")
).limit(5))

print("\nüí° Results may differ slightly because we're using different features")
print("   This is normal and expected!")


---

## Section 6: Deep Dive - Feature Contributions (Optional - Advanced)

**Note**: This section shows detailed analysis of contributions. Skip to Section 7 for production patterns!

**Reminder**: Contributions are now enabled by default (you already saw them in Section 2), but this section shows how to analyze them in depth.

Finding anomalies is great, but **understanding WHY** they're anomalous is crucial for investigation. Feature contributions show which columns drove each anomaly score.


In [None]:
# Score with feature contributions
print("üîç Scoring with feature contributions (explainability)...\n")

checks_with_contrib = [
    DQDatasetRule(
        check_func=has_no_anomalies,
        check_func_kwargs={
            "merge_columns": ["transaction_id"],
            "model": "sales_manual",
            "score_threshold": 0.5,
            "include_contributions": True,  # Add this to get explanations!
            "registry_table": registry_table
        }
    )
]

df_with_contrib = dq_engine.apply_checks(df_sales, checks_with_contrib)

print("‚úÖ Scored with feature contributions!")
print("\nüéØ Top Anomalies with Explanations:\n")

# Filter by _errors column (standard DQX pattern) to get flagged anomalies
anomalies_explained = df_with_contrib.filter(
    F.size(F.col("_errors")) > 0
).orderBy(F.col("_info.anomaly.score").desc()).limit(5)

display(anomalies_explained.select(
    "transaction_id",
    "amount",
    "quantity",
    F.date_format("date", "yyyy-MM-dd HH:mm").alias("date"),
    F.round("_info.anomaly.score", 3).alias("score"),
    F.col("_info.anomaly.contributions").alias("contributions")
))

print("\nüí° How to Read Contributions:")
print("   ‚Ä¢ Contributions show which features made this transaction unusual")
print("   ‚Ä¢ Higher contribution = that feature is more responsible for the anomaly")
print("   ‚Ä¢ Use this to triage and investigate efficiently!")
print("\n   Example: If 'amount' has high contribution ‚Üí pricing issue")
print("            If 'quantity' has high contribution ‚Üí bulk order anomaly")
print("            If 'date' has high contribution ‚Üí timing anomaly")
print("\nüí° About Category Contributions:")
print("   ‚Ä¢ You'll see contributions from ALL category features (one-hot encoded)")
print("   ‚Ä¢ Non-matching categories (e.g., 'category_Electronics' when item is Clothing)")
print("     show small contributions representing the feature's ABSENCE")
print("   ‚Ä¢ Focus on features with >5% contribution for investigation")


In [None]:
# Show one detailed example
print("üîé Detailed Example - Top Anomaly:\n")

# Extract flat columns for easier access
anomalies_flattened = anomalies_explained.select(
    "transaction_id",
    "amount",
    "quantity",
    "date",
    F.col("_info.anomaly.score").alias("score"),
    F.col("_info.anomaly.contributions").alias("contributions")
)

top_anomaly = anomalies_flattened.first()

print(f"Transaction ID: {top_anomaly['transaction_id']}")
print(f"Anomaly Score: {top_anomaly['score']:.3f}")
print(f"\nTransaction Details:")
print(f"   Amount: ${top_anomaly['amount']:.2f}")
print(f"   Quantity: {top_anomaly['quantity']}")
print(f"   Date: {top_anomaly['date']}")
print(f"\nFeature Contributions:")

contributions = top_anomaly['contributions']
if contributions:
    # Sort by contribution value
    sorted_contrib = sorted(contributions.items(), key=lambda x: abs(x[1]), reverse=True)
    for feature, value in sorted_contrib[:3]:  # Top 3
        print(f"   {feature}: {abs(value)*100:.1f}% contribution")
    
    print(f"\nüéØ Investigation Tip:")
    top_feature = sorted_contrib[0][0]
    if "amount" in top_feature:
        print(f"   ‚Üí Check for pricing errors or incorrect price feeds")
    elif "quantity" in top_feature:
        print(f"   ‚Üí Investigate bulk order or inventory issue")
    elif "date" in top_feature or "hour" in top_feature:
        print(f"   ‚Üí Review transaction timing - off-hours activity?")
else:
    print("   (No detailed contributions available)")


---

## Section 7: Using in Production

Ready to use anomaly detection in real pipelines? This section shows you how.

**What you'll learn:**
- How to automatically separate good data from bad data
- How to route anomalies to a quarantine table for investigation
- Best practices for production data quality pipelines


In [None]:
# Automatically separate clean data from anomalies
print("üì¶ Separating clean data from anomalies...\n")

# Split data into good and bad records
good_df, bad_df = dq_engine.apply_checks_and_split(df_sales, checks_combined)

# Save quarantined records (anomalies + rule violations)
quarantine_table = f"{catalog}.{schema_name}.sales_anomalies_quarantine"
bad_df_with_metadata = bad_df.select(
    "*",
    F.current_timestamp().alias("quarantine_timestamp"),
    F.lit("quality_check_failed").alias("quarantine_reason")
)
bad_df_with_metadata.write.mode("overwrite").saveAsTable(quarantine_table)

print(f"‚úÖ Automatically split data using apply_checks_and_split():")
print(f"   Clean records: {good_df.count()}")
print(f"   Quarantined: {bad_df.count()}")
print(f"\nüí° Benefits of apply_checks_and_split():")
print(f"   ‚Ä¢ Automatically routes failed checks to quarantine")
print(f"   ‚Ä¢ Handles both anomalies AND rule violations")
print(f"   ‚Ä¢ No manual filtering needed - just specify the checks!")
print(f"\nüìã Access quarantined records:")
print(f"   spark.table('{quarantine_table}')")


In [None]:
# Pattern 2: Use in downstream pipelines
print("üîÑ Pattern 2: Integrate with Downstream Pipelines\n")

# Use the clean data (good_df) for downstream processing
print("üí° Best Practices:")
print("   ‚úÖ Use good_df for downstream analytics, ML training, reporting")
print("   ‚úÖ Route bad_df to investigation/remediation workflows")
print("   ‚úÖ Monitor quarantine table for trends and retraining signals")
print("   ‚úÖ Combine rule-based + anomaly checks (shown in Section 2)")
print(f"\nüìä Production Flow:")
print(f"   1. Apply checks (Section 2) ‚Üí rule-based + anomaly detection")
print(f"   2. Split data (this section) ‚Üí good vs bad records")
print(f"   3. Process good_df ‚Üí downstream systems")
print(f"   4. Investigate bad_df ‚Üí manual review or auto-remediation")
print(f"\n‚ú® With new defaults, you get:")
print(f"   ‚Ä¢ Ensemble models (confidence scores)")
print(f"   ‚Ä¢ Feature contributions (explainability)")
print(f"   ‚Ä¢ Optimized performance (10-15x faster than baseline)")
print(f"   ‚Ä¢ Production-ready out-of-the-box!")


---

## Summary & Next Steps

### üéì What You Learned

1. **‚úÖ Anomaly Detection Concepts**
   - Known unknowns (rule-based checks) vs Unknown unknowns (ML anomaly detection)
   - Unity Catalog monitors tables, DQX monitors individual records
   - Use both approaches together for comprehensive quality

2. **‚úÖ Zero-Config Quick Start**
   - Train models with auto-discovery (no column selection needed)
   - Score data with one function call
   - Detect unusual patterns automatically

3. **‚úÖ Interpret Results**
   - Anomaly scores range 0-1 (0.5 threshold)
   - Adjust threshold based on precision/recall needs
   - Compare normal vs anomalous patterns

4. **‚úÖ Control & Tune**
   - Manual column selection for production
   - Threshold tuning for sensitivity
   - Feature contributions for investigation

5. **‚úÖ Production Integration**
   - Quarantine workflow for anomalies
   - Combine with traditional DQ checks
   - Easy integration with existing pipelines

### üí° Key Takeaways

- **Start simple**: Use auto-discovery first, then refine with manual selection
- **Threshold matters**: Adjust based on your tolerance for false positives
- **Contributions are crucial**: Use them to triage and investigate efficiently
- **Complement, don't replace**: Use both rule-based checks and anomaly detection
- **Unity Catalog + DQX**: Together they provide comprehensive data quality coverage

### üöÄ Next Steps

#### 1. Apply to Your Data
```python
# Replace with your table
model = anomaly_engine.train(
    df=spark.table("your_catalog.your_schema.your_table"),
    model_name="your_model_name"
)

checks = [
    has_no_anomalies(
        merge_columns=["your_id_column"],
        model="your_model_name"
    )
]
df_scored = dq_engine.apply_checks(your_df, checks)
```

#### 2. Explore Advanced Features
- **Segmented models**: Train separate models per region, category, etc.
- **Drift detection**: Monitor when models become stale
- **Ensemble models**: Get confidence intervals on scores
- See the pharma and investment banking demos for examples!

#### 3. Set Up Production Workflows
- Automate model training (weekly/monthly)
- Schedule scoring (hourly/daily)
- Build investigation workflow around quarantine table
- Integrate with alerting (Slack, PagerDuty, etc.)

#### 4. Monitor & Iterate
- Review flagged anomalies regularly
- Adjust thresholds based on false positive rate
- Retrain models as patterns change
- Combine with Unity Catalog's table-level monitoring

### üìö Resources

- [DQX Anomaly Detection Documentation](https://databrickslabs.github.io/dqx/guide/anomaly_detection)
- [API Reference](https://databrickslabs.github.io/dqx/reference/quality_checks#has_no_anomalies)
- [Unity Catalog Anomaly Detection](https://docs.databricks.com/aws/en/data-quality-monitoring/anomaly-detection/#-table-quality-details)
- [GitHub Repository](https://github.com/databrickslabs/dqx)

### üéâ You're Ready!

You now understand:
- ‚úÖ What anomaly detection is and when to use it
- ‚úÖ How to implement it with minimal configuration
- ‚úÖ How to interpret and tune results
- ‚úÖ How to integrate it into production

**Start detecting anomalies in your data today!** üöÄ

---

*Questions? Feedback? Open an issue on [GitHub](https://github.com/databrickslabs/dqx) or contact the DQX team!*
