# üìä Simple Anomaly Detection Demo

## Learn Anomaly Detection in 15 Minutes

This beginner-friendly demo shows you how to:
- Understand what anomaly detection is and why it matters
- Detect unusual patterns in your data with zero configuration
- Tune detection sensitivity to your needs
- Understand why specific records are flagged
- Integrate anomaly detection into production workflows

**Dataset**: Simple sales transactions (universally relatable, no domain expertise required)

**Time**: 12-17 minutes


---

## Section 0: What is Anomaly Detection in Data Quality?

Before we dive into the code, let's understand what anomaly detection is and why it's valuable.

### The Data Quality Challenge: Known vs Unknown Issues

#### üéØ Known Unknowns (Traditional Data Quality)

These are issues **you can anticipate** and write rules for:

| Issue Type | Example | Solution |
|------------|---------|----------|
| Null values | `amount` is NULL | `is_not_null(column="amount")` |
| Out of range | Price is negative | `is_in_range(column="price", min=0)` |
| Invalid format | Email without @ symbol | Regex validation |

**Works great when you know what to look for!**

#### üîç Unknown Unknowns (Anomaly Detection)

These are issues **you DON'T know to look for**:

- Unusual **patterns** across multiple columns
- Outlier **combinations** that are individually valid
- Subtle **data corruption** that passes all rules

**Problem**: You can't write rules for things you haven't thought of!

**Solution**: ML-based anomaly detection learns "normal" patterns from your data and flags deviations.

### Concrete Example

```
Known Unknown:  "Amount must be positive"
                ‚Üí is_in_range(min=0)
                ‚úÖ Catches: amount = -50

Unknown Unknown: "Transaction for $47,283 at 3am on Sunday for 2 items"
                 ‚Üí Anomaly detection
                 ‚úÖ Catches: All fields valid individually, but pattern is unusual
```

### Why Anomaly Detection Matters

- ‚úÖ **Catches issues before they become problems** - Early warning system
- ‚úÖ **No need to anticipate every failure mode** - Adapts to your data
- ‚úÖ **Learns patterns automatically** - No manual rule writing
- ‚úÖ **Complements rule-based checks** - Use both together for comprehensive quality

### Unity Catalog Integration

#### Built-in Quality Monitoring (Unity Catalog)

Unity Catalog includes **table-level** anomaly detection:
- Monitors column statistics and distributions
- Alerts on schema changes, cardinality shifts
- Tracks null rate changes over time
- Great for monitoring table health

#### When to Use DQX Anomaly Detection

DQX provides **row-level** anomaly detection:
- Detect unusual individual **records/transactions**
- **Multi-column pattern** detection (e.g., price + quantity + time)
- **Custom models per segment** (e.g., different regions, categories)
- **Feature contributions** to understand WHY records are anomalous
- **Integration** with existing DQX quality pipelines

#### Complementary Approach

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Rule-Based Checks (Known Unknowns)                     ‚îÇ
‚îÇ  ‚Ä¢ is_not_null, is_in_range, regex validation          ‚îÇ
‚îÇ  ‚Ä¢ Schema validation, referential integrity             ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                           +
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  ML Anomaly Detection (Unknown Unknowns)                ‚îÇ
‚îÇ  ‚Ä¢ Pattern detection, outlier identification            ‚îÇ
‚îÇ  ‚Ä¢ Multi-column relationship validation                 ‚îÇ
‚îÇ  DQX: Row-level anomaly detection                       ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                           +
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Unity Catalog Monitoring (Table Health)                ‚îÇ
‚îÇ  ‚Ä¢ Schema drift, cardinality changes                    ‚îÇ
‚îÇ  ‚Ä¢ Column statistics, metadata tracking                 ‚îÇ
‚îÇ  UC: Table-level anomaly detection                      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Key Takeaways

- üí° **Anomaly detection finds issues you didn't know to look for**
- üí° **Complements (doesn't replace) rule-based checks - use both!**
- üí° **Unity Catalog monitors tables, DQX monitors individual records**
- üí° **Together, they provide comprehensive quality coverage**

Let's see how easy it is to add anomaly detection to your pipeline! üöÄ


---

## Section 1: Setup & Data Generation

First, let's set up our environment and create simple sales transaction data.


In [None]:
# Imports
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta
import random
import numpy as np

from databricks.labs.dqx.anomaly import AnomalyEngine, has_no_anomalies
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.rule import DQDatasetRule
from databricks.labs.dqx.check_funcs import is_not_null, is_in_range
from databricks.sdk import WorkspaceClient

# Initialize
spark = SparkSession.builder.getOrCreate()
ws = WorkspaceClient()
anomaly_engine = AnomalyEngine(ws)
dq_engine = DQEngine(ws)

# Set seeds for reproducibility
random.seed(42)
np.random.seed(42)

print("‚úÖ Setup complete!")
print(f"   Spark version: {spark.version}")


In [None]:
# Generate simple sales transaction data
def generate_sales_data(num_rows=1000, anomaly_rate=0.04):
    """
    Generate sales transaction data with injected anomalies.
    
    Normal patterns:
    - Amount: $10-500 per transaction
    - Quantity: 1-10 items
    - Business hours: 9am-6pm weekdays
    - Regional consistency
    
    Anomalies (4%):
    - Pricing errors (extremely high/low amounts)
    - Quantity spikes (bulk orders 50-100 items)
    - Timing anomalies (3am transactions, weekend B2B)
    - Regional outliers (unusual amounts for region)
    """
    data = []
    categories = ["Electronics", "Clothing", "Food", "Books", "Home"]
    regions = ["North", "South", "East", "West"]
    
    # Regional pricing patterns (normal baseline)
    region_patterns = {
        "North": {"base_amount": 200, "quantity": 5},
        "South": {"base_amount": 150, "quantity": 4},
        "East": {"base_amount": 180, "quantity": 4},
        "West": {"base_amount": 220, "quantity": 6},
    }
    
    start_date = datetime(2024, 1, 1, 9, 0)  # Jan 1, 2024, 9am
    
    for i in range(num_rows):
        transaction_id = f"TXN{i:06d}"
        category = random.choice(categories)
        region = random.choice(regions)
        pattern = region_patterns[region]
        
        # Generate timestamp (mostly business hours weekdays)
        days_offset = random.randint(0, 90)  # 3 months of data
        hours_offset = random.randint(0, 9)  # 9am-6pm = 9 hours
        date = start_date + timedelta(days=days_offset, hours=hours_offset)
        
        # Skip weekends for normal transactions
        if date.weekday() >= 5:  # Saturday=5, Sunday=6
            date = date - timedelta(days=date.weekday() - 4)  # Move to Friday
        
        # Inject anomalies
        if random.random() < anomaly_rate:
            anomaly_type = random.choice(["pricing", "quantity", "timing", "regional"])
            
            if anomaly_type == "pricing":
                # Pricing error: extreme amounts
                amount = round(random.choice([pattern["base_amount"] * 10, pattern["base_amount"] / 10]), 2)
                quantity = int(np.random.normal(pattern["quantity"], 1))
            
            elif anomaly_type == "quantity":
                # Bulk order spike
                amount = round(pattern["base_amount"] * random.uniform(0.9, 1.1), 2)
                quantity = random.randint(50, 100)  # 10-20x normal
            
            elif anomaly_type == "timing":
                # Off-hours or weekend transaction
                amount = round(pattern["base_amount"] * random.uniform(0.9, 1.1), 2)
                quantity = int(np.random.normal(pattern["quantity"], 1))
                date = date.replace(hour=random.choice([2, 3, 4, 22, 23]))  # Late night/early morning
                # Or make it weekend
                if random.random() > 0.5:
                    date = date + timedelta(days=(5 - date.weekday()))  # Move to Saturday
            
            else:  # regional outlier
                # Amount unusual for this region (but normal for another)
                other_region = random.choice([r for r in regions if r != region])
                amount = round(region_patterns[other_region]["base_amount"] * random.uniform(0.9, 1.1), 2)
                quantity = int(np.random.normal(pattern["quantity"], 1))
        
        else:
            # Normal transaction
            amount = round(pattern["base_amount"] * random.uniform(0.7, 1.3), 2)
            quantity = max(1, int(np.random.normal(pattern["quantity"], 2)))
        
        # Ensure valid ranges
        amount = max(10, min(10000, amount))
        quantity = max(1, min(100, quantity))
        
        data.append((transaction_id, date, amount, quantity, category, region))
    
    return data

# Generate data
print("üîÑ Generating sales transaction data...\n")
sales_data = generate_sales_data(num_rows=1000, anomaly_rate=0.04)

schema = StructType([
    StructField("transaction_id", StringType(), False),
    StructField("date", TimestampType(), False),
    StructField("amount", DoubleType(), False),
    StructField("quantity", IntegerType(), False),
    StructField("category", StringType(), False),
    StructField("region", StringType(), False),
])

df_sales = spark.createDataFrame(sales_data, schema)

print("üìä Sample of sales transactions:")
display(df_sales.orderBy("date"))

print(f"\n‚úÖ Generated {df_sales.count()} sales transactions")
print(f"   Expected anomalies: ~{int(df_sales.count() * 0.04)} (4%)")
print(f"\nüí° Data includes:")
print(f"   ‚Ä¢ Normal patterns: Business hours, typical amounts, reasonable quantities")
print(f"   ‚Ä¢ Injected anomalies: Pricing errors, bulk orders, off-hours, regional outliers")


In [None]:
# Save to table for later reference
catalog = spark.sql("SELECT current_catalog()").first()[0]
schema_name = "dqx_demo"
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog}.{schema_name}")

table_name = f"{catalog}.{schema_name}.sales_transactions"
df_sales.write.mode("overwrite").saveAsTable(table_name)

print(f"‚úÖ Data saved to: {table_name}")


In [None]:
# Define unique registry table for this demo (to avoid conflicts with other demos)
registry_table = f"{catalog}.{schema_name}.anomaly_model_registry_101"
print(f"üìã Model registry: {registry_table}")

# Clean up old table if it exists (ensures new nested schema)
spark.sql(f"DROP TABLE IF EXISTS {registry_table}")
print(f"üóëÔ∏è  Cleaned up old registry table (if existed)")


---

## Section 2: Quick Start - Auto-Discovery

Let's detect anomalies with **ZERO configuration**. The system will:
1. Automatically select relevant columns
2. Auto-detect if segmentation is needed (e.g., separate models per region)
3. Train model(s) on normal patterns
4. Score all transactions for anomalies

**You provide**: Just the data  
**DQX provides**: Everything else!

**Note**: The system may auto-create segmented models if it detects distinct groups in your data. You'll see this in the registry table below.


In [None]:
# Step 1: Train model with auto-discovery (zero config!)
print("üéØ Training anomaly detection model...")
print("   (Auto-discovering columns, segments, and patterns)\n")

model_uri_auto = anomaly_engine.train(
    df=spark.table(table_name),
    model_name="sales_auto",
    registry_table=registry_table
)

print(f"‚úÖ Model trained successfully!")
print(f"   Model URI: {model_uri_auto}")

# Show the registry table to see what was created
print(f"\nüìã Model Registry Contents:")
print(f"   Explore the registry table to see all trained models and their configurations:\n")

display(
    spark.table(registry_table)
    .filter(F.col("model_name").startswith("sales_auto"))
    .select(
        "model_name",
        "columns", 
        "segment_by",
        "segment_values",
        "training_rows",
        "training_time",
        "status"
    )
    .orderBy("model_name")
)

print("\nüí° What you're seeing:")
print("   ‚Ä¢ If segment_by is populated, DQX auto-created separate models per segment")
print("   ‚Ä¢ Each row is a trained model (global or segment-specific)")
print("   ‚Ä¢ Status 'active' means this model is used for scoring")


In [None]:
# Step 2: Score transactions for anomalies
print("üîç Scoring transactions for anomalies...\n")

checks_auto = [
    DQDatasetRule(
        check_func=has_no_anomalies,
        check_func_kwargs={
            "merge_columns": ["transaction_id"],
            "model": "sales_auto",
            "registry_table": registry_table
        }
    )
]

df_scored = dq_engine.apply_checks(df_sales, checks_auto)

# Filter to anomalies (score >= 0.5 is considered anomalous)
anomalies = df_scored.filter(F.col("_info.anomaly.score") >= 0.5)

print(f"‚ö†Ô∏è  Found {anomalies.count()} anomalies out of {df_scored.count()} transactions")
print(f"   Detection rate: {(anomalies.count() / df_scored.count()) * 100:.1f}%")
print(f"\nüîù Top 10 anomalies (by score):\n")

display(anomalies.orderBy(F.col("_info.anomaly.score").desc()).select(
    "transaction_id", "date", "amount", "quantity", "category", "region",
    F.round("_info.anomaly.score", 3).alias("anomaly_score")
).limit(10))

print("\nüí° Key Point: Anomaly score ranges from 0 to 1")
print("   ‚Ä¢ Score >= 0.5: Considered anomalous (flagged)")
print("   ‚Ä¢ Score < 0.5: Normal transaction")
print("   ‚Ä¢ Higher score = more unusual")


---

## Section 3: Understanding the Results

Let's dig deeper into what we found and how to interpret anomaly scores.


In [None]:
# Analyze score distribution
print("üìä Anomaly Score Distribution:\n")

score_stats = df_scored.select("_info.anomaly.score").describe()
display(score_stats)

# Show score ranges
print("üìà Score Range Breakdown:\n")

score_ranges = df_scored.select(
    F.count(F.when(F.col("_info.anomaly.score") < 0.3, 1)).alias("normal_0.0_0.3"),
    F.count(F.when((F.col("_info.anomaly.score") >= 0.3) & (F.col("_info.anomaly.score") < 0.5), 1)).alias("borderline_0.3_0.5"),
    F.count(F.when((F.col("_info.anomaly.score") >= 0.5) & (F.col("_info.anomaly.score") < 0.7), 1)).alias("anomalous_0.5_0.7"),
    F.count(F.when(F.col("_info.anomaly.score") >= 0.7, 1)).alias("highly_anomalous_0.7_1.0"),
).first()

total = df_scored.count()
print(f"Normal (0.0-0.3):           {score_ranges['normal_0.0_0.3']:4d} ({score_ranges['normal_0.0_0.3']/total*100:5.1f}%)")
print(f"Borderline (0.3-0.5):       {score_ranges['borderline_0.3_0.5']:4d} ({score_ranges['borderline_0.3_0.5']/total*100:5.1f}%)")
print(f"Anomalous (0.5-0.7):        {score_ranges['anomalous_0.5_0.7']:4d} ({score_ranges['anomalous_0.5_0.7']/total*100:5.1f}%)")
print(f"Highly Anomalous (0.7-1.0): {score_ranges['highly_anomalous_0.7_1.0']:4d} ({score_ranges['highly_anomalous_0.7_1.0']/total*100:5.1f}%)")

print(f"\nüí° Interpretation:")
print(f"   ‚Ä¢ Most transactions score low (normal behavior)")
print(f"   ‚Ä¢ Threshold of 0.5 separates normal from anomalous")
print(f"   ‚Ä¢ You can adjust this threshold based on your needs!")


In [None]:
# Compare normal vs anomalous transactions
print("üîç Normal vs Anomalous Transaction Comparison:\n")

normal_stats = df_scored.filter(F.col("_info.anomaly.score") < 0.5).agg(
    F.avg("amount").alias("avg_amount"),
    F.avg("quantity").alias("avg_quantity"),
    F.count("*").alias("count")
).first()

anomaly_stats = df_scored.filter(F.col("_info.anomaly.score") >= 0.5).agg(
    F.avg("amount").alias("avg_amount"),
    F.avg("quantity").alias("avg_quantity"),
    F.count("*").alias("count")
).first()

print("Normal Transactions:")
print(f"   Count: {normal_stats['count']}")
print(f"   Avg Amount: ${normal_stats['avg_amount']:.2f}")
print(f"   Avg Quantity: {normal_stats['avg_quantity']:.1f}")

print("\nAnomalous Transactions:")
print(f"   Count: {anomaly_stats['count']}")
print(f"   Avg Amount: ${anomaly_stats['avg_amount']:.2f}")
print(f"   Avg Quantity: {anomaly_stats['avg_quantity']:.1f}")

print("\n‚úÖ Anomalies have different patterns - mission accomplished!")


---

## Section 4: Tuning the Threshold

The threshold (default 0.5) controls how sensitive anomaly detection is:
- **Lower threshold** (e.g., 0.3): More sensitive, flags more anomalies (higher recall)
- **Higher threshold** (e.g., 0.7): Less sensitive, flags only severe anomalies (higher precision)

Let's see how changing the threshold affects results!


In [None]:
# Try different thresholds
print("üéöÔ∏è  Testing Different Thresholds:\n")
print("Threshold | Anomalies | % of Data | Interpretation")
print("-" * 70)

thresholds = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
total_count = df_scored.count()

for threshold in thresholds:
    anomaly_count = df_scored.filter(F.col("_info.anomaly.score") >= threshold).count()
    percentage = (anomaly_count / total_count) * 100
    
    if threshold <= 0.3:
        interpretation = "Very sensitive (many alerts)"
    elif threshold <= 0.5:
        interpretation = "Balanced (recommended start)"
    elif threshold <= 0.7:
        interpretation = "Conservative (fewer alerts)"
    else:
        interpretation = "Very strict (critical only)"
    
    print(f"   {threshold:.1f}   |   {anomaly_count:4d}    |  {percentage:5.1f}%  | {interpretation}")

print("\nüí° How to Choose Your Threshold:")
print("   ‚Ä¢ Start with 0.5 (balanced)")
print("   ‚Ä¢ Too many false positives? ‚Üí Increase threshold (0.6, 0.7)")
print("   ‚Ä¢ Missing real issues? ‚Üí Decrease threshold (0.4, 0.3)")
print("   ‚Ä¢ Adjust based on investigation capacity and tolerance for risk")


In [None]:
# Let's look at borderline cases
print("üîç Examining Borderline Cases (scores 0.45-0.55):\n")

borderline = df_scored.filter(
    (F.col("_info.anomaly.score") >= 0.45) & 
    (F.col("_info.anomaly.score") <= 0.55)
).orderBy(F.col("_info.anomaly.score").desc())

print(f"Found {borderline.count()} borderline transactions:\n")
display(borderline.select(
    "transaction_id", "amount", "quantity", "category", "region",
    F.round("_info.anomaly.score", 3).alias("score")
).limit(10))

print("\nüí° These are on the edge - slight threshold changes will include/exclude them")
print("   Review these to calibrate your threshold for your use case")


---

## Section 5: Manual Column Selection

Auto-discovery is great for exploration, but for production you might want explicit control.

Let's train a model with **manually selected columns**.


In [None]:
# Train with manual column selection
print("üéØ Training model with manual column selection...\n")

model_uri_manual = anomaly_engine.train(
    df=spark.table(table_name),
    columns=["amount", "quantity", "date"],  # Explicitly specify columns
    model_name="sales_manual",
    registry_table=registry_table
)

print(f"‚úÖ Manual model trained!")
print(f"   Model URI: {model_uri_manual}")

# Compare auto vs manual in the registry
print(f"\nüìä Auto vs Manual Comparison:")
print(f"   View both models side-by-side in the registry:\n")

display(
    spark.table(registry_table)
    .filter(
        (F.col("model_name") == "sales_auto") | 
        (F.col("model_name") == "sales_manual")
    )
    .select(
        "model_name",
        "columns",
        "segment_by",
        "training_rows",
        "status"
    )
    .orderBy("model_name", "training_time")
)

print(f"\nüí° Key Differences:")
print(f"   ‚Ä¢ Auto model: Discovered columns automatically + may have segmentation")
print(f"   ‚Ä¢ Manual model: You explicitly chose 3 columns (amount, quantity, date)")
print(f"\nüí° When to use each approach:")
print(f"   ‚Ä¢ Auto-discovery: Exploration, quick start, don't know what matters")
print(f"   ‚Ä¢ Manual selection: Production, control features, domain knowledge")
print(f"   ‚Ä¢ Both are valid! Start with auto, refine with manual")


In [None]:
# Score with manual model
print("üîç Scoring with manual model...\n")

checks_manual = [
    DQDatasetRule(
        check_func=has_no_anomalies,
        check_func_kwargs={
            "merge_columns": ["transaction_id"],
            "model": "sales_manual",
            "score_threshold": 0.5,
            "registry_table": registry_table
        }
    )
]

df_scored_manual = dq_engine.apply_checks(df_sales, checks_manual)
anomalies_manual = df_scored_manual.filter(F.col("_info.anomaly.score") >= 0.5)

print(f"‚ö†Ô∏è  Manual model found {anomalies_manual.count()} anomalies")
print(f"   (Auto model found {anomalies.count()} anomalies)")
print(f"\nüîù Top 5 anomalies from manual model:\n")

display(anomalies_manual.orderBy(F.col("_info.anomaly.score").desc()).select(
    "transaction_id", "amount", "quantity", "date",
    F.round("_info.anomaly.score", 3).alias("score")
).limit(5))

print("\nüí° Results may differ slightly because we're using different features")
print("   This is normal and expected!")


---

## Section 6: Why Is This Anomalous?

Finding anomalies is great, but **understanding WHY** they're anomalous is crucial for investigation!

Let's add **feature contributions** to see which columns drove each anomaly score.


In [None]:
# Score with feature contributions
print("üîç Scoring with feature contributions (explainability)...\n")

checks_with_contrib = [
    DQDatasetRule(
        check_func=has_no_anomalies,
        check_func_kwargs={
            "merge_columns": ["transaction_id"],
            "model": "sales_manual",
            "score_threshold": 0.5,
            "include_contributions": True,  # Add this to get explanations!
            "registry_table": registry_table
        }
    )
]

df_with_contrib = dq_engine.apply_checks(df_sales, checks_with_contrib)

print("‚úÖ Scored with feature contributions!")
print("\nüéØ Top Anomalies with Explanations:\n")

anomalies_explained = df_with_contrib.filter(
    F.col("_info.anomaly.score") >= 0.5
).orderBy(F.col("_info.anomaly.score").desc()).limit(5)

display(anomalies_explained.select(
    "transaction_id",
    "amount",
    "quantity",
    F.date_format("date", "yyyy-MM-dd HH:mm").alias("date"),
    F.round("_info.anomaly.score", 3).alias("score"),
    F.col("_info.anomaly.contributions").alias("contributions")
))

print("\nüí° How to Read Contributions:")
print("   ‚Ä¢ Contributions show which features made this transaction unusual")
print("   ‚Ä¢ Higher contribution = that feature is more responsible for the anomaly")
print("   ‚Ä¢ Use this to triage and investigate efficiently!")
print("\n   Example: If 'amount' has high contribution ‚Üí pricing issue")
print("            If 'quantity' has high contribution ‚Üí bulk order anomaly")
print("            If 'date' has high contribution ‚Üí timing anomaly")


In [None]:
# Show one detailed example
print("üîé Detailed Example - Top Anomaly:\n")

top_anomaly = anomalies_explained.first()

print(f"Transaction ID: {top_anomaly['transaction_id']}")
print(f"Anomaly Score: {top_anomaly['score']:.3f}")
print(f"\nTransaction Details:")
print(f"   Amount: ${top_anomaly['amount']:.2f}")
print(f"   Quantity: {top_anomaly['quantity']}")
print(f"   Date: {top_anomaly['date']}")
print(f"\nFeature Contributions:")

contributions = top_anomaly['contributions']
if contributions:
    # Sort by contribution value
    sorted_contrib = sorted(contributions.items(), key=lambda x: abs(x[1]), reverse=True)
    for feature, value in sorted_contrib[:3]:  # Top 3
        print(f"   {feature}: {abs(value)*100:.1f}% contribution")
    
    print(f"\nüéØ Investigation Tip:")
    top_feature = sorted_contrib[0][0]
    if "amount" in top_feature:
        print(f"   ‚Üí Check for pricing errors or incorrect price feeds")
    elif "quantity" in top_feature:
        print(f"   ‚Üí Investigate bulk order or inventory issue")
    elif "date" in top_feature or "hour" in top_feature:
        print(f"   ‚Üí Review transaction timing - off-hours activity?")
else:
    print("   (No detailed contributions available)")


---

## Section 7: Production Integration

Now that you understand anomaly detection, let's see how to use it in production workflows.

Two common patterns:
1. **Quarantine anomalies** for review
2. **Combine with traditional DQ checks** for comprehensive quality


In [None]:
# Pattern 1: Quarantine anomalies
print("üì¶ Pattern 1: Quarantine Anomalies for Investigation\n")

# Filter anomalies and save to quarantine table
quarantine_table = f"{catalog}.{schema_name}.sales_anomalies_quarantine"

anomalies_to_quarantine = df_with_contrib.filter(
    F.col("_info.anomaly.score") >= 0.5
).select(
    "*",
    F.current_timestamp().alias("quarantine_timestamp"),
    F.lit("anomaly_detected").alias("quarantine_reason")
)

anomalies_to_quarantine.write.mode("overwrite").saveAsTable(quarantine_table)

print(f"‚úÖ Quarantined {anomalies_to_quarantine.count()} anomalies to: {quarantine_table}")
print(f"\nüí° Use Case:")
print(f"   ‚Ä¢ Automatically route unusual transactions for manual review")
print(f"   ‚Ä¢ Prevent bad data from reaching downstream systems")
print(f"   ‚Ä¢ Build investigation workflow around quarantine table")
print(f"\nüìã Access quarantined records:")
print(f"   spark.table('{quarantine_table}')")


In [None]:
# Pattern 2: Combine with traditional DQ checks
print("üîÑ Pattern 2: Combine Rule-Based + ML Anomaly Detection\n")

# Comprehensive quality checks
checks_combined = [
    # Traditional rule-based checks (known unknowns)
    is_not_null(columns=["transaction_id", "amount", "quantity"]),
    is_in_range(column="amount", min_value=0, max_value=100000),
    is_in_range(column="quantity", min_value=1, max_value=1000),
    
    # ML-based anomaly detection (unknown unknowns)
    DQDatasetRule(
        check_func=has_no_anomalies,
        check_func_kwargs={
            "merge_columns": ["transaction_id"],
            "model": "sales_manual",
            "score_threshold": 0.5,
            "include_contributions": True,
            "registry_table": registry_table
        }
    )
]

# Apply all checks in one pass
df_full_quality = dq_engine.apply_checks(df_sales, checks_combined)

print("‚úÖ Applied all quality checks (rule-based + ML)!")
print(f"\nüìä Quality Summary:")

total = df_full_quality.count()
anomalies_found = df_full_quality.filter(F.col("_info.anomaly.score") >= 0.5).count()
clean_records = total - anomalies_found

print(f"   Total records: {total}")
print(f"   Clean records: {clean_records} ({clean_records/total*100:.1f}%)")
print(f"   Anomalies: {anomalies_found} ({anomalies_found/total*100:.1f}%)")

print(f"\nüí° Best Practice:")
print(f"   ‚úÖ Use rule-based checks for known issues (nulls, ranges, formats)")
print(f"   ‚úÖ Use anomaly detection for unknown patterns")
print(f"   ‚úÖ Apply both together for comprehensive quality coverage!")

# Show combined results
print(f"\nüîù Records with issues (either rule violations or anomalies):\n")
issues = df_full_quality.filter(
    (F.col("_info.anomaly.score") >= 0.5) |
    (F.size(F.col("_info.failed_checks")) > 0)
)

if issues.count() > 0:
    display(issues.select(
        "transaction_id", "amount", "quantity",
        F.round("_info.anomaly.score", 3).alias("anomaly_score"),
        F.size("_info.failed_checks").alias("rule_violations")
    ).limit(10))
else:
    print("   No issues found! All data passed quality checks.")


---

## Summary & Next Steps

### üéì What You Learned

1. **‚úÖ Anomaly Detection Concepts**
   - Known unknowns (rule-based checks) vs Unknown unknowns (ML anomaly detection)
   - Unity Catalog monitors tables, DQX monitors individual records
   - Use both approaches together for comprehensive quality

2. **‚úÖ Zero-Config Quick Start**
   - Train models with auto-discovery (no column selection needed)
   - Score data with one function call
   - Detect unusual patterns automatically

3. **‚úÖ Interpret Results**
   - Anomaly scores range 0-1 (0.5 threshold)
   - Adjust threshold based on precision/recall needs
   - Compare normal vs anomalous patterns

4. **‚úÖ Control & Tune**
   - Manual column selection for production
   - Threshold tuning for sensitivity
   - Feature contributions for investigation

5. **‚úÖ Production Integration**
   - Quarantine workflow for anomalies
   - Combine with traditional DQ checks
   - Easy integration with existing pipelines

### üí° Key Takeaways

- **Start simple**: Use auto-discovery first, then refine with manual selection
- **Threshold matters**: Adjust based on your tolerance for false positives
- **Contributions are crucial**: Use them to triage and investigate efficiently
- **Complement, don't replace**: Use both rule-based checks and anomaly detection
- **Unity Catalog + DQX**: Together they provide comprehensive data quality coverage

### üöÄ Next Steps

#### 1. Apply to Your Data
```python
# Replace with your table
model = anomaly_engine.train(
    df=spark.table("your_catalog.your_schema.your_table"),
    model_name="your_model_name"
)

checks = [
    has_no_anomalies(
        merge_columns=["your_id_column"],
        model="your_model_name"
    )
]
df_scored = dq_engine.apply_checks(your_df, checks)
```

#### 2. Explore Advanced Features
- **Segmented models**: Train separate models per region, category, etc.
- **Drift detection**: Monitor when models become stale
- **Ensemble models**: Get confidence intervals on scores
- See the pharma and investment banking demos for examples!

#### 3. Set Up Production Workflows
- Automate model training (weekly/monthly)
- Schedule scoring (hourly/daily)
- Build investigation workflow around quarantine table
- Integrate with alerting (Slack, PagerDuty, etc.)

#### 4. Monitor & Iterate
- Review flagged anomalies regularly
- Adjust thresholds based on false positive rate
- Retrain models as patterns change
- Combine with Unity Catalog's table-level monitoring

### üìö Resources

- [DQX Anomaly Detection Documentation](https://databrickslabs.github.io/dqx/guide/anomaly_detection)
- [API Reference](https://databrickslabs.github.io/dqx/reference/quality_checks#has_no_anomalies)
- [Unity Catalog Anomaly Detection](https://docs.databricks.com/aws/en/data-quality-monitoring/anomaly-detection/#-table-quality-details)
- [GitHub Repository](https://github.com/databrickslabs/dqx)

### üéâ You're Ready!

You now understand:
- ‚úÖ What anomaly detection is and when to use it
- ‚úÖ How to implement it with minimal configuration
- ‚úÖ How to interpret and tune results
- ‚úÖ How to integrate it into production

**Start detecting anomalies in your data today!** üöÄ

---

*Questions? Feedback? Open an issue on [GitHub](https://github.com/databrickslabs/dqx) or contact the DQX team!*
