# 4.2 Declarative Data Quality with Delta Live Tables (DLT) Expectations

This notebook demonstrates how to implement declarative data quality using Delta Live Tables (DLT) expectations in a functional programming approach. We'll explore how DLT expectations align with functional principles and provide robust, automated data quality management.

## Learning Objectives

By the end of this notebook, you will understand how to:
- Design functional data pipelines with DLT expectations
- Implement declarative data quality rules using expectations
- Create composable expectation patterns
- Handle data quality violations with different strategies (warn, drop, fail)
- Monitor and alert on data quality metrics
- Test DLT pipelines with functional approaches
- Compare declarative vs imperative data quality patterns

## Prerequisites

- Understanding of PySpark DataFrames
- Knowledge of functional programming concepts
- Familiarity with Delta Lake basics
- Experience with data validation patterns

## 1. Introduction to Delta Live Tables and Expectations

Delta Live Tables (DLT) is a declarative framework for building reliable data pipelines on Databricks. DLT expectations provide automated data quality testing built directly into the pipeline definition.

### Why DLT Expectations?

**Traditional Imperative Approach:**
```python
# Scattered validation logic throughout code
if df.filter(col("age") < 0).count() > 0:
    raise ValueError("Invalid ages found")

# Manual error handling and logging
# Difficult to track quality metrics
# Quality checks can be missed
```

**DLT Declarative Approach:**
```python
@dlt.expect("valid_age", "age >= 0")
@dlt.expect("valid_email", "email RLIKE '^[^@]+@[^@]+\\.[^@]+$'")
@dlt.table
def customers():
    return spark.read.table("source_customers")
```

### Functional Programming Alignment

DLT expectations embody functional programming principles:
- **Declarative**: Define *what* quality rules should be enforced, not *how*
- **Composable**: Stack multiple expectations on the same table
- **Immutable**: Expectations don't modify the pipeline definition
- **Side-effect isolation**: Quality actions (warn/drop/fail) are explicitly declared

In [None]:
# Essential imports for DLT pipeline demonstration
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import *
import pyspark.sql.functions as F
from typing import Dict, List, Tuple, Optional, Callable, Any
from dataclasses import dataclass
from enum import Enum
import json

# Initialize Spark session (if not already available)
try:
    spark
except NameError:
    spark = SparkSession.builder.appName("DLTExpectations").getOrCreate()

print("✅ Setup complete - Ready for DLT expectations demonstration!")

## 2. DLT Expectation Types and Strategies

DLT provides three types of expectations with different violation handling strategies:

| Expectation Type | Behavior | Use Case | Functional Impact |
|-----------------|----------|----------|-------------------|
| `@dlt.expect()` | **Warn** - Records violations in metrics | Monitoring non-critical quality | Pure observation, no data modification |
| `@dlt.expect_or_drop()` | **Drop** - Removes violating records | Enforce quality constraints | Transforms data by filtering |
| `@dlt.expect_or_fail()` | **Fail** - Stops pipeline execution | Critical quality gates | Halts execution on violation |

### Conceptual DLT Pipeline Structure

Since we're demonstrating concepts (DLT requires Databricks workspace setup), we'll create functional patterns that mirror DLT's declarative approach.

In [None]:
# Simulate DLT expectation framework with functional patterns

class ExpectationType(Enum):
    """Types of expectation enforcement strategies"""
    WARN = "warn"           # Record violation, continue processing
    DROP = "drop"           # Remove violating records
    FAIL = "fail"           # Stop pipeline on violation

@dataclass
class Expectation:
    """Immutable expectation definition"""
    name: str
    constraint: str
    expectation_type: ExpectationType
    
    def evaluate(self, df: DataFrame) -> Tuple[DataFrame, 'ExpectationResult']:
        """
        Pure function to evaluate expectation on DataFrame.
        Returns tuple of (transformed_df, result) without side effects.
        """
        total_count = df.count()
        
        # Count violating records
        violating_count = df.filter(f"NOT ({self.constraint})").count()
        valid_count = total_count - violating_count
        
        # Create result
        result = ExpectationResult(
            expectation_name=self.name,
            constraint=self.constraint,
            expectation_type=self.expectation_type,
            total_records=total_count,
            valid_records=valid_count,
            violated_records=violating_count,
            passed=violating_count == 0
        )
        
        # Apply transformation based on expectation type
        if self.expectation_type == ExpectationType.WARN:
            # Return original DataFrame unchanged
            return df, result
        
        elif self.expectation_type == ExpectationType.DROP:
            # Return filtered DataFrame with violating records removed
            filtered_df = df.filter(self.constraint)
            return filtered_df, result
        
        elif self.expectation_type == ExpectationType.FAIL:
            # Return DataFrame but signal failure in result
            return df, result

@dataclass
class ExpectationResult:
    """Immutable expectation evaluation result"""
    expectation_name: str
    constraint: str
    expectation_type: ExpectationType
    total_records: int
    valid_records: int
    violated_records: int
    passed: bool
    
    @property
    def violation_rate(self) -> float:
        """Calculate percentage of records violating the expectation"""
        return (self.violated_records / self.total_records * 100) if self.total_records > 0 else 0.0
    
    def __str__(self) -> str:
        status = "✅ PASSED" if self.passed else "❌ FAILED"
        return (f"{status} - {self.expectation_name} ({self.expectation_type.value})\n"
                f"  Constraint: {self.constraint}\n"
                f"  Valid: {self.valid_records:,} / {self.total_records:,} "
                f"({100-self.violation_rate:.1f}%)\n"
                f"  Violations: {self.violated_records:,} ({self.violation_rate:.1f}%)")

print("✅ Expectation framework classes defined")

## 3. Creating Sample Data with Quality Issues

Let's create realistic sample data that contains various data quality issues to demonstrate expectation handling.

In [None]:
def create_customer_data_with_quality_issues():
    """
    Pure function to create customer data with intentional quality issues.
    Demonstrates various types of data quality violations.
    """
    
    # Mix of valid and invalid records
    data = [
        # Valid records
        (1, "Alice Johnson", "alice@example.com", 28, "2020-01-15", "Premium", 1500.00, "US"),
        (2, "Bob Smith", "bob.smith@company.com", 35, "2019-06-20", "Standard", 800.00, "CA"),
        (3, "Carol Davis", "carol.d@email.com", 42, "2021-03-10", "Premium", 2000.00, "UK"),
        (4, "David Wilson", "david.w@example.org", 31, "2020-11-05", "Standard", 950.00, "US"),
        (5, "Emma Brown", "emma.brown@mail.com", 29, "2022-01-18", "Premium", 1800.00, "AU"),
        
        # Records with quality issues
        (6, "Frank Miller", "invalid-email", 45, "2021-07-22", "Standard", 700.00, "US"),  # Invalid email
        (7, "Grace Lee", "grace@example.com", -5, "2020-09-15", "Premium", 1200.00, "CA"),  # Negative age
        (8, "Henry Chen", "henry.chen@email.com", 150, "2019-12-01", "Standard", 600.00, "CN"),  # Age > 120
        (9, None, "unknown@email.com", 30, "2021-04-10", "Standard", 500.00, "UK"),  # Null name
        (10, "Isabel Garcia", None, 33, "2020-08-25", "Premium", 1600.00, "ES"),  # Null email
        (11, "Jack Taylor", "jack@example.com", 28, "invalid-date", "Standard", 850.00, "US"),  # Invalid date
        (12, "Kate Anderson", "kate@email.com", 40, "2021-02-14", "InvalidTier", 1100.00, "AU"),  # Invalid tier
        (13, "Leo Martinez", "leo@example.com", 35, "2020-05-30", "Premium", -500.00, "MX"),  # Negative balance
        (14, "Maria Rodriguez", "maria@email.com", 38, "2019-10-18", "Standard", None, "BR"),  # Null balance
        (15, "Nathan White", "nathan@example.com", 27, "2022-03-22", "Premium", 1700.00, None),  # Null country
        
        # More valid records
        (16, "Olivia Harris", "olivia.h@email.com", 32, "2021-11-08", "Standard", 920.00, "US"),
        (17, "Paul Clark", "paul.clark@company.com", 44, "2020-04-17", "Premium", 1950.00, "CA"),
        (18, "Quinn Lewis", "quinn@example.com", 36, "2021-08-29", "Standard", 780.00, "UK"),
    ]
    
    schema = StructType([
        StructField("customer_id", IntegerType(), False),
        StructField("name", StringType(), True),
        StructField("email", StringType(), True),
        StructField("age", IntegerType(), True),
        StructField("signup_date", StringType(), True),
        StructField("tier", StringType(), True),
        StructField("account_balance", DoubleType(), True),
        StructField("country", StringType(), True)
    ])
    
    return spark.createDataFrame(data, schema)

# Create sample data
customers_df = create_customer_data_with_quality_issues()

print(f"Created customer dataset with {customers_df.count()} records")
print("\nSample data (showing quality issues):")
customers_df.show(truncate=False)

print("\nData quality issues present:")
print("  ✗ Invalid email formats")
print("  ✗ Invalid ages (negative, > 120)")
print("  ✗ Null values in required fields")
print("  ✗ Invalid date formats")
print("  ✗ Invalid tier values")
print("  ✗ Negative account balances")

## 4. Implementing DLT-Style Expectations

Let's create expectations for our customer data using the three different strategies.

In [None]:
# Define expectations for customer data quality

# WARN expectations - Monitor quality issues without blocking
warn_expectations = [
    Expectation(
        name="valid_email_format",
        constraint="email RLIKE '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\\\.[a-zA-Z]{2,}$'",
        expectation_type=ExpectationType.WARN
    ),
    Expectation(
        name="valid_signup_date",
        constraint="signup_date RLIKE '^[0-9]{4}-[0-9]{2}-[0-9]{2}$'",
        expectation_type=ExpectationType.WARN
    ),
]

# DROP expectations - Remove records that violate constraints
drop_expectations = [
    Expectation(
        name="valid_age_range",
        constraint="age >= 0 AND age <= 120",
        expectation_type=ExpectationType.DROP
    ),
    Expectation(
        name="valid_tier",
        constraint="tier IN ('Standard', 'Premium')",
        expectation_type=ExpectationType.DROP
    ),
    Expectation(
        name="non_negative_balance",
        constraint="account_balance >= 0",
        expectation_type=ExpectationType.DROP
    ),
]

# FAIL expectations - Critical quality gates that must pass
fail_expectations = [
    Expectation(
        name="required_name",
        constraint="name IS NOT NULL AND length(name) > 0",
        expectation_type=ExpectationType.FAIL
    ),
    Expectation(
        name="required_email",
        constraint="email IS NOT NULL AND length(email) > 0",
        expectation_type=ExpectationType.FAIL
    ),
]

print("✅ Expectations defined:")
print(f"  - WARN expectations: {len(warn_expectations)}")
print(f"  - DROP expectations: {len(drop_expectations)}")
print(f"  - FAIL expectations: {len(fail_expectations)}")

## 5. DLT Pipeline Simulation

Let's create a functional pipeline that applies expectations in the proper order: FAIL → DROP → WARN

In [None]:
class DLTPipeline:
    """
    Functional DLT pipeline that applies expectations declaratively.
    Immutable and composable.
    """
    
    def __init__(self, name: str):
        self.name = name
        self.expectations: List[Expectation] = []
        self.results: List[ExpectationResult] = []
    
    def add_expectation(self, expectation: Expectation) -> 'DLTPipeline':
        """Add an expectation to the pipeline (returns new pipeline)"""
        new_pipeline = DLTPipeline(self.name)
        new_pipeline.expectations = self.expectations + [expectation]
        new_pipeline.results = self.results.copy()
        return new_pipeline
    
    def add_expectations(self, expectations: List[Expectation]) -> 'DLTPipeline':
        """Add multiple expectations (returns new pipeline)"""
        new_pipeline = DLTPipeline(self.name)
        new_pipeline.expectations = self.expectations + expectations
        new_pipeline.results = self.results.copy()
        return new_pipeline
    
    def execute(self, df: DataFrame) -> Tuple[DataFrame, List[ExpectationResult]]:
        """
        Execute pipeline with all expectations.
        Pure function that returns transformed DataFrame and results.
        """
        current_df = df
        all_results = []
        
        # Sort expectations by type: FAIL first, then DROP, then WARN
        expectation_order = {
            ExpectationType.FAIL: 0,
            ExpectationType.DROP: 1,
            ExpectationType.WARN: 2
        }
        sorted_expectations = sorted(
            self.expectations,
            key=lambda e: expectation_order[e.expectation_type]
        )
        
        print(f"\n🔄 Executing DLT Pipeline: {self.name}")
        print(f"Total expectations: {len(sorted_expectations)}\n")
        
        for expectation in sorted_expectations:
            # Evaluate expectation
            transformed_df, result = expectation.evaluate(current_df)
            all_results.append(result)
            
            # Print result
            print(f"{'='*70}")
            print(result)
            print()
            
            # Handle FAIL expectation
            if expectation.expectation_type == ExpectationType.FAIL and not result.passed:
                print("🚨 PIPELINE FAILED - Critical expectation violated!")
                print(f"   Stopping execution at expectation: {expectation.name}\n")
                return current_df, all_results
            
            # Update current DataFrame
            current_df = transformed_df
        
        print(f"{'='*70}")
        print(f"✅ Pipeline completed successfully")
        print(f"   Input records: {df.count():,}")
        print(f"   Output records: {current_df.count():,}")
        print(f"   Records dropped: {df.count() - current_df.count():,}\n")
        
        return current_df, all_results

# Create and configure DLT pipeline
customer_pipeline = (
    DLTPipeline("customer_quality_pipeline")
    .add_expectations(fail_expectations)   # Critical gates first
    .add_expectations(drop_expectations)   # Filter violations
    .add_expectations(warn_expectations)   # Monitor quality
)

print(f"✅ Pipeline configured with {len(customer_pipeline.expectations)} expectations")

In [None]:
# Execute the DLT pipeline
cleaned_customers_df, pipeline_results = customer_pipeline.execute(customers_df)

# Show cleaned data
print("\n📊 Cleaned Customer Data:")
cleaned_customers_df.show(truncate=False)

## 6. Quality Metrics and Monitoring

DLT expectations automatically generate quality metrics. Let's create a functional metrics reporting system.

In [None]:
@dataclass
class QualityReport:
    """Immutable quality report aggregating expectation results"""
    pipeline_name: str
    results: List[ExpectationResult]
    input_record_count: int
    output_record_count: int
    
    @property
    def total_expectations(self) -> int:
        return len(self.results)
    
    @property
    def passed_expectations(self) -> int:
        return sum(1 for r in self.results if r.passed)
    
    @property
    def failed_expectations(self) -> int:
        return sum(1 for r in self.results if not r.passed)
    
    @property
    def records_dropped(self) -> int:
        return self.input_record_count - self.output_record_count
    
    @property
    def drop_rate(self) -> float:
        return (self.records_dropped / self.input_record_count * 100) if self.input_record_count > 0 else 0.0
    
    def get_results_by_type(self, expectation_type: ExpectationType) -> List[ExpectationResult]:
        """Filter results by expectation type"""
        return [r for r in self.results if r.expectation_type == expectation_type]
    
    def print_summary(self):
        """Print comprehensive quality report"""
        print(f"\n{'='*80}")
        print(f"📊 DATA QUALITY REPORT: {self.pipeline_name}")
        print(f"{'='*80}")
        
        print(f"\n📈 Pipeline Statistics:")
        print(f"  Input Records:     {self.input_record_count:,}")
        print(f"  Output Records:    {self.output_record_count:,}")
        print(f"  Records Dropped:   {self.records_dropped:,} ({self.drop_rate:.1f}%)")
        
        print(f"\n✅ Expectation Summary:")
        print(f"  Total Expectations: {self.total_expectations}")
        print(f"  Passed:            {self.passed_expectations}")
        print(f"  Failed:            {self.failed_expectations}")
        
        # Break down by expectation type
        for exp_type in ExpectationType:
            type_results = self.get_results_by_type(exp_type)
            if type_results:
                passed = sum(1 for r in type_results if r.passed)
                total = len(type_results)
                print(f"\n  {exp_type.value.upper()} Expectations: {passed}/{total} passed")
                
                for result in type_results:
                    status = "✅" if result.passed else "❌"
                    print(f"    {status} {result.expectation_name}: "
                          f"{result.violated_records:,} violations ({result.violation_rate:.1f}%)")
        
        print(f"\n{'='*80}\n")
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert report to dictionary for JSON serialization"""
        return {
            "pipeline_name": self.pipeline_name,
            "input_record_count": self.input_record_count,
            "output_record_count": self.output_record_count,
            "records_dropped": self.records_dropped,
            "drop_rate": self.drop_rate,
            "total_expectations": self.total_expectations,
            "passed_expectations": self.passed_expectations,
            "failed_expectations": self.failed_expectations,
            "expectations": [
                {
                    "name": r.expectation_name,
                    "type": r.expectation_type.value,
                    "constraint": r.constraint,
                    "passed": r.passed,
                    "total_records": r.total_records,
                    "valid_records": r.valid_records,
                    "violated_records": r.violated_records,
                    "violation_rate": r.violation_rate
                }
                for r in self.results
            ]
        }

# Generate quality report
quality_report = QualityReport(
    pipeline_name=customer_pipeline.name,
    results=pipeline_results,
    input_record_count=customers_df.count(),
    output_record_count=cleaned_customers_df.count()
)

quality_report.print_summary()

# Export to JSON for monitoring systems
print("📄 Quality Report (JSON format for monitoring):")
print(json.dumps(quality_report.to_dict(), indent=2))

## 7. Actual DLT Pipeline Code Examples

Here's how you would write real DLT pipelines in Databricks. These examples show the actual Python code you would use in a Databricks notebook configured as a DLT pipeline.

In [None]:
# NOTE: This code is for demonstration and would run in a Databricks DLT pipeline notebook
# It will not execute in a standard notebook without DLT runtime

# Example 1: Bronze Layer with WARN expectations
# Monitor data quality issues without blocking ingestion

"""
import dlt
from pyspark.sql.functions import col, to_date

@dlt.table(
    name="customers_bronze",
    comment="Raw customer data with quality monitoring"
)
@dlt.expect("valid_email", "email RLIKE '^[^@]+@[^@]+\\.[^@]+$'")
@dlt.expect("valid_date_format", "signup_date RLIKE '^[0-9]{4}-[0-9]{2}-[0-9]{2}$'")
def customers_bronze():
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .load("/mnt/source/customers/")
    )
"""

print("Example 1: Bronze layer with WARN expectations")
print("  - Monitors email format quality")
print("  - Monitors date format compliance")
print("  - Records metrics but doesn't drop records")
print()

In [None]:
# Example 2: Silver Layer with DROP expectations
# Remove records that violate business rules

"""
@dlt.table(
    name="customers_silver",
    comment="Cleaned customer data with enforced quality constraints"
)
@dlt.expect_or_drop("valid_age", "age >= 18 AND age <= 120")
@dlt.expect_or_drop("valid_tier", "tier IN ('Standard', 'Premium', 'Enterprise')")
@dlt.expect_or_drop("positive_balance", "account_balance >= 0")
@dlt.expect_or_drop("valid_country", "country IS NOT NULL AND length(country) = 2")
def customers_silver():
    return (
        dlt.read_stream("customers_bronze")
        .select(
            "customer_id",
            "name",
            "email",
            col("age").cast("int"),
            to_date("signup_date").alias("signup_date"),
            "tier",
            col("account_balance").cast("double"),
            "country"
        )
    )
"""

print("Example 2: Silver layer with DROP expectations")
print("  - Enforces age range (18-120)")
print("  - Validates tier values")
print("  - Ensures positive account balance")
print("  - Validates country code format")
print("  - Drops records that violate any constraint")
print()

In [None]:
# Example 3: Gold Layer with FAIL expectations
# Critical quality gates for business-critical tables

"""
@dlt.table(
    name="customers_gold",
    comment="Production-ready customer data with strict quality guarantees"
)
@dlt.expect_or_fail("no_nulls_in_key_fields", 
                     "customer_id IS NOT NULL AND name IS NOT NULL AND email IS NOT NULL")
@dlt.expect_or_fail("unique_customer_id", "customer_id IS NOT NULL")
@dlt.expect("high_quality_data", "age BETWEEN 18 AND 120")
def customers_gold():
    return (
        dlt.read("customers_silver")
        .groupBy("customer_id")  # Ensure uniqueness
        .agg(
            F.first("name").alias("name"),
            F.first("email").alias("email"),
            F.first("age").alias("age"),
            F.first("signup_date").alias("signup_date"),
            F.first("tier").alias("tier"),
            F.sum("account_balance").alias("total_balance"),
            F.first("country").alias("country")
        )
    )
"""

print("Example 3: Gold layer with FAIL expectations")
print("  - Fails pipeline if key fields are null")
print("  - Ensures customer_id uniqueness")
print("  - Monitors (but doesn't fail on) age quality")
print("  - Stops pipeline execution on critical violations")
print()

In [None]:
# Example 4: Advanced DLT patterns with custom Python functions

"""
# Custom validation function (pure function)
def is_valid_email_domain(email: str, allowed_domains: List[str]) -> bool:
    \"\"\"Pure function to validate email domain\"\"\"
    if not email or '@' not in email:
        return False
    domain = email.split('@')[1]
    return domain in allowed_domains

# Register as UDF for use in SQL expressions
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType

is_valid_email_domain_udf = udf(is_valid_email_domain, BooleanType())

@dlt.table(
    name="customers_with_domain_validation",
    comment="Customers with domain-specific validation"
)
@dlt.expect_or_drop(
    "approved_email_domain",
    "is_valid_email_domain(email, array('example.com', 'company.com', 'email.com'))"
)
def customers_with_domain_validation():
    return (
        dlt.read("customers_silver")
        .filter(col("tier") == "Premium")  # Only Premium customers need domain validation
    )
"""

print("Example 4: Advanced patterns with custom functions")
print("  - Uses custom Python functions for complex validation")
print("  - Combines functional programming with DLT expectations")
print("  - Demonstrates composable validation logic")

## 8. Comparing Imperative vs Declarative Approaches

Let's contrast imperative data quality checks with DLT's declarative approach.

In [None]:
print("="*80)
print("IMPERATIVE vs DECLARATIVE DATA QUALITY")
print("="*80)

print("\n❌ IMPERATIVE APPROACH (Anti-Pattern):")
print("""
def process_customers_imperative(df):
    # Scattered validation logic with side effects
    
    # Manual null checks
    null_names = df.filter(col("name").isNull()).count()
    if null_names > 0:
        print(f"WARNING: {null_names} records with null names")
        df = df.filter(col("name").isNotNull())
    
    # Manual age validation
    invalid_ages = df.filter((col("age") < 0) | (col("age") > 120)).count()
    if invalid_ages > 0:
        print(f"WARNING: {invalid_ages} records with invalid ages")
        df = df.filter((col("age") >= 0) & (col("age") <= 120))
    
    # Manual email validation
    email_pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'
    invalid_emails = df.filter(~col("email").rlike(email_pattern)).count()
    if invalid_emails > 0:
        print(f"WARNING: {invalid_emails} records with invalid emails")
        # Maybe drop, maybe keep - inconsistent handling
    
    # Manual tier validation
    df = df.filter(col("tier").isin(['Standard', 'Premium']))
    
    return df

# Issues:
# 1. Side effects (printing) mixed with transformation logic
# 2. Inconsistent error handling (some drop, some warn)
# 3. No centralized quality metrics
# 4. Hard to test individual validations
# 5. Validation logic scattered throughout code
# 6. No automatic quality monitoring
""")

print("\n✅ DECLARATIVE APPROACH (Best Practice):")
print("""
@dlt.table(name="customers_clean")
@dlt.expect_or_fail("valid_name", "name IS NOT NULL")
@dlt.expect_or_drop("valid_age", "age >= 0 AND age <= 120")
@dlt.expect("valid_email", "email RLIKE '^[^@]+@[^@]+\\.[^@]+$'")
@dlt.expect_or_drop("valid_tier", "tier IN ('Standard', 'Premium')")
def customers_clean():
    return spark.table("customers_source")

# Benefits:
# 1. Clear separation of concerns (expectations vs transformations)
# 2. Consistent, explicit error handling strategy
# 3. Automatic quality metrics collection
# 4. Easy to test and reason about
# 5. Self-documenting quality requirements
# 6. Built-in monitoring and alerting
# 7. Composable and reusable
""")

print("\n📊 Key Differences:")
print("""
┌─────────────────────┬──────────────────────────┬───────────────────────────┐
│ Aspect              │ Imperative               │ Declarative (DLT)         │
├─────────────────────┼──────────────────────────┼───────────────────────────┤
│ Code Style          │ How to validate          │ What to validate          │
│ Side Effects        │ Mixed with logic         │ Isolated in decorators    │
│ Quality Metrics     │ Manual tracking          │ Automatic collection      │
│ Consistency         │ Varies by implementation │ Standardized              │
│ Testability         │ Difficult                │ Easy                      │
│ Monitoring          │ Custom implementation    │ Built-in                  │
│ Composability       │ Low                      │ High                      │
│ Maintainability     │ Low                      │ High                      │
└─────────────────────┴──────────────────────────┴───────────────────────────┘
""")

## 9. Best Practices and Patterns

Guidelines for effective use of DLT expectations in production pipelines.

In [None]:
print("="*80)
print("DLT EXPECTATIONS BEST PRACTICES")
print("="*80)

print("""
1. ✅ CHOOSE THE RIGHT EXPECTATION TYPE

   @dlt.expect() - WARN
   • Use for: Monitoring data quality trends
   • Example: Email format compliance, date format standardization
   • Benefit: Visibility without blocking data flow
   
   @dlt.expect_or_drop() - DROP
   • Use for: Business rule enforcement
   • Example: Valid age ranges, approved categories, positive amounts
   • Benefit: Clean data without pipeline failures
   
   @dlt.expect_or_fail() - FAIL
   • Use for: Critical quality gates
   • Example: Required fields, data corruption detection
   • Benefit: Prevent bad data from reaching production

2. ✅ LAYER YOUR EXPECTATIONS

   Bronze Layer:
   • Use mostly WARN expectations
   • Monitor raw data quality issues
   • Maintain complete audit trail
   
   Silver Layer:
   • Use DROP expectations for business rules
   • Apply data cleansing and standardization
   • Create clean, validated datasets
   
   Gold Layer:
   • Use FAIL expectations for critical constraints
   • Enforce strict quality for business-critical tables
   • Ensure production-ready data quality

3. ✅ WRITE CLEAR, TESTABLE CONSTRAINTS

   Good:
   @dlt.expect("positive_amount", "amount > 0")
   @dlt.expect("valid_email", "email RLIKE '^[^@]+@[^@]+\\.[^@]+$'")
   
   Better:
   # More descriptive names and comprehensive checks
   @dlt.expect("transaction_amount_positive", "amount > 0 AND amount < 1000000")
   @dlt.expect("customer_email_format_valid", 
               "email IS NOT NULL AND email RLIKE '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'")

4. ✅ USE MEANINGFUL EXPECTATION NAMES

   Pattern: [entity]_[field]_[validation_type]
   
   Good names:
   • customer_age_within_valid_range
   • order_total_positive_value
   • product_sku_unique_identifier
   • transaction_date_not_future
   
   Avoid:
   • check1, validation_rule_2
   • test, verify
   • age_check (too vague)

5. ✅ COMPOSE EXPECTATIONS FUNCTIONALLY

   # Create reusable expectation configurations
   def standard_customer_expectations():
       return [
           ("valid_age", "age >= 18 AND age <= 120", "drop"),
           ("valid_email", "email IS NOT NULL", "fail"),
           ("valid_tier", "tier IN ('Standard', 'Premium')", "drop"),
       ]
   
   # Apply to multiple tables
   @dlt.table(name="us_customers")
   @apply_expectations(standard_customer_expectations())
   def us_customers():
       return spark.table("customers").filter(col("country") == "US")

6. ✅ MONITOR AND ALERT ON QUALITY METRICS

   • Set up dashboards for expectation violations
   • Configure alerts for critical expectation failures
   • Track quality trends over time
   • Review violation patterns regularly
   • Use metrics to improve upstream data sources

7. ✅ TEST EXPECTATIONS BEFORE PRODUCTION

   • Use sample data to validate expectation logic
   • Test with both valid and invalid records
   • Verify expectation names and constraints
   • Ensure appropriate expectation types
   • Check drop rates are reasonable

8. ✅ DOCUMENT BUSINESS RULES IN EXPECTATIONS

   @dlt.table(
       name="customers",
       comment="Customer master table with enforced quality standards"
   )
   @dlt.expect_or_drop(
       "customer_age_legal_minimum",
       "age >= 18",
       "Customers must be 18 or older per legal requirements"
   )
   def customers():
       return spark.table("source_customers")
""")

## 10. Anti-Patterns to Avoid

In [None]:
print("="*80)
print("DLT EXPECTATIONS ANTI-PATTERNS")
print("="*80)

print("""
❌ 1. OVERLY COMPLEX EXPECTATIONS

Bad:
@dlt.expect("complex_validation",
           "(age >= 18 AND tier = 'Premium' AND balance > 1000) OR 
            (age >= 21 AND tier = 'Standard' AND balance > 500) OR 
            (country IN ('US', 'CA') AND age >= 16)")

Better:
# Break into separate, testable expectations
@dlt.expect_or_drop("minimum_age", "age >= 16")
@dlt.expect_or_drop("premium_requirements", 
                     "tier != 'Premium' OR (age >= 18 AND balance > 1000)")
@dlt.expect_or_drop("standard_requirements",
                     "tier != 'Standard' OR (age >= 21 AND balance > 500)")

❌ 2. USING WRONG EXPECTATION TYPE

Bad:
# Using FAIL for non-critical quality monitoring
@dlt.expect_or_fail("preferred_email_domain", 
                     "email LIKE '%@company.com'")

Better:
# Use WARN for monitoring, not enforcement
@dlt.expect("preferred_email_domain",
           "email LIKE '%@company.com'")

❌ 3. MIXING VALIDATION WITH TRANSFORMATION

Bad:
@dlt.table(name="customers")
@dlt.expect_or_drop("valid_age", "age >= 18")
def customers():
    # Don't mix validation logic in the transformation
    df = spark.table("source")
    df = df.filter(col("country") == "US")  # ❌ Buried business logic
    df = df.filter(col("status") == "active")  # ❌ Hard to track
    return df

Better:
@dlt.table(name="customers")
@dlt.expect_or_drop("valid_age", "age >= 18")
@dlt.expect_or_drop("us_customers_only", "country = 'US'")
@dlt.expect_or_drop("active_customers_only", "status = 'active'")
def customers():
    # Pure transformation, validation in expectations
    return spark.table("source")

❌ 4. NO MONITORING OR ALERTING

Bad:
# Define expectations but never check the results
@dlt.expect("quality_check", "amount > 0")
# No dashboard, no alerts, no review process

Better:
# Set up comprehensive monitoring
# - Create quality dashboards
# - Configure alerts for critical violations
# - Regular review of quality metrics
# - Automated reports on data quality trends

❌ 5. VAGUE EXPECTATION NAMES

Bad:
@dlt.expect("check1", "age > 0")
@dlt.expect("validation", "email IS NOT NULL")
@dlt.expect("test", "balance >= 0")

Better:
@dlt.expect("customer_age_positive", "age > 0")
@dlt.expect("customer_email_required", "email IS NOT NULL")
@dlt.expect("account_balance_non_negative", "balance >= 0")

❌ 6. IGNORING DROPPED RECORDS

Bad:
@dlt.expect_or_drop("valid_data", "complex_condition")
# Never check how many records are being dropped
# Could be silently losing important data

Better:
# Monitor drop rates
# Alert if drop rate exceeds threshold (e.g., >5%)
# Investigate root causes of violations
# Fix upstream data issues

❌ 7. DUPLICATE VALIDATION LOGIC

Bad:
# Same validation in multiple places
@dlt.table(name="customers_us")
@dlt.expect_or_drop("valid_age", "age >= 18 AND age <= 120")
def customers_us():
    return spark.table("source").filter(col("country") == "US")

@dlt.table(name="customers_ca")
@dlt.expect_or_drop("valid_age", "age >= 18 AND age <= 120")  # ❌ Duplicate
def customers_ca():
    return spark.table("source").filter(col("country") == "CA")

Better:
# Create reusable expectation definitions
STANDARD_AGE_VALIDATION = ("valid_age", "age >= 18 AND age <= 120")

@dlt.table(name="customers_us")
@dlt.expect_or_drop(*STANDARD_AGE_VALIDATION)
def customers_us():
    return spark.table("source").filter(col("country") == "US")
""")

## 11. Testing DLT Pipelines

Functional approaches to testing DLT expectations before deployment.

In [None]:
# Testing framework for DLT expectations

def test_expectation(expectation: Expectation, test_data: DataFrame, 
                    expected_violations: int) -> bool:
    """
    Pure function to test an expectation against test data.
    Returns True if test passes, False otherwise.
    """
    _, result = expectation.evaluate(test_data)
    
    passed = result.violated_records == expected_violations
    
    if passed:
        print(f"✅ PASS: {expectation.name}")
        print(f"   Expected {expected_violations} violations, got {result.violated_records}")
    else:
        print(f"❌ FAIL: {expectation.name}")
        print(f"   Expected {expected_violations} violations, got {result.violated_records}")
    
    return passed

# Create test datasets
print("="*80)
print("TESTING DLT EXPECTATIONS")
print("="*80)

# Test 1: Valid age expectation
print("\nTest 1: Age validation expectation")
age_test_data = spark.createDataFrame([
    (1, 25),   # Valid
    (2, 30),   # Valid
    (3, -5),   # Invalid - negative
    (4, 150),  # Invalid - too high
    (5, 45),   # Valid
], ["id", "age"])

age_expectation = Expectation(
    name="valid_age_range",
    constraint="age >= 0 AND age <= 120",
    expectation_type=ExpectationType.DROP
)

test_expectation(age_expectation, age_test_data, expected_violations=2)

# Test 2: Email format expectation
print("\nTest 2: Email format validation")
email_test_data = spark.createDataFrame([
    (1, "valid@example.com"),      # Valid
    (2, "another@test.org"),       # Valid
    (3, "invalid-email"),           # Invalid
    (4, "missing-at-sign.com"),    # Invalid
    (5, "user@domain.co.uk"),      # Valid
], ["id", "email"])

email_expectation = Expectation(
    name="valid_email_format",
    constraint="email RLIKE '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\\\.[a-zA-Z]{2,}$'",
    expectation_type=ExpectationType.WARN
)

test_expectation(email_expectation, email_test_data, expected_violations=2)

# Test 3: Tier validation
print("\nTest 3: Tier value validation")
tier_test_data = spark.createDataFrame([
    (1, "Standard"),     # Valid
    (2, "Premium"),      # Valid
    (3, "InvalidTier"),  # Invalid
    (4, "Standard"),     # Valid
    (5, None),           # Invalid
], ["id", "tier"])

tier_expectation = Expectation(
    name="valid_tier",
    constraint="tier IN ('Standard', 'Premium')",
    expectation_type=ExpectationType.DROP
)

test_expectation(tier_expectation, tier_test_data, expected_violations=2)

print("\n" + "="*80)
print("✅ All expectation tests completed")

## Summary

In this notebook, we explored declarative data quality with Delta Live Tables expectations:

### Key Concepts Covered

1. **DLT Expectations Framework**
   - Three expectation types: WARN, DROP, FAIL
   - Declarative quality rules defined as decorators
   - Automatic quality metrics collection

2. **Functional Programming Alignment**
   - Declarative vs imperative approaches
   - Pure functions for expectation evaluation
   - Immutable expectation and result objects
   - Composable quality rules

3. **Quality Metrics and Monitoring**
   - Automated violation tracking
   - Quality reports and dashboards
   - Alerting on critical failures

4. **Layered Quality Strategy**
   - Bronze: WARN expectations for monitoring
   - Silver: DROP expectations for cleansing
   - Gold: FAIL expectations for critical gates

5. **Testing and Validation**
   - Functional testing patterns for expectations
   - Test data generation
   - Expectation verification before deployment

### Best Practices Demonstrated

- ✅ **Declarative Quality**: Define what quality means, not how to enforce it
- ✅ **Separation of Concerns**: Quality rules separate from transformation logic
- ✅ **Composability**: Reusable expectation patterns across tables
- ✅ **Observability**: Built-in quality metrics and monitoring
- ✅ **Testability**: Pure functions for expectation validation

### Functional Programming Benefits

- **Declarative**: Expectations describe desired quality state
- **Immutable**: Expectations don't modify pipeline definitions
- **Composable**: Stack multiple expectations on tables
- **Testable**: Pure evaluation functions easy to test
- **Maintainable**: Clear, self-documenting quality requirements

### Next Steps

- Practice defining expectations for your data
- Set up DLT pipelines in Databricks
- Create quality monitoring dashboards
- Implement automated quality alerts
- Build reusable expectation libraries

Delta Live Tables expectations provide a powerful, functional approach to data quality that aligns perfectly with modern data engineering best practices.

## Exercises

Practice implementing DLT-style expectations for your own data.

In [None]:
print("="*80)
print("EXERCISES: Practice DLT Expectations")
print("="*80)

print("""
Exercise 1: Create Transaction Validation Expectations
--------------------------------------------------------
Create a set of expectations for a transaction dataset:
- transaction_id must be unique and not null (FAIL)
- amount must be positive (DROP)
- transaction_date must be valid date format (DROP)
- payment_method must be in approved list (DROP)
- Monitor transaction amounts > $10,000 (WARN)

Exercise 2: Implement Multi-Layer Quality Strategy
---------------------------------------------------
Design bronze/silver/gold expectations for product data:
Bronze:
- Monitor SKU format compliance
- Track missing product descriptions

Silver:
- Drop products with invalid categories
- Remove products with negative prices

Gold:
- Fail if required fields are null
- Ensure price consistency across systems

Exercise 3: Create Reusable Expectation Patterns
-------------------------------------------------
Build a library of reusable expectations:
- Email validation pattern
- Phone number validation pattern
- Date range validation pattern
- Amount range validation pattern

Exercise 4: Implement Quality Monitoring
-----------------------------------------
Create a quality monitoring dashboard:
- Track violation rates over time
- Alert when drop rate exceeds threshold
- Generate quality trend reports
- Identify top quality issues

Exercise 5: Test Your Expectations
-----------------------------------
Write comprehensive tests for your expectations:
- Create test data with known violations
- Verify expectation behavior
- Test edge cases
- Validate quality metrics
""")

# Exercise templates (implement these!)

def create_transaction_expectations() -> List[Expectation]:
    """
    YOUR TASK: Create expectations for transaction data
    """
    # TODO: Implement transaction expectations
    pass

def create_product_expectations_bronze() -> List[Expectation]:
    """
    YOUR TASK: Create bronze layer expectations for products
    """
    # TODO: Implement bronze layer expectations
    pass

def create_reusable_email_expectation(column_name: str) -> Expectation:
    """
    YOUR TASK: Create reusable email validation expectation
    """
    # TODO: Implement reusable email expectation
    pass

print("\n📝 Complete the exercises above to master DLT expectations!")