# 6.4 Data Quality with Expectations in Lakeflow

This notebook demonstrates how to implement declarative data quality using expectations in Lakeflow Declarative Pipelines. We'll explore the migration from legacy Delta Live Tables (`dlt`) expectations to modern `pyspark.pipelines` expectations, implement layered quality strategies, and build composable quality rules following functional programming principles.

## Learning Objectives

By the end of this notebook, you will understand how to:
- Migrate from `@dlt.expect*` to `@dp.expect*` decorators
- Implement three expectation strategies: WARN, DROP, and FAIL
- Design layered quality strategies (Bronze → Silver → Gold)
- Create composable and reusable expectation patterns
- Monitor quality metrics and track violations
- Test expectations before production deployment
- Apply functional programming to data quality validation
- Build declarative quality rules that align with business requirements

## Prerequisites

- Completion of Notebooks 6.1, 6.2, and 6.3
- Understanding of Delta Lake and data quality concepts
- Familiarity with SQL expressions and constraints
- Knowledge of functional programming principles

In [None]:
# Platform setup detection
# In Databricks: Keep commented
# In Local: Uncomment this line
# %run 00_Environment_Setup.ipynb

In [None]:
# Essential imports
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql import functions as F
from pyspark.sql.types import *
from typing import Dict, List, Optional, Callable
from dataclasses import dataclass
from enum import Enum

# In a real Lakeflow pipeline:
# from pyspark import pipelines as dp

print("✅ Imports complete - Ready for expectations demonstration!")

## 1. Migration from DLT to Lakeflow Expectations

### Legacy DLT Expectations

**Before (Delta Live Tables)**:
```python
import dlt
from pyspark.sql import functions as F

# WARN: Monitor violations without blocking
@dlt.table
@dlt.expect("valid_email", "email IS NOT NULL")
def customers_bronze():
    return spark.table("raw.customers")

# DROP: Remove violating records
@dlt.table
@dlt.expect_or_drop("valid_age", "age >= 18 AND age <= 120")
def customers_silver():
    return dlt.read("customers_bronze")

# FAIL: Stop pipeline on violation
@dlt.table
@dlt.expect_or_fail("unique_id", "customer_id IS NOT NULL")
def customers_gold():
    return dlt.read("customers_silver")
```

### Modern Lakeflow Expectations

**After (Lakeflow Declarative Pipelines)**:
```python
from pyspark import pipelines as dp
from pyspark.sql import functions as F

# WARN: Monitor violations without blocking
@dp.table
@dp.expect("valid_email", "email IS NOT NULL")
def customers_bronze():
    return spark.table("raw.customers")

# DROP: Remove violating records
@dp.table
@dp.expect_or_drop("valid_age", "age >= 18 AND age <= 120")
def customers_silver():
    return dp.read("customers_bronze")

# FAIL: Stop pipeline on violation
@dp.table
@dp.expect_or_fail("unique_id", "customer_id IS NOT NULL")
def customers_gold():
    return dp.read("customers_silver")
```

### Migration Changes

| Legacy DLT | Modern Lakeflow | Notes |
|------------|-----------------|-------|
| `import dlt` | `from pyspark import pipelines as dp` | Module import change |
| `@dlt.expect(...)` | `@dp.expect(...)` | WARN strategy |
| `@dlt.expect_or_drop(...)` | `@dp.expect_or_drop(...)` | DROP strategy |
| `@dlt.expect_or_fail(...)` | `@dp.expect_or_fail(...)` | FAIL strategy |
| `dlt.read("table")` | `dp.read("table")` | Reading dependencies |

**Migration is straightforward**: Simply replace `dlt` with `dp` throughout your pipeline code!

## 2. Three Expectation Strategies

### Strategy 1: WARN (`@dp.expect`)

**Purpose**: Monitor quality issues without impacting data flow

**Behavior**:
- Records violations in quality metrics
- Does NOT drop records
- Does NOT fail pipeline
- All data passes through unchanged

**Use Cases**:
- Monitoring data quality trends
- Non-critical quality checks
- Exploratory data quality assessment
- Gradual quality improvement tracking

```python
@dp.table(
    name="bronze_events",
    comment="Raw events with quality monitoring"
)
@dp.expect("valid_email_format", 
           "email RLIKE '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'")
@dp.expect("valid_timestamp_format",
           "timestamp RLIKE '^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}'")
@dp.expect("reasonable_amount",
           "amount >= 0 AND amount < 1000000")
def bronze_events():
    """
    Bronze layer: Monitor quality, don't block ingestion.
    All records stored, violations tracked for analysis.
    """
    return spark.table("raw.events")

# Result:
# - All records ingested (even with violations)
# - Quality metrics show violation rates
# - Can analyze patterns in violations
```

### Strategy 2: DROP (`@dp.expect_or_drop`)

**Purpose**: Enforce quality by removing violating records

**Behavior**:
- Filters out records that violate constraint
- Continues processing with valid records
- Tracks drop rate in metrics
- Pipeline does NOT fail

**Use Cases**:
- Enforcing business rules
- Data cleansing in Silver layer
- Removing clearly invalid data
- Preventing downstream contamination

```python
@dp.table(
    name="silver_customers",
    comment="Cleaned customers with enforced quality"
)
@dp.expect_or_drop("valid_age_range",
                    "age >= 18 AND age <= 120")
@dp.expect_or_drop("valid_country_code",
                    "country IN ('US', 'CA', 'UK', 'AU', 'DE', 'FR', 'JP')")
@dp.expect_or_drop("positive_account_balance",
                    "account_balance >= 0")
@dp.expect_or_drop("non_null_email",
                    "email IS NOT NULL AND length(email) > 0")
def silver_customers():
    """
    Silver layer: Remove invalid records.
    Only valid customers proceed to downstream tables.
    """
    return dp.read("bronze_customers")

# Result:
# - Invalid records dropped automatically
# - Pipeline continues with clean data
# - Metrics show how many dropped
```

### Strategy 3: FAIL (`@dp.expect_or_fail`)

**Purpose**: Critical quality gates that must pass

**Behavior**:
- Stops entire pipeline execution if ANY record violates
- No data written on violation
- Provides clear error message
- Requires manual intervention to fix

**Use Cases**:
- Critical data integrity checks
- Unique key constraints
- Required fields for downstream systems
- Preventing data corruption

```python
@dp.table(
    name="gold_customers",
    comment="Production-ready customers with strict quality gates"
)
@dp.expect_or_fail("unique_customer_id",
                    "customer_id IS NOT NULL")
@dp.expect_or_fail("required_fields_present",
                    "customer_id IS NOT NULL AND name IS NOT NULL AND email IS NOT NULL")
@dp.expect_or_fail("no_future_dates",
                    "signup_date <= current_date()")
def gold_customers():
    """
    Gold layer: Fail fast on critical violations.
    Guarantees downstream systems receive valid data.
    """
    return dp.read("silver_customers")

# Result:
# - ANY violation stops pipeline
# - No partial data written
# - Alert triggers for investigation
```

### Strategy Decision Matrix

| Criteria | WARN | DROP | FAIL |
|----------|------|------|------|
| **Impact** | None | Removes records | Stops pipeline |
| **Data Flow** | All data passes | Only valid data | No data on violation |
| **Use Case** | Monitoring | Cleansing | Critical gates |
| **Typical Layer** | Bronze | Silver | Gold |
| **Business Tolerance** | High | Medium | Zero |
| **Example** | Email format | Age range | Unique ID |


## 3. Layered Quality Strategy: Bronze → Silver → Gold

### Architecture Pattern

```
┌─────────────────────────────────────────────────────────────┐
│                  LAYERED QUALITY STRATEGY                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  RAW DATA                                                    │
│     ↓                                                        │
│  ┌──────────────────────────────────────────────┐           │
│  │ BRONZE: Monitor (WARN)                       │           │
│  │ • Ingest all data                            │           │
│  │ • Track quality issues                       │           │
│  │ • No data loss                               │           │
│  │ • Complete audit trail                       │           │
│  └──────────────────────────────────────────────┘           │
│     ↓                                                        │
│  ┌──────────────────────────────────────────────┐           │
│  │ SILVER: Cleanse (DROP)                       │           │
│  │ • Remove invalid records                     │           │
│  │ • Apply business rules                       │           │
│  │ • Standardize formats                        │           │
│  │ • Track drop rates                           │           │
│  └──────────────────────────────────────────────┘           │
│     ↓                                                        │
│  ┌──────────────────────────────────────────────┐           │
│  │ GOLD: Enforce (FAIL)                         │           │
│  │ • Strict quality gates                       │           │
│  │ • Production guarantees                      │           │
│  │ • Zero tolerance for violations              │           │
│  │ • Fail fast on issues                        │           │
│  └──────────────────────────────────────────────┘           │
│     ↓                                                        │
│  BUSINESS APPLICATIONS                                       │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

### Complete Example: Customer Pipeline

```python
from pyspark import pipelines as dp
from pyspark.sql import functions as F

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# BRONZE LAYER: Monitor quality, ingest everything
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
@dp.table(
    name="bronze_customers",
    comment="Raw customer data with quality monitoring"
)
@dp.expect("has_email", "email IS NOT NULL")
@dp.expect("has_name", "name IS NOT NULL")
@dp.expect("email_format_looks_valid",
           "email RLIKE '^[^@]+@[^@]+\\.[^@]+$'")
@dp.expect("date_format_valid",
           "signup_date RLIKE '^[0-9]{4}-[0-9]{2}-[0-9]{2}$'")
def bronze_customers():
    """
    Bronze: Complete data ingestion with monitoring.
    - All records stored (even invalid)
    - Quality issues tracked for root cause analysis
    - Provides audit trail
    """
    return spark.table("raw.customers")

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# SILVER LAYER: Clean and standardize
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
@dp.table(
    name="silver_customers",
    comment="Cleaned customers with enforced business rules"
)
@dp.expect_or_drop("valid_age", "age >= 18 AND age <= 120")
@dp.expect_or_drop("valid_country",
                    "country IN ('US', 'CA', 'UK', 'AU', 'DE', 'FR', 'JP', 'CN', 'IN', 'BR')")
@dp.expect_or_drop("valid_tier",
                    "tier IN ('Free', 'Basic', 'Premium', 'Enterprise')")
@dp.expect_or_drop("non_negative_balance",
                    "account_balance >= 0")
@dp.expect_or_drop("valid_email_format",
                    "email RLIKE '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'")
def silver_customers():
    """
    Silver: Business rule enforcement.
    - Invalid records dropped
    - Data standardized
    - Clean dataset for analytics
    """
    return (
        dp.read("bronze_customers")
        .withColumn("country", F.upper(F.trim(F.col("country"))))  # Standardize
        .withColumn("email", F.lower(F.trim(F.col("email"))))      # Normalize
        .withColumn("signup_date", F.to_date("signup_date"))       # Parse date
    )

# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
# GOLD LAYER: Production guarantees
# ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
@dp.table(
    name="gold_customers",
    comment="Production-ready customers with strict quality gates"
)
@dp.expect_or_fail("customer_id_present",
                    "customer_id IS NOT NULL")
@dp.expect_or_fail("required_fields_complete",
                    "customer_id IS NOT NULL AND name IS NOT NULL AND email IS NOT NULL")
@dp.expect_or_fail("no_duplicate_ids",
                    "customer_id IS NOT NULL")  # Uniqueness enforced at table level
@dp.expect("high_quality_emails",  # WARN in gold for monitoring
           "email LIKE '%@gmail.com' OR email LIKE '%@yahoo.com' OR email LIKE '%@company.com'")
def gold_customers():
    """
    Gold: Zero-defect production data.
    - Critical fields guaranteed non-null
    - Deduplication applied
    - Safe for downstream consumption
    """
    return (
        dp.read("silver_customers")
        .dropDuplicates(["customer_id"])  # Ensure uniqueness
        .select(
            "customer_id",
            "name",
            "email",
            "age",
            "country",
            "tier",
            "account_balance",
            "signup_date"
        )
    )
```

## 4. Composable and Reusable Expectations

### Pattern 1: Standard Expectation Library

```python
# Define reusable expectation configurations
class StandardExpectations:
    """Library of reusable expectation definitions"""
    
    # Email validation
    EMAIL_FORMAT = (
        "valid_email_format",
        "email RLIKE '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'"
    )
    
    # Phone validation (US format)
    PHONE_FORMAT = (
        "valid_phone_format",
        "phone RLIKE '^\\+?1?[0-9]{10}$'"
    )
    
    # Age range
    ADULT_AGE = (
        "adult_age_requirement",
        "age >= 18 AND age <= 120"
    )
    
    # Non-negative amounts
    POSITIVE_AMOUNT = (
        "positive_monetary_amount",
        "amount >= 0 AND amount < 1000000000"
    )
    
    # Date not in future
    PAST_DATE = (
        "date_not_future",
        "date_column <= current_date()"
    )
    
    # Non-null required field
    @staticmethod
    def required_field(field_name: str):
        return (
            f"{field_name}_required",
            f"{field_name} IS NOT NULL AND length({field_name}) > 0"
        )
    
    # Value in allowed list
    @staticmethod
    def allowed_values(field_name: str, values: List[str]):
        values_str = "', '".join(values)
        return (
            f"{field_name}_allowed_values",
            f"{field_name} IN ('{values_str}')"
        )

# Use standard expectations
@dp.table
@dp.expect_or_drop(*StandardExpectations.EMAIL_FORMAT)
@dp.expect_or_drop(*StandardExpectations.ADULT_AGE)
@dp.expect_or_drop(*StandardExpectations.required_field("customer_id"))
@dp.expect_or_drop(*StandardExpectations.allowed_values(
    "country", ["US", "CA", "UK", "AU"]
))
def validated_customers():
    return dp.read("raw_customers")
```

### Pattern 2: Expectation Builders

```python
from typing import Tuple

class ExpectationBuilder:
    """Functional builder for creating expectations"""
    
    @staticmethod
    def range_check(field: str, min_val: float, max_val: float) -> Tuple[str, str]:
        """Create range validation expectation"""
        return (
            f"{field}_range_check",
            f"{field} >= {min_val} AND {field} <= {max_val}"
        )
    
    @staticmethod
    def not_null(field: str) -> Tuple[str, str]:
        """Create not-null expectation"""
        return (
            f"{field}_not_null",
            f"{field} IS NOT NULL"
        )
    
    @staticmethod
    def regex_match(field: str, pattern: str, description: str) -> Tuple[str, str]:
        """Create regex validation expectation"""
        return (
            f"{field}_{description}",
            f"{field} RLIKE '{pattern}'"
        )
    
    @staticmethod
    def in_list(field: str, values: List[str]) -> Tuple[str, str]:
        """Create enumeration expectation"""
        values_str = "', '".join(values)
        return (
            f"{field}_in_allowed_list",
            f"{field} IN ('{values_str}')"
        )
    
    @staticmethod
    def date_range(field: str, start_date: str, end_date: str) -> Tuple[str, str]:
        """Create date range expectation"""
        return (
            f"{field}_date_range",
            f"{field} >= '{start_date}' AND {field} <= '{end_date}'"
        )

# Usage
@dp.table
@dp.expect_or_drop(*ExpectationBuilder.range_check("age", 18, 120))
@dp.expect_or_drop(*ExpectationBuilder.not_null("customer_id"))
@dp.expect_or_drop(*ExpectationBuilder.regex_match(
    "email",
    "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$",
    "email_format"
))
@dp.expect_or_drop(*ExpectationBuilder.in_list(
    "tier", ["Free", "Basic", "Premium"]
))
def customers_with_validations():
    return dp.read("raw_customers")
```

### Pattern 3: Domain-Specific Expectation Sets

```python
class CustomerExpectations:
    """Customer-specific expectation patterns"""
    
    @staticmethod
    def bronze_layer():
        """Expectations for bronze customer data"""
        return [
            ("has_id", "customer_id IS NOT NULL"),
            ("has_email", "email IS NOT NULL"),
            ("has_name", "name IS NOT NULL"),
        ]
    
    @staticmethod
    def silver_layer():
        """Expectations for silver customer data"""
        return [
            ("valid_age", "age >= 18 AND age <= 120"),
            ("valid_email", "email RLIKE '^[^@]+@[^@]+\\.[^@]+$'"),
            ("valid_tier", "tier IN ('Free', 'Basic', 'Premium', 'Enterprise')"),
            ("positive_balance", "account_balance >= 0"),
        ]
    
    @staticmethod
    def gold_layer():
        """Expectations for gold customer data"""
        return [
            ("id_required", "customer_id IS NOT NULL"),
            ("complete_profile", 
             "customer_id IS NOT NULL AND name IS NOT NULL AND email IS NOT NULL"),
        ]

# Apply expectation sets
@dp.table
def bronze_customers():
    return spark.table("raw.customers")

# Dynamically apply expectations
for name, constraint in CustomerExpectations.bronze_layer():
    bronze_customers = dp.expect(name, constraint)(bronze_customers)

@dp.table
def silver_customers():
    return dp.read("bronze_customers")

for name, constraint in CustomerExpectations.silver_layer():
    silver_customers = dp.expect_or_drop(name, constraint)(silver_customers)
```

## 5. Quality Metrics and Monitoring

### What Metrics Are Tracked?

Lakeflow automatically collects quality metrics for each expectation:

```python
# Metrics collected automatically:
{
    "expectation_name": "valid_email",
    "dataset": "silver_customers",
    "passed_records": 9850,
    "failed_records": 150,
    "total_records": 10000,
    "pass_rate": 0.985,
    "fail_rate": 0.015,
    "action": "drop",  # or "warn" or "fail"
    "timestamp": "2024-10-28T12:30:00Z"
}
```

### Accessing Quality Metrics

```python
# In Databricks, metrics available via:
# 1. Pipeline UI (Data Quality tab)
# 2. Event logs (system.lakeflow.events table)
# 3. Query the metrics directly

# Example: Query quality metrics
quality_metrics = spark.sql("""
    SELECT 
        expectation_name,
        dataset,
        SUM(failed_records) as total_failures,
        AVG(fail_rate) as avg_fail_rate,
        MAX(timestamp) as last_check
    FROM system.lakeflow.event_log
    WHERE event_type = 'expectation_result'
    GROUP BY expectation_name, dataset
    HAVING avg_fail_rate > 0.01  -- Alert if >1% failure
    ORDER BY avg_fail_rate DESC
""")

quality_metrics.display()
```

### Setting Up Alerts

```python
# Monitor expectations and alert on issues
@dp.table
@dp.expect("critical_quality_check", "amount > 0")
def monitored_transactions():
    return dp.read("raw_transactions")

# Configure alerts (in Lakeflow UI or via API):
# - Alert if fail_rate > 5%
# - Alert if failed_records > 1000
# - Alert on any FAIL expectation violation
# - Send to: Slack, Email, PagerDuty, etc.
```

### Quality Dashboards

```python
# Create quality monitoring dashboard
@dp.materialized_view(
    name="quality_dashboard",
    comment="Real-time quality metrics for monitoring"
)
def quality_dashboard():
    """
    Aggregated quality metrics across all expectations.
    Refreshed on each pipeline run.
    """
    return spark.sql("""
        SELECT
            dataset,
            expectation_name,
            action,
            COUNT(*) as check_count,
            SUM(total_records) as total_records_checked,
            SUM(failed_records) as total_failures,
            AVG(fail_rate) as avg_failure_rate,
            MAX(fail_rate) as max_failure_rate,
            MIN(timestamp) as first_check,
            MAX(timestamp) as last_check
        FROM system.lakeflow.event_log
        WHERE event_type = 'expectation_result'
          AND timestamp >= current_date() - INTERVAL 7 DAYS
        GROUP BY dataset, expectation_name, action
        ORDER BY avg_failure_rate DESC
    """)

# Visualize in dashboards (Databricks SQL, PowerBI, Tableau, etc.)
```

## 6. Testing Expectations Before Deployment

### Pattern 1: Unit Testing Expectation Logic

```python
import pytest
from pyspark.sql import SparkSession

def test_age_expectation(spark: SparkSession):
    """Test age range expectation with known data"""
    
    # Create test data with known violations
    test_data = spark.createDataFrame([
        (1, "Alice", 25),    # Valid
        (2, "Bob", 17),      # Invalid (too young)
        (3, "Carol", 45),    # Valid
        (4, "David", 150),   # Invalid (too old)
        (5, "Eve", 30),      # Valid
    ], ["id", "name", "age"])
    
    # Apply expectation constraint
    constraint = "age >= 18 AND age <= 120"
    result = test_data.filter(constraint)
    
    # Verify correct filtering
    assert result.count() == 3  # Only 3 valid records
    valid_ids = [row.id for row in result.collect()]
    assert set(valid_ids) == {1, 3, 5}

def test_email_format_expectation(spark: SparkSession):
    """Test email format validation"""
    
    test_data = spark.createDataFrame([
        (1, "alice@example.com"),     # Valid
        (2, "invalid-email"),          # Invalid
        (3, "bob@test.org"),           # Valid
        (4, "@no-user.com"),           # Invalid
        (5, "carol@company.co.uk"),    # Valid
    ], ["id", "email"])
    
    constraint = "email RLIKE '^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$'"
    result = test_data.filter(constraint)
    
    assert result.count() == 3
    valid_ids = [row.id for row in result.collect()]
    assert set(valid_ids) == {1, 3, 5}
```

### Pattern 2: Integration Testing Full Pipeline

```python
def test_customer_pipeline_quality(spark: SparkSession):
    """Test complete pipeline with expectations"""
    
    # Create bronze data with various quality issues
    bronze_data = spark.createDataFrame([
        (1, "Alice", "alice@example.com", 25, "US", 1000.0),     # All valid
        (2, "Bob", "invalid", 17, "CA", -100.0),                # Multiple issues
        (3, "Carol", "carol@test.com", 45, "UK", 2000.0),       # All valid
        (4, None, "david@example.com", 30, "InvalidCountry", 500.0),  # Invalid name and country
        (5, "Eve", "eve@example.com", 150, "AU", 3000.0),       # Invalid age
    ], ["customer_id", "name", "email", "age", "country", "balance"])
    
    # Simulate silver layer expectations
    silver_data = (
        bronze_data
        .filter("age >= 18 AND age <= 120")  # Drop invalid ages
        .filter("country IN ('US', 'CA', 'UK', 'AU', 'DE')")  # Drop invalid countries
        .filter("balance >= 0")  # Drop negative balances
        .filter("name IS NOT NULL")  # Drop null names
    )
    
    # Verify results
    assert silver_data.count() == 2  # Only Alice and Carol pass
    valid_names = [row.name for row in silver_data.collect()]
    assert set(valid_names) == {"Alice", "Carol"}
```

### Pattern 3: Expectation Coverage Testing

```python
def test_all_expectations_covered():
    """Ensure critical fields have expectations defined"""
    
    # Define required expectations
    required_expectations = {
        "bronze_customers": ["has_id", "has_email"],
        "silver_customers": ["valid_age", "valid_email", "valid_tier"],
        "gold_customers": ["id_required", "complete_profile"],
    }
    
    # Verify expectations are defined
    # (In real implementation, parse pipeline definition)
    defined_expectations = get_defined_expectations()  # Custom function
    
    for table, expectations in required_expectations.items():
        for expectation in expectations:
            assert expectation in defined_expectations[table], \
                f"Missing expectation '{expectation}' for table '{table}'"
```

## Summary

In this notebook, we explored data quality with expectations in Lakeflow:

### Key Concepts Covered

1. **Migration from DLT to Lakeflow**
   - Simple replacement: `dlt` → `dp` throughout code
   - Identical functionality with open-source foundation
   - Backward compatibility maintained

2. **Three Expectation Strategies**
   - WARN (`@dp.expect`): Monitor without blocking
   - DROP (`@dp.expect_or_drop`): Remove violating records
   - FAIL (`@dp.expect_or_fail`): Stop pipeline on violations

3. **Layered Quality Architecture**
   - Bronze: WARN expectations for monitoring
   - Silver: DROP expectations for cleansing
   - Gold: FAIL expectations for guarantees

4. **Composable Expectations**
   - Standard expectation libraries
   - Expectation builders for reusability
   - Domain-specific expectation sets

5. **Quality Metrics and Monitoring**
   - Automatic metrics collection
   - Quality dashboards and reports
   - Alert configuration for violations

6. **Testing Strategies**
   - Unit testing expectation logic
   - Integration testing full pipelines
   - Coverage testing for completeness

### Functional Programming Benefits

- **Declarative**: Expectations describe desired quality state
- **Composable**: Expectations stack on table definitions
- **Immutable**: Expectations don't modify pipeline code
- **Pure**: Constraint evaluation is deterministic

### Best Practices

✅ Use layered strategy (WARN → DROP → FAIL)
✅ Create reusable expectation libraries
✅ Monitor metrics and set up alerts
✅ Test expectations with synthetic data
✅ Document business rules in expectation names
✅ Start with WARN, tighten to DROP/FAIL gradually
✅ Keep constraints simple and testable

### Next Steps

- **6.5**: Flows and advanced CDC patterns
- **6.6**: Best practices and anti-patterns


## Exercises

Practice implementing expectations:

**Exercise 1: Basic Expectations**
- Define a table with 3 WARN expectations
- Add 2 DROP expectations for business rules
- Include 1 FAIL expectation for critical field

**Exercise 2: Layered Pipeline**
- Create bronze/silver/gold pipeline
- Apply appropriate expectation strategy at each layer
- Track quality metrics across layers

**Exercise 3: Reusable Library**
- Build expectation library for your domain
- Create at least 5 reusable expectations
- Use them across multiple tables

**Exercise 4: Quality Monitoring**
- Set up quality metrics dashboard
- Identify top 3 quality issues
- Configure alerts for critical violations

**Exercise 5: Testing Strategy**
- Write unit tests for 3 expectations
- Create integration test for pipeline
- Verify expected drop rates

**Exercise 6: Migration Practice**
- Take existing DLT pipeline with expectations
- Migrate to Lakeflow (`dlt` → `dp`)
- Verify identical behavior
