# 6.6 Best Practices and Anti-Patterns for Lakeflow Declarative Pipelines

**Learning Objectives**:
- Understand prohibited operations in declarative pipeline definitions
- Apply decision criteria for table vs view vs temporary view selection
- Recognize and avoid common performance anti-patterns
- Implement effective testing strategies for declarative pipelines
- Execute systematic migration from imperative to declarative patterns
- Troubleshoot common pipeline issues with proven techniques

**Prerequisites**: Completion of notebooks 6.1-6.5

**Key Takeaway**: Following declarative best practices ensures maintainable, performant, and reliable data pipelines that leverage Lakeflow's automatic optimization and orchestration capabilities.

In [None]:
# Platform setup: Uncomment for local development, keep commented in Databricks
# %run ./00_Environment_Setup.ipynb

In [None]:
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, TimestampType, DoubleType
from pyspark import pipelines as dp
from datetime import datetime, timedelta
from typing import List, Dict, Tuple

## 1. Prohibited Operations in Pipeline Definitions

### The Declarative Paradigm Contract

Lakeflow declarative pipelines follow a **strict declarative contract**: pipeline definitions should describe **what** to compute, not **how** or **when** to execute operations.

**Core Principle**: Pipeline functions return DataFrame transformations; the platform handles execution, materialization, and scheduling.

### Prohibited Operations

The following operations **trigger immediate execution** and **violate the declarative contract**:

1. **Actions that Materialize Data**:
   - `.collect()`, `.take()`, `.head()`, `.first()`, `.show()`
   - `.count()`, `.countDistinct()`
   - `.toPandas()`
   
2. **Write Operations**:
   - `.write.format().save()`
   - `.write.saveAsTable()`
   - `.createOrReplaceTempView()`
   
3. **Cache Operations**:
   - `.cache()`, `.persist()`
   - `.unpersist()`
   
4. **Checkpoint Operations**:
   - Manual `.checkpoint()` (Lakeflow manages checkpoints automatically)

**Why Prohibited?**
- **Breaks lazy evaluation**: Lakeflow optimizes the entire DAG; premature materialization prevents optimization
- **Prevents orchestration**: Lakeflow manages when and how to execute; manual execution breaks dependency tracking
- **Resource inefficiency**: Materializing data in pipeline definitions wastes compute during planning phase
- **Inconsistent state**: Manual writes bypass Lakeflow's transaction and consistency guarantees

### ❌ Anti-Pattern: Materializing Data in Pipeline Definitions

In [None]:
# ❌ ANTI-PATTERN: Using .count() to validate data
# @dp.table
# def bad_validation_pattern():
#     df = spark.table("raw.customers")
#     
#     # WRONG: This materializes data during pipeline definition
#     if df.count() == 0:
#         raise ValueError("No customer data found")
#     
#     return df.filter(F.col("status") == "active")

# ❌ ANTI-PATTERN: Using .collect() for branching logic
# @dp.table
# def bad_branching_pattern():
#     config_df = spark.table("config.settings")
#     
#     # WRONG: Materializes data and breaks declarative contract
#     config = config_df.collect()[0].asDict()
#     threshold = config["revenue_threshold"]
#     
#     return (
#         spark.table("sales")
#         .filter(F.col("revenue") > threshold)
#     )

# ❌ ANTI-PATTERN: Manual write operations
# @dp.table
# def bad_write_pattern():
#     df = spark.table("raw.events")
#     
#     # WRONG: Manual write bypasses Lakeflow orchestration
#     df.write.format("delta").mode("append").save("/mnt/manual/events")
#     
#     return df

print("❌ These anti-patterns are commented out because they violate declarative principles")

### ✅ Correct Pattern: Pure Declarative Transformations

In [None]:
# ✅ CORRECT: Use expectations for data validation
@dp.table
@dp.expect_or_fail("has_customer_data", "customer_id IS NOT NULL")
def good_validation_pattern():
    """Data validation through declarative expectations."""
    return (
        spark.table("raw.customers")
        .filter(F.col("status") == "active")
    )

# ✅ CORRECT: Use joins for configuration-driven logic
@dp.table
def config_driven_thresholds():
    """Configuration-driven filtering using joins."""
    return (
        spark.table("sales")
        .join(
            spark.table("config.settings").select(
                F.col("revenue_threshold")
            ),
            how="cross"
        )
        .filter(F.col("revenue") > F.col("revenue_threshold"))
        .drop("revenue_threshold")
    )

# ✅ CORRECT: Let Lakeflow handle writes
@dp.table(
    name="processed_events",
    path="/mnt/processed/events"  # Lakeflow writes to this location
)
def good_write_pattern():
    """Lakeflow manages writes automatically."""
    return spark.table("raw.events")

print("✅ Declarative patterns preserve lazy evaluation and platform orchestration")

## 2. Table vs View vs Temporary View Selection Criteria

### Decision Matrix

| Criterion | @dp.table | @dp.materialized_view | @dp.temporary_view |
|-----------|-----------|----------------------|--------------------|
| **Persistence** | Durable storage | Durable storage | Logical only (no storage) |
| **Performance** | Pre-computed | Pre-computed | Computed on-demand |
| **Downstream Reads** | Many | Many | Few (1-2) |
| **Update Frequency** | Any | Any | Not applicable |
| **Storage Cost** | Yes | Yes | No |
| **Compute Cost** | Once per update | Once per update | Every read |
| **Use Case** | Source of truth | Aggregations | Intermediate logic |
| **Incremental** | Yes (streaming) | No | No |
| **Partitioning** | Yes | Yes | No |
| **Time Travel** | Yes | Yes | No |

### Selection Guidelines

**Use `@dp.table` when**:
- This is a **source of truth** for downstream consumers
- Data will be **read multiple times** by different pipelines
- You need **time travel** or **data versioning**
- You need **partitioning** for performance
- Data requires **incremental updates** (streaming)

**Use `@dp.materialized_view` when**:
- This is an **expensive aggregation** used by multiple consumers
- Source data changes **less frequently** than reads
- You want **pre-computed results** for BI dashboards
- Storage cost is acceptable for **query performance gains**

**Use `@dp.temporary_view` when**:
- This is **intermediate logic** used by only 1-2 downstream tables
- Computation is **cheap** (simple filters, selects)
- You want to **avoid storage costs**
- Logic is only for **code organization** (breaking complex logic into steps)

### Pattern Examples with Selection Rationale

In [None]:
# ✅ Table: Source of truth, read by many pipelines
@dp.table(
    partition_cols=["country", "year"],
    comment="Customer master table - source of truth"
)
def customers():
    """
    Why @dp.table?
    - Source of truth for customer data
    - Read by multiple downstream pipelines (orders, analytics, ML)
    - Needs partitioning for performance
    - Requires time travel for compliance
    """
    return (
        spark.table("raw.customers")
        .withColumn("year", F.year("signup_date"))
    )

# ✅ Temporary View: Simple filter, used by one downstream table
@dp.temporary_view
def active_customers():
    """
    Why @dp.temporary_view?
    - Simple filter operation (cheap to compute)
    - Only used by premium_customers table below
    - No need to persist (avoids storage cost)
    - Logical view for code organization
    """
    return dp.read("customers").filter(F.col("status") == "active")

# ✅ Table: Final output consumed by analytics
@dp.table
def premium_customers():
    """
    Why @dp.table?
    - Final output for analytics dashboards
    - Multiple downstream consumers (BI, ML, reports)
    - Needs durability and performance
    """
    return (
        dp.read("active_customers")
        .filter(F.col("lifetime_value") > 10000)
    )

# ✅ Materialized View: Expensive aggregation, read frequently
@dp.materialized_view
def customer_lifetime_metrics():
    """
    Why @dp.materialized_view?
    - Expensive aggregation across orders (millions of rows)
    - Read by multiple dashboards and reports
    - Source data (orders) changes less frequently than reads
    - Pre-computation saves significant query time
    """
    return (
        dp.read("customers")
        .join(dp.read("orders"), "customer_id")
        .groupBy("customer_id", "country")
        .agg(
            F.sum("order_total").alias("lifetime_value"),
            F.count("order_id").alias("order_count"),
            F.avg("order_total").alias("avg_order_value"),
            F.min("order_date").alias("first_order_date"),
            F.max("order_date").alias("last_order_date")
        )
    )

print("✅ Selection criteria:")
print("   - Table: Source of truth, multiple readers, needs persistence")
print("   - Materialized View: Expensive aggregation, read frequently")
print("   - Temporary View: Simple logic, single reader, avoid storage cost")

## 3. Performance Anti-Patterns and Optimization Strategies

### Common Performance Anti-Patterns

#### Anti-Pattern 1: Redundant Materialization

In [None]:
# ❌ ANTI-PATTERN: Materializing simple filters as tables
# @dp.table
# def customers_us():  # Unnecessary table
#     return dp.read("customers").filter(F.col("country") == "US")
# 
# @dp.table
# def customers_uk():  # Unnecessary table
#     return dp.read("customers").filter(F.col("country") == "UK")
# 
# @dp.table
# def customers_ca():  # Unnecessary table
#     return dp.read("customers").filter(F.col("country") == "CA")

# ✅ CORRECT: Use temporary views or compute on-demand
@dp.table(
    partition_cols=["country"],  # Partition for efficient filtering
    comment="All customers partitioned by country"
)
def customers_partitioned():
    """Single table with efficient partitioning."""
    return spark.table("raw.customers")

# Downstream consumers filter efficiently using partition pruning
@dp.table
def us_customer_analysis():
    """Partition pruning eliminates need for separate table."""
    return (
        dp.read("customers_partitioned")
        .filter(F.col("country") == "US")  # Efficient: reads only US partition
    )

print("✅ Optimization: Use partitioning instead of separate tables")
print("   - Reduces storage cost (no duplicated data)")
print("   - Reduces maintenance (single table to manage)")
print("   - Maintains performance (partition pruning is efficient)")

#### Anti-Pattern 2: Over-Materialization of Temporary Views

In [None]:
# ❌ ANTI-PATTERN: Materializing every intermediate step
# @dp.table  # Unnecessary materialization
# def step1_filter():
#     return spark.table("raw.orders").filter(F.col("status") != "cancelled")
# 
# @dp.table  # Unnecessary materialization
# def step2_enrich():
#     return (
#         dp.read("step1_filter")
#         .withColumn("order_year", F.year("order_date"))
#     )
# 
# @dp.table  # Only this needs materialization
# def step3_aggregate():
#     return (
#         dp.read("step2_enrich")
#         .groupBy("order_year").agg(F.sum("total").alias("yearly_revenue"))
#     )

# ✅ CORRECT: Use temporary views for intermediate steps
@dp.temporary_view
def filtered_orders():
    """Temporary view: cheap filter, single use."""
    return spark.table("raw.orders").filter(F.col("status") != "cancelled")

@dp.temporary_view
def enriched_orders():
    """Temporary view: simple transformation, single use."""
    return (
        dp.read("filtered_orders")
        .withColumn("order_year", F.year("order_date"))
    )

@dp.table  # Only final result materialized
def yearly_revenue():
    """Table: final aggregation for analytics consumption."""
    return (
        dp.read("enriched_orders")
        .groupBy("order_year")
        .agg(F.sum("total").alias("yearly_revenue"))
    )

print("✅ Optimization: Temporary views for linear pipelines")
print("   - Catalyst optimizer fuses operations into single plan")
print("   - Reduces storage cost (no intermediate tables)")
print("   - Maintains or improves performance (fewer reads/writes)")

#### Anti-Pattern 3: Missing Partitioning on Large Tables

In [None]:
# ❌ ANTI-PATTERN: No partitioning on large time-series table
# @dp.table  # Missing partition_cols
# def events_no_partitions():
#     """Problem: Full table scans for date-based queries."""
#     return spark.table("raw.events")

# ✅ CORRECT: Partition by common filter columns
@dp.table(
    partition_cols=["event_date"],  # Partition by date for time-series queries
    table_properties={
        "delta.autoOptimize.optimizeWrite": "true",
        "delta.autoOptimize.autoCompact": "true"
    }
)
def events_partitioned():
    """
    Partitioning strategy:
    - event_date: Common filter in analytics queries
    - Enables partition pruning (reads only relevant partitions)
    - Auto-optimize maintains healthy file sizes
    """
    return (
        spark.table("raw.events")
        .withColumn("event_date", F.to_date("event_timestamp"))
    )

# Downstream query benefits from partition pruning
@dp.table
def last_7_days_events():
    """
    Efficient: Reads only 7 partitions instead of entire table.
    """
    return (
        dp.read("events_partitioned")
        .filter(
            F.col("event_date") >= F.current_date() - F.expr("INTERVAL 7 DAYS")
        )
    )

print("✅ Optimization: Strategic partitioning")
print("   - Choose partition columns based on common filter patterns")
print("   - Avoid over-partitioning (target: 128MB-1GB per partition)")
print("   - Enable auto-optimize for partition management")

#### Anti-Pattern 4: Inefficient Join Patterns

In [None]:
# ❌ ANTI-PATTERN: Large table joined to small table without broadcast hint
# @dp.table
# def orders_with_country():
#     """Problem: Shuffle join instead of broadcast join."""
#     return (
#         dp.read("orders")  # Large: millions of rows
#         .join(
#             dp.read("countries"),  # Small: ~200 rows, but not broadcast
#             "country_code"
#         )
#     )

# ✅ CORRECT: Broadcast small dimension tables
@dp.table
def orders_with_country_optimized():
    """
    Optimization: Broadcast small dimension table.
    - Avoids shuffle of large orders table
    - Reduces network I/O
    - Faster execution
    """
    return (
        dp.read("orders")
        .join(
            F.broadcast(dp.read("countries")),  # Broadcast hint for small table
            "country_code"
        )
    )

# ✅ ALTERNATIVE: Configure auto-broadcast threshold
# In pipeline configuration (not in function definition):
# spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "10485760")  # 10MB

print("✅ Optimization: Broadcast joins for dimension tables")
print("   - Use F.broadcast() for small tables (<10MB)")
print("   - Avoids expensive shuffle operations")
print("   - Configure autoBroadcastJoinThreshold for automatic detection")

## 4. Testing Strategies for Declarative Pipelines

### Testing Philosophy

Declarative pipelines require a **different testing approach** than imperative code:

1. **Unit Tests**: Test pure transformation functions in isolation
2. **Integration Tests**: Test table dependencies and data flow
3. **Expectation Tests**: Validate data quality rules before deployment
4. **End-to-End Tests**: Test full pipeline execution in development environment

### Pattern 1: Unit Testing Pure Transformations

In [None]:
# Pipeline definition with pure transformation
@dp.table
def customer_segments():
    """Segment customers by lifetime value."""
    return apply_segmentation(spark.table("raw.customers"))

def apply_segmentation(df: DataFrame) -> DataFrame:
    """
    Pure function: Testable independently of pipeline.
    
    Segments:
    - Premium: lifetime_value >= 10000
    - Standard: 1000 <= lifetime_value < 10000
    - Basic: lifetime_value < 1000
    """
    return df.withColumn(
        "segment",
        F.when(F.col("lifetime_value") >= 10000, "Premium")
        .when(F.col("lifetime_value") >= 1000, "Standard")
        .otherwise("Basic")
    )

# Unit test (in tests/test_transformations.py)
"""
import pytest
from pyspark.sql import SparkSession
from chispa.dataframe_comparer import assert_df_equality

def test_apply_segmentation(spark: SparkSession):
    # Arrange: Create test data
    input_data = [
        (1, "Alice", 15000),
        (2, "Bob", 5000),
        (3, "Charlie", 500)
    ]
    input_df = spark.createDataFrame(
        input_data, 
        ["customer_id", "name", "lifetime_value"]
    )
    
    # Act: Apply transformation
    result_df = apply_segmentation(input_df)
    
    # Assert: Validate segmentation
    expected_data = [
        (1, "Alice", 15000, "Premium"),
        (2, "Bob", 5000, "Standard"),
        (3, "Charlie", 500, "Basic")
    ]
    expected_df = spark.createDataFrame(
        expected_data,
        ["customer_id", "name", "lifetime_value", "segment"]
    )
    
    assert_df_equality(result_df, expected_df, ignore_nullable=True)
"""

print("✅ Testing Pattern: Extract pure functions for unit testing")
print("   - Transformation logic in pure function (testable)")
print("   - Pipeline definition delegates to pure function")
print("   - Use pytest + chispa for DataFrame assertions")

### Pattern 2: Testing Expectations Before Deployment

In [None]:
# Pipeline with expectations
@dp.table
@dp.expect("valid_email", "email RLIKE '^[^@]+@[^@]+\\.[^@]+'")
@dp.expect_or_drop("adult_age", "age >= 18 AND age <= 120")
@dp.expect_or_fail("required_fields", "customer_id IS NOT NULL AND name IS NOT NULL")
def validated_customers():
    """Customer table with quality expectations."""
    return spark.table("raw.customers")

# Test expectations against sample data (in tests/test_expectations.py)
"""
import pytest
from pyspark.sql import SparkSession

def test_customer_expectations(spark: SparkSession):
    # Test data with various quality issues
    test_data = [
        (1, "Alice", "alice@example.com", 25),      # Valid
        (2, "Bob", "invalid-email", 30),            # Invalid email (WARN)
        (3, "Charlie", "charlie@example.com", 15),  # Underage (DROP)
        (None, "Dave", "dave@example.com", 35),     # Null ID (FAIL)
    ]
    test_df = spark.createDataFrame(
        test_data,
        ["customer_id", "name", "email", "age"]
    )
    
    # Create temporary table for testing
    test_df.createOrReplaceTempView("raw.customers")
    
    # Test expectations
    # 1. WARN expectation: count violations but don't drop
    invalid_email_count = test_df.filter(
        ~F.col("email").rlike("^[^@]+@[^@]+\\.[^@]+")
    ).count()
    assert invalid_email_count == 1, "Expected 1 invalid email"
    
    # 2. DROP expectation: verify rows would be dropped
    underage_count = test_df.filter(
        ~((F.col("age") >= 18) & (F.col("age") <= 120))
    ).count()
    assert underage_count == 1, "Expected 1 underage customer to be dropped"
    
    # 3. FAIL expectation: verify pipeline would fail
    null_id_count = test_df.filter(
        F.col("customer_id").isNull() | F.col("name").isNull()
    ).count()
    assert null_id_count == 1, "Expected 1 null ID to cause pipeline failure"
"""

print("✅ Testing Pattern: Validate expectations with test data")
print("   - Create test datasets with known quality issues")
print("   - Verify expectation logic before deployment")
print("   - Prevent surprises in production pipelines")

### Pattern 3: Integration Testing with Mock Dependencies

In [None]:
# Pipeline with dependencies
@dp.table
def customer_order_summary():
    """Aggregate orders by customer."""
    return (
        dp.read("customers")
        .join(dp.read("orders"), "customer_id")
        .groupBy("customer_id", "name")
        .agg(
            F.sum("order_total").alias("total_spent"),
            F.count("order_id").alias("order_count")
        )
    )

# Integration test with mock tables (in tests/test_integration.py)
"""
import pytest
from pyspark.sql import SparkSession
from chispa.dataframe_comparer import assert_df_equality

@pytest.fixture
def mock_customers(spark: SparkSession):
    """Mock customers table."""
    data = [
        (1, "Alice"),
        (2, "Bob")
    ]
    df = spark.createDataFrame(data, ["customer_id", "name"])
    df.createOrReplaceTempView("customers")
    return df

@pytest.fixture
def mock_orders(spark: SparkSession):
    """Mock orders table."""
    data = [
        (101, 1, 100.0),
        (102, 1, 200.0),
        (103, 2, 150.0)
    ]
    df = spark.createDataFrame(data, ["order_id", "customer_id", "order_total"])
    df.createOrReplaceTempView("orders")
    return df

def test_customer_order_summary(spark: SparkSession, mock_customers, mock_orders):
    # Act: Execute pipeline logic
    result_df = (
        spark.table("customers")
        .join(spark.table("orders"), "customer_id")
        .groupBy("customer_id", "name")
        .agg(
            F.sum("order_total").alias("total_spent"),
            F.count("order_id").alias("order_count")
        )
    )
    
    # Assert: Validate aggregation
    expected_data = [
        (1, "Alice", 300.0, 2),
        (2, "Bob", 150.0, 1)
    ]
    expected_df = spark.createDataFrame(
        expected_data,
        ["customer_id", "name", "total_spent", "order_count"]
    )
    
    assert_df_equality(result_df, expected_df, ignore_nullable=True)
"""

print("✅ Testing Pattern: Mock dependencies for integration tests")
print("   - Use pytest fixtures to create mock tables")
print("   - Test joins and aggregations with controlled data")
print("   - Validate end-to-end data flow")

## 5. Migration Checklist: Imperative to Declarative

### Step-by-Step Migration Process

**Phase 1: Preparation**
1. ✅ **Audit current pipeline**: Document all transformations, dependencies, and schedules
2. ✅ **Identify prohibited operations**: Find `.collect()`, `.write()`, `.cache()` calls
3. ✅ **Review table materialization**: Identify which intermediate steps need persistence
4. ✅ **Document data quality rules**: Extract validation logic for expectation migration

**Phase 2: Code Migration**
5. ✅ **Update imports**: Change `import dlt` to `from pyspark import pipelines as dp`
6. ✅ **Convert decorators**: Replace `@dlt.table` with `@dp.table`, etc.
7. ✅ **Remove prohibited operations**: Refactor materialization and writes
8. ✅ **Add table configurations**: Include partition_cols, table_properties, paths
9. ✅ **Migrate expectations**: Convert validation logic to `@dp.expect` patterns
10. ✅ **Extract pure functions**: Separate transformation logic for testing

**Phase 3: Optimization**
11. ✅ **Apply table type optimization**: Convert unnecessary tables to temporary views
12. ✅ **Add partitioning**: Partition large tables by common filter columns
13. ✅ **Optimize joins**: Add broadcast hints for dimension tables
14. ✅ **Configure auto-optimize**: Enable Delta Lake auto-compaction

**Phase 4: Testing**
15. ✅ **Create unit tests**: Test pure transformation functions
16. ✅ **Test expectations**: Validate quality rules with sample data
17. ✅ **Integration tests**: Test table dependencies with mocks
18. ✅ **Development deployment**: Run pipeline in development environment

**Phase 5: Deployment**
19. ✅ **Deploy to staging**: Validate with production-like data
20. ✅ **Monitor metrics**: Check execution time, data quality, resource usage
21. ✅ **Production deployment**: Blue-green or canary deployment strategy
22. ✅ **Post-deployment validation**: Verify data correctness and performance

### Migration Example: Before and After

In [None]:
# ❌ BEFORE: Imperative pipeline with anti-patterns
"""
import dlt
from pyspark.sql import functions as F

@dlt.table
def raw_customers():
    # Read data
    df = spark.read.format("parquet").load("/mnt/raw/customers")
    
    # Prohibited operation: count for validation
    if df.count() == 0:
        raise ValueError("No customer data")
    
    return df

@dlt.table
def filtered_customers():  # Unnecessary materialization
    return (
        dlt.read("raw_customers")
        .filter(F.col("status") == "active")
    )

@dlt.table
def premium_customers():  # No expectations
    df = (
        dlt.read("filtered_customers")
        .filter(F.col("lifetime_value") > 10000)
    )
    
    # Prohibited operation: manual write
    df.write.format("delta").mode("overwrite").save("/mnt/gold/premium")
    
    return df
"""

# ✅ AFTER: Declarative pipeline with best practices
from pyspark import pipelines as dp
from pyspark.sql import functions as F

@dp.table(
    name="raw_customers",
    partition_cols=["country"],
    comment="Raw customer data from source system"
)
@dp.expect_or_fail("has_data", "customer_id IS NOT NULL")  # Declarative validation
def raw_customers():
    """Source table with data quality enforcement."""
    return spark.read.format("parquet").load("/mnt/raw/customers")

@dp.temporary_view  # Optimized: temporary view for simple filter
def filtered_customers():
    """Active customers - logical view for code organization."""
    return dp.read("raw_customers").filter(F.col("status") == "active")

@dp.table(
    name="premium_customers",
    path="/mnt/gold/premium",  # Lakeflow manages writes
    table_properties={
        "delta.autoOptimize.optimizeWrite": "true",
        "delta.autoOptimize.autoCompact": "true"
    },
    comment="Premium tier customers (LTV > $10,000)"
)
@dp.expect("high_value", "lifetime_value > 10000")  # Quality expectation
@dp.expect_or_drop("valid_contact", "email IS NOT NULL AND phone IS NOT NULL")
def premium_customers():
    """Premium customer segment with quality guarantees."""
    return apply_premium_filter(dp.read("filtered_customers"))

def apply_premium_filter(df):
    """Pure function: testable premium customer logic."""
    return df.filter(F.col("lifetime_value") > 10000)

print("✅ Migration improvements:")
print("   - Removed prohibited operations (.count(), .write())")
print("   - Replaced validation with expectations")
print("   - Converted intermediate table to temporary view")
print("   - Added partitioning and table properties")
print("   - Extracted testable pure function")
print("   - Added comprehensive metadata (comments, expectations)")

## 6. Troubleshooting Common Pipeline Issues

### Issue 1: Pipeline Fails with "Cannot perform action in table definition"

**Symptom**: Error during pipeline creation or deployment
```
Error: Cannot perform action 'count' in table definition for 'my_table'
```

**Root Cause**: Prohibited operation (`.count()`, `.collect()`, `.show()`) in pipeline function

**Solution**: Remove action and use declarative alternatives

In [None]:
# ❌ Problematic code
# @dp.table
# def my_table():
#     df = spark.table("source")
#     if df.count() == 0:  # Prohibited action
#         raise ValueError("No data")
#     return df

# ✅ Solution: Use expectations
@dp.table
@dp.expect_or_fail("has_data", "id IS NOT NULL")
def my_table_fixed():
    """Pipeline fails if no valid data (declarative validation)."""
    return spark.table("source")

print("✅ Solution: Replace actions with expectations")

### Issue 2: Slow Pipeline Performance

**Symptom**: Pipeline takes significantly longer than expected

**Diagnosis Checklist**:
1. ✅ Check for missing partitioning on large tables
2. ✅ Verify broadcast hints for small dimension joins
3. ✅ Look for over-materialization (too many tables instead of views)
4. ✅ Review shuffle operations in Spark UI
5. ✅ Confirm auto-optimize is enabled

**Solution Pattern**: Add partitioning and broadcast hints

In [None]:
# ❌ Slow: Missing optimizations
# @dp.table  # No partitioning
# def slow_events():
#     return (
#         spark.table("raw.events")  # Large table
#         .join(spark.table("dim.countries"), "country_code")  # Small table, no broadcast
#     )

# ✅ Fast: Optimized version
@dp.table(
    partition_cols=["event_date"],  # Partition for queries
    table_properties={
        "delta.autoOptimize.optimizeWrite": "true",
        "delta.autoOptimize.autoCompact": "true"
    }
)
def fast_events():
    """Optimized with partitioning and broadcast join."""
    return (
        spark.table("raw.events")
        .withColumn("event_date", F.to_date("event_timestamp"))
        .join(
            F.broadcast(spark.table("dim.countries")),  # Broadcast small table
            "country_code"
        )
    )

print("✅ Solution: Add partitioning and broadcast joins")

### Issue 3: Expectation Failures in Production

**Symptom**: Pipeline fails with expectation violations
```
Error: Expectation 'valid_age' failed: 150 violations found
```

**Diagnosis**:
1. Review expectation metrics in Lakeflow UI
2. Query violation records for patterns
3. Assess if expectations are too strict or data has quality issues

**Solution Options**:

In [None]:
# ✅ Option 1: Change FAIL to WARN for monitoring
@dp.table
@dp.expect("valid_age", "age >= 0 AND age <= 120")  # WARN: monitor but don't fail
def customers_with_monitoring():
    """Monitor age violations without failing pipeline."""
    return spark.table("raw.customers")

# ✅ Option 2: Change FAIL to DROP for cleansing
@dp.table
@dp.expect_or_drop("valid_age", "age >= 0 AND age <= 120")  # DROP: remove bad rows
def customers_with_cleansing():
    """Drop invalid age records automatically."""
    return spark.table("raw.customers")

# ✅ Option 3: Adjust expectation logic
@dp.table
@dp.expect_or_fail(
    "valid_age_or_null",
    "age IS NULL OR (age >= 0 AND age <= 120)"  # Allow nulls
)
def customers_with_relaxed_rule():
    """Accept null ages, fail only on invalid values."""
    return spark.table("raw.customers")

print("✅ Solution: Adjust expectation strategy based on business requirements")

### Issue 4: Table Dependency Errors

**Symptom**: Error about missing or circular dependencies
```
Error: Table 'orders_summary' depends on 'orders' which is not defined
```

**Root Causes**:
1. Table name mismatch (function name vs configured name)
2. Circular dependency (A depends on B, B depends on A)
3. Cross-pipeline dependency not properly configured

**Solution**: Verify table names and dependency graph

In [None]:
# ❌ Problem: Name mismatch
# @dp.table(name="orders_clean")  # Configured name
# def orders():  # Function name
#     return spark.table("raw.orders")
# 
# @dp.table
# def orders_summary():
#     return dp.read("orders")  # Error: references function name, not configured name

# ✅ Solution: Use consistent naming
@dp.table(name="orders_clean")
def orders_clean_table():
    """Use configured name consistently."""
    return spark.table("raw.orders")

@dp.table
def orders_summary_fixed():
    """Reference configured name, not function name."""
    return dp.read("orders_clean")  # Correct: uses configured name

print("✅ Solution: Match dp.read() calls to configured table names")

### Issue 5: Streaming Table Checkpoint Errors

**Symptom**: Streaming table fails with checkpoint errors
```
Error: Checkpoint directory '/tmp/checkpoints/my_stream' is corrupt
```

**Root Causes**:
1. Schema changes in source without checkpoint reset
2. Manual checkpoint directory deletion
3. Multiple pipelines using same checkpoint location

**Solution**: Configure unique checkpoint paths and handle schema evolution

In [None]:
# ✅ Best practice: Unique checkpoint paths per table
@dp.streaming_table(
    name="events_stream",
    table_properties={
        "pipelines.checkpointLocation": "/mnt/checkpoints/events_stream",  # Unique path
        "pipelines.reset.allowed": "true"  # Allow checkpoint reset if needed
    }
)
def events_stream():
    """
    Streaming table with explicit checkpoint configuration.
    - Unique checkpoint path prevents conflicts
    - reset.allowed enables recovery from schema changes
    """
    return spark.readStream.table("raw.events")

print("✅ Solution: Configure unique checkpoint paths and allow resets")
print("   - Each streaming table needs unique checkpoint location")
print("   - Set reset.allowed for schema evolution scenarios")
print("   - Monitor checkpoint health in Lakeflow UI")

## Summary: Lakeflow Best Practices

### ✅ Do's

**Pipeline Design**:
- ✅ Use pure functions that return DataFrames
- ✅ Let Lakeflow manage execution, writes, and checkpoints
- ✅ Use expectations for data quality validation
- ✅ Choose appropriate table types (table vs materialized view vs temporary view)

**Performance Optimization**:
- ✅ Partition large tables by common filter columns
- ✅ Use broadcast joins for small dimension tables
- ✅ Enable Delta Lake auto-optimize
- ✅ Use temporary views for simple intermediate steps

**Testing & Quality**:
- ✅ Extract pure transformation functions for unit testing
- ✅ Test expectations with sample data before deployment
- ✅ Use integration tests with mock dependencies
- ✅ Monitor expectation metrics in production

**Code Organization**:
- ✅ Add comprehensive metadata (comments, table properties)
- ✅ Use consistent naming between function names and table names
- ✅ Document dependencies and data lineage
- ✅ Version control pipeline code

### ❌ Don'ts

**Prohibited Operations**:
- ❌ Never use `.collect()`, `.count()`, `.show()` in pipeline functions
- ❌ Never use `.write()` or `.save()` - let Lakeflow manage writes
- ❌ Never use `.cache()` or `.persist()` - Lakeflow optimizes caching
- ❌ Never manually manage checkpoints for streaming tables

**Anti-Patterns**:
- ❌ Don't materialize every intermediate step as a table
- ❌ Don't skip partitioning on large tables
- ❌ Don't use shuffle joins for small dimension tables
- ❌ Don't mix imperative and declarative patterns

**Testing & Deployment**:
- ❌ Don't deploy without testing expectations
- ❌ Don't skip integration testing with realistic data
- ❌ Don't use FAIL strategy for exploratory data quality
- ❌ Don't ignore expectation violation metrics

### Key Functional Programming Alignment

Lakeflow declarative pipelines **embody functional programming principles**:

1. **Pure Functions**: Pipeline functions return DataFrames without side effects
2. **Immutability**: DataFrames are immutable; transformations create new DataFrames
3. **Lazy Evaluation**: Lakeflow optimizes the entire DAG before execution
4. **Declarative Composition**: Describe what to compute, not how to execute
5. **Testability**: Pure functions are naturally testable in isolation

**Result**: Maintainable, performant, and reliable data pipelines with automatic optimization.

## Exercises

### Exercise 1: Identify Anti-Patterns

Review the following pipeline code and identify all anti-patterns. Then refactor to best practices.

```python
import dlt
from pyspark.sql import functions as F

@dlt.table
def raw_sales():
    df = spark.read.parquet("/mnt/raw/sales")
    if df.count() < 1000:
        print("Warning: Low sales volume")
    return df

@dlt.table
def filtered_sales():
    return dlt.read("raw_sales").filter(F.col("amount") > 0)

@dlt.table
def enriched_sales():
    return (
        dlt.read("filtered_sales")
        .withColumn("year", F.year("sale_date"))
    )

@dlt.table
def sales_summary():
    df = (
        dlt.read("enriched_sales")
        .groupBy("year").agg(F.sum("amount").alias("total"))
    )
    df.write.format("delta").mode("overwrite").save("/mnt/gold/summary")
    return df
```

**Tasks**:
1. List all anti-patterns found
2. Refactor to use `pyspark.pipelines` (dp)
3. Add appropriate table types (table vs temporary view)
4. Add data quality expectations
5. Add performance optimizations

### Exercise 2: Design Table Type Strategy

Given the following pipeline requirements, choose the appropriate table type for each and justify your decision:

1. **customer_master**: Customer records (10M rows), read by 5+ downstream pipelines, needs partitioning by country
2. **active_customers**: Simple filter of customer_master (status='active'), used by only 1 downstream table
3. **customer_segments**: Expensive segmentation logic joining customers with order history, used by multiple dashboards
4. **temp_date_filter**: Adds year/month columns for filtering, used only in next transformation step
5. **gold_customer_analytics**: Final output for BI consumption

**Tasks**:
- For each table, choose: @dp.table, @dp.materialized_view, or @dp.temporary_view
- Justify your choice using the decision matrix
- Add appropriate configurations (partitioning, table properties)

### Exercise 3: Testing Strategy Implementation

Create a complete testing suite for the following pipeline:

```python
from pyspark import pipelines as dp

@dp.table
@dp.expect("valid_email", "email RLIKE '^[^@]+@[^@]+\\.[^@]+'")
@dp.expect_or_drop("adult_age", "age >= 18")
def validated_users():
    return spark.table("raw.users")

@dp.table
def user_purchase_summary():
    return (
        dp.read("validated_users")
        .join(dp.read("purchases"), "user_id")
        .groupBy("user_id", "email")
        .agg(F.sum("amount").alias("total_spent"))
    )
```

**Tasks**:
1. Write unit tests for transformation logic
2. Create expectation tests with edge cases (invalid email, underage users)
3. Build integration tests with mock tables
4. Document test data requirements

### Exercise 4: Performance Optimization

Optimize the following slow pipeline:

```python
@dp.table
def slow_events():
    return spark.table("raw.events")  # 100M rows, daily queries by date

@dp.table
def events_with_location():
    return (
        dp.read("slow_events")
        .join(spark.table("dim.locations"), "location_id")  # 500 rows
    )
```

**Tasks**:
1. Add partitioning strategy
2. Optimize the join
3. Add Delta Lake optimization settings
4. Estimate performance improvement

### Exercise 5: Migration Planning

Plan a migration from the following imperative code to declarative Lakeflow:

```python
# Current imperative code
df_raw = spark.read.parquet("/mnt/raw/transactions")
df_filtered = df_raw.filter(F.col("amount") > 0)
df_filtered.write.format("delta").mode("append").save("/mnt/bronze/transactions")

df_bronze = spark.read.format("delta").load("/mnt/bronze/transactions")
if df_bronze.filter(F.col("customer_id").isNull()).count() > 0:
    raise ValueError("Null customer_id found")

df_enriched = df_bronze.join(
    spark.table("dim.customers"),
    "customer_id"
)
df_enriched.write.format("delta").mode("overwrite").save("/mnt/silver/transactions")
```

**Tasks**:
1. Create migration checklist
2. Identify all prohibited operations
3. Design declarative pipeline with expectations
4. Add testing strategy
5. Document rollback plan

## Next Steps

**Congratulations!** You've completed Section 6 on Declarative Pipelines with `pyspark.pipelines`.

**You now understand**:
- ✅ Prohibited operations in declarative pipelines and their declarative alternatives
- ✅ Decision criteria for table vs materialized view vs temporary view selection
- ✅ Common performance anti-patterns and optimization strategies
- ✅ Comprehensive testing approaches for declarative pipelines
- ✅ Systematic migration process from imperative to declarative patterns
- ✅ Troubleshooting techniques for common pipeline issues

**Continue your learning**:
- **Appendix 1.1**: Modular Design and Project Structure
- **Appendix 1.2**: Dependency Management and Package Distribution
- **Practice**: Apply these patterns to your production pipelines
- **Databricks Documentation**: [Lakeflow Declarative Pipelines Guide](https://docs.databricks.com/workflows/delta-live-tables/index.html)

**Recommended actions**:
1. Review your existing pipelines for anti-patterns
2. Create migration plan for imperative code
3. Implement testing suite for critical pipelines
4. Monitor expectation metrics in production

**Remember**: Declarative pipelines leverage functional programming principles to create maintainable, testable, and performant data workflows with automatic platform optimization.