# 6.5 Flows and Advanced CDC Patterns

This notebook explores advanced data integration patterns using Lakeflow flows and change data capture (CDC). We'll learn how to use `dp.append_flow()` for incremental loading, `dp.create_auto_cdc_flow()` for automatic change tracking, and implement Type 1 and Type 2 slowly changing dimensions following functional programming principles.

## Learning Objectives

By the end of this notebook, you will understand how to:
- Use `dp.append_flow()` for efficient incremental data loading
- Implement automatic CDC with `dp.create_auto_cdc_flow()`
- Process CDC from snapshot tables with `dp.create_auto_cdc_from_snapshot_flow()`
- Handle inserts, updates, and deletes declaratively
- Implement Type 1 slowly changing dimensions (overwrite)
- Build Type 2 slowly changing dimensions (history tracking)
- Design merge strategies and handle conflicts
- Apply functional programming to CDC workflows

## Prerequisites

- Completion of Notebooks 6.1-6.4
- Understanding of change data capture concepts
- Knowledge of slowly changing dimensions
- Familiarity with Delta Lake merge operations

In [None]:
# Platform setup detection
# In Databricks: Keep commented
# In Local: Uncomment this line
# %run 00_Environment_Setup.ipynb

In [None]:
# Essential imports
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql import functions as F
from pyspark.sql.types import *
from typing import Dict, List, Optional
from dataclasses import dataclass
from datetime import datetime, timedelta

# In a real Lakeflow pipeline:
# from pyspark import pipelines as dp

print("✅ Imports complete - Ready for flows and CDC demonstration!")

## 1. Understanding Flows in Lakeflow

### What Are Flows?

**Flows** are high-level abstractions for common data integration patterns. They encapsulate complex logic into simple, declarative API calls.

### Types of Flows

| Flow Type | Purpose | Use Case |
|-----------|---------|----------|
| **append_flow** | Incremental append | Growing event logs, audit trails |
| **create_auto_cdc_flow** | Automatic CDC | Database replication, real-time sync |
| **create_auto_cdc_from_snapshot_flow** | CDC from snapshots | Daily dumps, batch CDC |
| **create_sink** | Custom output | External system integration |

### Flows vs Tables

**Traditional Table Definition**:
```python
@dp.streaming_table
def my_table():
    return spark.readStream.table("source")
    # Manual logic for transformations
```

**Flow-Based Pattern**:
```python
dp.append_flow(
    source=dp.read("source"),
    target="my_table",
    target_columns=["col1", "col2", "col3"]
)
# Automatic append logic with deduplication
```

**Benefits of Flows**:
- Less boilerplate code
- Built-in best practices
- Automatic optimization
- Standardized patterns
- Easier maintenance

## 2. Incremental Loading with append_flow()

### Basic Append Flow

```python
from pyspark import pipelines as dp

# Source: New events arriving continuously
@dp.streaming_table
def bronze_events():
    return spark.readStream.table("raw.events")

# Target: Append only new events to silver layer
dp.append_flow(
    source=dp.read("bronze_events"),
    target="silver_events",
    target_columns=[
        "event_id",
        "event_type",
        "user_id",
        "timestamp",
        "metadata"
    ]
)

# What append_flow does automatically:
# 1. Reads new records from source
# 2. Selects only specified columns
# 3. Appends to target table
# 4. No duplicates (based on source processing)
# 5. Maintains order
```

### Append Flow with Transformations

```python
# Create source with transformations
@dp.streaming_table
def bronze_logs_transformed():
    """
    Apply transformations before append.
    Keep transformations pure and testable.
    """
    return (
        spark.readStream.table("raw.application_logs")
        .withColumn(
            "log_timestamp",
            F.to_timestamp("timestamp_str", "yyyy-MM-dd HH:mm:ss")
        )
        .withColumn(
            "log_level",
            F.upper(F.col("level"))
        )
        .withColumn(
            "processed_at",
            F.current_timestamp()
        )
    )

# Append transformed data
dp.append_flow(
    source=dp.read("bronze_logs_transformed"),
    target="silver_application_logs",
    target_columns=[
        "log_id",
        "log_timestamp",
        "log_level",
        "message",
        "application",
        "processed_at"
    ]
)
```

### Append Flow for Batch Data

```python
# Batch source (daily files)
@dp.table
def daily_sales_batch():
    """Daily sales files loaded in batch"""
    return (
        spark.read
        .format("parquet")
        .load("/mnt/sales/daily/*.parquet")
    )

# Append to historical sales
dp.append_flow(
    source=dp.read("daily_sales_batch"),
    target="historical_sales",
    target_columns=[
        "sale_id",
        "sale_date",
        "customer_id",
        "product_id",
        "quantity",
        "amount"
    ]
)
# Each pipeline run appends new daily data
```

### When to Use append_flow()

**✅ Use append_flow for:**
- Event logs (clickstreams, audit logs, IoT data)
- Transaction records (orders, payments, shipments)
- Time-series data (metrics, sensor readings)
- Append-only data sources
- Growing datasets without updates

**❌ Don't use append_flow for:**
- Data requiring updates (use CDC flows)
- Dimension tables with changes (use SCD patterns)
- Data with deletions (use merge operations)
- Full snapshot replacements (use regular tables)

## 3. Automatic CDC with create_auto_cdc_flow()

### What is Change Data Capture (CDC)?

CDC tracks changes (inserts, updates, deletes) in source systems and replicates them to target systems.

**CDC Change Events:**
```
Operation | customer_id | name    | email              | _change_type
----------|-------------|---------|--------------------|--------------
INSERT    | 1           | Alice   | alice@example.com  | insert
UPDATE    | 1           | Alice J | alice@example.com  | update
DELETE    | 1           | Alice J | alice@example.com  | delete
```

### Basic CDC Flow

```python
from pyspark import pipelines as dp

# Source: CDC stream from database
@dp.streaming_table
def bronze_customer_changes():
    """
    CDC events from source database.
    Must have: key columns, _change_type column, sequence column
    """
    return (
        spark.readStream
        .format("delta")
        .table("raw.customer_cdc")
    )

# Automatically apply CDC to maintain current state
dp.create_auto_cdc_flow(
    source="bronze_customer_changes",
    target="silver_customers",
    keys=["customer_id"],              # Primary key for matching
    sequence_by="update_timestamp",    # Order operations
    stored_as_scd_type=1               # Type 1: overwrite on update
)

# What auto_cdc_flow does automatically:
# 1. Reads CDC stream from source
# 2. Applies inserts (new records)
# 3. Applies updates (modified records)
# 4. Applies deletes (removed records)
# 5. Handles out-of-order events using sequence_by
# 6. Deduplicates based on keys
# 7. Maintains current state in target
```

### CDC with Change Type Column

```python
# CDC source must include _change_type or operation column
@dp.streaming_table
def order_cdc_stream():
    return (
        spark.readStream.table("raw.orders_cdc")
        # Expected schema:
        # - order_id (key)
        # - customer_id, product_id, amount, status (data)
        # - _change_type: 'insert', 'update', 'delete'
        # - update_timestamp (sequence)
    )

dp.create_auto_cdc_flow(
    source="order_cdc_stream",
    target="current_orders",
    keys=["order_id"],
    sequence_by="update_timestamp"
)
```

### Handling Multiple Keys

```python
# Composite key example
@dp.streaming_table
def order_line_items_cdc():
    return spark.readStream.table("raw.order_line_items_cdc")

dp.create_auto_cdc_flow(
    source="order_line_items_cdc",
    target="current_order_line_items",
    keys=["order_id", "line_item_id"],  # Composite key
    sequence_by="last_updated"
)
```

### CDC Flow Parameters

```python
dp.create_auto_cdc_flow(
    source="source_table",              # Source table name
    target="target_table",              # Target table name
    keys=["id"],                        # Primary/composite keys
    sequence_by="updated_at",           # Column for ordering changes
    stored_as_scd_type=1,               # 1 or 2 (default: 1)
    track_history_column_list=None,     # Columns to track (for SCD Type 2)
    ignore_null_updates=False,          # Skip null value updates
    apply_as_deletes=None,              # Custom delete condition
    apply_as_truncates=None,            # Custom truncate condition
    column_list=None,                   # Subset of columns to sync
    except_column_list=None             # Columns to exclude
)
```

## 4. CDC from Snapshots

### Snapshot-Based CDC Pattern

When source system provides full snapshots instead of change events:

```
Day 1 Snapshot:          Day 2 Snapshot:
id | name  | status      id | name    | status
1  | Alice | active      1  | Alice J | active  <- Updated
2  | Bob   | active      2  | Bob     | inactive <- Updated
                         3  | Carol   | active  <- Inserted
                         (id=3 missing = Deleted)
```

### Using create_auto_cdc_from_snapshot_flow()

```python
from pyspark import pipelines as dp

# Source: Daily snapshots
@dp.table
def daily_customer_snapshot():
    """
    Complete customer snapshot loaded daily.
    Lakeflow will compare with previous snapshot to detect changes.
    """
    return (
        spark.read
        .format("parquet")
        .load("/mnt/snapshots/customers/latest/")
    )

# Automatically detect and apply changes
dp.create_auto_cdc_from_snapshot_flow(
    source="daily_customer_snapshot",
    target="customers_current_state",
    keys=["customer_id"],
    sequence_by="snapshot_date",  # When snapshot was taken
    stored_as_scd_type=1
)

# What happens automatically:
# 1. Compares new snapshot to previous state
# 2. Detects inserts (new customer_ids)
# 3. Detects updates (changed column values)
# 4. Detects deletes (missing customer_ids)
# 5. Applies changes to maintain current state
```

### Snapshot CDC with Partitioning

```python
# Partitioned snapshots (common pattern)
@dp.table
def partitioned_product_snapshot():
    """
    Daily snapshots partitioned by date.
    Lakeflow processes only latest partition.
    """
    return (
        spark.read
        .format("delta")
        .table("raw.product_snapshots")
        .filter(F.col("snapshot_date") == F.current_date())
    )

dp.create_auto_cdc_from_snapshot_flow(
    source="partitioned_product_snapshot",
    target="products_current",
    keys=["product_id"],
    sequence_by="snapshot_date"
)
```

### Handling Soft Deletes

```python
# Source uses soft deletes (is_deleted flag)
@dp.table
def employee_snapshot_with_soft_deletes():
    return spark.table("raw.employee_snapshots")

dp.create_auto_cdc_from_snapshot_flow(
    source="employee_snapshot_with_soft_deletes",
    target="employees_active",
    keys=["employee_id"],
    sequence_by="snapshot_timestamp",
    apply_as_deletes="is_deleted = true"  # Treat as delete
)
# Records with is_deleted=true removed from target
```

## 5. Slowly Changing Dimensions (SCD)

### SCD Type 1: Overwrite (No History)

**Behavior**: Updates overwrite existing values, no history maintained.

```python
# Example: Customer current state
# Updates simply overwrite old values

@dp.streaming_table
def customer_cdc():
    return spark.readStream.table("raw.customer_changes")

dp.create_auto_cdc_flow(
    source="customer_cdc",
    target="customers",  # Current state only
    keys=["customer_id"],
    sequence_by="update_timestamp",
    stored_as_scd_type=1  # Type 1: Overwrite
)

# Result in target table:
# customer_id | name    | email              | tier
# 1           | Alice J | alice@example.com  | Premium
#             ^^^^^^^^^ Updated value (old value lost)
```

**When to Use SCD Type 1:**
- Current state is all that matters
- Historical changes not needed
- Storage efficiency important
- Examples: Contact information, preferences, current status

### SCD Type 2: Historical Tracking

**Behavior**: Maintains complete history of changes with effective dates.

```python
# Example: Customer history with all changes tracked

@dp.streaming_table
def customer_cdc_for_history():
    return spark.readStream.table("raw.customer_changes")

dp.create_auto_cdc_flow(
    source="customer_cdc_for_history",
    target="customers_history",
    keys=["customer_id"],
    sequence_by="update_timestamp",
    stored_as_scd_type=2,  # Type 2: Track history
    track_history_column_list=["tier", "status"]  # Track changes to these
)

# Result in target table:
# customer_id | name    | tier    | status  | __start_at          | __end_at            | __current
# 1           | Alice   | Free    | active  | 2024-01-01 00:00:00 | 2024-06-01 00:00:00 | false
# 1           | Alice   | Premium | active  | 2024-06-01 00:00:00 | 2024-10-01 00:00:00 | false
# 1           | Alice J | Premium | active  | 2024-10-01 00:00:00 | NULL                | true
#                                                                    ^^^^                 ^^^^
#                                                          Current record (no end date)
```

**SCD Type 2 Metadata Columns:**
- `__start_at`: When this version became effective
- `__end_at`: When this version was superseded (NULL for current)
- `__current`: Boolean flag for current record

**Querying SCD Type 2 Tables:**
```python
# Get current state only
current_customers = (
    spark.table("customers_history")
    .filter(F.col("__current") == True)
)

# Get state at specific point in time
customers_on_date = (
    spark.table("customers_history")
    .filter(
        (F.col("__start_at") <= F.lit("2024-06-15")) &
        ((F.col("__end_at") > F.lit("2024-06-15")) | F.col("__end_at").isNull())
    )
)

# Get complete history for a customer
customer_history = (
    spark.table("customers_history")
    .filter(F.col("customer_id") == 1)
    .orderBy("__start_at")
)
```

**When to Use SCD Type 2:**
- Historical analysis required
- Audit trail needed
- Regulatory compliance (track changes)
- Point-in-time queries important
- Examples: Pricing history, customer tier changes, product classifications

### SCD Type 1 vs Type 2 Decision Matrix

| Criteria | SCD Type 1 | SCD Type 2 |
|----------|------------|------------|
| **History** | Not tracked | Full history |
| **Storage** | Minimal | Higher |
| **Queries** | Simple | More complex |
| **Updates** | Overwrite | Insert new version |
| **Use Case** | Current state | Historical analysis |
| **Example** | Contact info | Pricing history |

## 6. Advanced CDC Patterns

### Pattern 1: Multi-Tier CDC

```python
# Bronze: Raw CDC events
@dp.streaming_table
def bronze_customer_cdc():
    return spark.readStream.table("raw.customer_cdc")

# Silver: Apply CDC to maintain current state (Type 1)
dp.create_auto_cdc_flow(
    source="bronze_customer_cdc",
    target="silver_customers_current",
    keys=["customer_id"],
    sequence_by="update_timestamp",
    stored_as_scd_type=1  # Current state
)

# Gold: Maintain history (Type 2) for analytics
dp.create_auto_cdc_flow(
    source="bronze_customer_cdc",
    target="gold_customers_history",
    keys=["customer_id"],
    sequence_by="update_timestamp",
    stored_as_scd_type=2,  # Track history
    track_history_column_list=["tier", "status", "country"]
)
```

### Pattern 2: Selective Column Tracking

```python
# Track history only for specific columns
@dp.streaming_table
def product_cdc():
    return spark.readStream.table("raw.product_changes")

dp.create_auto_cdc_flow(
    source="product_cdc",
    target="products_with_price_history",
    keys=["product_id"],
    sequence_by="updated_at",
    stored_as_scd_type=2,
    track_history_column_list=["price", "currency"]  # Only track price changes
    # Other columns (name, description) updated in place (Type 1)
)
```

### Pattern 3: Ignore Null Updates

```python
# Don't update when change event has null values
dp.create_auto_cdc_flow(
    source="customer_cdc",
    target="customers",
    keys=["customer_id"],
    sequence_by="update_timestamp",
    stored_as_scd_type=1,
    ignore_null_updates=True  # Keep existing value if update is NULL
)
# Useful when CDC sends partial updates
```

### Pattern 4: Custom Delete Conditions

```python
# Define custom logic for what constitutes a "delete"
@dp.streaming_table
def account_cdc():
    return spark.readStream.table("raw.account_changes")

dp.create_auto_cdc_flow(
    source="account_cdc",
    target="active_accounts",
    keys=["account_id"],
    sequence_by="modified_at",
    apply_as_deletes="status = 'CLOSED' OR is_deleted = true"
    # Remove records when status is CLOSED or is_deleted flag set
)
```

### Pattern 5: Column Subsetting

```python
# Sync only specific columns to target
dp.create_auto_cdc_flow(
    source="customer_cdc",
    target="customer_pii_subset",
    keys=["customer_id"],
    sequence_by="update_timestamp",
    stored_as_scd_type=1,
    column_list=["customer_id", "name", "email", "phone"],  # Only these columns
    # OR
    except_column_list=["internal_notes", "risk_score"]  # Exclude these columns
)
```

### Pattern 6: Handling Late Arrivals

```python
# CDC with watermarking for late events
@dp.streaming_table
def order_cdc_with_watermark():
    return (
        spark.readStream.table("raw.order_cdc")
        .withWatermark("event_timestamp", "1 hour")  # Accept events up to 1hr late
    )

dp.create_auto_cdc_flow(
    source="order_cdc_with_watermark",
    target="orders_current",
    keys=["order_id"],
    sequence_by="event_timestamp"  # Watermarked column
)
# Late events processed correctly based on sequence_by
```

## 7. Merge Strategies and Conflict Resolution

### Understanding Merge Conflicts

**Conflict Scenarios:**
```
Scenario 1: Out-of-order updates
Event 1: customer_id=1, tier='Premium', timestamp=10:00
Event 2: customer_id=1, tier='Free', timestamp=09:00  <- Older event arrives late
Solution: Use sequence_by to apply in correct order

Scenario 2: Duplicate keys in batch
Event 1: customer_id=1, name='Alice'
Event 2: customer_id=1, name='Alice J'  <- Same key, different value
Solution: Lakeflow deduplicates based on sequence_by (keeps latest)

Scenario 3: Concurrent updates
Source A: customer_id=1, email='alice@new.com', timestamp=10:00:00
Source B: customer_id=1, phone='555-1234', timestamp=10:00:00  <- Same timestamp
Solution: Define tie-breaker logic or merge both updates
```

### Sequence-Based Conflict Resolution

```python
# Use timestamp to order operations
dp.create_auto_cdc_flow(
    source="customer_cdc",
    target="customers",
    keys=["customer_id"],
    sequence_by="update_timestamp",  # Latest timestamp wins
    stored_as_scd_type=1
)

# If multiple events have same timestamp:
# - Insert: Always applied
# - Update: Last one processed wins (non-deterministic if truly concurrent)
# - Delete: Removes record
```

### Deduplication Strategy

```python
# Lakeflow automatically deduplicates based on:
# 1. Keys (primary key match)
# 2. sequence_by (latest value kept)

@dp.streaming_table
def deduplicated_cdc():
    """
    If CDC source has duplicates,
    pre-deduplicate before auto_cdc_flow
    """
    return (
        spark.readStream.table("raw.cdc_with_duplicates")
        .withWatermark("event_time", "10 minutes")
        .dropDuplicates(["customer_id", "event_time"])  # Remove exact duplicates
    )

dp.create_auto_cdc_flow(
    source="deduplicated_cdc",
    target="customers",
    keys=["customer_id"],
    sequence_by="event_time"
)
```

### Handling Partial Updates

```python
# CDC sends only changed columns (nulls for unchanged)
dp.create_auto_cdc_flow(
    source="partial_update_cdc",
    target="customers",
    keys=["customer_id"],
    sequence_by="update_timestamp",
    stored_as_scd_type=1,
    ignore_null_updates=True  # Don't overwrite with NULL
)

# Example:
# Existing: {id=1, name='Alice', email='alice@old.com', tier='Free'}
# Update:   {id=1, name=NULL, email='alice@new.com', tier=NULL}
# Result:   {id=1, name='Alice', email='alice@new.com', tier='Free'}
#                  ^^^^^ Kept    ^^^^^^^^^^^^^ Updated  ^^^^ Kept
```

## Summary

In this notebook, we explored flows and advanced CDC patterns in Lakeflow:

### Key Concepts Covered

1. **Lakeflow Flows**
   - High-level abstractions for common patterns
   - Less code, built-in best practices
   - Automatic optimization

2. **Append Flow (`dp.append_flow()`)**
   - Incremental data loading
   - Event logs and time-series data
   - Efficient append-only pattern

3. **Automatic CDC (`dp.create_auto_cdc_flow()`)**
   - Change data capture automation
   - Handles inserts, updates, deletes
   - Sequence-based ordering
   - Key-based deduplication

4. **Snapshot CDC (`dp.create_auto_cdc_from_snapshot_flow()`)**
   - CDC from full snapshots
   - Automatic change detection
   - Soft delete handling

5. **Slowly Changing Dimensions**
   - Type 1: Current state (overwrite)
   - Type 2: Historical tracking (versioning)
   - Selective column tracking
   - Point-in-time queries

6. **Advanced CDC Patterns**
   - Multi-tier CDC architectures
   - Selective column synchronization
   - Custom delete conditions
   - Late arrival handling

7. **Merge Strategies**
   - Conflict resolution with sequence_by
   - Deduplication strategies
   - Partial update handling
   - Null value management

### Functional Programming Benefits

- **Declarative**: Define what to sync, not how
- **Composable**: Flows integrate with table definitions
- **Immutable**: Source data unchanged, target reflects changes
- **Deterministic**: Sequence-based ordering ensures consistency

### Best Practices

✅ Use append_flow for append-only data
✅ Choose SCD Type 1 for current state, Type 2 for history
✅ Always specify sequence_by for ordering
✅ Define clear primary keys
✅ Handle late arrivals with watermarking
✅ Use ignore_null_updates for partial updates
✅ Monitor CDC lag and apply rates

### Next Steps

- **6.6**: Best practices and anti-patterns for Lakeflow


## Exercises

Practice CDC patterns:

**Exercise 1: Append Flow**
- Create append flow for event log data
- Apply transformations before appending
- Monitor append rates and volumes

**Exercise 2: Basic CDC**
- Implement auto_cdc_flow with sample data
- Test insert, update, and delete operations
- Verify correct sequence handling

**Exercise 3: SCD Type 1 vs Type 2**
- Create same source with both Type 1 and Type 2
- Apply sample changes
- Compare storage and query patterns

**Exercise 4: Snapshot CDC**
- Process daily snapshots with CDC
- Detect and apply all change types
- Handle soft deletes

**Exercise 5: Conflict Resolution**
- Create CDC stream with out-of-order events
- Implement sequence-based resolution
- Test late arrival handling

**Exercise 6: Multi-Tier CDC**
- Design bronze/silver/gold CDC architecture
- Apply Type 1 for current, Type 2 for history
- Monitor quality at each layer
