
# Lakeflow Auto CDC: Change Data Capture Made Easy üîÑ

Welcome to the **Auto CDC** demo! This notebook teaches you how to automatically track and process data changes in your Lakeflow pipelines.

---

## What you'll learn:

* üìö **Part 1:** Understanding CDC concepts
* üîç **Part 2:** How Auto CDC works
* ü•â **Part 3:** Basic Auto CDC setup
* üéØ **Part 4:** Advanced CDC features (SCD Type 1 & Type 2)
* üèóÔ∏è **Part 5:** Handling deletes and complex scenarios
* ‚úÖ **Part 6:** Best practices and monitoring

---

## Prerequisites:

* Complete [Lakeflow Pipeline Fundamentals](#notebook/2846436383063456)
* Complete [Lakeflow Expectations](#notebook/2846436383063443)
* Understanding of streaming tables
* Basic SQL knowledge

---

**Let's get started!** üöÄ

%undefined
## Part 1: Understanding Change Data Capture (CDC) üìö

Before using Auto CDC, let's understand what CDC is and why it matters.

---


### üîÑ What is Change Data Capture?

**Definition:**
* CDC tracks **changes** to data over time
* Captures INSERT, UPDATE, and DELETE operations
* Maintains history of how data evolves
* Essential for data warehousing and analytics

---

### **The Problem Without CDC:**

Imagine a customer table that gets updated daily:

**Day 1:**
```
customer_id | name      | email              | status
1           | Alice     | alice@email.com    | active
2           | Bob       | bob@email.com      | active
```

**Day 2 (Bob's email changed):**
```
customer_id | name      | email              | status
1           | Alice     | alice@email.com    | active
2           | Bob       | bob_new@email.com  | active
```

**Problems:**
* ‚ùå Lost Bob's old email address
* ‚ùå Don't know WHEN it changed
* ‚ùå Can't track history
* ‚ùå Can't audit changes

---

### **The Solution With CDC:**

CDC captures every change:

```
customer_id | name  | email              | status  | operation | timestamp
1           | Alice | alice@email.com    | active  | INSERT    | 2026-01-20
2           | Bob   | bob@email.com      | active  | INSERT    | 2026-01-20
2           | Bob   | bob_new@email.com  | active  | UPDATE    | 2026-01-21
```

**Benefits:**
* ‚úÖ Complete history preserved
* ‚úÖ Know what changed and when
* ‚úÖ Can replay changes
* ‚úÖ Full audit trail


### üìù CDC Operation Types

CDC tracks three types of operations:

---

### **1. INSERT - New Records**

**What it means:**
* A new record was added
* First time seeing this key

**Example:**
```
customer_id | name    | operation | timestamp
3           | Charlie | INSERT    | 2026-01-22
```

---

### **2. UPDATE - Modified Records**

**What it means:**
* An existing record was changed
* Key exists, but values changed

**Example:**
```
customer_id | name  | email              | operation | timestamp
2           | Bob   | bob_new@email.com  | UPDATE    | 2026-01-21
```

---

### **3. DELETE - Removed Records**

**What it means:**
* A record was deleted from source
* Mark as deleted (soft delete) or remove (hard delete)

**Example:**
```
customer_id | operation | timestamp
1           | DELETE    | 2026-01-23
```

---

### **How CDC Identifies Operations:**

**Requires:**
1. **Primary Key** - Unique identifier (e.g., `customer_id`)
2. **Sequence Column** - Order of changes (e.g., `timestamp`, `version`)
3. **Operation Column** (optional) - Explicit operation type


### üèõÔ∏è Slowly Changing Dimensions (SCD)

CDC is often used to implement **Slowly Changing Dimensions** - a data warehousing pattern.

---

### **SCD Type 1 - Overwrite (No History)**

**Behavior:**
* Overwrites old values with new values
* No history maintained
* Only current state exists

**Use case:** When history doesn't matter (e.g., fixing typos)

**Example:**
```
Before UPDATE:
customer_id | name  | email
2           | Bob   | bob@email.com

After UPDATE:
customer_id | name  | email
2           | Bob   | bob_new@email.com  ‚Üê Old email gone
```

---

### **SCD Type 2 - Track History (Full History)**

**Behavior:**
* Keeps all historical versions
* Adds new row for each change
* Tracks validity periods

**Use case:** When you need complete audit trail

**Example:**
```
customer_id | email              | valid_from  | valid_to    | is_current
2           | bob@email.com      | 2026-01-20  | 2026-01-21  | false
2           | bob_new@email.com  | 2026-01-21  | NULL        | true
```

---

**Lakeflow Auto CDC supports both SCD Type 1 and Type 2!**


## Part 2: How Lakeflow Auto CDC Works üîç

Lakeflow provides **automatic CDC processing** with the `APPLY CHANGES INTO` operation.

---


### üÜï The New Auto CDC API

**Important:** This demo uses the **new Auto CDC API** introduced in recent Databricks releases.

---

### **Key Differences from Legacy API:**

**Old API (deprecated):**
```python
import dlt

dlt.apply_changes(
    target="customers",
    source="customer_changes",
    keys=["customer_id"],
    sequence_by="timestamp"
)
```

**New API (current):**
```python
from pyspark import pipelines as dp
from pyspark.sql.functions import col

dp.create_streaming_table("customers")

dp.create_auto_cdc_flow(
    target="customers",
    source="customer_changes",
    keys=["customer_id"],
    sequence_by=col("timestamp")
)
```

---

### **What Changed:**

1. **Import statement:** `from pyspark import pipelines as dp` instead of `import dlt`
2. **Decorators:** `@dp.view`, `@dp.table` instead of `@dlt.table`
3. **Target creation:** Must explicitly create target with `dp.create_streaming_table()`
4. **CDC function:** `dp.create_auto_cdc_flow()` instead of `dlt.apply_changes()`
5. **Sequence column:** Use `col("column_name")` function instead of string
6. **SCD type:** String value `"1"` or `"2"` instead of integer
7. **Delete condition:** Use `expr("condition")` instead of string

---

**This new API is more explicit and aligns with Databricks' modern pipeline architecture.**


### ‚ö° What is Auto CDC?

**Auto CDC** is a Lakeflow feature that automatically:
* Processes change data from streaming sources
* Applies INSERT, UPDATE, DELETE operations
* Maintains SCD Type 1 or Type 2 tables
* Handles out-of-order data
* Deduplicates changes

---

### **Traditional CDC (Manual):**

```python
# Complex manual logic needed:
# 1. Read changes
# 2. Identify operation type
# 3. Handle duplicates
# 4. Merge with target table
# 5. Track history (if SCD Type 2)
# 6. Handle deletes
# ... 100+ lines of code
```

---

### **Auto CDC (Declarative):**

```python
from pyspark import pipelines as dp
from pyspark.sql.functions import col

# Create target table
dp.create_streaming_table("customers")

# Create Auto CDC flow
dp.create_auto_cdc_flow(
    target="customers",
    source="customer_changes",
    keys=["customer_id"],
    sequence_by=col("timestamp")
)
```

**That's it!** Lakeflow handles all the complexity.

---

### **Key Benefits:**

* ‚úÖ **Simple** - Declarative syntax
* ‚úÖ **Automatic** - Handles merges, deduplication
* ‚úÖ **Reliable** - Handles out-of-order data
* ‚úÖ **Flexible** - SCD Type 1 or Type 2
* ‚úÖ **Performant** - Optimized for streaming


### üß© Auto CDC Components

Auto CDC requires these components:

---

### **1. Source View (Changes)**

**What it is:**
* View with change data (streaming or batch)
* Contains INSERT, UPDATE, DELETE operations

**Example:**
```python
@dp.view
def customer_changes():
    return spark.readStream.table("bronze_customer_changes")
```

---

### **2. Target Streaming Table (Current State)**

**What it is:**
* The table to apply changes to
* Maintains current state (SCD Type 1) or history (SCD Type 2)
* Must be created BEFORE the Auto CDC flow

**Example:**
```python
dp.create_streaming_table("customers")
```

---

### **3. Auto CDC Flow**

**What it is:**
* The CDC processing logic
* Defined using `dp.create_auto_cdc_flow()`

**Required parameters:**
* `target` - Target table name
* `source` - Source view name
* `keys` - Primary key columns (list)
* `sequence_by` - Column to order changes (use `col()` function)

**Example:**
```python
from pyspark.sql.functions import col

dp.create_auto_cdc_flow(
    target="customers",
    source="customer_changes",
    keys=["customer_id"],
    sequence_by=col("timestamp")
)
```


## Part 3: Basic Auto CDC Setup (SCD Type 1) ü•â

Let's build our first Auto CDC pipeline with **SCD Type 1** (overwrite, no history).

**Scenario:** Track customer information with latest values only.

---

In [0]:
from pyspark import pipelines as dp
from pyspark.sql.functions import *

# ============================================
# BRONZE LAYER: Ingest CDC Changes
# ============================================

@dp.view
def bronze_customer_changes():
    """
    Ingest customer changes from source.
    Each row represents a change event (INSERT, UPDATE, DELETE).
    """
    return (
        spark.readStream
        .table("samples.tpch.customer")
        .select(
            col("c_custkey").alias("customer_id"),
            col("c_name").alias("customer_name"),
            col("c_address").alias("address"),
            col("c_phone").alias("phone"),
            col("c_mktsegment").alias("market_segment"),
            current_timestamp().alias("change_timestamp"),
            lit("INSERT").alias("operation")  # Simulating CDC operation
        )
    )

In [0]:
# ============================================
# SILVER LAYER: Apply CDC Changes (SCD Type 1)
# ============================================

# Create the target streaming table
dp.create_streaming_table("silver_customers")

# Create Auto CDC flow to apply changes (SCD Type 1)
dp.create_auto_cdc_flow(
    target="silver_customers",
    source="bronze_customer_changes",
    keys=["customer_id"],
    sequence_by=col("change_timestamp"),
    stored_as_scd_type="1"  # SCD Type 1: Overwrite (no history)
)


### üí° Understanding the Auto CDC Code

**What we did:**

---

### **1. Created Bronze Source View:**
```python
@dp.view
def bronze_customer_changes():
    return spark.readStream.table("samples.tpch.customer")
```
* Ingests raw change data as a view
* Each row is a change event
* Includes operation type (INSERT, UPDATE, DELETE)
* Uses `@dp.view` decorator from the new pipelines API

---

### **2. Created Target Streaming Table:**
```python
dp.create_streaming_table("silver_customers")
```
* Creates the target table for CDC changes
* Must be created BEFORE the Auto CDC flow
* Will contain the current state of data

---

### **3. Created Auto CDC Flow:**
```python
dp.create_auto_cdc_flow(
    target="silver_customers",           # Target table to update
    source="bronze_customer_changes",    # Source view with changes
    keys=["customer_id"],                # Primary key
    sequence_by=col("change_timestamp"), # Order changes by timestamp
    stored_as_scd_type="1"               # SCD Type 1 (overwrite)
)
```

---

### **Key Parameters Explained:**

**`target`:**
* Name of the target streaming table
* Must be created first with `create_streaming_table()`
* Contains current state of data

**`source`:**
* Name of the source view (created with `@dp.view`)
* Contains change events
* Must be a streaming source

**`keys`:**
* Primary key column(s) - list of column names
* Used to identify which record to update
* Can be composite key: `["customer_id", "order_id"]`

**`sequence_by`:**
* Column to order changes (use `col()` function)
* Ensures changes applied in correct order
* Handles out-of-order data automatically

**`stored_as_scd_type="1"`:**
* SCD Type 1: Overwrites old values
* No history maintained
* Only current state exists
* Use `"2"` for SCD Type 2 (history tracking)


### üîÑ How SCD Type 1 Processes Changes

**Example scenario:**

---

### **Initial State (Empty Table):**
```
silver_customers: (empty)
```

---

### **Change 1 - INSERT:**
```
customer_id | name    | email           | operation | timestamp
1           | Alice   | alice@email.com | INSERT    | 10:00
```

**Result:**
```
silver_customers:
customer_id | name    | email
1           | Alice   | alice@email.com
```

---

### **Change 2 - INSERT:**
```
customer_id | name  | email         | operation | timestamp
2           | Bob   | bob@email.com | INSERT    | 10:01
```

**Result:**
```
silver_customers:
customer_id | name    | email
1           | Alice   | alice@email.com
2           | Bob     | bob@email.com
```

---

### **Change 3 - UPDATE:**
```
customer_id | name  | email             | operation | timestamp
2           | Bob   | bob_new@email.com | UPDATE    | 10:02
```

**Result (SCD Type 1 - Overwrite):**
```
silver_customers:
customer_id | name    | email
1           | Alice   | alice@email.com
2           | Bob     | bob_new@email.com  ‚Üê Updated (old value gone)
```

**Note:** Old email `bob@email.com` is lost - no history maintained.


### üéØ Challenge 1: Create Auto CDC for Products

**Your task:**

Create an Auto CDC pipeline for product data using SCD Type 1.

---

**Requirements:**

1. **Bronze table:** `bronze_product_changes`
   * Ingest from `samples.tpch.part`
   * Map columns:
     * `p_partkey` ‚Üí `product_id`
     * `p_name` ‚Üí `product_name`
     * `p_brand` ‚Üí `brand`
     * `p_size` ‚Üí `size`
     * `p_retailprice` ‚Üí `price`
   * Add `change_timestamp` (current timestamp)
   * Add `operation` (set to "INSERT")

2. **Streaming source:** `product_changes_stream`
   * Read from `bronze_product_changes`

3. **Apply CDC:**
   * Target: `silver_products`
   * Keys: `["product_id"]`
   * Sequence by: `change_timestamp`
   * SCD Type 1

---

**Write your code below:** üëá

In [0]:
# ============================================
# CHALLENGE 1: Your solution here
# ============================================

# TODO: Create bronze_product_changes table


# TODO: Create product_changes_stream table


# TODO: Apply CDC changes to silver_products



## Part 4: Advanced CDC - SCD Type 2 (History Tracking) üéØ

Now let's implement **SCD Type 2** to maintain complete history of changes.

**Scenario:** Track customer information with full audit trail.

---


### üìú SCD Type 2 - History Tracking

**What is SCD Type 2?**
* Maintains **complete history** of all changes
* Creates **new row** for each change
* Tracks **validity periods** for each version
* Marks **current** vs **historical** records

---

### **SCD Type 2 Columns:**

Auto CDC automatically adds these columns:

**`__START_AT`:**
* When this version became valid
* Timestamp of the change

**`__END_AT`:**
* When this version became invalid
* NULL for current version

**`__CURRENT`:**
* Boolean flag
* TRUE for current version
* FALSE for historical versions

---

### **Example:**

**Changes:**
```
Timestamp | customer_id | name  | email
10:00     | 2           | Bob   | bob@email.com
10:05     | 2           | Bob   | bob_new@email.com
10:10     | 2           | Bob   | bob_final@email.com
```

**Result (SCD Type 2):**
```
customer_id | name | email              | __START_AT | __END_AT | __CURRENT
2           | Bob  | bob@email.com      | 10:00      | 10:05    | false
2           | Bob  | bob_new@email.com  | 10:05      | 10:10    | false
2           | Bob  | bob_final@email.com| 10:10      | NULL     | true
```

**Benefits:**
* ‚úÖ Complete audit trail
* ‚úÖ Can query historical state
* ‚úÖ Know exactly when changes occurred
* ‚úÖ Can rollback to any point in time

In [0]:
# ============================================
# SILVER LAYER: Apply CDC Changes (SCD Type 2)
# ============================================

# Create the target streaming table for SCD Type 2
dp.create_streaming_table("silver_customers_history")

# Create Auto CDC flow with SCD Type 2 (history tracking)
dp.create_auto_cdc_flow(
    target="silver_customers_history",
    source="bronze_customer_changes",
    keys=["customer_id"],
    sequence_by=col("change_timestamp"),
    stored_as_scd_type="2"  # SCD Type 2: Track history
)


### üí° SCD Type 2 - What Changed?

**The only difference from SCD Type 1:**

```python
stored_as_scd_type="2"  # Changed from "1" to "2"
```

**That's it!** Auto CDC handles all the complexity:

---

### **What Auto CDC Does Automatically:**

1. **Creates history rows:**
   * New row for each change
   * Preserves all historical versions

2. **Adds tracking columns:**
   * `__START_AT` - When version became valid
   * `__END_AT` - When version became invalid
   * `__CURRENT` - Is this the current version?

3. **Manages validity periods:**
   * Sets `__END_AT` on old versions
   * Sets `__END_AT = NULL` on current version
   * Updates `__CURRENT` flag

4. **Handles out-of-order data:**
   * Uses `sequence_by` to order changes
   * Correctly updates validity periods

---

### **Querying SCD Type 2 Tables:**

**Get current records only:**
```sql
SELECT * FROM silver_customers_history
WHERE __CURRENT = true
```

**Get all history:**
```sql
SELECT * FROM silver_customers_history
ORDER BY customer_id, __START_AT
```

**Get state at specific time:**
```sql
SELECT * FROM silver_customers_history
WHERE __START_AT <= '2026-01-25 10:00:00'
  AND (__END_AT > '2026-01-25 10:00:00' OR __END_AT IS NULL)
```


### üéØ Challenge 2: Create SCD Type 2 for Suppliers

**Your task:**

Create an Auto CDC pipeline for supplier data with full history tracking (SCD Type 2).

---

**Requirements:**

1. **Bronze table:** `bronze_supplier_changes`
   * Ingest from `samples.tpch.supplier`
   * Map columns:
     * `s_suppkey` ‚Üí `supplier_id`
     * `s_name` ‚Üí `supplier_name`
     * `s_address` ‚Üí `address`
     * `s_phone` ‚Üí `phone`
     * `s_acctbal` ‚Üí `account_balance`
   * Add `change_timestamp` (current timestamp)
   * Add `operation` (set to "INSERT")

2. **Streaming source:** `supplier_changes_stream`
   * Read from `bronze_supplier_changes`

3. **Apply CDC with SCD Type 2:**
   * Target: `silver_suppliers_history`
   * Keys: `["supplier_id"]`
   * Sequence by: `change_timestamp`
   * SCD Type 2 (history tracking)

---

**Write your code below:** üëá

In [0]:
# ============================================
# CHALLENGE 2: Your solution here
# ============================================

# TODO: Create bronze_supplier_changes table


# TODO: Create supplier_changes_stream table


# TODO: Apply CDC changes with SCD Type 2



## Part 5: Handling Deletes and Complex Scenarios üèóÔ∏è

Let's explore advanced CDC features: handling deletes, column tracking, and filtering.

---


### üóëÔ∏è Handling DELETE Operations

Auto CDC can handle DELETE operations in two ways:

---

### **1. Soft Delete (Default for SCD Type 2)**

**Behavior:**
* Marks record as deleted
* Keeps the record in the table
* Sets `__END_AT` timestamp
* Sets `__CURRENT = false`

**Use case:** Maintain complete audit trail including deletions

**Example:**
```
Before DELETE:
customer_id | name  | __START_AT | __END_AT | __CURRENT
2           | Bob   | 10:00      | NULL     | true

After DELETE:
customer_id | name  | __START_AT | __END_AT | __CURRENT
2           | Bob   | 10:00      | 10:15    | false
```

---

### **2. Hard Delete (Optional)**

**Behavior:**
* Physically removes the record
* No trace left in table

**Use case:** Compliance (GDPR), data retention policies

**Configuration:**
```python
dlt.apply_changes(
    target="customers",
    source="customer_changes",
    keys=["customer_id"],
    sequence_by="timestamp",
    apply_as_deletes="operation = 'DELETE'",  # Specify delete condition
    apply_as_truncates="operation = 'TRUNCATE'"  # Optional: truncate condition
)
```

---

### **Identifying Deletes:**

Auto CDC needs to know which records are deletes:

**Option 1: Operation column**
```python
apply_as_deletes="operation = 'DELETE'"
```

**Option 2: Deleted flag**
```python
apply_as_deletes="is_deleted = true"
```

**Option 3: Null values**
```python
apply_as_deletes="customer_name IS NULL"
```

In [0]:
# ============================================
# CDC with DELETE Handling
# ============================================

# Bronze layer with delete operations
@dp.view
def bronze_customer_changes_with_deletes():
    """
    Simulating CDC feed with INSERT, UPDATE, and DELETE operations.
    """
    return (
        spark.readStream
        .table("samples.tpch.customer")
        .select(
            col("c_custkey").alias("customer_id"),
            col("c_name").alias("customer_name"),
            col("c_address").alias("address"),
            col("c_phone").alias("phone"),
            current_timestamp().alias("change_timestamp"),
            # Randomly mark some records as deleted for demo
            when(col("c_custkey") % 100 == 0, lit("DELETE"))
            .otherwise(lit("INSERT"))
            .alias("operation")
        )
    )

# Create target streaming table
dp.create_streaming_table("silver_customers_with_deletes")

# Apply changes with delete handling
dp.create_auto_cdc_flow(
    target="silver_customers_with_deletes",
    source="bronze_customer_changes_with_deletes",
    keys=["customer_id"],
    sequence_by=col("change_timestamp"),
    stored_as_scd_type="2",
    apply_as_deletes=expr("operation = 'DELETE'")  # Handle DELETE operations
)


### üìä Advanced Features: Column Tracking and Filtering

---

### **1. Track History on Specific Columns**

You can exclude columns from history tracking (SCD Type 2):

```python
dp.create_auto_cdc_flow(
    target="customers",
    source="customer_changes",
    keys=["customer_id"],
    sequence_by=col("timestamp"),
    stored_as_scd_type="2",
    track_history_except_column_list=["last_login", "login_count"]  # Don't track these
)
```

**Use case:**
* Ignore changes to non-critical columns
* Reduce storage for SCD Type 2
* Focus on important business attributes

**Example:**
* Track changes to `email`, `phone`, `address`
* Ignore changes to `last_login`, `login_count` (frequently changing, not important)

---

### **2. Exclude Columns from Target**

Exclude columns from the target table:

```python
dp.create_auto_cdc_flow(
    target="customers",
    source="customer_changes",
    keys=["customer_id"],
    sequence_by=col("timestamp"),
    except_column_list=["operation", "source_system"]  # Don't include these
)
```

**Use case:**
* Remove CDC metadata columns
* Exclude technical columns
* Keep target table clean

---

### **3. Filter Source Data**

Apply filters before CDC processing:

```python
@dp.view
def customer_changes_filtered():
    return (
        spark.readStream.table("bronze_customer_changes")
        .filter(col("country") == "USA")  # Only process USA customers
    )

dp.create_streaming_table("silver_customers_usa")

dp.create_auto_cdc_flow(
    target="silver_customers_usa",
    source="customer_changes_filtered",
    keys=["customer_id"],
    sequence_by=col("timestamp")
)
```


### üîë Composite Keys (Multiple Columns)

Some tables require multiple columns as primary key:

---

### **Example: Order Line Items**

**Scenario:**
* Each order has multiple line items
* Primary key: `order_id` + `line_number`

**Implementation:**
```python
@dp.view
def order_line_changes():
    return (
        spark.readStream
        .table("samples.tpch.lineitem")
        .select(
            col("l_orderkey").alias("order_id"),
            col("l_linenumber").alias("line_number"),
            col("l_partkey").alias("product_id"),
            col("l_quantity").alias("quantity"),
            col("l_extendedprice").alias("price"),
            current_timestamp().alias("change_timestamp")
        )
    )

# Create target table
dp.create_streaming_table("silver_order_lines")

# Create Auto CDC flow with composite key
dp.create_auto_cdc_flow(
    target="silver_order_lines",
    source="order_line_changes",
    keys=["order_id", "line_number"],  # Composite key
    sequence_by=col("change_timestamp"),
    stored_as_scd_type="1"
)
```

**Key points:**
* `keys` accepts list of multiple columns
* All key columns must be present in source
* Combination must be unique


### üéØ Challenge 3: Advanced CDC with Composite Keys

**Your task:**

Create an Auto CDC pipeline for order line items with composite keys and delete handling.

---

**Requirements:**

1. **Bronze table:** `bronze_orderline_changes`
   * Ingest from `samples.tpch.lineitem`
   * Map columns:
     * `l_orderkey` ‚Üí `order_id`
     * `l_linenumber` ‚Üí `line_number`
     * `l_partkey` ‚Üí `product_id`
     * `l_quantity` ‚Üí `quantity`
     * `l_extendedprice` ‚Üí `extended_price`
     * `l_linestatus` ‚Üí `status`
   * Add `change_timestamp` (current timestamp)
   * Add `operation` column:
     * Set to "DELETE" if `l_linestatus = 'F'` (finished)
     * Otherwise set to "INSERT"

2. **Streaming source:** `orderline_changes_stream`
   * Read from `bronze_orderline_changes`

3. **Apply CDC:**
   * Target: `silver_order_lines`
   * Composite keys: `["order_id", "line_number"]`
   * Sequence by: `change_timestamp`
   * SCD Type 1
   * Handle deletes: `operation = 'DELETE'`
   * Exclude columns: `["operation"]`

---

**Write your code below:** üëá

In [0]:
# ============================================
# CHALLENGE 3: Your solution here
# ============================================

# TODO: Create bronze_orderline_changes table


# TODO: Create orderline_changes_stream table


# TODO: Apply CDC with composite keys and delete handling



## Part 6: Best Practices and Monitoring ‚úÖ

Let's cover best practices for production Auto CDC pipelines.

---

%undefined
### üéØ Auto CDC Best Practices

---

### **1. Choose the Right SCD Type**

**SCD Type 1 (Overwrite):**
* ‚úÖ Use when: History not needed
* ‚úÖ Use when: Storage is a concern
* ‚úÖ Use when: Only current state matters
* ‚úÖ Examples: Reference data, dimension corrections

**SCD Type 2 (History):**
* ‚úÖ Use when: Audit trail required
* ‚úÖ Use when: Compliance needs
* ‚úÖ Use when: Historical analysis needed
* ‚úÖ Examples: Customer data, pricing, product attributes

---

### **2. Sequence Column Selection**

**Good sequence columns:**
* ‚úÖ Timestamp with high precision
* ‚úÖ Monotonically increasing version number
* ‚úÖ Transaction ID or log sequence number

**Bad sequence columns:**
* ‚ùå Date only (no time)
* ‚ùå Non-unique values
* ‚ùå Can decrease or reset

**Example:**
```python
from pyspark.sql.functions import col

# Good
sequence_by=col("updated_timestamp")  # Timestamp with milliseconds
sequence_by=col("version_number")     # Incrementing integer

# Bad
sequence_by=col("updated_date")       # Date only, multiple changes per day
```

---

### **3. Primary Key Selection**

**Requirements:**
* ‚úÖ Must be unique
* ‚úÖ Must be immutable (never changes)
* ‚úÖ Must be present in all records
* ‚úÖ Should be business meaningful

**Examples:**
```python
# Good
keys=["customer_id"]                    # Natural key
keys=["order_id", "line_number"]        # Composite key

# Avoid if possible
keys=["surrogate_key"]                  # Generated key (less meaningful)
```

---

### **4. Handle Out-of-Order Data**

Auto CDC automatically handles out-of-order data using `sequence_by`:

```python
# Changes arrive out of order:
Timestamp | customer_id | email
10:05     | 1           | new@email.com
10:00     | 1           | old@email.com   ‚Üê Arrives late

# Auto CDC applies in correct order:
# 1. old@email.com (10:00)
# 2. new@email.com (10:05)
```

**Best practice:** Always specify `sequence_by` for reliable ordering.

---

### **5. Data Quality with Expectations**

Combine Auto CDC with expectations:

```python
from pyspark import pipelines as dp
from pyspark.sql.functions import col

@dp.view(
    expect_or_drop={
        "valid_key": "customer_id IS NOT NULL",
        "valid_sequence": "change_timestamp IS NOT NULL"
    }
)
def customer_changes_validated():
    return spark.readStream.table("bronze_customer_changes")

dp.create_streaming_table("silver_customers")

dp.create_auto_cdc_flow(
    target="silver_customers",
    source="customer_changes_validated",
    keys=["customer_id"],
    sequence_by=col("change_timestamp")
)
```

---

### **6. Performance Optimization**

**Partition target tables:**
```python
@dp.table(
    partition_cols=["change_date"],
    table_properties={
        "delta.autoOptimize.optimizeWrite": "true",
        "delta.autoOptimize.autoCompact": "true"
    }
)
```

**Use appropriate cluster size:**
* Start small for development
* Scale up for production based on data volume

**Monitor pipeline metrics:**
* Check processing latency
* Monitor backlog size
* Track error rates


### üìä Monitoring Auto CDC Pipelines

---

### **1. Pipeline Event Log**

View CDC operations in the event log:

```sql
SELECT 
    timestamp,
    details:flow_definition.output_dataset as target_table,
    details:flow_definition.input_datasets as source_tables,
    details:flow_progress.metrics
FROM event_log(TABLE(silver_customers))
WHERE event_type = 'flow_progress'
ORDER BY timestamp DESC
```

---

### **2. Check SCD Type 2 History**

**Count versions per key:**
```sql
SELECT 
    customer_id,
    COUNT(*) as version_count,
    MIN(__START_AT) as first_seen,
    MAX(__START_AT) as last_changed
FROM silver_customers_history
GROUP BY customer_id
ORDER BY version_count DESC
```

**Find current records:**
```sql
SELECT COUNT(*) as current_records
FROM silver_customers_history
WHERE __CURRENT = true
```

**Find deleted records:**
```sql
SELECT *
FROM silver_customers_history
WHERE __CURRENT = false
  AND __END_AT IS NOT NULL
ORDER BY __END_AT DESC
```

---

### **3. Validate CDC Processing**

**Compare source and target counts:**
```sql
-- Source changes
SELECT COUNT(*) as source_count
FROM bronze_customer_changes

-- Target records (current only for SCD Type 2)
SELECT COUNT(*) as target_count
FROM silver_customers_history
WHERE __CURRENT = true
```

**Check for duplicates:**
```sql
SELECT 
    customer_id,
    COUNT(*) as duplicate_count
FROM silver_customers_history
WHERE __CURRENT = true
GROUP BY customer_id
HAVING COUNT(*) > 1
```

---

### **4. Common Issues and Solutions**

**Issue: Duplicate keys in target**
* **Cause:** Multiple records with same key and `__CURRENT = true`
* **Solution:** Check `sequence_by` column has unique values per key

**Issue: Missing records**
* **Cause:** Records filtered out or failed expectations
* **Solution:** Check expectations and source data quality

**Issue: Out-of-order processing**
* **Cause:** `sequence_by` column not properly ordered
* **Solution:** Use timestamp with high precision or version number

**Issue: Slow processing**
* **Cause:** Large backlog or insufficient resources
* **Solution:** Scale up cluster, optimize source queries, add partitioning


### üèÜ Complete Production-Ready Example

Here's a complete Auto CDC pipeline with all best practices:

---

In [0]:
# ============================================
# PRODUCTION-READY AUTO CDC PIPELINE
# ============================================

from pyspark import pipelines as dp
from pyspark.sql.functions import *

# ============================================
# BRONZE: Ingest with data quality checks
# ============================================

@dp.table(
    name="bronze_customer_changes_prod",
    comment="Production customer CDC feed with quality checks",
    expect={
        "valid_timestamp": "change_timestamp IS NOT NULL",
        "valid_operation": "operation IN ('INSERT', 'UPDATE', 'DELETE')"
    }
)
def bronze_customer_changes_prod():
    """
    Ingest customer changes with monitoring.
    """
    return (
        spark.readStream
        .table("samples.tpch.customer")
        .select(
            col("c_custkey").alias("customer_id"),
            col("c_name").alias("customer_name"),
            col("c_address").alias("address"),
            col("c_phone").alias("phone"),
            col("c_mktsegment").alias("market_segment"),
            col("c_acctbal").alias("account_balance"),
            current_timestamp().alias("change_timestamp"),
            lit("INSERT").alias("operation"),
            lit("tpch_source").alias("source_system")
        )
    )

# ============================================
# SILVER: Validated view for CDC
# ============================================

@dp.view(
    name="customer_changes_validated",
    comment="Validated customer changes ready for CDC",
    expect_or_drop={
        "valid_key": "customer_id IS NOT NULL",
        "valid_sequence": "change_timestamp IS NOT NULL",
        "valid_name": "customer_name IS NOT NULL AND LENGTH(customer_name) > 0"
    }
)
def customer_changes_validated():
    """
    Validate and clean changes before CDC processing.
    """
    return (
        spark.readStream.table("bronze_customer_changes_prod")
        .filter(col("operation").isin(["INSERT", "UPDATE", "DELETE"]))
    )

# ============================================
# SILVER: Apply CDC with SCD Type 2
# ============================================

# Create target streaming table
dp.create_streaming_table("silver_customers_prod")

# Create Auto CDC flow with advanced features
dp.create_auto_cdc_flow(
    target="silver_customers_prod",
    source="customer_changes_validated",
    keys=["customer_id"],
    sequence_by=col("change_timestamp"),
    stored_as_scd_type="2",
    apply_as_deletes=expr("operation = 'DELETE'"),
    except_column_list=["operation", "source_system"],  # Exclude metadata
    track_history_except_column_list=["account_balance"]  # Don't track history for this column
)


## Summary: Auto CDC Mastery üéì

Congratulations! You've learned how to use the **new Auto CDC API** in Lakeflow pipelines.

---

### **Key Concepts:**

**1. CDC Basics:**
* ‚úÖ Tracks INSERT, UPDATE, DELETE operations
* ‚úÖ Maintains data history
* ‚úÖ Essential for data warehousing

**2. New Auto CDC API:**
* ‚úÖ Import: `from pyspark import pipelines as dp`
* ‚úÖ Create target: `dp.create_streaming_table()`
* ‚úÖ Create flow: `dp.create_auto_cdc_flow()`
* ‚úÖ Automatic merge and deduplication
* ‚úÖ Handles out-of-order data

**3. SCD Types:**
* ‚úÖ **Type 1:** Overwrite (no history) - `stored_as_scd_type="1"`
* ‚úÖ **Type 2:** Track history (audit trail) - `stored_as_scd_type="2"`

**4. Key Parameters:**
* ‚úÖ `target` - Target table name (string)
* ‚úÖ `source` - Source view name (string)
* ‚úÖ `keys` - Primary key columns (list)
* ‚úÖ `sequence_by` - Order changes (use `col()` function)
* ‚úÖ `stored_as_scd_type` - "1" or "2" (string)
* ‚úÖ `apply_as_deletes` - Handle deletes (use `expr()` function)
* ‚úÖ `except_column_list` - Exclude columns from target
* ‚úÖ `track_history_except_column_list` - Exclude from history tracking

---

### **Best Practices:**

* ‚úÖ Choose appropriate SCD type
* ‚úÖ Use reliable sequence column with `col()`
* ‚úÖ Validate data with expectations
* ‚úÖ Monitor pipeline metrics
* ‚úÖ Handle deletes appropriately
* ‚úÖ Use composite keys when needed

---

### **Next Steps:**

1. **Create your pipeline:**
   * Save this notebook
   * Create Delta Live Tables pipeline
   * Configure target catalog and schema

2. **Run and monitor:**
   * Start the pipeline
   * Check event logs
   * Validate CDC processing

3. **Explore more:**
   * Combine with expectations
   * Add data quality rules
   * Optimize performance

---

### **Resources:**

* [Lakeflow Pipeline Fundamentals](#notebook/2846436383063456)
* [Lakeflow Expectations](#notebook/2846436383063443)
* [Databricks Auto CDC Documentation](https://docs.databricks.com/aws/en/ldp/cdc/)

---

**Happy CDC processing!** üöÄ


## Creating Your Auto CDC Pipeline üõ†Ô∏è

Follow these steps to create and run your pipeline:

---

### **Step 1: Save This Notebook**

1. Click **"Save"** or press `Ctrl+S`
2. Note the notebook path

---

### **Step 2: Create Delta Live Tables Pipeline**

1. Click **"Workflows"** in the left sidebar
2. Click **"Delta Live Tables"** tab
3. Click **"Create Pipeline"**

---

### **Step 3: Configure Pipeline**

**General settings:**
* **Pipeline name:** `auto_cdc_demo`
* **Product edition:** Advanced (required for CDC)
* **Notebook libraries:** Add this notebook

**Destination:**
* **Catalog:** Your catalog name
* **Target schema:** `auto_cdc_demo`

**Compute:**
* **Pipeline mode:** Triggered (for learning)
* **Cluster mode:** Enhanced autoscaling
* **Min workers:** 1
* **Max workers:** 2

---

### **Step 4: Start Pipeline**

1. Click **"Start"**
2. Watch the pipeline execute
3. View the lineage graph

---

### **Step 5: Explore Results**

**Query SCD Type 1 table:**
```sql
SELECT * FROM <catalog>.auto_cdc_demo.silver_customers
LIMIT 10
```

**Query SCD Type 2 table:**
```sql
-- Current records only
SELECT * FROM <catalog>.auto_cdc_demo.silver_customers_history
WHERE __CURRENT = true
LIMIT 10

-- All history
SELECT * FROM <catalog>.auto_cdc_demo.silver_customers_history
ORDER BY customer_id, __START_AT
LIMIT 20
```

---

**You're ready to build production CDC pipelines!** üéâ