In [0]:
# INCLUDE_HEADER_TRUE
# INCLUDE_FOOTER_TRUE

# Lecture - Change Data Capture (CDC) Review

This lecture provides a comprehensive review of Change Data Capture (CDC) concepts and implementation patterns in the Lakehouse. You'll explore how CDC enables real-time data synchronization and learn about different approaches to handling changing data.

## Learning Objectives

By the end of this lecture, you will be able to:

1. **Define** Change Data Capture (CDC) and explain its role in data synchronization
2. **Distinguish** between SCD Type 1 and SCD Type 2 patterns and their use cases
3. **Analyze** how SCD Type 1 overwrites existing data while SCD Type 2 preserves history
4. **Identify** when to use each SCD type based on business requirements
5. **Recognize** how `AUTO CDC INTO` simplifies CDC implementation in Lakeflow Declarative Pipelines

## A. What is Change Data Capture?

<img src="./Includes/images/cdc_lecture/01-cdcoverview-review.png" alt="CDC Overview" width="1100">


Let's review **Change Data Capture (CDC)**, a foundational concept for keeping data synchronized across systems.

#### CDC Definition and Purpose

Change Data Capture is a technique used to track and capture changes in data sources like databases, Lakehouses, or data warehouses, and then apply those changes to a target table to ensure it reflects the latest state of the source.

#### Slowly Changing Dimensions (SCDs)

CDC is closely tied to the concept of **Slowly Changing Dimensions (SCDs)**, which describe how historical data changes are handled in your target system.

We'll focus on two main types:
- **SCD Type 1** - Overwrites existing data with new values (no history tracking)  
- **SCD Type 2** - Preserves history by storing previous versions of records

#### Real-World Example

Imagine a **customer** table where new customers are added, existing customer information is updated, or some customers are deleted. CDC ensures those changes flow into the target table using either SCD Type 1 or Type 2 logic, keeping the target table continuously up to date.

**Think about it:** What types of data changes occur frequently in your organization that would benefit from automated CDC processing?

#### Documentation Resources
- [What is change data capture (CDC)?](https://docs.databricks.com/aws/en/ldp/what-is-change-data-capture)

## B. Slowly Changing Dimensions (SCD) Type 1

### B1. SCD Type 1 - Overview

<img src="./Includes/images/cdc_lecture/02-scd-type-1-02-review-slide.png" alt="SCD Type 1 Example" width="1100">


SCD Type 1 updates the **target table** by **overwriting existing rows** with the latest values. No version history is maintained - only the current state matters.

#### Key Characteristics:
- **Updates:** When a record is updated by its key(s), the existing row is replaced with new values
- **Deletes:** When a record is deleted by its key(s), it is removed from the target table  
- **Current State Only:** Only the most recent version of each record is stored
- **No History:** Previous changes and historical versions are not retained

#### Our Scenario Setup

**Target Table (customers)** - Current customer data:

| CustomerID | Name   | Address | ProcessDate |
|------------|--------|---------|-------------|
| 1 | Peter | 1 Blue Rd. | 5/1/2025 |
| 2 | Samarth | 22 Front St | 5/1/2025 |

**Source Updates** - Incoming changes:

| Change Type | Details |
|-------------|---------|
| **Update** | Peter has two address updates (5/15 and 5/20) for `customer_id = 1` |
| **Delete** | Samarth requests account removal (`customer_id = 2`) |
| **Insert** | New customer Kostas joins (`customer_id = 3`) |

**Goal:** Apply updates so the target table reflects the latest state of all customers.

### B2. SCD Type 1 - Implementation Example

<img src="./Includes/images/cdc_lecture/02-scd-type-1-01-review-slide.png" alt="SCD Type 1 Overview" width="1100">

When we apply **SCD Type 1**, the target table is updated with the latest customer information using the **CustomerID** (key column) and **ProcessDate** (sequence column) to determine the most recent changes.

#### Processing Logic:

1. **Peter (CustomerID 1):** 
   - Multiple address updates exist (5/15 and 5/20)
   - Only the latest update (5/20) with address *123 Main St.* is applied
   - Previous address history is lost

2. **Samarth (CustomerID 2):** 
   - Marked for deletion
   - Entire row is removed from the target table
   - No trace of the customer remains

3. **Kostas (CustomerID 3):** 
   - New customer record
   - Inserted as a new row in the target table

#### Final Result:
The customers table contains only the most current snapshot of active customers. This approach is ideal when:
- Historical data is not required for business operations
- Storage efficiency is prioritized
- Regulatory compliance doesn't require audit trails

**Use SCD Type 1 when:** You only need the most up-to-date and accurate information without historical context.

#### Documentation
- [Use SCD Type 1 to keep only the latest data](https://docs.databricks.com/aws/en/ldp/what-is-change-data-capture#step-2-use-scd-type-1-to-keep-only-the-latest-data)

## C. Slowly Changing Dimensions (SCD) Type 2

### C1. SCD Type 2 - Scenario

<img src="./Includes/images/cdc_lecture/03-auto-cdc-examplescenario.png" alt="SCD Type 2 - Scenario" width="1100">

### C2. SCD Type 2 - Implementation
<img src="./Includes/images/cdc_lecture/02-scd-type-2-01-review-slide.png" alt="SCD Type 2" width="1100">




**Slowly Changing Dimensions Type 2 (SCD Type 2)** introduces **historical tracking and versioning** of records, preserving a complete audit trail of all changes over time.

#### Core Principles:

When a record changes:
- **Historical Preservation:** The old record is preserved with metadata columns showing its validity period
- **New Version Creation:** A new record is inserted with the updated information
- **Soft Deletes:** Deleted records remain in the table but are flagged as inactive

#### Metadata Columns Added:
- **__START_AT** - Timestamp when the row became active  
- **__END_AT** - Timestamp when the row became inactive  
  - `NULL` **value** = currently active record
  - **Non** `NULL` **value** = inactive/historical record

#### Detailed Example Breakdown:

**Customer ID 1 - Peter:**  
- **Two records exist** for Peter in the final table
- **Active record:** Shows Peter's current address with `__END_AT = NULL`
- **Historical record:** Preserves his previous address with `__END_AT` populated

**Customer ID 2 - Samarth:**  
- **Account deletion** processed as a soft delete
- **Original record remains** but `__END_AT` is populated with deletion timestamp
- **No new record created** since this was a deletion operation

**Customer ID 3 - Kostas:**  
- **New customer insertion** creates active record
- **__START_AT** marks when he joined, **__END_AT** remains NULL

#### Business Value:
SCD Type 2 enables complete historical analysis, allowing you to:
- Track how customer attributes evolved over time
- Perform point-in-time analysis for any historical date
- Maintain compliance with audit requirements
- Support advanced analytics on changing dimensions

**Use SCD Type 2 when:** Historical data tracking is essential for business intelligence, compliance, or analytical requirements.

#### Documentation
- [Use SCD Type 2 to keep historical data](https://docs.databricks.com/aws/en/ldp/what-is-change-data-capture#step-3-use-scd-type-2-to-keep-historical-data)

## D. Implementing CDC with `AUTO CDC INTO` in Spark Declarative Pipelines (SCD Type 1 Example)

<img src="./Includes/images/cdc_lecture/03-auto-cdc-example.png" alt="AUTO CDC Example" width="1100">

Now that we've covered CDC concepts and both SCD patterns, let's explore how **Lakeflow Spark Declarative Pipelines** simplifies CDC implementation with the `AUTO CDC INTO` statement.

**NOTE:** The `AUTO CDC` APIs were previously known as `APPLY CHANGES INTO`, but the syntax and functionality remain identical.

#### AUTO CDC INTO Syntax Breakdown

```sql
CREATE OR REFRESH STREAMING TABLE customers;

CREATE FLOW scd_type_1_flow AS
AUTO CDC INTO customers 
 FROM STREAM updates
 KEYS (CustomerID)                              
 APPLY AS DELETE WHEN operation = "DELETE"     
 SEQUENCE BY ProcessDate                 
 COLUMNS * EXCEPT (operation)  
 STORED AS SCD TYPE 1;
```

- **`AUTO CDC INTO customers`** - Specifies the target table for CDC operations
- **`FROM STREAM updates`** - Defines the source stream containing CDC events
- **`KEYS (CustomerID)`** - Establishes unique key(s) for matching source and target records
- **`APPLY AS DELETE WHEN operation = "DELETE"`** - Defines deletion logic based on operation column
- **`SEQUENCE BY ProcessDate`** - Ensures events are processed in chronological order
- **`COLUMNS * EXCEPT (operation)`** - Includes all columns except operational metadata
- **`STORED AS SCD TYPE 1`** - Specifies SCD Type 1 pattern (default is SCD Type 1 default)

#### Key Advantages

`AUTO CDC INTO` provides significant benefits over traditional approaches:

1. **Simplified Implementation:** Eliminates complex `MERGE INTO` logic
2. **Automatic Ordering:** Handles event sequencing automatically
3. **Built-in SCD Support:** Native support for both Type 1 and Type 2 patterns
4. **Streaming Integration:** Works seamlessly with both streaming and batch sources
5. **Error Handling:** Includes robust error handling and recovery mechanisms

**Reflection Question:** How might `AUTO CDC INTO` simplify your current data pipeline maintenance compared to custom merge logic?

## E. Documentation and Next Steps

#### Key Resources:
- [The AUTO CDC APIs: Simplify change data capture with Lakeflow Declarative Pipelines](https://docs.databricks.com/aws/en/ldp/cdc)
- [AUTO CDC INTO (Lakeflow Declarative Pipelines)](https://docs.databricks.com/aws/en/ldp/developer/ldp-sql-ref-apply-changes-into)

#### Coming Up Next:
In the following demonstration, you'll get hands-on experience implementing both SCD Type 1 and Type 2 patterns using `AUTO CDC INTO` with real customer data scenarios.

## F. Summary and Key Takeaways

### What We Covered:

1. **Change Data Capture (CDC)** enables automated synchronization of data changes between systems
2. **SCD Type 1** overwrites existing data, maintaining only current state (no history)
3. **SCD Type 2** preserves complete historical versions using metadata columns
4. **`AUTO CDC INTO`** provides declarative, simplified CDC implementation in Lakeflow pipelines

### Decision Framework:

**Choose SCD Type 1 when:**
- Only current data state is needed
- Storage efficiency is prioritized
- Historical tracking is not required

**Choose SCD Type 2 when:**
- Historical analysis is essential
- Audit trails are required for compliance
- Point-in-time reporting is needed

### Preparation for Demo - Automating SCD Type 2 with AUTO CDC in Lakeflow Spark Declarative Pipelines
You're now ready to implement these concepts hands on in the upcoming demonstration where you'll build working CDC pipelines using both SCD patterns.