# 6.1 Introduction to pyspark.pipelines and Databricks Lakeflow Architecture

This notebook introduces Apache Spark 4.1's declarative pipelines framework (`pyspark.pipelines`) and Databricks Lakeflow Declarative Pipelines, the evolution of Delta Live Tables (DLT). We'll explore the migration path from legacy `dlt` to the new `pyspark.pipelines` module, understanding how this aligns with functional programming principles.

## Learning Objectives

By the end of this notebook, you will understand:
- The evolution from Delta Live Tables (`dlt`) to Lakeflow Declarative Pipelines (`pyspark.pipelines`)
- Apache Spark 4.1's open-source declarative pipelines vs Databricks Lakeflow extensions
- Core architectural concepts: flows, tables, views, sinks
- Why declarative pipelines align with functional programming
- Migration strategies from legacy DLT code
- The Lakeflow platform ecosystem

## Prerequisites

- Understanding of PySpark DataFrames and transformations
- Familiarity with functional programming concepts (pure functions, immutability)
- Knowledge of Delta Lake basics
- Experience with data pipeline patterns

## Important Note

**Platform Requirements**: This notebook demonstrates concepts that require:
- Apache Spark 4.1+ (for open-source `pyspark.pipelines`)
- Databricks Runtime (for full Lakeflow features)
- Pipeline execution mode (not standard interactive notebook mode)

Code examples show both the declarative syntax and functional simulation patterns for educational purposes.

## 1. The Evolution: From DLT to Lakeflow Declarative Pipelines

### Historical Context

**Delta Live Tables (Legacy)**:
```python
import dlt  # Legacy module

@dlt.table
def customers():
    return spark.read.table("source_customers")
```

**Lakeflow Declarative Pipelines (Modern)**:
```python
from pyspark import pipelines as dp  # New module in Spark 4.1+

@dp.table
def customers():
    return spark.read.table("source_customers")
```

### Key Changes and Timeline

| Milestone | Description | Significance |
|-----------|-------------|---------------|
| **2021** | Delta Live Tables (DLT) introduced | Proprietary Databricks framework for declarative pipelines |
| **2024** | Apache Spark 4.1 released | Declarative pipelines (`pyspark.pipelines`) become open source |
| **2025** | Databricks Data + AI Summit | DLT rebranded as "Lakeflow Declarative Pipelines" |
| **Current** | Unified approach | Single API works across open-source Spark and Databricks |

### Why the Change?

1. **Open Source Alignment**: Makes declarative pipelines available to the entire Spark community
2. **Standardization**: `pyspark.pipelines` becomes the standard, not a Databricks-specific API
3. **Portability**: Code written with `pyspark.pipelines` runs on any Spark 4.1+ environment
4. **Innovation**: Databricks contribution accelerates data engineering innovation ecosystem-wide
5. **Future-Proofing**: Aligns with Apache Spark's long-term evolution

### Backward Compatibility

**Good News**: Existing `dlt` code continues to work!
```python
# Both imports work, but dp is recommended for new code
import dlt                          # Legacy: still supported
from pyspark import pipelines as dp  # Modern: recommended

# Old decorator names still work
@dlt.table              # ✓ Supported (legacy)
@dp.table               # ✓ Recommended (modern)
```

**Migration Recommendation**: 
- Existing pipelines: No immediate action required
- New development: Use `from pyspark import pipelines as dp`
- Gradual migration: Update imports as you enhance pipelines

## 2. Open Source vs Databricks Extensions

Understanding what's in open-source Spark vs Databricks-specific extensions:

### Open Source Apache Spark 4.1 (`pyspark.pipelines`)

**Core Capabilities**:
```python
from pyspark import pipelines as dp

# ✓ Available in open-source Spark 4.1+
@dp.table                    # Define materialized tables
@dp.streaming_table          # Define streaming tables
@dp.materialized_view        # Define materialized views
@dp.temporary_view           # Define temporary views

# Quality expectations
@dp.expect()                 # Monitor quality (WARN)
@dp.expect_or_drop()         # Filter violations (DROP)
@dp.expect_or_fail()         # Stop on violations (FAIL)
```

**What You Get**:
- Declarative table definitions
- Automatic dependency resolution
- Built-in data quality expectations
- Streaming and batch processing
- Optimized execution planning

**Runs On**:
- Any Apache Spark 4.1+ cluster
- Local development environments
- Kubernetes-based Spark
- Cloud Spark services (EMR, HDInsight, Dataproc)
- Databricks (extended features)

### Databricks Lakeflow Extensions

**Additional Features** (Databricks-specific):
```python
from pyspark import pipelines as dp

# ⚡ Databricks-specific enhancements
dp.create_auto_cdc_flow()              # Automatic CDC pattern
dp.create_auto_cdc_from_snapshot_flow() # CDC from snapshots
dp.create_sink()                        # Advanced sink configuration
dp.append_flow()                        # Incremental append patterns

# Unity Catalog integration
# Photon acceleration
# Serverless compute
# Advanced monitoring and lineage
```

**Databricks Platform Integration**:
- **Unity Catalog**: Fine-grained access control, data lineage, audit logs
- **Photon Engine**: 2-10x performance improvements for pipeline operations
- **Serverless Compute**: Auto-scaling, instant start, pay-per-use
- **Lakeflow UI**: Visual pipeline designer, monitoring dashboards, alerting
- **Lakeflow Connect**: High-throughput connectors for 100+ data sources
- **Lakeflow Jobs**: Workflow orchestration, scheduling, error handling

### Code Portability Strategy

**Pattern 1: Pure Open Source** (maximum portability)
```python
from pyspark import pipelines as dp

@dp.table
def bronze_events():
    return spark.readStream.format("delta").table("raw.events")

# ✓ Runs anywhere with Spark 4.1+
```

**Pattern 2: Databricks-Enhanced** (optimized for Databricks)
```python
from pyspark import pipelines as dp

# Use Databricks-specific features when available
dp.create_auto_cdc_flow(
    source="bronze.customers",
    target="silver.customers",
    keys=["customer_id"]
)

# ⚡ Leverages Databricks optimizations
```

**Recommendation**: Use open-source APIs for core logic, Databricks extensions for optimization.

## 3. Core Architecture: Declarative vs Imperative Pipelines

### The Paradigm Shift

**Imperative Approach** (Traditional PySpark):
```python
# ❌ You define HOW to build the pipeline
# Step 1: Read raw data
raw_df = spark.read.format("json").load("/data/raw/events/")
raw_df.write.format("delta").mode("overwrite").save("/data/bronze/events")

# Step 2: Transform to silver
bronze_df = spark.read.format("delta").load("/data/bronze/events")
silver_df = bronze_df.filter(col("event_type").isNotNull())
silver_df.write.format("delta").mode("append").save("/data/silver/events")

# Step 3: Aggregate to gold
silver_df = spark.read.format("delta").load("/data/silver/events")
gold_df = silver_df.groupBy("date").agg(count("*").alias("event_count"))
gold_df.write.format("delta").mode("overwrite").save("/data/gold/daily_events")

# Issues:
# - Must manually orchestrate execution order
# - Need custom dependency management
# - Error handling is manual
# - Retry logic is custom
# - Monitoring requires separate implementation
```

**Declarative Approach** (`pyspark.pipelines`):
```python
# ✅ You define WHAT tables you want, Spark figures out HOW
from pyspark import pipelines as dp
from pyspark.sql import functions as F

@dp.table
def bronze_events():
    """Raw events from source system"""
    return spark.read.format("json").load("/data/raw/events/")

@dp.table
@dp.expect_or_drop("valid_event_type", "event_type IS NOT NULL")
def silver_events():
    """Cleaned events with quality filters"""
    return dp.read("bronze_events")

@dp.table
def gold_daily_events():
    """Daily event aggregations"""
    return (
        dp.read("silver_events")
        .groupBy("date")
        .agg(F.count("*").alias("event_count"))
    )

# Benefits:
# ✓ Automatic dependency resolution (gold depends on silver depends on bronze)
# ✓ Built-in error handling and retries
# ✓ Automatic quality monitoring
# ✓ Intelligent parallelization
# ✓ Incremental processing
```

### Functional Programming Alignment

Declarative pipelines naturally embody functional programming principles:

1. **Pure Functions**: Pipeline definitions are pure functions
   ```python
   @dp.table
   def customers():  # Pure function: input → output
       return spark.table("raw.customers")  # No side effects
   ```

2. **Immutability**: Tables are immutable, transformations create new tables
   ```python
   # bronze_events is immutable
   # silver_events is a NEW table, not a mutation
   ```

3. **Composition**: Tables compose naturally through dependencies
   ```python
   @dp.table
   def final_result():
       return dp.read("intermediate1").join(dp.read("intermediate2"))
   ```

4. **Declarative**: Focus on WHAT, not HOW
   - Catalyst optimizer handles execution strategy
   - Pipeline engine handles orchestration
   - Platform handles infrastructure

5. **No Side Effects in Definitions**: Actions (write, collect) are prohibited
   ```python
   @dp.table
   def valid_table():
       return spark.table("source")  # ✓ Returns DataFrame
   
   @dp.table
   def invalid_table():
       df = spark.table("source")
       df.write.save("somewhere")    # ❌ Side effect forbidden!
       return df
   ```

## 4. Core Concepts: Tables, Views, and Flows

### Table Types in Lakeflow

```python
from pyspark import pipelines as dp

# 1. Materialized Table (@dp.table)
@dp.table(comment="Customer master table")
def customers():
    """
    Fully materialized Delta table.
    - Data is physically stored
    - Can be queried independently
    - Batch-processed by default
    - Best for: Reference data, slowly changing dimensions
    """
    return spark.read.table("raw.customers")

# 2. Streaming Table (@dp.streaming_table)
@dp.streaming_table(comment="Real-time events stream")
def events_stream():
    """
    Streaming Delta table with incremental processing.
    - Continuously processes new data
    - Maintains checkpoints
    - Exactly-once guarantees
    - Best for: Real-time data, event streams, CDC
    """
    return spark.readStream.format("delta").table("raw.events")

# 3. Materialized View (@dp.materialized_view)
@dp.materialized_view(comment="Customer order summary")
def customer_orders():
    """
    Pre-computed aggregate or join.
    - Stored like a table
    - Refreshed on pipeline run
    - Optimized for read performance
    - Best for: Complex aggregations, frequently-queried joins
    """
    return (
        dp.read("customers")
        .join(dp.read("orders"), "customer_id")
        .groupBy("customer_id")
        .agg(F.sum("order_total").alias("total_spent"))
    )

# 4. Temporary View (@dp.temporary_view)
@dp.temporary_view(comment="Filtered active customers")
def active_customers():
    """
    Logical view, not materialized.
    - No physical storage
    - Computed on-demand when referenced
    - Saves storage for intermediate transformations
    - Best for: Intermediate filters, reusable subqueries
    """
    return dp.read("customers").filter(F.col("status") == "active")
```

### Decision Matrix: Which Type to Use?

| Criteria | @dp.table | @dp.streaming_table | @dp.materialized_view | @dp.temporary_view |
|----------|-----------|---------------------|----------------------|--------------------|
| **Storage** | Materialized | Materialized | Materialized | None |
| **Processing** | Batch | Streaming | Batch | On-demand |
| **Use Case** | Source/dimension tables | Real-time data | Complex aggregations | Intermediate logic |
| **Query Performance** | Fast | Fast | Fast | Depends on complexity |
| **Storage Cost** | Moderate | Moderate | Moderate | None |
| **Refresh Pattern** | Full or incremental | Continuous | On pipeline run | Every query |
| **Best For** | Static/slow-changing | Events, CDC | Pre-computed joins | Filters, subqueries |

### Flows: Coordinating Multiple Tables

**Concept**: A "flow" represents a logical grouping of tables with a specific processing pattern.

```python
# Example: Append Flow (for growing datasets)
dp.append_flow(
    source=dp.read("bronze_logs"),
    target="silver_logs",
    target_columns=["timestamp", "user_id", "action", "processed_at"]
)
# Only new records from bronze are appended to silver

# Example: CDC Flow (change data capture)
dp.create_auto_cdc_flow(
    source="bronze.customer_changes",
    target="silver.customers",
    keys=["customer_id"],
    sequence_by="update_timestamp"
)
# Automatically applies inserts, updates, deletes
```

### Dependency Resolution

```python
# Lakeflow automatically resolves dependencies
@dp.table
def a():
    return spark.table("source_a")

@dp.table
def b():
    return dp.read("a")  # Depends on table 'a'

@dp.table
def c():
    return dp.read("a").union(dp.read("b"))  # Depends on both 'a' and 'b'

# Execution order: a → b → c (automatically determined)
# Parallelization: a runs first, then b, then c (DAG optimization)
```

## 5. The Lakeflow Platform Ecosystem

### Three Pillars of Lakeflow

```
┌─────────────────────────────────────────────────────────────────┐
│                     LAKEFLOW PLATFORM                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────┐  ┌──────────────────┐  ┌──────────────┐  │
│  │  Lakeflow        │  │  Lakeflow        │  │  Lakeflow    │  │
│  │  Connect         │  │  Declarative     │  │  Jobs        │  │
│  │                  │  │  Pipelines       │  │              │  │
│  │  [Ingestion]     │→ │  [Transform]     │→ │  [Orchestrate]│  │
│  └──────────────────┘  └──────────────────┘  └──────────────┘  │
│                                                                  │
│  • 100+ connectors   │  • pyspark.pipelines │  • DAG workflows │
│  • Change tracking   │  • Data quality      │  • Scheduling    │
│  • Schema detection  │  • Lineage tracking  │  • Monitoring    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

### 1. Lakeflow Connect (Ingestion Layer)

**Purpose**: High-throughput data ingestion from diverse sources

**Key Features**:
- 100+ pre-built connectors (Salesforce, SAP, Oracle, MySQL, etc.)
- Auto-schema detection and evolution
- Change data capture (CDC) built-in
- Incremental loading strategies
- Error handling and dead-letter queues

**Example Integration**:
```python
# Lakeflow Connect ingests data to bronze layer
# Then Lakeflow Declarative Pipelines transforms it

@dp.streaming_table
def bronze_salesforce():
    # Data ingested by Lakeflow Connect
    return spark.readStream.table("connect.salesforce_accounts")
```

### 2. Lakeflow Declarative Pipelines (Transformation Layer)

**Purpose**: Declarative data transformations with built-in quality

**What We're Learning in This Section**:
- Table definitions with `@dp.table`, `@dp.streaming_table`, etc.
- Data quality expectations
- Dependency management
- Streaming and batch processing

### 3. Lakeflow Jobs (Orchestration Layer)

**Purpose**: Workflow orchestration and scheduling

**Key Features**:
- Pipeline scheduling (cron, event-driven)
- Cross-pipeline dependencies
- Retry policies and error handling
- Monitoring and alerting
- Resource management

**Example Workflow**:
```python
# Lakeflow Jobs orchestrates:
# 1. Run Lakeflow Connect ingestion → bronze tables
# 2. Trigger Lakeflow Declarative Pipeline → silver/gold tables
# 3. Send notifications on completion/failure
# 4. Trigger downstream ML or BI workloads
```

### Integration Pattern

```python
# Full Lakeflow workflow example
from pyspark import pipelines as dp

# Step 1: Lakeflow Connect ingests to bronze (configured via UI)

# Step 2: Lakeflow Declarative Pipeline transforms
@dp.streaming_table
def bronze_customers():
    """Ingested by Lakeflow Connect"""
    return spark.readStream.table("connect.customers")

@dp.table
@dp.expect_or_drop("valid_email", "email IS NOT NULL")
def silver_customers():
    """Cleaned and validated"""
    return dp.read("bronze_customers").filter(F.col("country") == "US")

@dp.materialized_view
def gold_customer_metrics():
    """Business aggregates"""
    return (
        dp.read("silver_customers")
        .groupBy("segment")
        .agg(F.count("*").alias("customer_count"))
    )

# Step 3: Lakeflow Jobs schedules and monitors execution
```

## 6. Migration Guide: From dlt to pyspark.pipelines

### Import Changes

```python
# OLD (Legacy DLT)
import dlt
from pyspark.sql import functions as F

# NEW (Lakeflow Declarative Pipelines)
from pyspark import pipelines as dp  # Changed import
from pyspark.sql import functions as F  # No change
```

### Decorator Migration

| Legacy DLT | Modern Lakeflow | Notes |
|------------|-----------------|-------|
| `@dlt.table` | `@dp.table` | Direct 1:1 replacement |
| `@dlt.view` | `@dp.materialized_view` | Name clarified |
| `dlt.read("table")` | `dp.read("table")` | Function call replacement |
| `dlt.read_stream("table")` | `dp.read_stream("table")` | Streaming reads |
| `@dlt.expect()` | `@dp.expect()` | Direct replacement |
| `@dlt.expect_or_drop()` | `@dp.expect_or_drop()` | Direct replacement |
| `@dlt.expect_or_fail()` | `@dp.expect_or_fail()` | Direct replacement |

### Complete Migration Example

**Before (Legacy DLT)**:
```python
import dlt
from pyspark.sql import functions as F

@dlt.table(
    name="customers_bronze",
    comment="Raw customer data"
)
def customers_bronze():
    return spark.read.table("raw.customers")

@dlt.table(
    name="customers_silver",
    comment="Cleaned customer data"
)
@dlt.expect_or_drop("valid_email", "email IS NOT NULL")
@dlt.expect_or_drop("valid_age", "age >= 18 AND age <= 120")
def customers_silver():
    return dlt.read("customers_bronze")

@dlt.view(
    name="customer_summary",
    comment="Customer metrics by country"
)
def customer_summary():
    return (
        dlt.read("customers_silver")
        .groupBy("country")
        .agg(F.count("*").alias("customer_count"))
    )
```

**After (Modern Lakeflow)**:
```python
from pyspark import pipelines as dp  # Changed: new import
from pyspark.sql import functions as F

@dp.table(  # Changed: dlt → dp
    name="customers_bronze",
    comment="Raw customer data"
)
def customers_bronze():
    return spark.read.table("raw.customers")

@dp.table(  # Changed: dlt → dp
    name="customers_silver",
    comment="Cleaned customer data"
)
@dp.expect_or_drop("valid_email", "email IS NOT NULL")  # Changed: dlt → dp
@dp.expect_or_drop("valid_age", "age >= 18 AND age <= 120")  # Changed: dlt → dp
def customers_silver():
    return dp.read("customers_bronze")  # Changed: dlt → dp

@dp.materialized_view(  # Changed: @dlt.view → @dp.materialized_view
    name="customer_summary",
    comment="Customer metrics by country"
)
def customer_summary():
    return (
        dp.read("customers_silver")  # Changed: dlt → dp
        .groupBy("country")
        .agg(F.count("*").alias("customer_count"))
    )
```

### Migration Automation

**Simple Find-and-Replace Strategy**:
```bash
# In your pipeline notebooks/files:
1. Replace: import dlt → from pyspark import pipelines as dp
2. Replace: @dlt. → @dp.
3. Replace: dlt.read → dp.read
4. Replace: dlt.read_stream → dp.read_stream
5. Replace: @dlt.view → @dp.materialized_view
```

**Testing After Migration**:
```python
# Both versions should produce identical results
# 1. Run legacy pipeline in test mode
# 2. Run migrated pipeline in test mode
# 3. Compare output tables (should be identical)
```

## 7. Quick Reference: API Cheat Sheet

### Essential Imports
```python
from pyspark import pipelines as dp
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
```

### Table Definition Decorators
```python
@dp.table                    # Materialized batch table
@dp.streaming_table          # Streaming table with checkpoints
@dp.materialized_view        # Pre-computed view (refreshed on run)
@dp.temporary_view           # Logical view (not materialized)
```

### Reading Tables
```python
dp.read("table_name")        # Read batch table
dp.read_stream("table_name") # Read streaming table
```

### Data Quality Expectations
```python
@dp.expect("name", "constraint")           # WARN: Monitor violations
@dp.expect_or_drop("name", "constraint")   # DROP: Remove violating records
@dp.expect_or_fail("name", "constraint")   # FAIL: Stop pipeline on violation
```

### Flow Creation (Databricks-specific)
```python
dp.append_flow()                           # Incremental append pattern
dp.create_auto_cdc_flow()                  # Automatic CDC
dp.create_sink()                           # Custom sink configuration
```

### Prohibited Operations in Definitions
```python
# ❌ NEVER use these in @dp.table functions:
df.collect()          # Action: triggers computation
df.count()            # Action: triggers computation
df.toPandas()         # Action: collects to driver
df.write.save()       # Side effect: writes data
df.write.saveAsTable()  # Side effect: creates table
df.writeStream.start()  # Side effect: starts stream

# ✅ Only return DataFrames:
return df  # Correct: returns DataFrame for Lakeflow to materialize
```

## Summary

In this notebook, we explored the evolution from Delta Live Tables to Lakeflow Declarative Pipelines:

### Key Concepts Covered

1. **Evolution Path**
   - Legacy: `import dlt` (Databricks proprietary)
   - Modern: `from pyspark import pipelines as dp` (Apache Spark 4.1+)
   - Backward compatible: Both work, new code should use `dp`

2. **Open Source vs Databricks**
   - Core pipelines API: Open source in Spark 4.1+
   - Databricks extensions: CDC flows, Unity Catalog, Photon, serverless
   - Code portability: Core API works anywhere

3. **Declarative Paradigm**
   - Define WHAT tables you want, not HOW to build them
   - Automatic dependency resolution
   - Built-in error handling and retries
   - Aligns with functional programming principles

4. **Table Types**
   - `@dp.table`: Materialized batch tables
   - `@dp.streaming_table`: Real-time streaming tables
   - `@dp.materialized_view`: Pre-computed aggregations
   - `@dp.temporary_view`: Logical views (no storage)

5. **Lakeflow Platform**
   - Lakeflow Connect: Data ingestion
   - Lakeflow Declarative Pipelines: Transformations
   - Lakeflow Jobs: Orchestration and scheduling

6. **Migration Strategy**
   - Simple find-and-replace: `dlt` → `dp`
   - No breaking changes: Legacy code continues to work
   - Gradual adoption recommended

### Functional Programming Benefits

- **Pure Functions**: Table definitions are pure (no side effects)
- **Immutability**: Tables are immutable, transformations create new tables
- **Composition**: Tables compose through dependency graph
- **Declarative**: Focus on outcome, not implementation

### Next Steps

In upcoming notebooks, we'll dive deeper into:
- **6.2**: Defining tables, views, and sinks in detail
- **6.3**: Streaming tables and real-time processing patterns
- **6.4**: Data quality expectations and validation
- **6.5**: Advanced flows and CDC patterns
- **6.6**: Best practices and anti-patterns

Lakeflow Declarative Pipelines represents a significant evolution in data engineering, bringing declarative, functional paradigms to large-scale data processing with Apache Spark.

## Exercises

Practice understanding the evolution from DLT to Lakeflow:

**Exercise 1: Identify Migration Patterns**
- Review your existing DLT code (if available)
- Identify which decorators need to change
- Create a migration checklist

**Exercise 2: Table Type Selection**
- For each scenario below, choose the appropriate decorator:
  - Real-time clickstream events
  - Daily customer dimension table
  - Complex 5-table join used in multiple places
  - Intermediate filter step (active users only)

**Exercise 3: Declarative vs Imperative**
- Take an existing imperative PySpark pipeline
- Redesign it using `@dp` decorators
- Compare code readability and maintainability

**Exercise 4: Dependency Mapping**
- Draw the dependency graph for a multi-table pipeline
- Identify which tables can run in parallel
- Understand how Lakeflow resolves execution order

**Exercise 5: Migration Planning**
- Estimate the effort to migrate an existing DLT pipeline
- Identify any Databricks-specific features used
- Plan a phased migration approach

In the next notebook, we'll implement actual Lakeflow pipelines with hands-on code examples!