# 4.1 Implementing Schema Enforcement and Constraints with Delta Lake

This notebook demonstrates how to implement robust data quality controls using Delta Lake's schema enforcement and constraint capabilities.

## Learning Objectives
- Understand Delta Lake's schema enforcement mechanisms
- Implement CHECK constraints for data validation
- Handle schema evolution safely
- Use Delta Lake features for data integrity
- Build declarative data quality pipelines

## Introduction to Delta Lake Schema Enforcement

Delta Lake provides strong schema enforcement that prevents data corruption by ensuring all writes conform to the table's schema. This is a cornerstone of data reliability in the lakehouse architecture.

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from delta import *

# Configure Spark for Delta Lake
# Note: In Databricks, this is pre-configured
spark = (SparkSession.builder
         .appName("DeltaLakeSchemaEnforcement")
         .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
         .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
         .getOrCreate())

# Sample customer data with various data quality issues
customer_data = [
    (1, "John Doe", "john.doe@email.com", 28, "Premium", 1200.50, "2023-01-15"),
    (2, "Jane Smith", "jane.smith@email.com", 35, "Standard", 850.75, "2023-01-16"),
    (3, "Bob Johnson", "bob@company.com", 42, "Premium", 2100.00, "2023-01-17"),
    (4, "Alice Brown", "alice.brown@service.org", 29, "Gold", 1500.25, "2023-01-18"),
    (5, "Charlie Wilson", "charlie@email.com", 33, "Standard", 750.00, "2023-01-19")
]

customer_schema = StructType([
    StructField("customer_id", IntegerType(), False),  # NOT NULL
    StructField("name", StringType(), False),          # NOT NULL
    StructField("email", StringType(), False),         # NOT NULL
    StructField("age", IntegerType(), True),
    StructField("tier", StringType(), True),
    StructField("account_balance", DoubleType(), True),
    StructField("created_date", StringType(), True)
])

customers_df = spark.createDataFrame(customer_data, customer_schema)

print("Customer data created:")
customers_df.show()
customers_df.printSchema()

## Creating Delta Tables with Schema Enforcement

Let's create a Delta table and demonstrate basic schema enforcement:

In [None]:
# Create Delta table with initial schema
delta_table_path = "/tmp/delta_customers"

# Clear existing data for demo
try:
    dbutils.fs.rm(delta_table_path, True)
except:
    pass  # Path doesn't exist yet

print("=== Creating Initial Delta Table ===")

# Write initial data to create the Delta table
(customers_df
 .withColumn("created_date", F.to_date(F.col("created_date"), "yyyy-MM-dd"))
 .write
 .format("delta")
 .mode("overwrite")
 .save(delta_table_path))

print(f"Delta table created at: {delta_table_path}")

# Read the table back to verify
delta_df = spark.read.format("delta").load(delta_table_path)
print("\nTable contents:")
delta_df.show()
print("\nTable schema:")
delta_df.printSchema()

## Demonstrating Schema Enforcement

Now let's see how Delta Lake prevents schema violations:

In [None]:
print("=== Schema Enforcement Demonstrations ===")

# 1. Adding extra columns (will fail without schema evolution)
print("\n1. Testing extra column addition:")

bad_data_extra_col = [
    (6, "David Lee", "david@email.com", 31, "Premium", 900.00, "2023-01-20", "New York")  # Extra column
]

bad_schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("email", StringType(), False),
    StructField("age", IntegerType(), True),
    StructField("tier", StringType(), True),
    StructField("account_balance", DoubleType(), True),
    StructField("created_date", StringType(), True),
    StructField("city", StringType(), True)  # Extra column!
])

bad_df = spark.createDataFrame(bad_data_extra_col, bad_schema)
bad_df = bad_df.withColumn("created_date", F.to_date(F.col("created_date"), "yyyy-MM-dd"))

try:
    (bad_df
     .write
     .format("delta")
     .mode("append")
     .save(delta_table_path))
    print("❌ Unexpected: Write succeeded")
except Exception as e:
    print(f"✅ Expected: Schema enforcement prevented write - {str(e)[:100]}...")

In [None]:
# 2. Wrong data types (will fail)
print("\n2. Testing wrong data types:")

bad_data_types = [
    ("seven", "Eve Green", "eve@email.com", 25, "Standard", 600.00, "2023-01-21")  # String instead of Int
]

bad_type_df = spark.createDataFrame(bad_data_types, 
                                   ["customer_id", "name", "email", "age", "tier", "account_balance", "created_date"])
bad_type_df = bad_type_df.withColumn("created_date", F.to_date(F.col("created_date"), "yyyy-MM-dd"))

try:
    (bad_type_df
     .write
     .format("delta")
     .mode("append")
     .save(delta_table_path))
    print("❌ Unexpected: Write succeeded")
except Exception as e:
    print(f"✅ Expected: Type mismatch prevented write - {str(e)[:100]}...")

In [None]:
# 3. Missing required columns (will fail)
print("\n3. Testing missing required columns:")

incomplete_data = [
    (7, "Frank Miller", 40, "Premium", 1100.00)  # Missing email column
]

incomplete_schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("name", StringType(), False),
    # Missing email field!
    StructField("age", IntegerType(), True),
    StructField("tier", StringType(), True),
    StructField("account_balance", DoubleType(), True)
])

incomplete_df = spark.createDataFrame(incomplete_data, incomplete_schema)

try:
    (incomplete_df
     .write
     .format("delta")
     .mode("append")
     .save(delta_table_path))
    print("❌ Unexpected: Write succeeded")
except Exception as e:
    print(f"✅ Expected: Missing column prevented write - {str(e)[:100]}...")

## Safe Schema Evolution

When you need to evolve schemas, Delta Lake provides safe mechanisms:

In [None]:
print("=== Safe Schema Evolution ===")

# 1. Adding optional columns with mergeSchema option
print("\n1. Adding new optional columns:")

new_customer_data = [
    (6, "David Lee", "david@email.com", 31, "Premium", 900.00, "2023-01-20", "New York", "Engineering")
]

evolved_schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("email", StringType(), False),
    StructField("age", IntegerType(), True),
    StructField("tier", StringType(), True),
    StructField("account_balance", DoubleType(), True),
    StructField("created_date", StringType(), True),
    StructField("city", StringType(), True),        # New optional column
    StructField("department", StringType(), True)   # New optional column
])

evolved_df = spark.createDataFrame(new_customer_data, evolved_schema)
evolved_df = evolved_df.withColumn("created_date", F.to_date(F.col("created_date"), "yyyy-MM-dd"))

# Enable schema merging
(evolved_df
 .write
 .format("delta")
 .mode("append")
 .option("mergeSchema", "true")  # This allows schema evolution
 .save(delta_table_path))

print("✅ Schema evolution successful!")

# Verify the evolved schema
evolved_table = spark.read.format("delta").load(delta_table_path)
print("\nEvolved table schema:")
evolved_table.printSchema()

print("\nTable contents after evolution:")
evolved_table.show()

## Implementing CHECK Constraints

Delta Lake supports CHECK constraints to enforce business rules at the table level:

In [None]:
print("=== Implementing CHECK Constraints ===")

# Create a new table with constraints for demonstration
constrained_table_path = "/tmp/delta_customers_constrained"

try:
    dbutils.fs.rm(constrained_table_path, True)
except:
    pass

# Create table using SQL for easier constraint definition
spark.sql(f"""
CREATE TABLE delta.`{constrained_table_path}` (
    customer_id INT NOT NULL,
    name STRING NOT NULL,
    email STRING NOT NULL,
    age INT,
    tier STRING,
    account_balance DOUBLE,
    created_date DATE
) USING DELTA
""")

print("Table created with basic schema")

In [None]:
# Add CHECK constraints
print("\nAdding CHECK constraints:")

# 1. Age constraint
spark.sql(f"""
ALTER TABLE delta.`{constrained_table_path}`
ADD CONSTRAINT age_check CHECK (age >= 18 AND age <= 120)
""")
print("✅ Age constraint added: age >= 18 AND age <= 120")

# 2. Account balance constraint
spark.sql(f"""
ALTER TABLE delta.`{constrained_table_path}`
ADD CONSTRAINT balance_check CHECK (account_balance >= 0)
""")
print("✅ Balance constraint added: account_balance >= 0")

# 3. Tier validation constraint
spark.sql(f"""
ALTER TABLE delta.`{constrained_table_path}`
ADD CONSTRAINT tier_check CHECK (tier IN ('Standard', 'Premium', 'Gold'))
""")
print("✅ Tier constraint added: tier IN ('Standard', 'Premium', 'Gold')")

# 4. Email format constraint (basic)
spark.sql(f"""
ALTER TABLE delta.`{constrained_table_path}`
ADD CONSTRAINT email_format_check CHECK (email LIKE '%@%')
""")
print("✅ Email format constraint added: email LIKE '%@%'")

In [None]:
# View table constraints
print("\n=== Current Table Constraints ===")
constraints_df = spark.sql(f"DESCRIBE TABLE EXTENDED delta.`{constrained_table_path}`")
constraints_df.filter(F.col("col_name").contains("Constraint")).show(truncate=False)

## Testing Constraint Enforcement

Let's test how constraints prevent invalid data from entering the table:

In [None]:
print("=== Testing Constraint Enforcement ===")

# 1. Valid data (should succeed)
print("\n1. Inserting valid data:")

valid_data = [
    (1, "John Doe", "john@email.com", 25, "Standard", 500.00, "2023-01-15"),
    (2, "Jane Smith", "jane@email.com", 35, "Premium", 1500.00, "2023-01-16")
]

valid_df = spark.createDataFrame(valid_data, customer_schema)
valid_df = valid_df.withColumn("created_date", F.to_date(F.col("created_date"), "yyyy-MM-dd"))

try:
    (valid_df
     .write
     .format("delta")
     .mode("append")
     .save(constrained_table_path))
    print("✅ Valid data inserted successfully")
except Exception as e:
    print(f"❌ Unexpected failure: {e}")

# Verify insertion
result_df = spark.read.format("delta").load(constrained_table_path)
print(f"Current record count: {result_df.count()}")

In [None]:
# 2. Invalid age (should fail)
print("\n2. Testing age constraint violation:")

invalid_age_data = [
    (3, "Too Young", "young@email.com", 15, "Standard", 100.00, "2023-01-17")  # Age < 18
]

invalid_age_df = spark.createDataFrame(invalid_age_data, customer_schema)
invalid_age_df = invalid_age_df.withColumn("created_date", F.to_date(F.col("created_date"), "yyyy-MM-dd"))

try:
    (invalid_age_df
     .write
     .format("delta")
     .mode("append")
     .save(constrained_table_path))
    print("❌ Unexpected: Invalid age data was inserted")
except Exception as e:
    print(f"✅ Expected: Age constraint prevented insertion - {str(e)[:100]}...")

In [None]:
# 3. Invalid balance (should fail)
print("\n3. Testing balance constraint violation:")

invalid_balance_data = [
    (4, "Negative Balance", "negative@email.com", 30, "Standard", -100.00, "2023-01-18")
]

invalid_balance_df = spark.createDataFrame(invalid_balance_data, customer_schema)
invalid_balance_df = invalid_balance_df.withColumn("created_date", F.to_date(F.col("created_date"), "yyyy-MM-dd"))

try:
    (invalid_balance_df
     .write
     .format("delta")
     .mode("append")
     .save(constrained_table_path))
    print("❌ Unexpected: Negative balance data was inserted")
except Exception as e:
    print(f"✅ Expected: Balance constraint prevented insertion - {str(e)[:50]}...")

In [None]:
# 4. Invalid tier (should fail)
print("\n4. Testing tier constraint violation:")

invalid_tier_data = [
    (5, "Invalid Tier", "invalid@email.com", 25, "Platinum", 2000.00, "2023-01-19")  # Invalid tier
]

invalid_tier_df = spark.createDataFrame(invalid_tier_data, customer_schema)
invalid_tier_df = invalid_tier_df.withColumn("created_date", F.to_date(F.col("created_date"), "yyyy-MM-dd"))

try:
    (invalid_tier_df
     .write
     .format("delta")
     .mode("append")
     .save(constrained_table_path))
    print("❌ Unexpected: Invalid tier data was inserted")
except Exception as e:
    print(f"✅ Expected: Tier constraint prevented insertion - {str(e)[:50]}...")

## Advanced Constraint Patterns

Let's explore more sophisticated constraint patterns:

In [None]:
print("=== Advanced Constraint Patterns ===")

# Create a more complex table for advanced constraints
advanced_table_path = "/tmp/delta_orders_advanced"

try:
    dbutils.fs.rm(advanced_table_path, True)
except:
    pass

# Create orders table
spark.sql(f"""
CREATE TABLE delta.`{advanced_table_path}` (
    order_id STRING NOT NULL,
    customer_id INT NOT NULL,
    order_date DATE NOT NULL,
    ship_date DATE,
    order_amount DECIMAL(10,2) NOT NULL,
    discount_percent DECIMAL(5,2),
    status STRING NOT NULL,
    priority STRING,
    created_timestamp TIMESTAMP NOT NULL
) USING DELTA
""")

print("Advanced orders table created")

In [None]:
# Add sophisticated constraints
print("\nAdding advanced constraints:")

# 1. Date logic constraint
spark.sql(f"""
ALTER TABLE delta.`{advanced_table_path}`
ADD CONSTRAINT date_logic_check 
CHECK (ship_date IS NULL OR ship_date >= order_date)
""")
print("✅ Date logic constraint: ship_date >= order_date")

# 2. Amount and discount relationship
spark.sql(f"""
ALTER TABLE delta.`{advanced_table_path}`
ADD CONSTRAINT discount_logic_check 
CHECK (discount_percent IS NULL OR (discount_percent >= 0 AND discount_percent <= 50))
""")
print("✅ Discount constraint: 0 <= discount_percent <= 50")

# 3. Status and priority relationship
spark.sql(f"""
ALTER TABLE delta.`{advanced_table_path}`
ADD CONSTRAINT status_check 
CHECK (status IN ('Pending', 'Processing', 'Shipped', 'Delivered', 'Cancelled'))
""")
print("✅ Status constraint: Valid status values")

# 4. Order ID format constraint
spark.sql(f"""
ALTER TABLE delta.`{advanced_table_path}`
ADD CONSTRAINT order_id_format_check 
CHECK (order_id RLIKE '^ORD-[0-9]{{8}}$')
""")
print("✅ Order ID format constraint: ORD-########")

# 5. Minimum order amount
spark.sql(f"""
ALTER TABLE delta.`{advanced_table_path}`
ADD CONSTRAINT minimum_order_check 
CHECK (order_amount >= 0.01)
""")
print("✅ Minimum order amount constraint: >= $0.01")

In [None]:
# Test advanced constraints
print("\n=== Testing Advanced Constraints ===")

# Valid order data
valid_orders = [
    ("ORD-00000001", 1, "2023-01-15", "2023-01-18", 125.99, 5.0, "Shipped", "High", "2023-01-15 10:30:00"),
    ("ORD-00000002", 2, "2023-01-16", None, 89.50, None, "Processing", "Normal", "2023-01-16 14:20:00")
]

orders_schema = StructType([
    StructField("order_id", StringType(), False),
    StructField("customer_id", IntegerType(), False),
    StructField("order_date", StringType(), False),
    StructField("ship_date", StringType(), True),
    StructField("order_amount", DoubleType(), False),
    StructField("discount_percent", DoubleType(), True),
    StructField("status", StringType(), False),
    StructField("priority", StringType(), True),
    StructField("created_timestamp", StringType(), False)
])

valid_orders_df = spark.createDataFrame(valid_orders, orders_schema)
valid_orders_df = (valid_orders_df
                  .withColumn("order_date", F.to_date("order_date"))
                  .withColumn("ship_date", F.to_date("ship_date"))
                  .withColumn("created_timestamp", F.to_timestamp("created_timestamp")))

# Insert valid data
try:
    (valid_orders_df
     .write
     .format("delta")
     .mode("append")
     .save(advanced_table_path))
    print("✅ Valid orders inserted successfully")
except Exception as e:
    print(f"❌ Unexpected failure: {e}")

# Show current data
current_orders = spark.read.format("delta").load(advanced_table_path)
print("\nCurrent orders:")
current_orders.show()

In [None]:
# Test constraint violations
print("\n=== Testing Constraint Violations ===")

# 1. Invalid order ID format
print("\n1. Testing invalid order ID format:")
invalid_id_orders = [
    ("INVALID-ID", 3, "2023-01-17", None, 50.00, None, "Pending", "Low", "2023-01-17 09:00:00")
]

invalid_id_df = spark.createDataFrame(invalid_id_orders, orders_schema)
invalid_id_df = (invalid_id_df
                .withColumn("order_date", F.to_date("order_date"))
                .withColumn("ship_date", F.to_date("ship_date"))
                .withColumn("created_timestamp", F.to_timestamp("created_timestamp")))

try:
    (invalid_id_df
     .write
     .format("delta")
     .mode("append")
     .save(advanced_table_path))
    print("❌ Unexpected: Invalid order ID was accepted")
except Exception as e:
    print("✅ Expected: Order ID format constraint prevented insertion")

# 2. Invalid date logic (ship_date < order_date)
print("\n2. Testing invalid date logic:")
invalid_date_orders = [
    ("ORD-00000003", 4, "2023-01-20", "2023-01-18", 75.00, None, "Shipped", "Normal", "2023-01-20 11:00:00")
]

invalid_date_df = spark.createDataFrame(invalid_date_orders, orders_schema)
invalid_date_df = (invalid_date_df
                  .withColumn("order_date", F.to_date("order_date"))
                  .withColumn("ship_date", F.to_date("ship_date"))
                  .withColumn("created_timestamp", F.to_timestamp("created_timestamp")))

try:
    (invalid_date_df
     .write
     .format("delta")
     .mode("append")
     .save(advanced_table_path))
    print("❌ Unexpected: Invalid date logic was accepted")
except Exception as e:
    print("✅ Expected: Date logic constraint prevented insertion")

## Building Declarative Data Quality Functions

Let's create reusable functions that leverage Delta Lake's constraints:

In [None]:
print("=== Declarative Data Quality Functions ===")

class DeltaTableManager:
    """
    Utility class for managing Delta tables with constraints
    """
    
    @staticmethod
    def create_customer_table(table_path, constraints=True):
        """
        Create a customer table with optional constraints
        """
        # Create base table
        spark.sql(f"""
        CREATE TABLE IF NOT EXISTS delta.`{table_path}` (
            customer_id INT NOT NULL,
            name STRING NOT NULL,
            email STRING NOT NULL,
            age INT,
            tier STRING,
            account_balance DOUBLE,
            created_date DATE,
            city STRING,
            department STRING
        ) USING DELTA
        """)
        
        if constraints:
            DeltaTableManager.add_customer_constraints(table_path)
    
    @staticmethod
    def add_customer_constraints(table_path):
        """
        Add standard customer constraints to a Delta table
        """
        constraints = [
            ("age_check", "age >= 18 AND age <= 120"),
            ("balance_check", "account_balance >= 0"),
            ("tier_check", "tier IN ('Standard', 'Premium', 'Gold')"),
            ("email_format_check", "email LIKE '%@%'")
        ]
        
        for constraint_name, constraint_expr in constraints:
            try:
                spark.sql(f"""
                ALTER TABLE delta.`{table_path}`
                ADD CONSTRAINT {constraint_name} CHECK ({constraint_expr})
                """)
                print(f"✅ Added constraint: {constraint_name}")
            except Exception as e:
                if "already exists" in str(e).lower():
                    print(f"⚠️  Constraint {constraint_name} already exists")
                else:
                    print(f"❌ Failed to add constraint {constraint_name}: {e}")
    
    @staticmethod
    def validate_before_write(df, table_path):
        """
        Validate DataFrame against table constraints before writing
        This is a proactive validation approach
        """
        print("=== Pre-write Validation ===")
        
        validation_results = {}
        
        # Age validation
        if "age" in df.columns:
            invalid_age_count = df.filter((F.col("age") < 18) | (F.col("age") > 120)).count()
            validation_results["age_check"] = invalid_age_count == 0
            if invalid_age_count > 0:
                print(f"❌ Age validation failed: {invalid_age_count} records with invalid age")
        
        # Balance validation
        if "account_balance" in df.columns:
            invalid_balance_count = df.filter(F.col("account_balance") < 0).count()
            validation_results["balance_check"] = invalid_balance_count == 0
            if invalid_balance_count > 0:
                print(f"❌ Balance validation failed: {invalid_balance_count} records with negative balance")
        
        # Tier validation
        if "tier" in df.columns:
            valid_tiers = ['Standard', 'Premium', 'Gold']
            invalid_tier_count = df.filter(~F.col("tier").isin(valid_tiers)).count()
            validation_results["tier_check"] = invalid_tier_count == 0
            if invalid_tier_count > 0:
                print(f"❌ Tier validation failed: {invalid_tier_count} records with invalid tier")
        
        # Email validation
        if "email" in df.columns:
            invalid_email_count = df.filter(~F.col("email").contains("@")).count()
            validation_results["email_format_check"] = invalid_email_count == 0
            if invalid_email_count > 0:
                print(f"❌ Email validation failed: {invalid_email_count} records with invalid email")
        
        all_passed = all(validation_results.values())
        
        if all_passed:
            print("✅ All validations passed")
        else:
            print("❌ Some validations failed")
        
        return all_passed, validation_results
    
    @staticmethod
    def safe_write_to_delta(df, table_path, mode="append", validate=True):
        """
        Safely write DataFrame to Delta table with validation
        """
        if validate:
            is_valid, validation_results = DeltaTableManager.validate_before_write(df, table_path)
            if not is_valid:
                raise ValueError("Data validation failed. Cannot write to Delta table.")
        
        try:
            (df
             .write
             .format("delta")
             .mode(mode)
             .save(table_path))
            print(f"✅ Successfully wrote {df.count()} records to {table_path}")
        except Exception as e:
            print(f"❌ Write failed: {e}")
            raise

print("DeltaTableManager utility class defined")

In [None]:
# Test the declarative data quality functions
print("=== Testing Declarative Data Quality ===")

managed_table_path = "/tmp/delta_customers_managed"

try:
    dbutils.fs.rm(managed_table_path, True)
except:
    pass

# Create managed table with constraints
DeltaTableManager.create_customer_table(managed_table_path, constraints=True)

# Test with valid data
print("\n1. Testing with valid data:")
valid_customers = [
    (1, "Alice Johnson", "alice@email.com", 28, "Premium", 1200.50, "2023-01-15", "New York", "Engineering"),
    (2, "Bob Smith", "bob@email.com", 35, "Gold", 2500.00, "2023-01-16", "California", "Sales")
]

valid_df = spark.createDataFrame(valid_customers, 
                                ["customer_id", "name", "email", "age", "tier", 
                                 "account_balance", "created_date", "city", "department"])
valid_df = valid_df.withColumn("created_date", F.to_date("created_date"))

DeltaTableManager.safe_write_to_delta(valid_df, managed_table_path)

# Test with invalid data (pre-validation should catch this)
print("\n2. Testing with invalid data:")
invalid_customers = [
    (3, "Too Young", "young@email.com", 15, "Standard", 100.00, "2023-01-17", "Texas", "Support"),  # Invalid age
    (4, "Negative Balance", "negative@email.com", 30, "Premium", -500.00, "2023-01-18", "Florida", "Marketing")  # Invalid balance
]

invalid_df = spark.createDataFrame(invalid_customers,
                                  ["customer_id", "name", "email", "age", "tier", 
                                   "account_balance", "created_date", "city", "department"])
invalid_df = invalid_df.withColumn("created_date", F.to_date("created_date"))

try:
    DeltaTableManager.safe_write_to_delta(invalid_df, managed_table_path)
except ValueError as e:
    print(f"✅ Expected: Pre-validation prevented write - {e}")

# Show final table contents
final_table = spark.read.format("delta").load(managed_table_path)
print(f"\nFinal table record count: {final_table.count()}")
final_table.show()

## Summary

**Key Takeaways:**

1. **Schema Enforcement Benefits**:
   - Prevents data corruption at write time
   - Ensures data consistency across the lakehouse
   - Eliminates need for application-level schema validation

2. **Delta Lake Constraints**:
   - CHECK constraints enforce business rules declaratively
   - Constraints are evaluated at write time
   - Support complex expressions and relationships

3. **Schema Evolution**:
   - Use `mergeSchema=true` for safe schema evolution
   - Add optional columns without breaking existing queries
   - Maintain backward compatibility

4. **Best Practices**:
   - Define constraints early in table lifecycle
   - Use descriptive constraint names
   - Implement pre-write validation for better error messages
   - Build reusable constraint management utilities

5. **Functional Programming Alignment**:
   - Constraints are declarative (what, not how)
   - Immutable data guarantees with schema enforcement
   - Composable validation functions
   - Pure functions for constraint checking

**Benefits of Declarative Data Quality**:
- Centralized data quality rules
- Automatic enforcement without application logic
- Clear error messages for quality violations
- Consistent quality across all write paths
- Self-documenting data contracts

**Next Steps**: In the next notebook, we'll explore Delta Live Tables (DLT) for even more advanced declarative data quality patterns and automated pipeline management.

## Exercise

Practice implementing schema enforcement and constraints:

1. Create a Delta table for your domain with appropriate constraints
2. Test constraint violations with realistic bad data
3. Implement safe schema evolution by adding new optional columns
4. Build a validation utility class for your specific use case
5. Create a complete data quality pipeline with pre-write validation

In [None]:
# Your exercise code here

# 1. Create your domain-specific Delta table
def create_your_table(table_path):
    """
    Create a Delta table for your specific domain
    """
    # Your table creation logic
    pass

# 2. Define domain-specific constraints
def add_your_constraints(table_path):
    """
    Add constraints relevant to your domain
    """
    # Your constraint logic
    pass

# 3. Create validation functions
def validate_your_data(df):
    """
    Pre-write validation for your data
    """
    # Your validation logic
    pass

# 4. Test with sample data
# your_test_data = [...]
# test_df = spark.createDataFrame(your_test_data, your_schema)
# validate_your_data(test_df)