# Run 2 Validation

This notebook validates the output tables after Run 2 (Day 2) data load.

## Expected Changes in Run 2:
- **Customer Updates**:
  - John (ID=1): Email changed from john.doe@example.com to jdoe@example.com
  - Richard (ID=10): Marked for deletion (DELETE_FLAG=1)
  - New customers: Alice (ID=3), Joe (ID=4)
- **Customer Address Updates**:
  - Jane (ID=2): City changed to Perth, WA
  - New: Alice (ID=3) in Sydney, NSW
- **SCD2 Behavior**: Previous versions should be closed, new versions created


In [None]:
%run "./initialize"


In [None]:
# Import validation utilities
from validation_utils import ValidationRunner

# Initialize validation runner with spark session
v = ValidationRunner(spark)


## Bronze Layer Validations

After Run 2:
- `customer`: 7 rows total (3 original + 4 new CDC records)
- `customer_address`: 6 rows (4 original + 2 new)


In [None]:
print("=" * 60)
print("BRONZE LAYER - Base Samples")
print("=" * 60)

# Bronze customer - 3 original + 4 new (John update, Alice, Joe, Richard delete marker)
v.validate_row_count(f"{bronze_schema}.customer", 7, "Total customer CDC records")

# All customer IDs including new ones
v.validate_values_exist(f"{bronze_schema}.customer", "CUSTOMER_ID", [1, 2, 3, 4, 10], "All customer IDs")

# Bronze customer_address - 4 original + 2 new (Jane update, Alice new)
v.validate_row_count(f"{bronze_schema}.customer_address", 6, "Total address CDC records")


## Silver Layer Validations

### SCD2 Behavior After Run 2:
- **customer**: 
  - John (ID=1): 1 closed (old email), 1 active (new email)
  - Jane (ID=2): 1 active (unchanged)
  - Alice (ID=3): 1 active (new)
  - Joe (ID=4): 1 active (new)
  - Richard (ID=10): 1 closed (deleted via apply_as_deletes)
  - Total: 4 active, 2 closed = 6 rows


In [None]:
print("\n" + "=" * 60)
print("SILVER LAYER - Base Samples")
print("=" * 60)

# Silver customer SCD2 after Day 2:
# Active: Jane (unchanged), Alice (new), Joe (new), John (updated) = 4
# Closed: John (old version), Richard (deleted) = 2
v.validate_active_scd2_count(f"{silver_schema}.customer", 4, "__END_AT")
v.validate_closed_scd2_count(f"{silver_schema}.customer", 2, "__END_AT")

# Validate John's current email is updated
v.validate_column_value(
    f"{silver_schema}.customer",
    "CUSTOMER_ID = 1 AND __END_AT IS NULL",
    "EMAIL",
    "jdoe@example.com",
    "John's updated email (active record)"
)

# Validate new customers exist
v.validate_values_exist(f"{silver_schema}.customer", "CUSTOMER_ID", [1, 2, 3, 4], "Active customer IDs")

# Silver customer_address SCD2:
# Jane (ID=2): 1 closed (Melbourne), 1 active (Perth)
# Alice (ID=3): 1 active (new)
# Original 1, 4, 10 still active
v.validate_min_closed_scd2_count(f"{silver_schema}.customer_address", 1, "__END_AT")


### Multi-Source Streaming Pipeline

The multi-source streaming table merges customer and customer_address data.


In [None]:
print("\n" + "=" * 60)
print("SILVER LAYER - Multi-Source Streaming")
print("=" * 60)

# After Run 2, multi-source should have:
# Active IDs: 1, 2, 3, 4 (10 was deleted)
# Should have historical records from changes
v.validate_active_scd2_count(
    f"{silver_schema}.customer_ms_basic",
    4,
    "__END_AT"
)

# Should have at least 1 closed record from changes
v.validate_min_closed_scd2_count(
    f"{silver_schema}.customer_ms_basic",
    1,
    "__END_AT"
)

print("\n" + "=" * 60)
print("OPERATIONAL METADATA - meta_load_details Validation")
print("=" * 60)

# Validate meta_load_details nested column fields are not null
# Using wildcard to check all fields in the struct
v.validate_column_not_null(
    f"{silver_schema}.customer",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer meta_load_details"
)

v.validate_column_not_null(
    f"{silver_schema}.customer_address",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer_address meta_load_details"
)

v.validate_column_not_null(
    f"{silver_schema}.customer_ms_basic",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer_ms_basic meta_load_details"
)


## Gold Layer Validations


In [None]:
print("\n" + "=" * 60)
print("GOLD LAYER - Stream-Static")
print("=" * 60)

# Gold dimension should reflect silver layer changes
v.validate_min_row_count(
    f"{gold_schema}.dim_customer_sql_sample",
    4,  # At least 4 rows (active records)
    "Dimension records"
)


## Feature Samples - Snapshots


In [None]:
print("\n" + "=" * 60)
print("BRONZE LAYER - Feature Samples (Snapshots)")
print("=" * 60)

# Periodic snapshot should show SCD2 changes after snapshot update
# Run 2 overwrites snapshot source, so periodic snapshot should detect changes
v.validate_min_row_count(
    f"{bronze_schema}.feature_periodic_snapshot_scd2",
    3,  # At least 3 records from initial + any changes
    "Periodic snapshot records"
)


## Validation Summary


In [None]:
v.print_summary()
