# Run 3 Validation

This notebook validates the output tables after Run 3 (Day 3) data load.

## Expected Changes in Run 3:
- **Snapshot Source Updates**:
  - customer_snapshot_source is overwritten (John ID=1 removed, only Jane, Alice, Joe, Richard remain)
  - New address for customer ID=1 (Brisbane, QLD)
- **SCD2 Behavior**: 
  - Periodic snapshots should detect the removal of John from snapshot
  - Customer address should show new version for ID=1


In [None]:
%run "./initialize"


In [None]:
# Import validation utilities
from validation_utils import ValidationRunner

# Initialize validation runner with spark session
v = ValidationRunner(spark)


## Bronze Layer Validations

Run 3 only adds 1 new address record for customer ID=1 (Brisbane, QLD)


In [None]:
print("=" * 60)
print("BRONZE LAYER - Base Samples")
print("=" * 60)

# Bronze customer - no new customer records in Run 3, still 7
v.validate_row_count(f"{bronze_schema}.customer", 7, "Customer CDC records (no change)")

# Bronze customer_address - 6 from Run 2 + 1 new (ID=1 Brisbane)
v.validate_row_count(f"{bronze_schema}.customer_address", 7, "Address CDC records (+1 from Run 3)")


## Silver Layer Validations

### SCD2 Changes:
- Customer table unchanged (no new customer data in Run 3)
- Customer address: ID=1 gets new version (Brisbane), old version (Melbourne) closed


In [None]:
print("\n" + "=" * 60)
print("SILVER LAYER - Base Samples")
print("=" * 60)

# Silver customer - unchanged from Run 2
v.validate_active_scd2_count(f"{silver_schema}.customer", 4, "__END_AT")

# Silver customer_address - ID=1 now has new active version
# Should have at least 2 closed records now (Jane's Melbourne + ID=1's original Melbourne)
v.validate_min_closed_scd2_count(f"{silver_schema}.customer_address", 2, "__END_AT")

# Validate ID=1's current city is Brisbane
v.validate_column_value(
    f"{silver_schema}.customer_address",
    "CUSTOMER_ID = 1 AND __END_AT IS NULL",
    "CITY",
    "Brisbane",
    "Customer 1's updated city (active record)"
)

print("\n" + "=" * 60)
print("OPERATIONAL METADATA - meta_load_details Validation")
print("=" * 60)

# Validate meta_load_details nested column fields are not null
# Using wildcard to check all fields in the struct
v.validate_column_not_null(
    f"{silver_schema}.customer",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer meta_load_details"
)

v.validate_column_not_null(
    f"{silver_schema}.customer_address",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer_address meta_load_details"
)

v.validate_column_not_null(
    f"{silver_schema}.customer_ms_basic",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer_ms_basic meta_load_details"
)

# TODO: Add validation for all operational metadata fields. Currently fails on pipeline_update_id sometimes 
# v.validate_column_not_null(
#     f"{silver_schema}.customer_ms_basic",
#     "meta_load_details.*",
#     "Silver customer_ms_basic meta_load_details"
# )


## Feature Samples - Periodic Snapshot

The periodic snapshot should detect the change in customer_snapshot_source


In [None]:
print("\n" + "=" * 60)
print("BRONZE LAYER - Feature Samples (Snapshots)")
print("=" * 60)

# Periodic snapshot should have processed another snapshot
# John (ID=1) was removed from snapshot source in Run 3
# This should result in a delete operation in SCD2
v.validate_min_row_count(
    f"{bronze_schema}.feature_periodic_snapshot_scd2",
    4,  # At least 4 rows (original + changes from snapshots)
    "Periodic snapshot records after Run 3"
)


## Gold Layer Validations


In [None]:
print("\n" + "=" * 60)
print("GOLD LAYER - Stream-Static")
print("=" * 60)

# Gold dimension should have more history from address changes
v.validate_min_row_count(
    f"{gold_schema}.dim_customer_sql_sample",
    5,  # At least 5 rows including history
    "Dimension records with history"
)


## Validation Summary


In [None]:
v.print_summary()
