# Run 4 Validation

This notebook validates the output tables after Run 4 (Day 4) data load.

## Expected Changes in Run 4:
- **Customer Updates**:
  - John (ID=1): Email changed from jdoe@example.com to john.doe@another.example.com
  - DELETE_FLAG set to False (was NULL)
- **SCD2 Behavior**: 
  - John should have another version created in silver.customer


In [None]:
%run "./initialize"


In [None]:
# Import validation utilities
from validation_utils import ValidationRunner

# Initialize validation runner with spark session
v = ValidationRunner(spark)


## Bronze Layer Validations

Run 4 adds 1 new customer record (John's email update)


In [None]:
print("=" * 60)
print("BRONZE LAYER - Base Samples")
print("=" * 60)

# Bronze customer - 7 from Run 3 + 1 new (John's email update)
v.validate_row_count(f"{bronze_schema}.customer", 8, "Customer CDC records (+1 John update)")

# Bronze customer_address - unchanged from Run 3
v.validate_row_count(f"{bronze_schema}.customer_address", 7, "Address CDC records (unchanged)")


## Silver Layer Validations

### SCD2 Changes:
- John (ID=1): Another version created (email changed to john.doe@another.example.com)
- Now John has 3 versions: original, jdoe, and john.doe@another.example.com


In [None]:
print("\n" + "=" * 60)
print("SILVER LAYER - Base Samples")
print("=" * 60)

# Silver customer - still 4 active (John, Jane, Alice, Joe)
# Richard was deleted in Run 2
v.validate_active_scd2_count(f"{silver_schema}.customer", 4, "__END_AT")

# Silver customer - should now have 3 closed records:
# 1. John's original version
# 2. John's jdoe version
# 3. Richard (deleted)
v.validate_min_closed_scd2_count(f"{silver_schema}.customer", 3, "__END_AT")

# Validate John's current email is the new one
v.validate_column_value(
    f"{silver_schema}.customer",
    "CUSTOMER_ID = 1 AND __END_AT IS NULL",
    "EMAIL",
    "john.doe@another.example.com",
    "John's latest email (active record)"
)

# Customer address unchanged from Run 3
v.validate_min_closed_scd2_count(f"{silver_schema}.customer_address", 2, "__END_AT")


## Gold Layer Validations


In [None]:
print("\n" + "=" * 60)
print("GOLD LAYER - Stream-Static")
print("=" * 60)

# TODO - Fix Gold pattern for DWH, currently does not have intended behavior
# # Gold dimension should have accumulated more history
# v.validate_min_row_count(
#     f"{gold_schema}.dim_customer_sql_sample",
#     6,  # At least 6 rows including all history
#     "Final dimension records with full history"
# )

# # Verify active record count
# v.validate_active_scd2_count(
#     f"{gold_schema}.dim_customer_sql_sample",
#     4,  # 4 active customers (John, Jane, Alice, Joe) - Richard deleted
#     "__END_AT"
# )


## Multi-Source Streaming Final State


In [None]:
print("\n" + "=" * 60)
print("SILVER LAYER - Multi-Source Streaming Final State")
print("=" * 60)

# Multi-source streaming should have accumulated history
v.validate_active_scd2_count(
    f"{silver_schema}.customer_ms_basic",
    4,  # 4 active (John, Jane, Alice, Joe)
    "__END_AT"
)

# Should have multiple closed records from all the changes
v.validate_min_closed_scd2_count(
    f"{silver_schema}.customer_ms_basic",
    2,  # At least 2 closed (Richard deleted, plus changes)
    "__END_AT"
)

print("\n" + "=" * 60)
print("OPERATIONAL METADATA - meta_load_details Validation")
print("=" * 60)

# Validate meta_load_details nested column fields are not null
# Using wildcard to check all fields in the struct
v.validate_column_not_null(
    f"{silver_schema}.customer",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer meta_load_details"
)

v.validate_column_not_null(
    f"{silver_schema}.customer_address",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer_address meta_load_details"
)

v.validate_column_not_null(
    f"{silver_schema}.customer_ms_basic",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer_ms_basic meta_load_details"
)


## Validation Summary


In [None]:
v.print_summary()
