# Run 1 Validation

This notebook validates the output tables after Run 1 (Day 1) data load.

## Expected State After Run 1:
- **Staging**: Initial customer data loaded (3 customers, 5 addresses, purchases)
- **Bronze**: Streaming tables populated from staging sources
- **Silver**: SCD2 tables with initial versions (all active, no history)
- **Gold**: Dimension tables populated from silver


In [None]:
%run "./initialize"


In [None]:
# Import validation utilities
from validation_utils import ValidationRunner

# Initialize validation runner with spark session
v = ValidationRunner(spark)


## Bronze Layer Validations

### Base Samples Pipeline
- `customer`: Streaming from staging.customer (3 rows: John, Jane, Richard)
- `customer_address`: Streaming with data quality, quarantine mode=table
  - 5 rows in staging, but 1 has NULL CUSTOMER_ID which should be quarantined


In [None]:
print("=" * 60)
print("BRONZE LAYER - Base Samples")
print("=" * 60)

# Bronze customer table - should have 3 rows (John, Jane, Richard)
v.validate_row_count(f"{bronze_schema}.customer", 3, "Initial customer load")

# Validate customer IDs present
v.validate_values_exist(f"{bronze_schema}.customer", "CUSTOMER_ID", [1, 2, 10], "Customer IDs")

# Validate specific customer data
v.validate_column_value(
    f"{bronze_schema}.customer",
    "CUSTOMER_ID = 1",
    "EMAIL",
    "john.doe@example.com",
    "John's email"
)

# Bronze customer_address - should have 4 valid rows (NULL CUSTOMER_ID quarantined)
v.validate_row_count(f"{bronze_schema}.customer_address", 4, "Valid addresses (1 quarantined)")

# Quarantine table should have 1 row (NULL CUSTOMER_ID)
v.validate_quarantine_count(f"{bronze_schema}.customer_address_quarantine", 1)


### Feature Samples - Data Quality Pipeline

- `feature_quarantine_table`: Same source as customer_address, quarantine mode=table
- `feature_quarantine_flag`: Quarantine mode=flag (adds __IS_QUARANTINED column)


In [None]:
print("\n" + "=" * 60)
print("BRONZE LAYER - Feature Samples (Data Quality)")
print("=" * 60)

# Feature quarantine table - same as customer_address
v.validate_row_count(f"{bronze_schema}.feature_quarantine_table", 4, "Valid records in quarantine table feature")
v.validate_quarantine_count(f"{bronze_schema}.feature_quarantine_table_quarantine", 1)


### Feature Samples - Snapshots Pipeline

- Historical snapshots from files and tables
- Periodic snapshot SCD2


In [None]:
print("\n" + "=" * 60)
print("BRONZE LAYER - Feature Samples (Snapshots)")
print("=" * 60)

# Historical snapshot from table - processes all historical snapshots
# Initial load has data with 3 timestamps (2024-01-01, 2024-01-04, 2024-02-10)
# SCD2 should track changes across snapshots
v.validate_min_row_count(
    f"{bronze_schema}.feature_historical_snapshot_table_datetime", 
    3,  # At minimum 3 customers from initial snapshot
    "Historical snapshot from table"
)

# Periodic snapshot SCD2 - initial snapshot with 3 customers
v.validate_active_scd2_count(
    f"{bronze_schema}.feature_periodic_snapshot_scd2",
    3,
    "__END_AT"
)


## Silver Layer Validations

### Base Samples Pipeline
- `customer`: SCD2 with CDC from bronze.customer
- `customer_address`: SCD2 with CDC from bronze.customer_address


In [None]:
print("\n" + "=" * 60)
print("SILVER LAYER - Base Samples")
print("=" * 60)

# Silver customer - SCD2, all records should be active (no changes yet)
v.validate_row_count(f"{silver_schema}.customer", 3, "Initial SCD2 records")
v.validate_active_scd2_count(f"{silver_schema}.customer", 3)
v.validate_closed_scd2_count(f"{silver_schema}.customer", 0)

# Silver customer_address - SCD2, 4 valid records from bronze (quarantine filtered)
v.validate_row_count(f"{silver_schema}.customer_address", 4, "Initial SCD2 records")
v.validate_active_scd2_count(f"{silver_schema}.customer_address", 4)
v.validate_closed_scd2_count(f"{silver_schema}.customer_address", 0)


### Multi-Source Streaming Pipeline

- `customer_ms_basic`: Merges customer and customer_address streams


In [None]:
print("\n" + "=" * 60)
print("SILVER LAYER - Multi-Source Streaming")
print("=" * 60)

# Multi-source streaming combines customer and customer_address
# Customer IDs: 1, 2, 10 from customer + 1, 2, 4, 10 from address (deduplicated)
# Unique IDs: 1, 2, 4, 10 = 4 active records
v.validate_active_scd2_count(
    f"{silver_schema}.customer_ms_basic",
    4,  # Unique customer IDs from both sources
    "__END_AT"
)

print("\n" + "=" * 60)
print("OPERATIONAL METADATA - meta_load_details Validation")
print("=" * 60)

# Validate meta_load_details nested column fields are not null
# Using wildcard to check all fields in the struct
v.validate_column_not_null(
    f"{silver_schema}.customer",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer meta_load_details"
)

v.validate_column_not_null(
    f"{silver_schema}.customer_address",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer_address meta_load_details"
)

v.validate_column_not_null(
    f"{silver_schema}.customer_ms_basic",
    "meta_load_details.pipeline_start_timestamp",
    "Silver customer_ms_basic meta_load_details"
)


## Gold Layer Validations

### Stream-Static Pipeline
- `dim_customer_sql_sample`: Dimension built from silver customer data


In [None]:
print("\n" + "=" * 60)
print("GOLD LAYER - Stream-Static")
print("=" * 60)

# Gold dimension - should have SCD2 records from silver
v.validate_active_scd2_count(
    f"{gold_schema}.dim_customer_sql_sample",
    4,  # Combined unique keys from customer and customer_address
    "__END_AT"
)


## YAML Sample Validations

The YAML sample pipeline mirrors the bronze base samples but uses YAML format


In [None]:
print("\n" + "=" * 60)
print("YAML SAMPLE")
print("=" * 60)

# YAML customer - same as bronze.customer
v.validate_row_count(f"{yaml_schema}.customer", 3, "YAML customer table")


## Validation Summary


In [None]:
v.print_summary()
