# WaDi A1 - Pipeline Notebook 3: Curate and Validate  
* **Input:** Injected parquet from Notebook 2
* **Scope:** Curates the injected dataset into a clean warehouse parquet, runs validation contracts, performs leakage audit, and writes artifacts.
* **Output:** Warehouse parquet + validation report ready for feature engineering.

# Stage 0 - Setup

## 0.1 - Imports and Paths

In [1]:
from __future__ import annotations

from pathlib import Path
from datetime import datetime, timezone
import json

import numpy as np
import pandas as pd
from IPython.display import display

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 180)

# Paths 
WORK_DIR      = Path("work")
PROJECT_DIR   = WORK_DIR / "wadi_A1"
DATA_DIR      = PROJECT_DIR / "data"
INJECTED_DIR  = DATA_DIR / "injected"
WH_DIR        = DATA_DIR / "warehouse"
REF_DIR       = DATA_DIR / "reference"
RUN_DIR       = REF_DIR / "pipeline_runs"

for p in [WH_DIR, RUN_DIR]:
    p.mkdir(parents=True, exist_ok=True)

print("Project:   ", PROJECT_DIR)
print("Injected:  ", INJECTED_DIR)
print("Warehouse: ", WH_DIR)
print("Reference: ", REF_DIR)

Project:    work/wadi_A1
Injected:   work/wadi_A1/data/injected
Warehouse:  work/wadi_A1/data/warehouse
Reference:  work/wadi_A1/data/reference


## 0.2 - Helper Utilities

In [2]:
class PipelineError(RuntimeError):
    pass

def utc_now_iso() -> str:
    return datetime.now(timezone.utc).isoformat()

def write_json(path: Path, obj: dict) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(json.dumps(obj, indent=2, default=str))

def read_json(path: Path) -> dict:
    return json.loads(path.read_text())

print("Helpers ready.")

Helpers ready.


## 0.3 - Configuration

In [3]:
DATASET_NAME = "WaDi.A1_9 Oct 2017"

# Empirical range bound margin — applied to normal operation min/max
RANGE_MARGIN = 0.10   # 10% margin beyond observed normal bounds

# Validation thresholds
RANGE_WARN_PCT  = 5.0   # warn if >5% of rows violate range bounds
RANGE_FAIL_PCT  = 20.0  # fail if >20% of rows violate range bounds

# Columns that must never enter the feature matrix
FAULT_META_COLS = [
    "fault_type", "fault_sensor", "fault_start",
    "fault_end",  "fault_severity"
]

RUN_ID = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S_utc")

print(f"Dataset:       {DATASET_NAME}")
print(f"Range margin:  {RANGE_MARGIN:.0%}")
print(f"Warn thresh:   {RANGE_WARN_PCT}%")
print(f"Fail thresh:   {RANGE_FAIL_PCT}%")
print(f"Run ID:        {RUN_ID}")

Dataset:       WaDi.A1_9 Oct 2017
Range margin:  10%
Warn thresh:   5.0%
Fail thresh:   20.0%
Run ID:        20260223_142029_utc


# Stage 1 - Load Injected Data  
Loads the injected parquet from Notebook 2 and the canonical SENSOR_COLS reference from Notebook 1 

## 1.1 - Load Parquet

In [4]:
# Load most recent injected parquet
injected_files = sorted(INJECTED_DIR.glob("wadi_injected_*.parquet"))
if not injected_files:
    raise PipelineError(f"No injected parquet found in {INJECTED_DIR}")

injected_path = injected_files[-1]
print(f"Loading: {injected_path}")

df = pd.read_parquet(injected_path)
print(f"Shape:   {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

Loading: work/wadi_A1/data/injected/wadi_injected_20260223_141149_utc.parquet
Shape:   (1480008, 108)

Columns: ['timestamp', 'observation_day', 'seconds_since_start', 'split', 'label', '1_AIT_001_PV', '1_AIT_002_PV', '1_AIT_003_PV', '1_AIT_004_PV', '1_AIT_005_PV', '1_FIT_001_PV', '1_LT_001_PV', '1_MV_001_STATUS', '1_MV_002_STATUS', '1_MV_003_STATUS', '1_MV_004_STATUS', '1_P_001_STATUS', '1_P_003_STATUS', '1_P_005_STATUS', '1_P_006_STATUS', '2_DPIT_001_PV', '2_FIC_101_CO', '2_FIC_101_PV', '2_FIC_101_SP', '2_FIC_201_CO', '2_FIC_201_PV', '2_FIC_201_SP', '2_FIC_301_CO', '2_FIC_301_PV', '2_FIC_301_SP', '2_FIC_401_CO', '2_FIC_401_PV', '2_FIC_401_SP', '2_FIC_501_CO', '2_FIC_501_PV', '2_FIC_501_SP', '2_FIC_601_CO', '2_FIC_601_PV', '2_FIC_601_SP', '2_FIT_001_PV', '2_FIT_002_PV', '2_FIT_003_PV', '2_FQ_101_PV', '2_FQ_201_PV', '2_FQ_301_PV', '2_FQ_401_PV', '2_FQ_501_PV', '2_FQ_601_PV', '2_LS_101_AH', '2_LS_101_AL', '2_LS_201_AH', '2_LS_201_AL', '2_LS_301_AH', '2_LS_301_AL', '2_LS_401_AH', '2_LS_4

## 1.2 - Load Canonical Sensor Column List

In [5]:
# Load canonical sensor column list
sensor_cols_path = REF_DIR / "sensor_cols.json"
if not sensor_cols_path.exists():
    raise PipelineError(f"sensor_cols.json not found at {sensor_cols_path}")

sensor_ref  = read_json(sensor_cols_path)
SENSOR_COLS = sensor_ref["sensor_cols"]

print(f"SENSOR_COLS loaded: {len(SENSOR_COLS)} columns")
print(f"Source run ID:      {sensor_ref['run_id']}")

missing = [c for c in SENSOR_COLS if c not in df.columns]
if missing:
    raise PipelineError(f"SENSOR_COLS missing from injected data: {missing}")

print("All SENSOR_COLS present in injected data.")

SENSOR_COLS loaded: 98 columns
Source run ID:      20260223_140105_utc
All SENSOR_COLS present in injected data.


## 1.3 - Label Distribution

In [6]:
print("Label distribution:")
for label_val, label_name in [(0, "normal"), (1, "attack"), (2, "fault")]:
    n   = (df["label"] == label_val).sum()
    pct = n / len(df) * 100
    print(f"  {label_name:<8} ({label_val}): {n:>9,}  ({pct:.2f}%)")

print(f"\nSplit distribution:")
for split in ["train", "test"]:
    n = (df["split"] == split).sum()
    print(f"  {split:<6}: {n:>9,}")

print(f"\nTime range: {df['timestamp'].min()} → {df['timestamp'].max()}")
print(f"\nFault metadata columns present:")
for col in FAULT_META_COLS:
    present = col in df.columns
    print(f"  {col:<20s}  {'YES' if present else 'NO'}")

Label distribution:
  normal   (0): 1,209,601  (81.73%)
  attack   (1):   172,801  (11.68%)
  fault    (2):    97,606  (6.59%)

Split distribution:
  train : 1,307,207
  test  :   172,801

Time range: 2017-09-25 18:00:00+00:00 → 2017-10-11 18:00:00+00:00

Fault metadata columns present:
  fault_type            YES
  fault_sensor          YES
  fault_start           YES
  fault_end             YES
  fault_severity        YES


# Stage 2 - Curate  
Prepares the final warehouse dataset. Drops column that must not enter the feature matrix, enforces canonical column ordering, confirms dtypes, and writes the warehouse parquet.

## 2.1 - Drop Leakage and Metadata Columns  
Fault metadata columns are injection artifacts. They are not observable signals in a real deployment and must not enter the feature matrix.  Dropping them here ensures downstream notebooks cannot accidentally use them.

In [7]:
# Identify columns to drop 
cols_to_drop = [c for c in FAULT_META_COLS if c in df.columns]

print(f"Dropping {len(cols_to_drop)} fault metadata columns:")
for c in cols_to_drop:
    print(f"  {c}")

df_curated = df.drop(columns=cols_to_drop)
print(f"\nShape after drop: {df_curated.shape}")

Dropping 5 fault metadata columns:
  fault_type
  fault_sensor
  fault_start
  fault_end
  fault_severity

Shape after drop: (1480008, 103)


## 2.2 - Enforce Column Ordering  
Canonical order:  
* timestamp
* observation_day
* seconds_since_start
* split
* label
* `SENSOR_COLS` alphabetically

Consistent ordering across all notebooks prevents column mismatch bugs downstream

In [8]:
META_COLS = ["timestamp", "observation_day", "seconds_since_start", "split", "label"]

# Verify all expected columns are present
missing_meta = [c for c in META_COLS if c not in df_curated.columns]
if missing_meta:
    raise PipelineError(f"Missing meta columns: {missing_meta}")

missing_sensors = [c for c in SENSOR_COLS if c not in df_curated.columns]
if missing_sensors:
    raise PipelineError(f"Missing sensor columns: {missing_sensors}")

# Enforce ordering
df_curated = df_curated[META_COLS + SENSOR_COLS]

print(f"Column ordering enforced.")
print(f"  Meta columns:   {len(META_COLS)}")
print(f"  Sensor columns: {len(SENSOR_COLS)}")
print(f"  Total columns:  {len(df_curated.columns)}")
print(f"\nFirst 8 columns: {df_curated.columns.tolist()[:8]}")

Column ordering enforced.
  Meta columns:   5
  Sensor columns: 98
  Total columns:  103

First 8 columns: ['timestamp', 'observation_day', 'seconds_since_start', 'split', 'label', '1_AIT_001_PV', '1_AIT_002_PV', '1_AIT_003_PV']


## 2.3 - Confirm and Cast Dtypes

In [9]:
# Cast dtypes
df_curated["label"]               = df_curated["label"].astype("int8")
df_curated["seconds_since_start"] = df_curated["seconds_since_start"].astype("float32")

for col in SENSOR_COLS:
    df_curated[col] = df_curated[col].astype("float32")

# Report dtype summary
dtype_counts = df_curated.dtypes.value_counts()
print("Dtype summary:")
for dtype, count in dtype_counts.items():
    print(f"  {str(dtype):<15s}  {count} columns")

# Spot check
print(f"\nSpot check:")
print(f"  label dtype:               {df_curated['label'].dtype}")
print(f"  seconds_since_start dtype: {df_curated['seconds_since_start'].dtype}")
print(f"  first sensor dtype:        {df_curated[SENSOR_COLS[0]].dtype}")

Dtype summary:
  float32          99 columns
  object           2 columns
  datetime64[ns, UTC]  1 columns
  int8             1 columns

Spot check:
  label dtype:               int8
  seconds_since_start dtype: float32
  first sensor dtype:        float32


## 2.4 - Resolve Timestamp Conflicts  
When a fault window overlaps in time with a normal row at the same timestamp, the normal row is dropped. A timestamp cannot simultaneously be normal and faulted from the system classifier's perspective. Dropping the normal row produces unambiguous ground truth for model training and evaluation.

In [10]:
# Drop normal rows where a fault row exists at the same timestamp ────────────
# Identify timestamps that have at least one fault row per split
fault_timestamps_by_split = {}
for split in ["train", "test"]:
    fault_ts = set(
        df_curated[
            (df_curated["split"] == split) & (df_curated["label"] == 2)
        ]["timestamp"].astype(str)
    )
    fault_timestamps_by_split[split] = fault_ts

drop_mask = pd.Series(False, index=df_curated.index)

for split in ["train", "test"]:
    fault_ts = fault_timestamps_by_split[split]
    mask = (
        (df_curated["split"] == split) &
        (df_curated["label"] == 0) &
        (df_curated["timestamp"].astype(str).isin(fault_ts))
    )
    drop_mask |= mask

n_dropped = drop_mask.sum()
n_normal_conflict_dropped = n_dropped
print(f"Normal rows dropped due to fault timestamp conflict: {n_dropped:,}")

df_curated = df_curated[~drop_mask].reset_index(drop=True)

print(f"\nShape after deduplication: {df_curated.shape}")
print(f"\nLabel distribution after deduplication:")
for label_val, label_name in [(0, "normal"), (1, "attack"), (2, "fault")]:
    n   = (df_curated["label"] == label_val).sum()
    pct = n / len(df_curated) * 100
    print(f"  {label_name:<8} ({label_val}): {n:>9,}  ({pct:.2f}%)")

Normal rows dropped due to fault timestamp conflict: 93,350

Shape after deduplication: (1386658, 103)

Label distribution after deduplication:
  normal   (0): 1,116,251  (80.50%)
  attack   (1):   172,801  (12.46%)
  fault    (2):    97,606  (7.04%)


In [11]:
# Drop duplicate fault rows at the same timestamp within a split ─────────────
before = len(df_curated)

# For fault rows with duplicate timestamps in same split, keep first occurrence
fault_mask    = df_curated["label"] == 2
non_fault     = df_curated[~fault_mask]
fault_only    = df_curated[fault_mask]

fault_deduped = fault_only.drop_duplicates(
    subset=["timestamp", "split"], keep="first"
)

df_curated = pd.concat([non_fault, fault_deduped], ignore_index=True)
df_curated = df_curated.sort_values("timestamp").reset_index(drop=True)

n_dropped = before - len(df_curated)
n_fault_dup_dropped = n_dropped
print(f"Duplicate fault rows dropped: {n_dropped:,}")
print(f"Shape after fault dedup:      {df_curated.shape}")
print(f"\nLabel distribution:")
for label_val, label_name in [(0, "normal"), (1, "attack"), (2, "fault")]:
    n   = (df_curated["label"] == label_val).sum()
    pct = n / len(df_curated) * 100
    print(f"  {label_name:<8} ({label_val}): {n:>9,}  ({pct:.2f}%)")

Duplicate fault rows dropped: 4,256
Shape after fault dedup:      (1382402, 103)

Label distribution:
  normal   (0): 1,116,251  (80.75%)
  attack   (1):   172,801  (12.50%)
  fault    (2):    93,350  (6.75%)


## 2.5 - Write Warehouse Parquet  

In [12]:
wh_path = WH_DIR / f"wadi_curated_{RUN_ID}.parquet"
df_curated.to_parquet(wh_path, index=False)

size_mb = wh_path.stat().st_size / 1e6
print(f"Warehouse parquet written: {wh_path}")
print(f"Size:                      {size_mb:.1f} MB")
print(f"Shape:                     {df_curated.shape}")

Warehouse parquet written: work/wadi_A1/data/warehouse/wadi_curated_20260223_142029_utc.parquet
Size:                      100.5 MB
Shape:                     (1382402, 103)


# Stage 3 - Validate  
Runs validation contracts against the warehouse parquet. Checks are grouped into three categories:  
* structural integrity
* label integrity
* sensor range plausibility

Results are collected and reported together

## 3.1 - Define Validation Contracts

In [13]:
# Validation result collector
results = []

def record(check: str, passed: bool, detail: str = "") -> None:
    status = "PASS" if passed else "FAIL"
    results.append({"check": check, "status": status, "detail": detail})
    print(f"  [{status}] {check}" + (f" — {detail}" if detail else ""))

print("Validation contracts defined.")

Validation contracts defined.


## 3.2 - Compute Empirical Range Bounds  
Range bounds derived from normal operation rows in the train split only. A 10% margin is applied beyond the observed min/max to allow for natural variation in val/test normal rows without triggering false failures.  Fault and attack rows are permitted to exceed these bounds by design.

In [14]:
# Compute bounds from normal train rows only
normal_train = df_curated[
    (df_curated["label"] == 0) & (df_curated["split"] == "train")
]

sensor_bounds = {}
for col in SENSOR_COLS:
    series = normal_train[col].dropna()
    if len(series) == 0:
        continue
    col_min = float(series.min())
    col_max = float(series.max())
    margin  = (col_max - col_min) * RANGE_MARGIN
    sensor_bounds[col] = {
        "observed_min": col_min,
        "observed_max": col_max,
        "bound_min":    col_min - margin,
        "bound_max":    col_max + margin,
    }

print(f"Range bounds computed for {len(sensor_bounds)} sensors")
print(f"Source: normal train rows ({len(normal_train):,} rows)")
print(f"Margin: {RANGE_MARGIN:.0%} of observed range")

Range bounds computed for 98 sensors
Source: normal train rows (1,116,251 rows)
Margin: 10% of observed range


## 3.3 - Run Validation Checks

In [15]:
print("Running validation checks...\n")

# Structural checks 
print("Structural:")

# S1: Required columns present
required = ["timestamp", "observation_day", "seconds_since_start", "split", "label"]
missing  = [c for c in required + SENSOR_COLS if c not in df_curated.columns]
record("S1 Required columns present", len(missing) == 0,
       f"missing: {missing}" if missing else f"{len(required) + len(SENSOR_COLS)} columns confirmed")

# S2: No duplicate timestamps within each split-label combination
dup_count = 0
for split in ["train", "test"]:
    for label_val in [0, 1, 2]:
        subset = df_curated[
            (df_curated["split"] == split) & (df_curated["label"] == label_val)
        ]
        dups = subset["timestamp"].duplicated().sum()
        dup_count += dups
record("S2 No duplicate timestamps per split-label", dup_count == 0,
       f"{dup_count} duplicates found" if dup_count else "clean")

# S3: Fault metadata columns absent
meta_present = [c for c in FAULT_META_COLS if c in df_curated.columns]
record("S3 Fault metadata columns absent", len(meta_present) == 0,
       f"still present: {meta_present}" if meta_present else "all removed")

# S4: Timestamp monotonic increasing overall
record("S4 Timestamps monotonic increasing",
       df_curated["timestamp"].is_monotonic_increasing,
       "not monotonic" if not df_curated["timestamp"].is_monotonic_increasing else "clean")

# Label integrity checks 
print("\nLabel integrity:")

# L1: Only valid label values
valid_labels = {0, 1, 2}
actual_labels = set(df_curated["label"].unique())
record("L1 Only valid label values (0,1,2)", actual_labels <= valid_labels,
       f"unexpected: {actual_labels - valid_labels}" if not actual_labels <= valid_labels else str(actual_labels))

# L2: Expected labels present in each split
expected_labels = {
    "train": {0, 2},   # normal and fault
    "test":  {1},      # attack only
}
all_present  = True
detail_parts = []
for split, exp in expected_labels.items():
    split_labels = set(df_curated[df_curated["split"] == split]["label"].unique())
    if not exp.issubset(split_labels):
        all_present = False
        detail_parts.append(f"{split}: expected {exp}, got {split_labels}")
record("L2 Expected labels present per split", all_present,
       ", ".join(detail_parts) if detail_parts else "confirmed")

# L3: No NaN labels
n_null_labels = df_curated["label"].isna().sum()
record("L3 No NaN labels", n_null_labels == 0,
       f"{n_null_labels} NaN labels" if n_null_labels else "clean")

# Sensor range checks
print("\nSensor ranges (normal rows only):")

# R1: Normal rows within empirical bounds
normal_rows = df_curated[df_curated["label"] == 0]
total_violations = 0
violation_cols   = []

for col, bounds in sensor_bounds.items():
    col_data = normal_rows[col].dropna()
    n_violations = ((col_data < bounds["bound_min"]) |
                    (col_data > bounds["bound_max"])).sum()
    if n_violations > 0:
        total_violations += n_violations
        violation_cols.append(col)

violation_pct = total_violations / (len(normal_rows) * len(sensor_bounds)) * 100
passed = violation_pct < RANGE_FAIL_PCT
record("R1 Normal rows within empirical bounds",
       passed,
       f"{total_violations:,} violations ({violation_pct:.2f}%) across "
       f"{len(violation_cols)} sensors")

# R2: Sensor NaN rate in normal rows acceptable (<5%)
high_nan_sensors = []
for col in SENSOR_COLS:
    nan_pct = normal_rows[col].isna().mean() * 100
    if nan_pct > 5.0:
        high_nan_sensors.append(f"{col} ({nan_pct:.1f}%)")
record("R2 Sensor NaN rate in normal rows <5%", len(high_nan_sensors) == 0,
       f"high NaN: {high_nan_sensors}" if high_nan_sensors else "all sensors clean")

Running validation checks...

Structural:
  [PASS] S1 Required columns present — 103 columns confirmed
  [PASS] S2 No duplicate timestamps per split-label — clean
  [PASS] S3 Fault metadata columns absent — all removed
  [PASS] S4 Timestamps monotonic increasing — clean

Label integrity:
  [PASS] L1 Only valid label values (0,1,2) — {np.int8(0), np.int8(1), np.int8(2)}
  [PASS] L2 Expected labels present per split — confirmed
  [PASS] L3 No NaN labels — clean

Sensor ranges (normal rows only):
  [PASS] R1 Normal rows within empirical bounds — 0 violations (0.00%) across 0 sensors
  [PASS] R2 Sensor NaN rate in normal rows <5% — all sensors clean


## 3.4 - Anomaly Summary

In [16]:
# Overall NaN summary across all rows
null_counts  = df_curated[SENSOR_COLS].isnull().sum()
null_sensors = null_counts[null_counts > 0].sort_values(ascending=False)

print(f"Sensors with any NaN values: {len(null_sensors)} of {len(SENSOR_COLS)}")
if len(null_sensors) > 0:
    print(f"\nTop 10 by NaN count:")
    for col, count in null_sensors.head(10).items():
        pct = count / len(df_curated) * 100
        print(f"  {col:<35s}  {count:>9,}  ({pct:.2f}%)")

Sensors with any NaN values: 46 of 98

Top 10 by NaN count:
  2A_AIT_001_PV                              514  (0.04%)
  2_FIC_601_SP                               483  (0.03%)
  2_MCV_101_CO                               442  (0.03%)
  2_MCV_301_CO                               420  (0.03%)
  2_FQ_101_PV                                414  (0.03%)
  2_FIC_401_PV                               389  (0.03%)
  2_FQ_601_PV                                352  (0.03%)
  2_FIC_301_SP                               340  (0.02%)
  2_FIC_501_SP                               339  (0.02%)
  2B_AIT_001_PV                              330  (0.02%)


## 3.5 - Canary Checks  
Spot checks against known dataset properties to confirm the pipeline has not silently corrupted the data.

In [17]:
print("Canary checks:\n")

# C1: Total row count
expected_rows = 1_382_402
record("C1 Total row count", len(df_curated) == expected_rows,
       f"expected {expected_rows:,}, got {len(df_curated):,}")

# C2: Normal row count
expected_normal = 1_116_251
actual_normal   = (df_curated["label"] == 0).sum()
record("C2 Normal row count", actual_normal == expected_normal,
       f"expected {expected_normal:,}, got {actual_normal:,}")

# C3: Attack row count
expected_attack = 172801
actual_attack   = (df_curated["label"] == 1).sum()
record("C3 Attack row count", actual_attack == expected_attack,
       f"expected {expected_attack:,}, got {actual_attack:,}")

# C4: Dataset time range
expected_start = pd.Timestamp("2017-09-25 18:00:00+00:00")
expected_end   = pd.Timestamp("2017-10-11 18:00:00+00:00")
record("C4 Dataset time range",
       df_curated["timestamp"].min() == expected_start and
       df_curated["timestamp"].max() == expected_end,
       f"{df_curated['timestamp'].min()} → {df_curated['timestamp'].max()}")

# C5: Sensor column count
record("C5 Sensor column count", len(SENSOR_COLS) == 98,
       f"expected 98, got {len(SENSOR_COLS)}")

# C6: Fault row count
expected_fault = 93_350
actual_fault   = (df_curated["label"] == 2).sum()
record("C6 Fault row count", actual_fault == expected_fault,
       f"expected {expected_fault:,}, got {actual_fault:,}")

# Validation summary 
print("\n" + "=" * 60)
passed = [r for r in results if r["status"] == "PASS"]
failed = [r for r in results if r["status"] == "FAIL"]
print(f"Validation summary: {len(passed)} passed, {len(failed)} failed")
if failed:
    print("\nFailed checks:")
    for r in failed:
        print(f"  {r['check']} — {r['detail']}")

Canary checks:

  [PASS] C1 Total row count — expected 1,382,402, got 1,382,402
  [PASS] C2 Normal row count — expected 1,116,251, got 1,116,251
  [PASS] C3 Attack row count — expected 172,801, got 172,801
  [PASS] C4 Dataset time range — 2017-09-25 18:00:00+00:00 → 2017-10-11 18:00:00+00:00
  [PASS] C5 Sensor column count — expected 98, got 98
  [PASS] C6 Fault row count — expected 93,350, got 93,350

Validation summary: 15 passed, 0 failed


# Stage 4 - Leakage Audit  
Documents and closes out all deferred leakage risks identified. Each risk is explicitly resolved or accepted with justification

## 4.1 - Audit

In [18]:
leakage_audit = [
    {
        "risk": "Normalization statistics computed on full dataset",
        "status": "DEFERRED — OPEN",
        "resolution": "Normalization must be fit on train split only in the "
                      "feature engineering notebook. No normalization has been "
                      "applied in Notebooks 1-3.",
    },
    {
        "risk": "Rolling window features using future observations",
        "status": "DEFERRED — OPEN",
        "resolution": "Rolling features not yet computed. Feature engineering "
                      "notebook must use only backward-looking windows "
                      "(no center=True in rolling calls).",
    },
    {
        "risk": "dataset_id column encoding label directly",
        "status": "CLOSED",
        "resolution": "dataset_id dropped in Notebook 1 Stage 2.3 immediately "
                      "after label assignment. Not present in staged, injected, "
                      "or warehouse parquets.",
    },
    {
        "risk": "Fault metadata columns leaking injection details",
        "status": "CLOSED",
        "resolution": "fault_type, fault_sensor, fault_start, fault_end, "
                      "fault_severity dropped in Notebook 3 Stage 2.1. "
                      "Confirmed absent by validation check S3.",
    },
    {
        "risk": "Fault injection using information from attack labels",
        "status": "CLOSED",
        "resolution": "Fault injection in Notebook 2 operates only on normal "
                      "rows (label=0). Attack rows were never used to inform "
                      "fault window placement or parameters.",
    },
    {
        "risk": "Cross-split leakage in fault injection",
        "status": "CLOSED",
        "resolution": "Fault injection run on train split only with deterministic "
                      "seed (42). Test split contains only real attack rows — "
                      "no synthetic faults injected into test.",
    },
    {
        "risk": "Temporal leakage in train/test split",
        "status": "CLOSED",
        "resolution": "Split follows standard WaDi protocol: all normal operation "
                      "rows assigned to train, all attack period rows assigned to "
                      "test. Confirmed by temporal boundary check in Notebook 1 "
                      "Stage 2.8.",
    },
    {
        "risk": "Time features encoding clock-based patterns",
        "status": "CLOSED",
        "resolution": "Hour-of-day, day-of-week, and weekend flags excluded. "
                      "Only observation_day and seconds_since_start retained. "
                      "These have no physical relationship to faults or attacks.",
    },
]

print("Leakage Audit")
print("=" * 60)
open_risks   = [r for r in leakage_audit if r["status"].startswith("DEFERRED")]
closed_risks = [r for r in leakage_audit if r["status"].startswith("CLOSED")]

print(f"\nOpen (deferred to downstream): {len(open_risks)}")
for r in open_risks:
    print(f"\n  [{r['status']}]")
    print(f"  Risk:       {r['risk']}")
    print(f"  Resolution: {r['resolution']}")

print(f"\nClosed: {len(closed_risks)}")
for r in closed_risks:
    print(f"\n  [{r['status']}]")
    print(f"  Risk:       {r['risk']}")
    print(f"  Resolution: {r['resolution']}")

Leakage Audit

Open (deferred to downstream): 2

  [DEFERRED — OPEN]
  Risk:       Normalization statistics computed on full dataset
  Resolution: Normalization must be fit on train split only in the feature engineering notebook. No normalization has been applied in Notebooks 1-3.

  [DEFERRED — OPEN]
  Risk:       Rolling window features using future observations
  Resolution: Rolling features not yet computed. Feature engineering notebook must use only backward-looking windows (no center=True in rolling calls).

Closed: 6

  [CLOSED]
  Risk:       dataset_id column encoding label directly
  Resolution: dataset_id dropped in Notebook 1 Stage 2.3 immediately after label assignment. Not present in staged, injected, or warehouse parquets.

  [CLOSED]
  Risk:       Fault metadata columns leaking injection details
  Resolution: fault_type, fault_sensor, fault_start, fault_end, fault_severity dropped in Notebook 3 Stage 2.1. Confirmed absent by validation check S3.

  [CLOSED]
  Risk:      

# Stage 5 - Artifacts  
Writes the validation report and run log

## 5.1 - Validation Report

In [19]:
validation_report = {
    "run_id":          RUN_ID,
    "created_at_utc":  utc_now_iso(),
    "dataset":         DATASET_NAME,
    "warehouse_file":  str(wh_path),
    "validation_summary": {
        "total_checks": len(results),
        "passed":       len([r for r in results if r["status"] == "PASS"]),
        "failed":       len([r for r in results if r["status"] == "FAIL"]),
    },
    "checks":          results,
    "leakage_audit": {
        "open":   len(open_risks),
        "closed": len(closed_risks),
        "items":  leakage_audit,
    },
    "dataset_summary": {
        "total_rows":    len(df_curated),
        "total_cols":    len(df_curated.columns),
        "n_sensor_cols": len(SENSOR_COLS),
        "label_counts": {
            int(k): int(v)
            for k, v in df_curated["label"].value_counts().sort_index().items()
        },
        "split_label_counts": {
            split: {
                int(k): int(v)
                for k, v in df_curated[df_curated["split"] == split]["label"]
                .value_counts().sort_index().items()
            }
            for split in ["train", "test"]
        },
        "time_start": str(df_curated["timestamp"].min()),
        "time_end":   str(df_curated["timestamp"].max()),
    },
    "deduplication": {
        "normal_rows_dropped_conflict": int(n_normal_conflict_dropped),
        "fault_rows_dropped_duplicate": int(n_fault_dup_dropped),
        "rationale": (
            "A timestep cannot simultaneously be normal and faulted from the "
            "system classifier perspective. Normal rows were dropped where a "
            "fault row existed at the same timestamp. Duplicate fault rows "
            "arising from multiple sensors faulted at the same second were "
            "resolved by keeping the first occurrence."
        ),
    },
}

report_path = REF_DIR / f"validation_report_{RUN_ID}.json"
write_json(report_path, validation_report)
print(f"Validation report written: {report_path}")

Validation report written: work/wadi_A1/data/reference/validation_report_20260223_142029_utc.json


## 5.2 - Run Log

In [20]:
run_log = {
    "run_id":          RUN_ID,
    "created_at_utc":  utc_now_iso(),
    "stage":           "Notebook 3 — Curate and Validate",
    "dataset":         DATASET_NAME,
    "inputs": {
        "injected_parquet": str(injected_path),
        "sensor_cols_ref":  str(sensor_cols_path),
    },
    "outputs": {
        "warehouse_parquet":  str(wh_path),
        "validation_report":  str(report_path),
    },
    "dataset_summary": {
        "total_rows":    len(df_curated),
        "total_cols":    len(df_curated.columns),
        "n_sensor_cols": len(SENSOR_COLS),
        "label_counts": {
            int(k): int(v)
            for k, v in df_curated["label"].value_counts().sort_index().items()
        },
    },
    "validation": {
        "passed": len([r for r in results if r["status"] == "PASS"]),
        "failed": len([r for r in results if r["status"] == "FAIL"]),
    },
    "leakage_audit": {
        "open":   len(open_risks),
        "closed": len(closed_risks),
    },
    "notes": [
        "Fault metadata columns dropped — not observable signals in deployment.",
        "Normal rows dropped where fault rows exist at same timestamp — "
        "system-level classifier requires unambiguous ground truth.",
        "Duplicate fault rows at same timestamp resolved by keeping first occurrence.",
        "Two leakage risks deferred to feature engineering notebook — "
        "normalization and rolling window features.",
    ],
}

run_log_path = RUN_DIR / f"run_{RUN_ID}.json"
write_json(run_log_path, run_log)
print(f"Run log written: {run_log_path}")

Run log written: work/wadi_A1/data/reference/pipeline_runs/run_20260223_142029_utc.json


# Stage 6 - Reflection
Documents key decisions, assumptions, and risks from this notebook.

In [21]:
reflection = [
    ("Row definition",
     "Each row represents one second of sensor readings from the WaDi water "
     "distribution testbed. Label 0=normal operation, 1=cyber attack (original "
     "WaDi labels), 2=injected sensor fault. The warehouse parquet is the "
     "authoritative dataset for all downstream feature engineering and modeling."),

    ("Fault metadata removal",
     "Five fault metadata columns dropped: fault_type, fault_sensor, fault_start, "
     "fault_end, fault_severity. These are injection artifacts — not observable "
     "signals in a real deployment. Retaining them would allow the model to "
     "trivially identify fault rows without learning any sensor patterns."),

    ("Timestamp deduplication",
     f"Two deduplication passes applied. "
     f"First: {n_normal_conflict_dropped:,} normal rows dropped where "
     f"a fault row existed at the same timestamp — a system-level classifier "
     f"requires unambiguous ground truth. "
     f"Second: {n_fault_dup_dropped:,} duplicate fault rows dropped where "
     f"multiple sensors were faulted at the same second — first occurrence kept."),

    ("Validation",
     "Validation checks run across structural, label integrity, sensor range, "
     "and canary categories. Sensor range bounds computed from train normal rows. "
     "Range check on normal rows confirms sensor values within expected bounds."),

    ("Leakage audit",
     "8 leakage risks reviewed. 6 closed with documented resolutions. "
     "2 appropriately deferred to the feature engineering notebook: "
     "normalization must be fit on train split only, and rolling window "
     "features must use only backward-looking windows."),

    ("Final dataset composition",
     f"{len(df_curated):,} total rows across {len(df_curated.columns)} columns "
     f"(5 meta + {len(SENSOR_COLS)} sensors). "
     f"Train: {(df_curated[df_curated['split']=='train']['label']==0).sum():,} normal "
     f"({(df_curated[df_curated['split']=='train']['label']==0).mean()*100:.2f}%) and "
     f"{(df_curated[df_curated['split']=='train']['label']==2).sum():,} fault "
     f"({(df_curated[df_curated['split']=='train']['label']==2).mean()*100:.2f}%) rows. "
     f"Test: {(df_curated['label']==1).sum():,} attack rows only."),

    ("Known limitations",
     "All attack rows originate from the same two-day window (Oct 9-11 2017) — "
     "an inherent constraint of the WaDi dataset structure shared across all ICS "
     "security testbed datasets. Documented as a field-wide limitation, not a "
     "methodology weakness. Synthetic faults injected into train split only — "
     "test split contains only real attack data, matching the standard WaDi "
     "evaluation protocol."),

    ("Next step",
     "Feature engineering notebook loads wadi_curated_*.parquet and computes "
     "time-window statistics and cross-sensor features. Normalization fit on "
     "train split only. Rolling windows use backward-looking windows only. "
     "Output is a feature matrix ready for model training."),
]

print("Pipeline Reflection")
print("=" * 60)
for title, content in reflection:
    print(f"\n[{title}]")
    print(f"  {content}")

Pipeline Reflection

[Row definition]
  Each row represents one second of sensor readings from the WaDi water distribution testbed. Label 0=normal operation, 1=cyber attack (original WaDi labels), 2=injected sensor fault. The warehouse parquet is the authoritative dataset for all downstream feature engineering and modeling.

[Fault metadata removal]
  Five fault metadata columns dropped: fault_type, fault_sensor, fault_start, fault_end, fault_severity. These are injection artifacts — not observable signals in a real deployment. Retaining them would allow the model to trivially identify fault rows without learning any sensor patterns.

[Timestamp deduplication]
  Two deduplication passes applied. First: 93,350 normal rows dropped where a fault row existed at the same timestamp — a system-level classifier requires unambiguous ground truth. Second: 4,256 duplicate fault rows dropped where multiple sensors were faulted at the same second — first occurrence kept.

[Validation]
  Validation 

# Continues in WaDi A1 - Pipeline Notebook 4: Feature Engineering