# WaDi A1 - Pipeline Notebook 1: Ingest & Stage

The notebook will cover the following:  
* **Dataset:** WaDi.A1_9 Oct 2017 - Water Distribution testbed, iTrust Labs, SUTD
* **Scope:** Stages 0-2. Ingests raw CSV files, cleans and types the data, drops uninformative sensor columns, and assigns a temporal train/test split following the standard WaDi evaluation protocol.
* **Split strategy:** Normal operation rows (label=0) -> train. Attack period rows (label=1) -> test. This matches the convention used in the ICS anomaly detection literature
* **Output:** A staged Parquet file with a `split` column, ready for fault injection.

# Stage 0 - Setup

## 0.1 - Imports and Paths

In [1]:
from __future__ import annotations

from pathlib import Path
from datetime import datetime, timezone
import json
import hashlib

import numpy as np
import pandas as pd
from IPython.display import display

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 180)

# ── Paths ─────────────────────────────────────────────────────────────────────
WORK_DIR    = Path("work")
PROJECT_DIR = WORK_DIR / "wadi_A1"
DATA_DIR    = PROJECT_DIR / "data"
RAW_DIR     = DATA_DIR / "raw"
STAGED_DIR  = DATA_DIR / "staged"
REF_DIR     = DATA_DIR / "reference"
RUN_DIR     = REF_DIR / "pipeline_runs"

for p in [RAW_DIR, STAGED_DIR, REF_DIR, RUN_DIR]:
    p.mkdir(parents=True, exist_ok=True)

print("Project:", PROJECT_DIR)
print("Raw:    ", RAW_DIR)
print("Staged: ", STAGED_DIR)
print("Ref:    ", REF_DIR)

Project: work/wadi_A1
Raw:     work/wadi_A1/data/raw
Staged:  work/wadi_A1/data/staged
Ref:     work/wadi_A1/data/reference


## 0.2 Helper Utilities  
Reusable functions used throughout the pipeline

In [2]:
class PipelineError(RuntimeError):
    pass

def utc_now_iso() -> str:
    return datetime.now(timezone.utc).isoformat()

def sha16(x: str) -> str:
    return hashlib.sha256(x.encode("utf-8")).hexdigest()[:16]

def write_json(path: Path, obj: dict) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(json.dumps(obj, indent=2, default=str))

def read_json(path: Path) -> dict:
    return json.loads(path.read_text())

def require_columns(df: pd.DataFrame, cols: list[str], context: str) -> None:
    missing = [c for c in cols if c not in df.columns]
    if missing:
        raise PipelineError(f"[{context}] Missing required columns: {missing}")

print("Helpers ready.")

Helpers ready.


## 0.3 - Configuration  
Pipeline constants. 

In [3]:
# Dataset Identity
DATASET_NAME   = "WaDi.A1_9 Oct 2017"
DATA_SOURCE    = "iTrust Labs, SUTD"
DATASET_URL    = "https://itrust.sutd.edu.sg/itrust-labs_datasets/dataset_info/"
DOWNLOAD_DATE  = "2026-02-06"

# Expected raw files
EXPECTED_FILES = [
    "WADI_14days.csv",
    "WADI_attackdata.csv",
    "attack_description.xlsx",
    "table_WADI.pdf",
]

# Run ID
RUN_ID = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S_utc")

print(f"Dataset:  {DATASET_NAME}")
print(f"Source:   {DATA_SOURCE}")
print(f"Run ID:   {RUN_ID}")
print(f"Split:    normal operation → train | attack period → test")

Dataset:  WaDi.A1_9 Oct 2017
Source:   iTrust Labs, SUTD
Run ID:   20260223_140105_utc
Split:    normal operation → train | attack period → test


# Stage 1 - Ingest: Acquire Raw Data  
Verifies all expected raw files are present and documents their provenance.  
The raw files are not modified - this stage is read-only

## 1.1 - Verify Raw Files

In [4]:
# Verify all expected raw files are present
missing = [f for f in EXPECTED_FILES if not (RAW_DIR / f).exists()]

if missing:
    raise PipelineError(f"Missing required files in {RAW_DIR}: {missing}")

print("All required files present:\n")
for fname in EXPECTED_FILES:
    fpath = RAW_DIR / fname
    size_mb = fpath.stat().st_size / 1e6
    print(f"  {fname:<35s}  {size_mb:>8.1f} MB")

All required files present:

  WADI_14days.csv                         777.5 MB
  WADI_attackdata.csv                     111.2 MB
  attack_description.xlsx                   0.0 MB
  table_WADI.pdf                            0.0 MB


## 1.2 - Write Raw Metadata  
Documents the source, download date, and file fingerprints for reproducibility

In [5]:
file_meta = {}
for fname in EXPECTED_FILES:
    fpath = RAW_DIR / fname
    file_meta[fname] = {
        "size_bytes": fpath.stat().st_size,
        "sha16":      sha16(fpath.read_bytes().hex()),
        "path":       str(fpath),
    }

metadata_raw = {
    "run_id":        RUN_ID,
    "created_at_utc": utc_now_iso(),
    "dataset_name":  DATASET_NAME,
    "source":        DATA_SOURCE,
    "source_url":    DATASET_URL,
    "download_date": DOWNLOAD_DATE,
    "files":         file_meta,
    "notes": [
        "Dataset requires registration with iTrust Labs.",
        "Files manually downloaded and placed in RAW_DIR.",
        "Normal operations: 14 days. Attack scenarios: 2 days.",
        "Attack labels in WADI_attackdata.csv cover Oct 9–11 2017.",
    ],
}

meta_path = RAW_DIR / f"metadata_raw_{RUN_ID}.json"
write_json(meta_path, metadata_raw)
print(f"Metadata written: {meta_path}")

Metadata written: work/wadi_A1/data/raw/metadata_raw_20260223_140105_utc.json


# Stage 2 - Stage: Parse and Normalize  
* Loads raw CSVs
* cleans column names
* parses timestamps
* combines normal and attack data into a single DataFrame
* drops uninformative sensor columns
* defines the SENSOR_COLS list
* adds time features
* assigns the stratified train/val/test split
* writes the staged parquet

## 2.1 - Define Staging Function  
Cleans the raw WaDi CSV structure:  
* Strips the windows path prefix from column names
* Parses separate Date + Time columns into a single UTC timestamp
* Drops the row-number column
* Casts all sensor columns to float32

In [6]:
# Windows path prefix present on all sensor column names
WADI_COL_PREFIX = "\\\\WIN-25J4RO10SBF\\LOG_DATA\\SUTD_WADI\\LOG_DATA\\"

def stage_wadi_data(df_raw: pd.DataFrame, dataset_id: str) -> pd.DataFrame:
    """
    Parse and normalize one raw WaDi CSV (normal or attack).
    Returns a clean, typed DataFrame with UTC timestamps.
    """
    df = df_raw.copy()

    # ── Strip Windows path prefix from column names ───────────────────────────
    df.columns = [
        c.replace(WADI_COL_PREFIX, "").strip()
        if c.startswith("\\\\") else c.strip()
        for c in df.columns
    ]

    # ── Drop row-number column (unnamed or 'Row') ─────────────────────────────
    drop_candidates = [c for c in df.columns if c.strip() in ("", "Row") or
                       c.startswith("Unnamed")]
    df = df.drop(columns=drop_candidates, errors="ignore")

    # ── Parse timestamp ───────────────────────────────────────────────────────
    # Normal file has 'Date' + 'Time', attack file has the same structure
    df["timestamp"] = pd.to_datetime(
        df["Date"].astype(str) + " " + df["Time"].astype(str),
        format="mixed",
        dayfirst=False,
    ).dt.tz_localize("UTC")

    df = df.drop(columns=["Date", "Time"], errors="ignore")

    # ── Add source identifier ─────────────────────────────────────────────────
    df["dataset_id"] = dataset_id

    # ── Sort by timestamp ─────────────────────────────────────────────────────
    df = df.sort_values("timestamp").reset_index(drop=True)

    # ── Cast sensor columns to float32 ────────────────────────────────────────
    meta_cols = {"timestamp", "dataset_id"}
    sensor_cols = [c for c in df.columns if c not in meta_cols]
    for col in sensor_cols:
        df[col] = pd.to_numeric(df[col], errors="coerce").astype("float32")

    return df

print("Staging function defined.")

Staging function defined.


## 2.2 - Load and Stage Raw Data  
Loads both CSV files and applies the staging function. The normal operations file has metadata rows at the top that must be skipped

In [7]:
# Peek at the first 10 raw lines to understand the file structure
with open(RAW_DIR / "WADI_14days.csv", "r", encoding="utf-8", errors="replace") as f:
    for i, line in enumerate(f):
        print(f"Line {i}: {line[:120]!r}")
        if i >= 9:
            break

Line 0: 'Created: 10/9/2017 6:05:57.359 PM Malay Peninsula Standard Time                       \n'
Line 1: 'Number of rows: 1.2096E+6\n'
Line 2: 'Interpolation interval: 1 seconds\n'
Line 3: '\n'
Line 4: 'Row,Date,Time,\\\\WIN-25J4RO10SBF\\LOG_DATA\\SUTD_WADI\\LOG_DATA\\1_AIT_001_PV,\\\\WIN-25J4RO10SBF\\LOG_DATA\\SUTD_WADI\\LOG_DATA\\1'
Line 5: '1,9/25/2017,6:00:00.000 PM,171.155,0.619473,11.5759,504.645,0.318319,0.00115685,0,0,47.8911,1,1,1,1,1,1,1,1,2,1,2464.88,'
Line 6: '2,9/25/2017,6:00:01.000 PM,171.155,0.619473,11.5759,504.645,0.318319,0.00115685,0,0,47.8911,1,1,1,1,1,1,1,1,2,1,2464.88,'
Line 7: '3,9/25/2017,6:00:02.000 PM,171.155,0.619473,11.5759,504.645,0.318319,0.00115685,0,0,47.8911,1,1,1,1,1,1,1,1,2,1,2464.88,'
Line 8: '4,9/25/2017,6:00:03.000 PM,171.155,0.607477,11.5725,504.673,0.318438,0.00120685,0,0,47.7503,1,1,1,1,1,1,1,1,2,1,2477.67,'
Line 9: '5,9/25/2017,6:00:04.000 PM,171.155,0.607477,11.5725,504.673,0.318438,0.00120685,0,0,47.7503,1,1,1,1,1,1,1,1,2,1,2477.67,'


In [8]:
# 1. Load normal operations CSV 
print("Loading normal operations data...")
df_raw_normal = pd.read_csv(RAW_DIR / "WADI_14days.csv", skiprows=4, header=0, low_memory=False)
print(f"  Raw shape: {df_raw_normal.shape}")

# 2. Load attack data CSV
print("\nLoading attack data...")
df_raw_attack = pd.read_csv(RAW_DIR / "WADI_attackdata.csv", low_memory=False)
print(f"  Raw shape: {df_raw_attack.shape}")

# 3. Stage both 
print("\nStaging normal data...")
df_normal = stage_wadi_data(df_raw_normal, dataset_id="normal")
print(f"  Staged shape: {df_normal.shape}")
print(f"  Time range: {df_normal['timestamp'].min()} → {df_normal['timestamp'].max()}")

print("\nStaging attack data...")
df_attack = stage_wadi_data(df_raw_attack, dataset_id="attack")
print(f"  Staged shape: {df_attack.shape}")
print(f"  Time range: {df_attack['timestamp'].min()} → {df_attack['timestamp'].max()}")

Loading normal operations data...
  Raw shape: (1209601, 130)

Loading attack data...
  Raw shape: (172801, 130)

Staging normal data...
  Staged shape: (1209601, 129)
  Time range: 2017-09-25 18:00:00+00:00 → 2017-10-09 18:00:00+00:00

Staging attack data...
  Staged shape: (172801, 129)
  Time range: 2017-10-09 18:00:00+00:00 → 2017-10-11 18:00:00+00:00


## 2.3 - Combine and Assign Labels  
* Concatenates normal and attack DataFrames into a single dataset
* Assigns the three-class label column: 0=normal, 1=attack.
* Drops the dataset_id column. It directly encodes the label and would be data leakage if carried into the feature matrix.

In [9]:
# 1. Defragment both DataFrames before combining
df_normal = df_normal.copy()
df_attack = df_attack.copy()

df_normal["label"] = 0
df_attack["label"] = 1

# 2. Combine 
df = pd.concat([df_normal, df_attack], ignore_index=True)
df = df.sort_values("timestamp").reset_index(drop=True)

# 3. Drop dataset_id — encodes the label, would be direct leakage 
df = df.drop(columns=["dataset_id"])

# 4. Cast label to int8 
df["label"] = df["label"].astype("int8")

print(f"Combined shape: {df.shape}")
print(f"Time range:     {df['timestamp'].min()} → {df['timestamp'].max()}")
print(f"\nLabel counts:")
print(f"  Normal (0): {(df['label']==0).sum():>9,}")
print(f"  Attack (1): {(df['label']==1).sum():>9,}")

Combined shape: (1382402, 129)
Time range:     2017-09-25 18:00:00+00:00 → 2017-10-11 18:00:00+00:00

Label counts:
  Normal (0): 1,209,601
  Attack (1):   172,801


## 2.4 - Drop Uninformative Sensor Columns  
Identifies and removes sensor columns that carry no predictive signal:  
* **100% NaN** - sensor was never recorded or permanently offline
* **Constant value** - no variation across the entire dataset, zero discriminative power

These columns are documented explicitly before removal so the decision is reproducible and citable. They will be excluded from SENSOR_COLS permanently.  

In [10]:
# Identify all sensor columns (everything except timestamp and label)
meta_cols = {"timestamp", "label"}
all_sensor_candidates = [c for c in df.columns if c not in meta_cols]

# Find 100% NaN columns 
null_counts = df[all_sensor_candidates].isnull().sum()
fully_null = null_counts[null_counts == len(df)].index.tolist()

# Find constant columns (zero variance)
constant = [
    c for c in all_sensor_candidates
    if c not in fully_null and df[c].nunique(dropna=True) <= 1
]

print(f"100% NaN columns ({len(fully_null)}):")
for c in fully_null:
    print(f"  {c}")

print(f"\nConstant-value columns ({len(constant)}):")
for c in constant:
    val = df[c].dropna().unique()
    print(f"  {c:<35s}  value={val[0] if len(val) else 'NaN'}")

print(f"\nTotal to drop: {len(fully_null) + len(constant)}")

100% NaN columns (4):
  2_LS_001_AL
  2_LS_002_AL
  2_P_001_STATUS
  2_P_002_STATUS

Constant-value columns (25):
  1_LS_001_AL                          value=0.0
  1_LS_002_AL                          value=0.0
  1_P_002_STATUS                       value=1.0
  1_P_004_STATUS                       value=1.0
  2_MV_001_STATUS                      value=1.0
  2_MV_002_STATUS                      value=1.0
  2_MV_004_STATUS                      value=2.0
  2_MV_005_STATUS                      value=2.0
  2_MV_009_STATUS                      value=2.0
  2_P_004_STATUS                       value=1.0
  2_SV_101_STATUS                      value=1.0
  2_SV_201_STATUS                      value=1.0
  2_SV_301_STATUS                      value=1.0
  2_SV_401_STATUS                      value=1.0
  2_SV_501_STATUS                      value=1.0
  2_SV_601_STATUS                      value=1.0
  3_LS_001_AL                          value=1.0
  3_MV_001_STATUS                      value=1.0
  3_

In [11]:
# Drop uninformative columns 
cols_to_drop = fully_null + constant

# Document what we're dropping before removing
drop_log = {
    "fully_null":  fully_null,
    "constant":    constant,
    "total_dropped": len(cols_to_drop),
}

df = df.drop(columns=cols_to_drop)

n_before = len(df.columns) + len(cols_to_drop)
print(f"Columns before drop: {n_before}")
print(f"Columns dropped:     {len(cols_to_drop)}")
print(f"Columns remaining:   {len(df.columns)}")
print(f"\nRemaining columns: timestamp, label + {len(df.columns) - 2} sensor columns")

Columns before drop: 129
Columns dropped:     29
Columns remaining:   100

Remaining columns: timestamp, label + 98 sensor columns


## 2.5 - Define and Freeze SENSOR_COLS  
Defines the canonical list of sensor columns used by all downstream modules.  
Written to a reference JSON so fault injection and curation modules load the same column list without hardcoding it.

In [12]:
# Define canonical sensor column list 
SENSOR_COLS = [c for c in df.columns if c not in {"timestamp", "label"}]

print(f"SENSOR_COLS count: {len(SENSOR_COLS)}")
print(f"\nFirst 10: {SENSOR_COLS[:10]}")
print(f"Last 10:  {SENSOR_COLS[-10:]}")

# Write to reference JSON for downstream notebooks 
sensor_cols_path = REF_DIR / "sensor_cols.json"
write_json(sensor_cols_path, {
    "run_id":       RUN_ID,
    "dataset":      DATASET_NAME,
    "sensor_cols":  SENSOR_COLS,
    "n_sensors":    len(SENSOR_COLS),
    "dropped_fully_null": drop_log["fully_null"],
    "dropped_constant":   drop_log["constant"],
    "notes": [
        "100% NaN columns dropped — sensor never recorded or permanently offline.",
        "Constant-value columns dropped — no variation, zero discriminative power.",
        "This list is the canonical feature set for all downstream notebooks.",
    ]
})

print(f"\nSENSOR_COLS written to: {sensor_cols_path}")

SENSOR_COLS count: 98

First 10: ['1_AIT_001_PV', '1_AIT_002_PV', '1_AIT_003_PV', '1_AIT_004_PV', '1_AIT_005_PV', '1_FIT_001_PV', '1_LT_001_PV', '1_MV_001_STATUS', '1_MV_002_STATUS', '1_MV_003_STATUS']
Last 10:  ['2B_AIT_004_PV', '3_AIT_001_PV', '3_AIT_002_PV', '3_AIT_003_PV', '3_AIT_004_PV', '3_AIT_005_PV', '3_FIT_001_PV', '3_LT_001_PV', 'LEAK_DIFF_PRESSURE', 'TOTAL_CONS_REQUIRED_FLOW']

SENSOR_COLS written to: work/wadi_A1/data/reference/sensor_cols.json


## 2.6 - Add Time Features  
Adds two time-derived columns:  
* `observation_day` - calendar date, used for canary checks and daily row counts
* `seconds_since_start` - ordinal position from dataset start, captures startup vs steady-state behavior without encoding clock time.


In [13]:
# Add time features 
df["observation_day"] = df["timestamp"].dt.date

t0 = df["timestamp"].min()
df["seconds_since_start"] = (
    (df["timestamp"] - t0).dt.total_seconds().astype("float32")
)

print(f"observation_day range:     {df['observation_day'].min()} → {df['observation_day'].max()}")
print(f"seconds_since_start range: {df['seconds_since_start'].min():.0f} → {df['seconds_since_start'].max():.0f}")
print(f"  ({df['seconds_since_start'].max() / 86400:.1f} days)")

observation_day range:     2017-09-25 → 2017-10-11
seconds_since_start range: 0 → 1382400
  (16.0 days)


## 2.7 - Train/Test Split

Normal operation rows (label=0) are assigned to train.
Attack period rows (label=1) are assigned to test.

This matches the standard WaDi evaluation protocol used in the literature:
train on normal behavior, evaluate on the attack period. It also avoids
the temporal distribution shift that arises from splitting within the
normal operation period.

In [14]:
# Assign split: normal -> train, attack -> test
df["split"] = df["label"].map({0: "train", 1: "test"})

print("Split distribution:")
for split in ["train", "test"]:
    n = (df["split"] == split).sum()
    print(f"  {split:<6}: {n:>9,}")

print("\nSplit distribution by class:")
for label_val, label_name in [(0, "normal"), (1, "attack")]:
    counts = df[df["label"] == label_val]["split"].value_counts()
    print(f"\n  {label_name} (label={label_val}):")
    for split, n in counts.items():
        print(f"    {split:<6}: {n:>9,}")

Split distribution:
  train : 1,209,601
  test  :   172,801

Split distribution by class:

  normal (label=0):
    train : 1,209,601

  attack (label=1):
    test  :   172,801


## 2.8 - Validate Split  
Confirms the split assignment is correct before saving.  
Checks that all classes appear in all splits and that temporal order is preserved within each class-split combination  

In [15]:
errors = []

# Check all rows assigned
n_unassigned = df["split"].isna().sum()
if n_unassigned > 0:
    errors.append(f"  {n_unassigned} rows have no split assignment")

# Check expected splits exist
for expected_split in ["train", "test"]:
    if expected_split not in df["split"].values:
        errors.append(f"  Split '{expected_split}' is missing entirely")

# Check normal rows only in train
normal_in_test = ((df["label"] == 0) & (df["split"] == "test")).sum()
if normal_in_test > 0:
    errors.append(f"  {normal_in_test} normal rows found in test split")

# Check attack rows only in test
attack_in_train = ((df["label"] == 1) & (df["split"] == "train")).sum()
if attack_in_train > 0:
    errors.append(f"  {attack_in_train} attack rows found in train split")

# Check temporal order: all train timestamps precede test timestamps
train_max = df[df["split"] == "train"]["timestamp"].max()
test_min  = df[df["split"] == "test"]["timestamp"].min()
if train_max > test_min:
    errors.append(f"  Train/test temporal boundary violated: "
                  f"train_max={train_max}, test_min={test_min}")

# Report
if errors:
    print("VALIDATION FAILED:")
    for e in errors:
        print(e)
else:
    print("Split validation PASSED — all checks clean.")

print(f"\nTemporal boundary:")
print(f"  Train: {df[df['split']=='train']['timestamp'].min().date()} → "
      f"{df[df['split']=='train']['timestamp'].max().date()}")
print(f"  Test:  {df[df['split']=='test']['timestamp'].min().date()} → "
      f"{df[df['split']=='test']['timestamp'].max().date()}")

Split validation PASSED — all checks clean.

Temporal boundary:
  Train: 2017-09-25 → 2017-10-09
  Test:  2017-10-09 → 2017-10-11


## 2.9 - Write Staged Parquet and Run Log  
Writes the staged dataset to disk and documents the pipeline run.  
The staged parquet is the input to the fault injection module.

In [16]:
# Final column ordering 
ordered_cols = ["timestamp", "observation_day", "seconds_since_start",
                "split", "label"] + SENSOR_COLS
df_staged = df[ordered_cols].copy()

# Write staged parquet 
staged_path = STAGED_DIR / f"wadi_staged_{RUN_ID}.parquet"
df_staged.to_parquet(staged_path, index=False)

size_mb = staged_path.stat().st_size / 1e6
print(f"Staged parquet written: {staged_path}")
print(f"Size:                   {size_mb:.1f} MB")
print(f"Shape:                  {df_staged.shape}")

# Write run log 
run_log = {
    "run_id":           RUN_ID,
    "created_at_utc":   utc_now_iso(),
    "stage":            "Notebook 1 — Ingest and Stage",
    "dataset":          DATASET_NAME,
    "source":           DATA_SOURCE,
    "download_date":    DOWNLOAD_DATE,
    "inputs": {
        "normal_csv":   str(RAW_DIR / "WADI_14days.csv"),
        "attack_csv":   str(RAW_DIR / "WADI_attackdata.csv"),
    },
    "outputs": {
        "staged_parquet":  str(staged_path),
        "sensor_cols_ref": str(sensor_cols_path),
    },
    "dataset_summary": {
        "total_rows":    len(df_staged),
        "total_cols":    len(df_staged.columns),
        "n_sensor_cols": len(SENSOR_COLS),
        "label_counts":  df_staged["label"].value_counts().sort_index().to_dict(),
        "split_counts":  df_staged["split"].value_counts().sort_index().to_dict(),
        "time_start":    str(df_staged["timestamp"].min()),
        "time_end":      str(df_staged["timestamp"].max()),
    },
    "columns_dropped": {
        "fully_null":  drop_log["fully_null"],
        "constant":    drop_log["constant"],
        "total":       drop_log["total_dropped"],
    },
    "split_strategy": "normal_operation=train, attack_period=test",
    
    "notes": [
        "dataset_id dropped — directly encodes label, would be leakage.",
        "observation_hour, dayofweek, is_weekend excluded — no physical relationship to faults or attacks.",
        "Split follows standard WaDi protocol: normal operation → train, attack period → test.",
        "Matches literature convention for direct comparability with prior WaDi work.",
        "Fault injection notebook reads this parquet and injects label=2 rows into train and test splits.",
    ],
}

run_log_path = RUN_DIR / f"run_{RUN_ID}.json"
write_json(run_log_path, run_log)
print(f"\nRun log written: {run_log_path}")

Staged parquet written: work/wadi_A1/data/staged/wadi_staged_20260223_140105_utc.parquet
Size:                   100.2 MB
Shape:                  (1382402, 103)

Run log written: work/wadi_A1/data/reference/pipeline_runs/run_20260223_140105_utc.json


# Stage 3 - Pipeline Reflection
Documents key decisions, assumptions, and risks from this notebook.

In [17]:
reflection = [
    ("Row definition",
     "Each row represents one second of sensor readings from the WaDi water "
     "distribution testbed. Label 0=normal operation, 1=cyber attack (original "
     "WaDi labels). Label 2=injected sensor fault will be added by the fault "
     "injection notebook."),

    ("Columns dropped",
     "29 of 128 sensor columns removed before defining SENSOR_COLS: 4 columns "
     "were 100% NaN across the entire dataset (sensors never recorded), and 25 "
     "were constant-valued (no variation, zero discriminative power). All are "
     "documented in the run log and sensor_cols.json."),

    ("Time features",
     "Only observation_day and seconds_since_start are retained. Hour-of-day, "
     "day-of-week, and weekend flags were deliberately excluded — they have no "
     "physical relationship to sensor faults or cyber attacks in an ICS "
     "environment and would add spurious signal."),

    ("dataset_id dropped",
     "The dataset_id column (normal/attack) directly encodes the label and was "
     "dropped immediately after label assignment. Retaining it would be direct "
     "label leakage into the feature matrix."),

    ("Train/test split",
     "Normal operation rows (label=0) assigned to train; attack period rows "
     "(label=1) assigned to test. This follows the standard WaDi evaluation "
     "protocol used in the ICS anomaly detection literature and avoids temporal "
     "distribution shift. Train covers 2017-09-25 to 2017-10-09 (14 days); "
     "test covers 2017-10-09 to 2017-10-11 (2 days). No validation split is "
     "used at this stage — cross-validation is handled in the modeling notebook."),

    ("Known limitation",
     "Attack train/val/test rows all come from the same two-day window (Oct 9–11). "
     "Train and test attacks represent similar system conditions, which may "
     "slightly inflate attack detection performance. This is documented as a "
     "limitation of the WaDi dataset structure, not of the methodology."),

    ("Leakage risks",
     "Normalization stats must be fit on train split only — deferred to feature "
     "engineering notebook. Rolling window features must use only past "
     "observations — deferred to feature engineering notebook. Fault injection "
     "must not use information from attack labels — enforced in injection notebook."),

    ("Next step",
     "Fault injection notebook reads wadi_staged_*.parquet and injects synthetic "
     "sensor failures into normal rows within the train split only, producing a "
     "three-class dataset with label 2=fault. The curate/validate notebook then "
     "picks up from the injected parquet."),
]

print("Pipeline Reflection")
print("=" * 60)
for title, content in reflection:
    print(f"\n[{title}]")
    print(f"  {content}")

Pipeline Reflection

[Row definition]
  Each row represents one second of sensor readings from the WaDi water distribution testbed. Label 0=normal operation, 1=cyber attack (original WaDi labels). Label 2=injected sensor fault will be added by the fault injection notebook.

[Columns dropped]
  29 of 128 sensor columns removed before defining SENSOR_COLS: 4 columns were 100% NaN across the entire dataset (sensors never recorded), and 25 were constant-valued (no variation, zero discriminative power). All are documented in the run log and sensor_cols.json.

[Time features]
  Only observation_day and seconds_since_start are retained. Hour-of-day, day-of-week, and weekend flags were deliberately excluded — they have no physical relationship to sensor faults or cyber attacks in an ICS environment and would add spurious signal.

[dataset_id dropped]
  The dataset_id column (normal/attack) directly encodes the label and was dropped immediately after label assignment. Retaining it would be dire

# This Continues with WaDi A2 Pipeline Notebook 2 - Fault Injection