# Data Pipeline Template  
A repeatable pipeline for building analysis-ready datasets with validation.

## Dataset Overview  

**Describe the dataset here:**
* What does a row represent?
* What are the key variables?
* What makes this data messy/realistic?
* What is the time range?

## What this Pipeline Produces

**Artifacts**
* `data/raw/` - raw data snapshots + metadata
* `data/staged/` - parsed/normalized table (typed, missingness normalized)
* `data/warehouse/` - curated table (Parquet; optionally partitioned)
* `data/reference/validation_report.json` - contracts + anomaly rates + canaries
* `data/reference/pipeline_runs/` - run logs for reproducibility

# 0. Setup

## 0.1 Directory Structure

Create project directories for the pipeline layers

In [2]:
# Project structure setup

# Import Libraries
from __future__ import annotations

from pathlib import Path
from datetime import datetime, timedelta, timezone
import json
import hashlib
import math

import numpy as np
import pandas as pd

from IPython.display import display

pd.set_option("display.max_columns", 180)
pd.set_option("display.width", 180)

# Define Paths
WORK_DIR = Path("work")
PROJECT_DIR = WORK_DIR / "hai_21_03"
DATA_DIR = PROJECT_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
STAGED_DIR = DATA_DIR / "staged"
WH_DIR = DATA_DIR / "warehouse"
REF_DIR = DATA_DIR / "reference"
RUN_DIR = REF_DIR / "pipeline_runs"

# Create Directories
for p in [RAW_DIR, STAGED_DIR, WH_DIR, REF_DIR, RUN_DIR]:
    p.mkdir(parents=True, exist_ok=True)

# Output
print("Project:", PROJECT_DIR)
print("Raw:", RAW_DIR)
print("Staged:", STAGED_DIR)
print("Warehouse:", WH_DIR)
print("Reference:", REF_DIR)
print("Runs:", RUN_DIR)


Project: work/hai_21_03
Raw: work/hai_21_03/data/raw
Staged: work/hai_21_03/data/staged
Warehouse: work/hai_21_03/data/warehouse
Reference: work/hai_21_03/data/reference
Runs: work/hai_21_03/data/reference/pipeline_runs


## 0.2 Helper Utilities

Define reusable helper functions for the pipeline

In [3]:
# Helper Functions
class PipelineError(RuntimeError):
    pass

def utc_now_iso() -> str:
    return datetime.now(timezone.utc).isoformat()

def sha16(x: str) -> str:
    return hashlib.sha256(x.encode('utf-8')).hexdigest()[:16]

def write_json(path: Path, obj: dict) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(json.dumps(obj, indent=2, default=str))

def read_json(path: Path) -> dict:
    return json.loads(path.read_text())

def require_columns(df: pd.DataFrame, cols: list[str], context: str) -> None:
    missing = [c for c in cols if c not in df.columns]
    if missing:
        raise PipelineError(f'[{context}] Missing required columns: {missing}')

def require_unique(df: pd.DataFrame, key: str, context: str) -> None:
    if key not in df.columns:
        raise PipelineError(f'[{context}] Missing key column "{key}"')
    dupes = int(df[key].duplicated().sum())
    if dupes:
        raise PipelineError(f'[{context}] Key "{key}" has {dupes} duplicates')

print("Helpers ready.")

Helpers ready.


## 0.3 Configuration

Set pipeline configuration constants

In [16]:
# Pipeline Configuration

DATASET_NAME = "HAI-21.03"
DATASET_VERSION = "21.03"
RUN_ID = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S_utc")

TRAIN_FILES = ["train1.csv.gz", "train2.csv.gz", "train3.csv.gz"]
TEST_FILES = ["test1.csv.gz", "test2.csv.gz", "test3.csv.gz", "test4.csv.gz", "test5.csv.gz"]
ALL_FILES = TRAIN_FILES + TEST_FILES

TIMESTAMP_COL = "timestamp"
ATTACK_LABEL_COLS = ["attack", "attack_P1", "attack_P2", "attack_P3"]

---

# 1. Ingest: Acquire Raw Data

Download or load raw data and save with metadata

## 1.1 Fetch Raw Data

Download or load the raw dataset

In [10]:
# Fetch Dataset

import requests
import shutil

BASE_URL = "https://raw.githubusercontent.com/icsdataset/hai/master/hai-21.03/"

raw_file_meta = {}

for file_name in ALL_FILES:
    url = BASE_URL + file_name
    dest = RAW_DIR / f"hai_{file_name.replace('.csv.gz', '')}_run_{RUN_ID}.csv.gz"

    if dest.exists():
        print(f"  [SKIP] {file_name} already exists")
    else:
        print(f"  [DOWNLOAD] {file_name} ...")
        with requests.get(url, stream=True) as r:
            r.raise_for_status()
            with open(dest, 'wb') as f:
                shutil.copyfileobj(r.raw, f)
        print(f"  [DONE] saved to {dest}")

    raw_file_meta[file_name] = {
        "source_url": url,
        "local_path": str(dest),
        "size_bytes": dest.stat().st_size,
        "format": "csv.gz",
        "decompression": "handled automatically by pandas at read time",
        "split": "train" if file_name in TRAIN_FILES else "test"
    }

  [DOWNLOAD] train1.csv.gz ...
  [DONE] saved to work/hai_21_03/data/raw/hai_train1_run_20260213_042316_utc.csv.gz
  [DOWNLOAD] train2.csv.gz ...
  [DONE] saved to work/hai_21_03/data/raw/hai_train2_run_20260213_042316_utc.csv.gz
  [DOWNLOAD] train3.csv.gz ...
  [DONE] saved to work/hai_21_03/data/raw/hai_train3_run_20260213_042316_utc.csv.gz
  [DOWNLOAD] test1.csv.gz ...
  [DONE] saved to work/hai_21_03/data/raw/hai_test1_run_20260213_042316_utc.csv.gz
  [DOWNLOAD] test2.csv.gz ...
  [DONE] saved to work/hai_21_03/data/raw/hai_test2_run_20260213_042316_utc.csv.gz
  [DOWNLOAD] test3.csv.gz ...
  [DONE] saved to work/hai_21_03/data/raw/hai_test3_run_20260213_042316_utc.csv.gz
  [DOWNLOAD] test4.csv.gz ...
  [DONE] saved to work/hai_21_03/data/raw/hai_test4_run_20260213_042316_utc.csv.gz
  [DOWNLOAD] test5.csv.gz ...
  [DONE] saved to work/hai_21_03/data/raw/hai_test5_run_20260213_042316_utc.csv.gz


## 1.2 Write Raw Metadata

Document the raw data source and retrieval details

In [11]:
# Create Raw Metadata

ingest_meta = {
    "run_id": RUN_ID,
    "dataset": DATASET_NAME,
    "version": DATASET_VERSION,
    "fetched_at_utc": datetime.now(timezone.utc).isoformat(),
    "source": "https://github.com/icsdataset/hai",
    "base_url": BASE_URL,
    "files": raw_file_meta,   
    "summary": {
        "total_files": len(ALL_FILES),
        "train_files": len(TRAIN_FILES),
        "test_files": len(TEST_FILES),
        "total_size_bytes": sum(v["size_bytes"] for v in raw_file_meta.values())
    }
}

meta_path = RAW_DIR / f"hai_ingest_meta_run_{RUN_ID}.json"
write_json(meta_path, ingest_meta)

print(f"Saved: {meta_path}")
print(f"Total files: {ingest_meta['summary']['total_files']}")
print(f"Total size: {ingest_meta['summary']['total_size_bytes'] / 1e6:.1f} MB")

Saved: work/hai_21_03/data/raw/hai_ingest_meta_run_20260213_042316_utc.json
Total files: 8
Total size: 187.2 MB


---

# 2. Stage: Parse and Normalize

Convert raw data into a clean, typed DataFrame

## 2.1 Read Raw Data

**How do we read it?**

In [12]:
# Read raw data

def read_raw_data(path: Path, split: str) -> pd.DataFrame:
    df = pd.read_csv(path, compression='gzip')
    df["file_source"] = path.stem.replace(".csv", "")
    df["split"] = split
    return df

In [13]:
# Load an concatenate all 8 files

dfs = []
for file_name in TRAIN_FILES:
    path = RAW_DIR / f"hai_{file_name.replace('.csv.gz', '')}_run_{RUN_ID}.csv.gz"
    dfs.append(read_raw_data(path, split="train"))

for file_name in TEST_FILES:
    path = RAW_DIR / f"hai_{file_name.replace('.csv.gz', '')}_run_{RUN_ID}.csv.gz"
    dfs.append(read_raw_data(path, split="test"))

df_raw = pd.concat(dfs, ignore_index=True)
print(f"Combined shape: {df_raw.shape}")
print(df_raw.dtypes.value_counts())
display(df_raw.head(3))

Combined shape: (1323608, 86)
float64    57
int64      26
object      3
Name: count, dtype: int64


Unnamed: 0,time,P1_B2004,P1_B2016,P1_B3004,P1_B3005,P1_B4002,P1_B4005,P1_B400B,P1_B4022,P1_FCV01D,P1_FCV01Z,P1_FCV02D,P1_FCV02Z,P1_FCV03D,P1_FCV03Z,P1_FT01,P1_FT01Z,P1_FT02,P1_FT02Z,P1_FT03,P1_FT03Z,P1_LCV01D,P1_LCV01Z,P1_LIT01,P1_PCV01D,P1_PCV01Z,P1_PCV02D,P1_PCV02Z,P1_PIT01,P1_PIT02,P1_PP01AD,P1_PP01AR,P1_PP01BD,P1_PP01BR,P1_PP02D,P1_PP02R,P1_STSP,P1_TIT01,P1_TIT02,P2_24Vdc,P2_ASD,P2_AutoGO,P2_CO_rpm,P2_Emerg,P2_HILout,P2_MSD,P2_ManualGO,P2_OnOff,P2_RTR,P2_SIT01,P2_SIT02,P2_TripEx,P2_VT01,P2_VTR01,P2_VTR02,P2_VTR03,P2_VTR04,P2_VXT02,P2_VXT03,P2_VYT02,P2_VYT03,P3_FIT01,P3_LCP01D,P3_LCV01D,P3_LH,P3_LIT01,P3_LL,P3_PIT01,P4_HT_FD,P4_HT_LD,P4_HT_PO,P4_HT_PS,P4_LD,P4_ST_FD,P4_ST_GOV,P4_ST_LD,P4_ST_PO,P4_ST_PS,P4_ST_PT01,P4_ST_TT01,attack,attack_P1,attack_P2,attack_P3,file_source,split
0,2020-07-11 00:00:00,0.10121,1.29784,397.63785,1001.99799,33.6555,100.0,2847.02539,37.14706,100.0,100.0,0.0,-1.87531,51.58201,52.80456,166.74039,808.2962,1973.19031,2847.02539,246.43968,1000.44769,8.79882,8.46252,395.19528,39.09198,40.49072,12.0,12.01782,1.3681,0.27786,540833,540833,0,0,1,1,1,35.437,35.74219,28.02645,0,1,54074.0,0,712.07275,763.19324,0,1,2880,780.0,779.59595,1,11.89504,10,10,10,10,-3.066,-1.2648,4.1758,6.0951,4795.0,10832.0,608.0,70,15454.0,20,815.0,-0.00072,0.06511,4.01474,0,301.01636,-0.00297,16495.0,301.35992,305.03113,0,10052.0,27610.0,0,0,0,0,hai_train1_run_20260213_042316_utc,train
1,2020-07-11 00:00:01,0.10121,1.29692,397.63785,1001.99799,33.6555,100.0,2839.5852,37.14477,100.0,100.0,0.0,-1.88294,51.60648,52.78931,168.64778,819.16809,1975.479,2839.5852,246.43968,1000.0127,8.78811,8.47015,395.1442,39.0568,40.49072,12.0,12.01782,1.3681,0.27634,540833,540833,0,0,1,1,1,35.45227,35.74219,28.02473,0,1,54089.0,0,708.52661,763.19324,0,1,2880,781.0,780.67328,1,11.93421,10,10,10,10,-2.9721,-1.3147,3.9259,5.9262,4835.0,10984.0,528.0,70,15461.0,20,883.0,-0.00051,0.0434,3.74347,0,297.43567,0.00072,16402.0,297.43567,304.27161,0,10052.0,27610.0,0,0,0,0,hai_train1_run_20260213_042316_utc,train
2,2020-07-11 00:00:02,0.10121,1.29631,397.63785,1001.99799,33.6555,100.0,2833.26807,37.14325,100.0,100.0,0.0,-1.88294,51.5779,52.79694,168.83849,823.51697,1972.42725,2833.26807,246.05821,1000.88245,8.81787,8.47015,395.1442,38.97124,40.49835,12.0,12.01782,1.36734,0.27634,540833,540833,0,0,1,1,1,35.45227,35.74219,28.02817,0,1,54124.0,0,709.15527,763.19324,0,1,2880,780.0,780.06574,1,11.9703,10,10,10,10,-2.9857,-1.4032,3.6489,5.8101,4961.0,11120.0,464.0,70,15462.0,20,956.0,-0.00043,0.0434,3.43603,0,298.84619,-0.00145,16379.0,298.66534,303.89179,0,10050.0,27617.0,0,0,0,0,hai_train1_run_20260213_042316_utc,train


## 2.2 Create Staged DataFrame

**How do we transform it?**

In [17]:
# Create staged DataFrame

def staged_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # 1. Parse timestamp
    df["timestamp"] = pd.to_datetime(df["time"], format="%Y-%m-%d %H:%M:%S", utc=True)
    df = df.drop(columns=["time"])

    # 2. Sort by file_source then timestamp (preserve time order within each file)
    df = df.sort_values(["file_source", "timestamp"]).reset_index(drop=True)

    # 3. Derive observation_day (for partitioning and canary checks later)
    df["observation_day"] = df["timestamp"].dt.date

    # 4. Normalize attack label cols to int8
    for col in ATTACK_LABEL_COLS:
        df[col] = df[col].astype("int8")

    # 5. Create unified attack label (1 if ANY subsystem under attack)
    df["is_attack"] = (df[ATTACK_LABEL_COLS].any(axis=1)).astype("int8")

    # 6. Cast sensor columnst to float 32
    sensor_cols = [c for c in df.columns
                    if c not in ATTACK_LABEL_COLS + 
                    ["timestamp", "observation_day", "file_source", "split", "is_attack"]]
    df[sensor_cols] = df[sensor_cols].astype("float32")

    return df

## 2.3 Test Staging Function

**Does it work?**

In [18]:
# Test staged function

df_staged = staged_dataframe(df_raw)
print(f"Staged_shape: {df_staged.shape}")
print(df_staged.dtypes.value_counts())
display(df_staged.head(3))

Staged_shape: (1323608, 88)
float32                79
int8                    5
object                  3
datetime64[ns, UTC]     1
Name: count, dtype: int64


Unnamed: 0,P1_B2004,P1_B2016,P1_B3004,P1_B3005,P1_B4002,P1_B4005,P1_B400B,P1_B4022,P1_FCV01D,P1_FCV01Z,P1_FCV02D,P1_FCV02Z,P1_FCV03D,P1_FCV03Z,P1_FT01,P1_FT01Z,P1_FT02,P1_FT02Z,P1_FT03,P1_FT03Z,P1_LCV01D,P1_LCV01Z,P1_LIT01,P1_PCV01D,P1_PCV01Z,P1_PCV02D,P1_PCV02Z,P1_PIT01,P1_PIT02,P1_PP01AD,P1_PP01AR,P1_PP01BD,P1_PP01BR,P1_PP02D,P1_PP02R,P1_STSP,P1_TIT01,P1_TIT02,P2_24Vdc,P2_ASD,P2_AutoGO,P2_CO_rpm,P2_Emerg,P2_HILout,P2_MSD,P2_ManualGO,P2_OnOff,P2_RTR,P2_SIT01,P2_SIT02,P2_TripEx,P2_VT01,P2_VTR01,P2_VTR02,P2_VTR03,P2_VTR04,P2_VXT02,P2_VXT03,P2_VYT02,P2_VYT03,P3_FIT01,P3_LCP01D,P3_LCV01D,P3_LH,P3_LIT01,P3_LL,P3_PIT01,P4_HT_FD,P4_HT_LD,P4_HT_PO,P4_HT_PS,P4_LD,P4_ST_FD,P4_ST_GOV,P4_ST_LD,P4_ST_PO,P4_ST_PS,P4_ST_PT01,P4_ST_TT01,attack,attack_P1,attack_P2,attack_P3,file_source,split,timestamp,observation_day,is_attack
0,0.10178,1.58771,403.788544,985.373535,32.595268,100.0,2839.585205,36.810101,100.0,99.916077,0.0,-1.86768,50.907261,51.950069,176.086426,845.695496,1978.721558,2843.375488,243.388016,989.141174,10.8929,10.8429,402.709473,40.741249,41.32233,12.0,12.26196,1.34293,0.27557,540833.0,540833.0,0.0,0.0,1.0,1.0,1.0,34.887699,35.147099,28.03162,0.0,1.0,54116.0,0.0,725.213623,763.193237,0.0,1.0,2880.0,790.0,789.765076,1.0,11.9104,10.0,10.0,10.0,10.0,-2.8687,-1.0189,3.7751,5.633,-25.0,688.0,15888.0,70.0,18082.0,20.0,-23.0,0.00029,76.801208,73.585808,0.0,464.066101,0.0047,20469.0,386.266663,380.316833,0.0,10044.0,27567.0,0,0,0,0,hai_test1_run_20260213_042316_utc,test,2020-07-07 15:00:00+00:00,2020-07-07,0
1,0.10178,1.58725,403.788544,985.373535,32.595268,100.0,2843.375488,36.808949,100.0,99.916077,0.0,-1.86768,50.746071,51.965328,173.797562,840.477051,1986.923218,2845.060059,243.006561,992.620178,10.80512,10.8429,402.811737,40.86124,41.32233,12.0,12.26196,1.34216,0.2771,540833.0,540833.0,0.0,0.0,1.0,1.0,1.0,34.887699,35.147099,28.02301,0.0,1.0,54114.0,0.0,721.740723,763.193237,0.0,1.0,2880.0,789.0,789.13147,1.0,11.98856,10.0,10.0,10.0,10.0,-2.9842,-1.2637,3.1689,5.4158,-25.0,648.0,15952.0,70.0,18043.0,20.0,-23.0,0.00051,76.924187,73.89325,0.0,464.228882,0.0021,20489.0,386.302856,380.027466,0.0,10040.0,27564.0,0,0,0,0,hai_test1_run_20260213_042316_utc,test,2020-07-07 15:00:01+00:00,2020-07-07,0
2,0.10178,1.59519,403.788544,985.373535,32.595268,100.0,2845.060059,36.828789,100.0,99.916077,0.0,-1.86768,50.662289,51.965328,174.560516,835.258423,1978.721558,2837.339111,242.815857,993.924683,10.80029,10.8429,402.76062,41.02906,41.32233,12.0,12.26196,1.34369,0.2771,540833.0,540833.0,0.0,0.0,1.0,1.0,1.0,34.887699,35.147099,28.02993,0.0,1.0,54082.0,0.0,718.157959,763.193237,0.0,1.0,2880.0,786.0,785.816528,1.0,11.974,10.0,10.0,10.0,10.0,-3.4939,-1.5398,2.9615,5.5532,-25.0,616.0,16000.0,70.0,18024.0,20.0,-23.0,0.00022,77.04715,74.200684,0.0,466.905334,0.0013,20604.0,389.738831,381.528503,0.0,10037.0,27565.0,0,0,0,0,hai_test1_run_20260213_042316_utc,test,2020-07-07 15:00:02+00:00,2020-07-07,0


## 2.4 Write Staged Outputs

**How do we save it?**

In [19]:
# Write staged outputs


# Save staged Parquet
staged_path = STAGED_DIR / f"hai_staged_run_{RUN_ID}.parquet"
df_staged.to_parquet(staged_path, index=False)
print(f"Saved staged: {staged_path}")
print(f"File size: {staged_path.stat().st_size / 1e6:.1f} MB")

# Write staged metadata
staged_meta = {
    "run_id": RUN_ID,
    "dataset": DATASET_NAME,
    "staged_at_utc": datetime.now(timezone.utc).isoformat(),
    "source_files": len(ALL_FILES),
    "staged_path": str(staged_path),
    "shape": {
        "rows": len(df_staged),
        "cols": len(df_staged.columns)
    },
    "splits": df_staged["split"].value_counts().to_dict(),
    "is_attack_counts": df_staged["is_attack"].value_counts().to_dict(),
    "dtypes": df_staged.dtypes.astype(str).to_dict(),
    "columns": df_staged.columns.tolist()
}

staged_meta_path = STAGED_DIR / f"hai_staged_meta_run_{RUN_ID}.json"
write_json(staged_meta_path, staged_meta)
print(f"Saved metadata: {staged_meta_path}")

Saved staged: work/hai_21_03/data/staged/hai_staged_run_20260213_044331_utc.parquet
File size: 119.9 MB
Saved metadata: work/hai_21_03/data/staged/hai_staged_meta_run_20260213_044331_utc.json


---

# 3. Curate: Analysis-Ready Features

Add engineered features and data quality flags.

## 3.1 Create Curated DataFrame

In [14]:
# Create curated DataFrame

# >>> TODO: Define curate_data(df_staged) function
# Time Features:
# - observation_day (date only)
# - observation_hour (0-23)
# - dayofweek (0=Monday, 6=Sunday)
# - is_weekend (binary flag)
#
# Domain Specific:
# - Add flags (e.g., high_value, anomaly_indicator)
# - Add derived metrics (e.g., deltas, ratios, rolling averages)
# - Add categorical encodings if needed
#
# Return curated DataFrame


## 3.2 Test Curation Function

In [15]:
# Test Curation Function

# >>> TODO: 
# Run curation
# Display results
# Verify features

## 3.3 Choose Curated Columns and Write

In [16]:
# Choose Curated Columns
# >>> TODO: Select curated data
# 1. Define curated_columns list
# - Identifiers
# - Timestamps
# - Core measurements
# - Engineered features
# - Flags
#
# 2. Select columns: df_final = df_curated[curated_columns]

# Write to warehouse as Parquet

# Optional - Write partitioned by date

---

# 4. Validate: Contracts + Anomalies + Canaries

Define validation rules and check data quality

## 4.1 Define Contracts

**Define Rules**

In [17]:
# Define contracts

# >>> TODO: 
# Required columns (must exist and have data):
# required_cols = ['id', 'timestamp', 'kety_metric_1', ...]

# Optional columns (monitor but do not fail):
# optional_cols = ['optional_metric1', ...]
 
# Plausible ranges (for validation)
# range_checks = {
#     'metric_1': (min_val, max_val),
#     'metric_2': (min_val, max_val),
#     ....
# }


## 4.2 Implement Validation Checks

**Implement Checker**

In [18]:
# Implement validation checks

# >>> TODO: Define validate_data(df, required_cols, optional_cols, range_checks)
# Check 1. Required columns exist
# Check 2. Required columns have data ( < 99% missing)
# Check 3. Range violations (< 5% out of range)
# Check 4. Uniqueness (if applicable)
# Check 5. No future timestamps (if applicable)

# Return validation report with:
# - passed (True/False)
# - failures (list)
# - warnings (list)

## 4.3 Run Validation

**Run Checker**

In [19]:
# Validation

# >>> TODO: Call validate_data()
# print results

## 4.4 Anomaly Flags and Investigation

**Define Anomalies**

In [20]:
# Anomalies

# >>> TODO: Define create_anomaly_summary(df, range_checks)
# Calculate (but do not add to df)
# - Value anomalies (out of range counts)
# - Missingness rates by column
# - Suspicious row count

# Return summary dict

## 4.5 Run Anomaly Analysis

In [21]:
# Anomaly Analysis

# >>> TODO: Call create_anomaly_summary()
# print results

## 4.6 Canary Checks

**Define Canaries**

In [22]:
# Define Canaries

# >>> TODO: Define create_canary_summary(df)
# Group by observation_day and check:
# - Observations per day (min/median/max)
# - Drops (days with < 50% of median)
# - Overall missingness by column
# - Worst day missingness per column
# - Days with high missingness (> 30%)

# Return canary summary dict

## 4.7 Run Canary Analysis

In [23]:
# Canary Analysis
# >>> TODO: Call create_canary_summary(), 
# print results

---

# 5. Leakage Audit (Conceptual)

Document potential temporal leakage risks.

## 5.1 Write Leakage Checklist

In [1]:
# Leakage Checklist

# >>> TODO: Create leakage checklist 
# leakage_checklist = [
#   "If building rolling features, are they computed using only past data?"
#   "If you standardize/normalize, are stats computed on TRAIN only?"
#   "If you impute missing values, does the method avoid future observations?",
#   "Are you aggregating by day unavailable at prediction time?",
#   "Is prediction time defined clearly?"
#   ... add more as needed
# ]

---

# 6. Write Final Artifacts

Consolidate validation results and create run log

## 6.1 Write Validation Report

In [2]:
# Validation Report

# >>> TODO: Define write_validation_report(...)
# Consolidate: 
# - Contracts (validation_report)
# - Anomalies (anomaly_summary)
# - Canaries (canary_summary)
# - Leakage checklist

# Write to: data/reference/validation_report.json

## 6.2 Create Run Log

In [3]:
# Run log

# >>> TODO: Define create_run_log(...)
# Document:
# - run_id
# - generated_at_utc
# - inputs (raw data paths, sizes, query fingerprint)
# - outputs (staged, curated, validation paths)
# - row_definition
# - notes

# Write to: data/reference/pipeline_runs/{RUN_ID}.json

---

# 7. Self-Check and Reflection

Document insights and potential issues

## 7.1 Pipeline Reflection

In [4]:
# Pipeline reflection

# TODO: Write reflection
# reflection = [
#     "Row definition: ...",
#     "Required columns/sensors: ...",
#     "Range checks: ...",
#     "Biggest anomaly found: ...",
#     "Likely breakage scenario: ...",
#     "Temporal leakage risk: ...",
#     "Data quality insight: ...",
#     "Pipeline reproducibility: ..."
# ]

# Print formatted reflection

---

# 8. Final Summary

## Outputs Created:

## Outputs Created:

- `data/raw/` — raw data snapshot + metadata
- `data/staged/` — parsed, typed DataFrame
- `data/warehouse/` — analysis-ready curated data
- `data/reference/validation_report.json` — full validation results
- `data/reference/pipeline_runs/{run_id}.json` — execution log

## Next Steps:

1. Use curated data for analysis or modeling
2. Review validation report for data quality issues
3. Check leakage checklist before ML model development
4. Re-run pipeline with new data using same structure