# Homework 2: NOAA Buoy Data Pipeline (NDBC)

You will build a repeatable pipeline that ingests **real sensor data**, handles missingness + sentinel values,
creates an analysis-ready table, and writes a validation report you could run every day.

## Dataset Overview: NOAA NDBC Buoy Observations

This project uses data from the **NOAA National Data Buoy Center (NDBC)**, which operates a network of ocean buoys and coastal stations that continuously measure atmospheric and oceanographic conditions.

Each buoy is a **physical sensor platform** deployed at a fixed geographic location. It records observations at regular time intervals (often hourly), transmitting them to NOAA for operational use in weather forecasting, marine safety, and climate research.

---

### What a single row represents

In this dataset:

> **Each row represents one sensor observation at a specific buoy at a specific UTC timestamp.**

This makes the data:
- **Time series** (ordered in time)
- **Stateful** (conditions at a moment, not events)
- **Naturally indexed by time**

There is no “target variable” yet — this dataset is about *measurement*, not decisions.

---

### Core variables (typical)

Not every buoy reports every variable, but common fields include:

- **Wind**
  - `WDIR` — wind direction (degrees)
  - `WSPD` — wind speed (m/s)
  - `GST` — wind gust (m/s)
- **Waves**
  - `WVHT` — significant wave height (m)
  - `DPD` — dominant wave period (s)
  - `APD` — average wave period (s)
- **Atmosphere**
  - `PRES` — sea-level pressure (hPa)
  - `ATMP` — air temperature (°C)
- **Ocean**
  - `WTMP` — water temperature (°C)

Missing values are common and usually reflect **sensor downtime**, **transmission issues**, or **environmental constraints**, not data entry errors.

---

### Why this dataset is realistic (and messy)

This is **real operational sensor data**, not a curated research dataset. As a result:

- Missing values are encoded as sentinels (e.g. `MM`)
- Sensors may fail temporarily or permanently
- Some variables appear or disappear over time
- Units and ranges must be interpreted using domain knowledge
- The dataset contains a rolling time window, not full history

These characteristics make the dataset ideal for practicing **data ingestion, validation, and pipeline design**.

---

### Important constraint: rolling history

The data used here comes from NOAA’s `realtime2` endpoint, which provides a **rolling window of recent observations** (typically ~30–45 days).

This means:
- You are *not* requesting a specific date range
- Older observations are continuously overwritten
- Historical backfills require a different NOAA endpoint

This is intentional and mirrors how real production systems separate **realtime feeds** from **historical archives**.

---

The goal is not to “clean it perfectly,” but to **build trust in the parts you use** — and to document the assumptions you make along the way.


## What you will produce

Artifacts (under a project folder):

- `data/raw/` — raw station snapshot(s) + metadata (station id, URL, timestamp)
- `data/staged/` — parsed/normalized table (typed, missingness normalized)
- `data/warehouse/` — curated table (Parquet; optionally partitioned by day)
- `data/reference/validation_report.json` — contracts + anomaly rates + canaries
- `data/reference/pipeline_runs/` — run logs for reproducibility

> Principle: In sensor data, “cleaning” is mostly about **assumptions** (units, ranges, sentinel values)
> and **guardrails** (contracts + anomaly flags), not deleting rows.


## 0) Setup

Create a project folder somewhere, for example at:

`~/work/homework_2_noaa/`

Create the five directories listed above.


In [31]:
from __future__ import annotations

from pathlib import Path
from datetime import datetime, timezone
import json
import hashlib

import numpy as np
import pandas as pd

from IPython.display import display

pd.set_option("display.max_columns", 160)
pd.set_option("display.width", None)

WORK_DIR = Path("../work")
PROJECT_DIR = WORK_DIR / "HW2_NOAA"

DATA_DIR = PROJECT_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
STAGED_DIR = DATA_DIR / "staged"
WH_DIR = DATA_DIR / "warehouse"
REF_DIR = DATA_DIR / "reference"
RUN_DIR = REF_DIR / "pipeline_runs"

for p in [RAW_DIR, STAGED_DIR, WH_DIR, REF_DIR, RUN_DIR]:
    p.mkdir(parents=True, exist_ok=True)

print("Project:", PROJECT_DIR)
print("Raw:", RAW_DIR)
print("Staged:", STAGED_DIR)
print("Warehouse:", WH_DIR)
print("Reference:", REF_DIR)
print("Runs:", RUN_DIR)


Project: ../work/HW2_NOAA
Raw: ../work/HW2_NOAA/data/raw
Staged: ../work/HW2_NOAA/data/staged
Warehouse: ../work/HW2_NOAA/data/warehouse
Reference: ../work/HW2_NOAA/data/reference
Runs: ../work/HW2_NOAA/data/reference/pipeline_runs


### Helper utilities

Create helper utilies.

In [32]:
class PipelineError(RuntimeError):
    pass

def utc_now_iso() -> str:
    return datetime.now(timezone.utc).isoformat()

def sha16(x: str) -> str:
    return hashlib.sha256(x.encode("utf-8")).hexdigest()[:16]

def write_txt(path: Path, content: str) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(content)

def write_json(path: Path, obj: dict) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(json.dumps(obj, indent=2, default=str))

def read_json(path: Path) -> dict:
    return json.loads(path.read_text())

def require_columns(df: pd.DataFrame, cols: list[str], context: str) -> None:
    missing = [c for c in cols if c not in df.columns]
    if missing:
        raise PipelineError(f"[{context}] Missing required columns: {missing}")

def require_unique(df: pd.DataFrame, key: str, context: str) -> None:
    if key not in df.columns:
        raise PipelineError(f"[{context}] Missing key column '{key}'")
    dupes = int(df[key].duplicated().sum())
    if dupes:
        raise PipelineError(f"[{context}] Key '{key}' has {dupes} duplicates")

print("Helpers ready.")


Helpers ready.


#### 1) Ingest: download latest buoy observations

NOAA NDBC provides station “realtime2” text files, e.g.

- `https://www.ndbc.noaa.gov/data/realtime2/44103.txt`

These files are human-readable but still messy:
- header lines starting with `#`
- missing values as sentinel strings like `MM`
- sometimes extra columns depending on station/sensor

**Station 44013 — Boston 16 NM East of Boston, MA** 

We will use data from buoy station `44013` in this homework for these reasons:
- Location in the North Atlantic, a region with rich weather variability (storms, seasonal shifts).
- Long historical coverage (data available back into the 1980s/1990s).
- Strong mix of variables: wind, wave height, pressures, temperatures, etc.
- Very useful for seasonality, trend analysis, anomaly detection, and combining meteorological + oceanographic features.
- This station is especially popular for regional marine research and forecasting, so its data patterns can be both interesting and instructive for data science exercises.

Look at the station information [here](https://www.ndbc.noaa.gov/station_page.php?station=44013).


✅ **Exercise 1.1 — fetch raw data and write snapshot**

Download the file, then write:

- raw text: `data/raw/ndbc_<station>_<runid>.txt`
- raw metadata: `data/raw/ndbc_meta_<station>_<runid>.json`

**Hint:** use `requests.get(url).text` and save as UTF-8.


In [42]:
import requests

# Setup constants
STATION_ID = "44013"  
NOAA_URL = f"https://www.ndbc.noaa.gov/data/realtime2/{STATION_ID}.txt"
RUN_ID = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S_utc")

print(f"Station: {STATION_ID}")
print(f"URL: {NOAA_URL}")
print(f"run_id: {RUN_ID}")

# Retreive dataset and save
raw_data_path = RAW_DIR / f"nbdc_{STATION_ID}_run_{RUN_ID}.txt"
raw_data = requests.get(NOAA_URL).text
write_txt(raw_data_path ,raw_data)
print(f"Wrote: {raw_data_path}")

# Create metadata and save
raw_metadata = {
    "run_id": RUN_ID,
    "generated_at_utc": utc_now_iso(),
    "query": {
        "url": NOAA_URL,
        "station_id": STATION_ID
    },
    "n_chars": len(raw_data),
    "source": "National Data Buoy Center (NOAA)"
}
raw_meta_path = RAW_DIR / f"nbdc_{STATION_ID}_meta_{RUN_ID}.json"
write_json(raw_meta_path, raw_metadata)
print(f"Wrote: {raw_meta_path}")

# Print first few rows of raw data
print('\n'.join(raw_data.splitlines()[:10]))

Station: 44013
URL: https://www.ndbc.noaa.gov/data/realtime2/44013.txt
run_id: 20260215_220552_utc
Wrote: ../work/HW2_NOAA/data/raw/nbdc_44013_run_20260215_220552_utc.txt
Wrote: ../work/HW2_NOAA/data/raw/nbdc_44013_meta_20260215_220552_utc.json
#YY  MM DD hh mm WDIR WSPD GST  WVHT   DPD   APD MWD   PRES  ATMP  WTMP  DEWP  VIS PTDY  TIDE
#yr  mo dy hr mn degT m/s  m/s     m   sec   sec degT   hPa  degC  degC  degC  nmi  hPa    ft
2026 02 15 21 30  90  3.0  4.0    MM    MM    MM  MM 1021.9  -0.9   3.0 -10.8   MM   MM    MM
2026 02 15 21 20  80  2.0  4.0   0.8     6   4.6  35 1021.7  -0.9   3.0 -10.7   MM   MM    MM
2026 02 15 21 10  80  3.0  4.0    MM    MM    MM  MM 1021.6  -0.9   3.0 -10.7   MM   MM    MM
2026 02 15 21 00  80  3.0  4.0    MM    MM    MM  MM 1021.4  -1.0   3.0 -10.6   MM +0.9    MM
2026 02 15 20 50  70  3.0  4.0   0.8     6   4.5  39 1021.2  -0.9   3.0 -10.3   MM   MM    MM
2026 02 15 20 40  80  3.0  4.0    MM    MM    MM  MM 1021.1  -1.0   3.0 -10.3   MM   MM    MM
202

## 2) Stage: parse + normalize missingness + types

NDBC realtime2 files have:
- one header line naming columns (after `#`)
- data rows with whitespace-separated values

Typical columns include:
- `YY MM DD hh mm` (timestamp components, UTC)
- `WDIR` wind direction (deg)
- `WSPD` wind speed (m/s)
- `GST` gust (m/s)
- `WVHT` wave height (m)
- `DPD` dominant period (s)
- `APD` average period (s)
- `PRES` pressure (hPa)
- `ATMP` air temp (C)
- `WTMP` water temp (C)

But not every station has every column.

✅ **Exercise 2.1 — parse the raw file into a DataFrame**

Implement the function `read_ndbc_txt(path)` that:
- reads the file
- returns a DataFrame

**Hints:**
- Many files have a commented header and sometimes a units line.
- A robust approach:
  - Find the first **non-#** line (header)
  - If the next line looks like units (letters), skip it
  - Parse remaining lines with `delim_whitespace=True` or `sep=r"\s+"`


In [43]:
def make_unique(cols):
    seen = {}
    out = []
    for c in cols:
        if c in seen:
            seen[c] += 1
            out.append(f"{c}_{seen[c]}")
        else:
            seen[c] = 0
            out.append(c)
    return out

def read_ndbc_txt(path: Path) -> pd.DataFrame:
    # Find header = first line that starts with '#'
    header_line = None
    with open(path, "r") as f:
        for line in f:
            if line.startswith("#"):
                header_line = line
                break

    if header_line is None:
        raise ValueError("No header line found (no line starts with '#').")

    columns = make_unique(header_line.lstrip("#").split())

    # Read data: skip any commented lines (header + units) automatically
    df = pd.read_csv(
        path,
        sep=r"\s+",
        comment="#",
        names=columns,
        header=None,
        na_values=["MM", "9999", "999", "99"],
    )
    return df


In [44]:
df_raw = read_ndbc_txt(raw_data_path)
print(f"Raw flattened shape: {df_raw.shape}")
display(df_raw.head(10))

Raw flattened shape: (6563, 19)


Unnamed: 0,YY,MM,DD,hh,mm,WDIR,WSPD,GST,WVHT,DPD,APD,MWD,PRES,ATMP,WTMP,DEWP,VIS,PTDY,TIDE
0,2026,2,15,21,30,90.0,3.0,4.0,,,,,1021.9,-0.9,3.0,-10.8,,,
1,2026,2,15,21,20,80.0,2.0,4.0,0.8,6.0,4.6,35.0,1021.7,-0.9,3.0,-10.7,,,
2,2026,2,15,21,10,80.0,3.0,4.0,,,,,1021.6,-0.9,3.0,-10.7,,,
3,2026,2,15,21,0,80.0,3.0,4.0,,,,,1021.4,-1.0,3.0,-10.6,,0.9,
4,2026,2,15,20,50,70.0,3.0,4.0,0.8,6.0,4.5,39.0,1021.2,-0.9,3.0,-10.3,,,
5,2026,2,15,20,40,80.0,3.0,4.0,,,,,1021.1,-1.0,3.0,-10.3,,,
6,2026,2,15,20,30,60.0,3.0,4.0,,,,,1021.0,-1.1,3.0,-10.6,,,
7,2026,2,15,20,20,60.0,3.0,5.0,0.9,5.0,4.4,32.0,1021.0,-1.2,3.0,-10.6,,,
8,2026,2,15,20,10,60.0,3.0,4.0,,,,,1020.9,-1.2,3.0,-10.4,,,
9,2026,2,15,20,0,40.0,4.0,6.0,,,,,1020.7,-1.2,3.0,-10.5,,0.0,


✅ **Exercise 2.2 — construct a UTC timestamp + normalize missing values**

Create a staged table that includes:

- `station_id`
- `time_utc` as timezone-aware datetime
- numeric sensor fields coerced to numeric
- missing values: `MM` becomes NaN (via numeric coercion)

**Hints:**
- Timestamp columns can be `YY MM DD hh mm` or sometimes `YYYY MM DD hh mm`.
- The month column is usually `MM` and minute is `mm` (case matters).


In [45]:
def stage_ndbc_data(df_raw: pd.DataFrame, station_id: str) -> pd.DataFrame:
    df = df_raw.copy()
    
    # Station ID
    df['station_id'] = station_id
    
    # Identify year column
    if 'YYYY' in df.columns:
        year_col = 'YYYY'
    elif 'YY' in df.columns:
        year_col = 'YY'
    else:
        raise ValueError("No year column found")

    # Create timezone-aware UTC timestamp
    time_df = pd.DataFrame({
        'year': df[year_col],
        'month': df['MM'],
        'day': df['DD'],
        'hour': df['hh'],
        'minute': df['mm']
    })
    df['time_utc'] = pd.to_datetime(time_df).dt.tz_localize('UTC')
    
    # Identify sensor columns
    time_cols = {year_col, 'MM', 'DD', 'hh', 'mm', 'time_utc', 'station_id'}
    sensor_cols = [col for col in df.columns if col not in time_cols]

    # Coerce sensor columns to numeric (MM -> NaN)
    for col in sensor_cols:
        df[col] = pd.to_numeric(df[col], errors='coerce')

    # Select final columns: station_id, time_utc, then all sensors
    final_cols = ['station_id', 'time_utc'] + sensor_cols
    df_staged = df[final_cols]
    
    return df_staged

print("Staging function ready")

Staging function ready


In [46]:
df_staged = stage_ndbc_data(df_raw, station_id = STATION_ID)
print(f"Staged shape: {df_staged.shape}")
display(df_staged.head(10))

Staged shape: (6563, 16)


Unnamed: 0,station_id,time_utc,WDIR,WSPD,GST,WVHT,DPD,APD,MWD,PRES,ATMP,WTMP,DEWP,VIS,PTDY,TIDE
0,44013,2026-02-15 21:30:00+00:00,90.0,3.0,4.0,,,,,1021.9,-0.9,3.0,-10.8,,,
1,44013,2026-02-15 21:20:00+00:00,80.0,2.0,4.0,0.8,6.0,4.6,35.0,1021.7,-0.9,3.0,-10.7,,,
2,44013,2026-02-15 21:10:00+00:00,80.0,3.0,4.0,,,,,1021.6,-0.9,3.0,-10.7,,,
3,44013,2026-02-15 21:00:00+00:00,80.0,3.0,4.0,,,,,1021.4,-1.0,3.0,-10.6,,0.9,
4,44013,2026-02-15 20:50:00+00:00,70.0,3.0,4.0,0.8,6.0,4.5,39.0,1021.2,-0.9,3.0,-10.3,,,
5,44013,2026-02-15 20:40:00+00:00,80.0,3.0,4.0,,,,,1021.1,-1.0,3.0,-10.3,,,
6,44013,2026-02-15 20:30:00+00:00,60.0,3.0,4.0,,,,,1021.0,-1.1,3.0,-10.6,,,
7,44013,2026-02-15 20:20:00+00:00,60.0,3.0,5.0,0.9,5.0,4.4,32.0,1021.0,-1.2,3.0,-10.6,,,
8,44013,2026-02-15 20:10:00+00:00,60.0,3.0,4.0,,,,,1020.9,-1.2,3.0,-10.4,,,
9,44013,2026-02-15 20:00:00+00:00,40.0,4.0,6.0,,,,,1020.7,-1.2,3.0,-10.5,,0.0,


✅ **Exercise 2.3 — write staged outputs**


In [47]:
# Define output path and write data
staged_data_path = STAGED_DIR / f"ndbc_staged_{STATION_ID}_run_{RUN_ID}.parquet"
df_staged.to_parquet(staged_data_path, index=False)
print(f"Wrote: {staged_data_path}")

# Create staged metadata
staged_metadata = {
    "station_id": STATION_ID,
    "source_file": raw_data_path.name,
    "staged_at": datetime.now(timezone.utc).isoformat(),
    "row_count": len(df_staged),
    "column_count": len(df_staged.columns),
    "time_range": {
        "start": df_staged['time_utc'].min().isoformat(),
        "end": df_staged['time_utc'].max().isoformat(),
        "span_days": (df_staged['time_utc'].max() - df_staged['time_utc'].min()).days
    },
    "columns": df_staged.columns.tolist(),
    "data_types": {col: str(dtype) for col, dtype in df_staged.dtypes.items()},
}

# Write metadata
staged_meta_path = STAGED_DIR / f"ndbc_staged_meta_{STATION_ID}_run_{RUN_ID}.json"
write_json(staged_meta_path, staged_metadata)
print(f"Wrote: {staged_meta_path}")

Wrote: ../work/HW2_NOAA/data/staged/ndbc_staged_44013_run_20260215_220552_utc.parquet
Wrote: ../work/HW2_NOAA/data/staged/ndbc_staged_meta_44013_run_20260215_220552_utc.json


## 3) Curate: analysis-ready features

✅ **Exercise 3.1 — time features + flags**


In [48]:
def curate_ndbc_data(df_staged: pd.DataFrame) -> pd.DataFrame:
    df = df_staged.copy()

    # Time features
    df['observation_day'] = df['time_utc'].dt.date
    df['observation_hour'] = df['time_utc'].dt.hour
    df['dayofweek'] = df['time_utc'].dt.dayofweek
    df['is_weekend'] = (df['dayofweek'] >= 5).astype(int)

    # Data Quality Flags
    df['wind_high'] = (df['WSPD'] > 10.0).astype(int)
    df['temp_gap_c'] = df['ATMP'] - df['WTMP']
    

    return df

print("Curate function ready")

Curate function ready


In [49]:
df_curated = curate_ndbc_data(df_staged)
display(df_curated.head(10))
print(df_curated.columns.tolist())

Unnamed: 0,station_id,time_utc,WDIR,WSPD,GST,WVHT,DPD,APD,MWD,PRES,ATMP,WTMP,DEWP,VIS,PTDY,TIDE,observation_day,observation_hour,dayofweek,is_weekend,wind_high,temp_gap_c
0,44013,2026-02-15 21:30:00+00:00,90.0,3.0,4.0,,,,,1021.9,-0.9,3.0,-10.8,,,,2026-02-15,21,6,1,0,-3.9
1,44013,2026-02-15 21:20:00+00:00,80.0,2.0,4.0,0.8,6.0,4.6,35.0,1021.7,-0.9,3.0,-10.7,,,,2026-02-15,21,6,1,0,-3.9
2,44013,2026-02-15 21:10:00+00:00,80.0,3.0,4.0,,,,,1021.6,-0.9,3.0,-10.7,,,,2026-02-15,21,6,1,0,-3.9
3,44013,2026-02-15 21:00:00+00:00,80.0,3.0,4.0,,,,,1021.4,-1.0,3.0,-10.6,,0.9,,2026-02-15,21,6,1,0,-4.0
4,44013,2026-02-15 20:50:00+00:00,70.0,3.0,4.0,0.8,6.0,4.5,39.0,1021.2,-0.9,3.0,-10.3,,,,2026-02-15,20,6,1,0,-3.9
5,44013,2026-02-15 20:40:00+00:00,80.0,3.0,4.0,,,,,1021.1,-1.0,3.0,-10.3,,,,2026-02-15,20,6,1,0,-4.0
6,44013,2026-02-15 20:30:00+00:00,60.0,3.0,4.0,,,,,1021.0,-1.1,3.0,-10.6,,,,2026-02-15,20,6,1,0,-4.1
7,44013,2026-02-15 20:20:00+00:00,60.0,3.0,5.0,0.9,5.0,4.4,32.0,1021.0,-1.2,3.0,-10.6,,,,2026-02-15,20,6,1,0,-4.2
8,44013,2026-02-15 20:10:00+00:00,60.0,3.0,4.0,,,,,1020.9,-1.2,3.0,-10.4,,,,2026-02-15,20,6,1,0,-4.2
9,44013,2026-02-15 20:00:00+00:00,40.0,4.0,6.0,,,,,1020.7,-1.2,3.0,-10.5,,0.0,,2026-02-15,20,6,1,0,-4.2


['station_id', 'time_utc', 'WDIR', 'WSPD', 'GST', 'WVHT', 'DPD', 'APD', 'MWD', 'PRES', 'ATMP', 'WTMP', 'DEWP', 'VIS', 'PTDY', 'TIDE', 'observation_day', 'observation_hour', 'dayofweek', 'is_weekend', 'wind_high', 'temp_gap_c']


✅ **Exercise 3.2 — choose curated columns and write Parquet**


In [50]:
curated_columns = [
    # Identifiers
    'station_id',
    'time_utc',
    
    # Sensor measurements
    'WDIR', 
    'WSPD', 
    'GST',
    'WVHT', 
    'DPD', 
    'APD', 
    'MWD',
    'PRES', 
    'ATMP', 
    'WTMP', 
    'DEWP',
    'VIS', 
    'PTDY', 
    'TIDE',
    
    # Engineered time features
    'observation_day',
    'observation_hour',
    'dayofweek',
    'is_weekend',
    
    # Data quality flags
    'wind_high',
    'temp_gap_c'
]

df_final = df_curated[curated_columns]

# Write to warehouse
curated_data_path = WH_DIR / f"ndbc_curated_{STATION_ID}.parquet"
df_final.to_parquet(curated_data_path, index=False)

# Calculate statistics
num_rows = len(df_final)
num_cols = len(df_final.columns)
num_days = df_final['observation_day'].nunique()

print(f"Wrote: {curated_data_path} | rows: {num_rows} | cols: {num_cols}")
print(f"Partitioned days: {num_days}")

Wrote: ../work/HW2_NOAA/data/warehouse/ndbc_curated_44013.parquet | rows: 6563 | cols: 22
Partitioned days: 46


## 4) Validate: contracts + anomalies + canaries

✅ **Exercise 4.1 — required columns + plausible ranges**


In [51]:
# Required columns
required_cols = [
    'station_id',
    'time_utc',
    'WSPD',
    'PRES',
    'ATMP',
    'WTMP'
]

# Optional columns
optional_cols = [
    'WVHT',
    'GST',
    'WDIR'
]

# Plausible ranges for validation
range_checks = {
    'WSPD': (0, 50),      
    'GST': (0, 60),       
    'WVHT': (0, 20),      
    'PRES': (900, 1100),  
    'ATMP': (-50, 50),    
    'WTMP': (-5, 40),     
    'WDIR': (0, 360)      
}

print("Required cols (contract):", required_cols)
print("Optional cols (monitor):", optional_cols)
print("Range checks will apply to:", list(range_checks.keys()))
print("\nNote: WVHT exists as a column but is often missing for some stations. We'll monitor it, not fail the run.")


Required cols (contract): ['station_id', 'time_utc', 'WSPD', 'PRES', 'ATMP', 'WTMP']
Optional cols (monitor): ['WVHT', 'GST', 'WDIR']
Range checks will apply to: ['WSPD', 'GST', 'WVHT', 'PRES', 'ATMP', 'WTMP', 'WDIR']

Note: WVHT exists as a column but is often missing for some stations. We'll monitor it, not fail the run.


✅ **Exercise 4.2 — implement validation checks**


In [52]:
def validate_curated_data(df: pd.DataFrame, required_cols: list, optional_cols: list, range_checks: dict) -> dict:

    validation_report = {
        "validated_at": datetime.now(timezone.utc).isoformat(),
        "row_count": len(df),
        "passed": True,
        "failures": [],
        "warnings": []
    }

    # 1: Required columns must exist
    missing_required = [col for col in required_cols if col not in df.columns]
    if missing_required:
        validation_report['passed'] = False
        validation_report['failures'].append({
            'check': 'required_columns',
            'issue': f"Missing required columns: {missing_required}"
        })

    # 2: Required columns must have some data (not all NaN)
    for col in required_cols:
        if col in df.columns:
            missing_rate = df[col].isna().sum() / len(df)
            if missing_rate >= 0.99:  # 99% or more missing = FAIL
                validation_report['passed'] = False
                validation_report['failures'].append({
                    'check': f'missing_rate_required:{col}',
                    'missing_rate': missing_rate
                })

    # 3: Range violations for all columns
    for col, (min_val, max_val) in range_checks.items():
        if col not in df.columns:
            continue
        
        valid_data = df[col].dropna()
        if len(valid_data) == 0:
            continue  
        
        out_of_range = ((valid_data < min_val) | (valid_data > max_val)).sum()
        violation_rate = out_of_range / len(valid_data)
        
        # FAIL if >5% of values out of range
        if violation_rate > 0.05:
            validation_report['passed'] = False
            validation_report['failures'].append({
                'check': f'range_violation:{col}',
                'out_of_range': int(out_of_range),
                'total': len(valid_data),
                'violation_rate': violation_rate,
                'range': [min_val, max_val]
            })

    # 4: Optional column warnings (high missing rate)
    for col in optional_cols:
        if col in df.columns:
            missing_rate = df[col].isna().sum() / len(df)
            if missing_rate > 0.3:  # WARN if >30% missing
                validation_report['warnings'].append({
                    'check': f'missing_rate_optional:{col}',
                    'missing_rate': missing_rate,
                    'warn_if_gt': 0.3
                })


    return validation_report

print("Validation function ready")

Validation function ready


In [53]:
# Run validation
validation_report = validate_curated_data(df_curated, required_cols, optional_cols, range_checks)

# Print results
print(f"Validation passed? {validation_report['passed']}")
print(f"Failures: {len(validation_report['failures'])} | Warnings: {len(validation_report['warnings'])}")

if validation_report['failures']:
    print("\n--- FAILURES ---")
    for failure in validation_report['failures']:
        print(f"- {failure['check']} {failure}")

if validation_report['warnings']:
    print("\n--- WARNINGS ---")
    for warning in validation_report['warnings']:
        print(f"- {warning['check']} {warning}")

Validation passed? True

- missing_rate_optional:WVHT {'check': 'missing_rate_optional:WVHT', 'missing_rate': np.float64(0.5933262227639798), 'warn_if_gt': 0.3}


✅ **Exercise 4.3 — anomaly flags + investigation table**


In [54]:
def create_anomaly_summary(df: pd.DataFrame, range_checks: dict) -> dict:

    summary = {
        "row_count": len(df),
        "value_anomalies": {},
        "missingness_rates": {},
        "suspicious_row_count": 0
    }
    
    # --- Value anomalies (out of range) ---
    value_anom_count = 0
    for col, (min_val, max_val) in range_checks.items():
        if col not in df.columns:
            continue
        
        valid_data = df[col].dropna()
        if len(valid_data) == 0:
            continue
        
        out_of_range = ((valid_data < min_val) | (valid_data > max_val)).sum()
        if out_of_range > 0:
            flag_name = f'anom_{col.lower()}'
            summary['value_anomalies'][flag_name] = int(out_of_range)
            value_anom_count += out_of_range
    
    # --- Missingness rates ---
    key_sensor_cols = ['WVHT', 'WTMP', 'ATMP', 'WSPD', 'PRES']
    for col in key_sensor_cols:
        if col in df.columns:
            missing_rate = df[col].isna().mean()
            if missing_rate > 0:
                flag_name = f'miss_{col.lower()}'
                summary['missingness_rates'][flag_name] = round(missing_rate, 5)
    
    # Suspicious row count (rows with ANY value anomaly)
    summary['suspicious_row_count'] = value_anom_count
    
    return summary

print("Anomaly summary function ready")

Anomaly summary function ready


In [55]:
anomaly_summary = create_anomaly_summary(df_curated, range_checks)

# Print formatted output
print("=== Row-level anomaly summary ===")
print(f"Rows: {anomaly_summary['row_count']}")

print("\nValue anomaly counts (top):")
if anomaly_summary['value_anomalies']:
    sorted_anoms = sorted(anomaly_summary['value_anomalies'].items(), 
                         key=lambda x: x[1], reverse=True)
    for flag, count in sorted_anoms[:5]:
        print(f"- {flag}: {count}")
else:
    print("- None triggered by current thresholds.")

print("\nMissingness rates (informational):")
if anomaly_summary['missingness_rates']:
    sorted_miss = sorted(anomaly_summary['missingness_rates'].items(), 
                        key=lambda x: x[1], reverse=True)
    for flag, rate in sorted_miss:
        print(f"- {flag}: {rate*100:.3f}%")

print(f"\nSuspicious rows (value anomalies only): {anomaly_summary['suspicious_row_count']}")

=== Row-level anomaly summary ===
Rows: 6563

Value anomaly counts (top):
- None triggered by current thresholds.

Missingness rates (informational):
- miss_wvht: 59.333%
- miss_wtmp: 1.341%
- miss_pres: 0.335%
- miss_atmp: 0.229%
- miss_wspd: 0.137%

Suspicious rows (value anomalies only): 0


✅ **Exercise 4.4 — canaries + spike/drop detection**


In [56]:
def create_canary_summary(df: pd.DataFrame) -> dict:

    summary = {
        "day_count": 0,
        "obs_per_day": {},
        "drops": [],
        "overall_missingness": {},
        "worst_day_missingness": {},
        "high_missingness_days": {}
    }
    
    # Group by observation_day
    daily = df.groupby('observation_day')
    obs_counts = daily.size()
    
    # Basic stats
    summary['day_count'] = len(obs_counts)
    summary['obs_per_day'] = {
        'min': int(obs_counts.min()),
        'median': float(obs_counts.median()),
        'max': int(obs_counts.max())
    }
    
    # Drop detection: days with < 50% of median observations 
    median_obs = obs_counts.median()
    drop_threshold = median_obs * 0.5  
    
    drops = obs_counts[obs_counts < drop_threshold]
    summary['drops'] = [
        {'obs_day': day, 'n_obs': int(count)}
        for day, count in drops.items()
    ]

    # Overall missingness rates 
    key_cols = ['WSPD', 'WVHT', 'PRES', 'ATMP', 'WTMP']
    for col in key_cols:
        if col in df.columns:
            missing_rate = df[col].isna().mean()
            summary['overall_missingness'][col] = round(missing_rate, 3)


    # Worst day missingness per column
    for col in key_cols:
        if col in df.columns:
            daily_missing = df.groupby('observation_day')[col].apply(lambda x: x.isna().mean())
            worst_day = daily_missing.idxmax()
            worst_rate = daily_missing.max()
            
            summary['worst_day_missingness'][col] = {
                'rate': round(worst_rate, 3),
                'day': worst_day
            }

    # Days with high missingness (>30%) 
    for col in key_cols:
        if col in df.columns:
            daily_missing = df.groupby('observation_day')[col].apply(lambda x: x.isna().mean())
            high_miss_days = daily_missing[daily_missing > 0.30].sort_values(ascending=False)
            
            if len(high_miss_days) > 0:
                summary['high_missingness_days'][col] = {
                    'count': len(high_miss_days),
                    'examples': [
                        {'obs_day': day, 'missing_rate': float(rate)}
                        for day, rate in high_miss_days.head(5).items()
                    ]
                }
    
    return summary

print("Canary summary function ready")

Canary summary function ready


In [57]:
# Create canary summary
canary_summary = create_canary_summary(df_curated)

# Print formatted output
print("=== Canary summary ===")
print(f"Days: {canary_summary['day_count']}")
print(f"Obs/day (min/median/max): {canary_summary['obs_per_day']['min']} / "
      f"{canary_summary['obs_per_day']['median']} / {canary_summary['obs_per_day']['max']}")

print("\nDrops (unexpectedly few rows):")
if canary_summary['drops']:
    for drop in canary_summary['drops']:
        print(f"- {drop}")
else:
    print("- None detected")

print("\nOverall missingness (fraction of rows missing):")
for col, rate in canary_summary['overall_missingness'].items():
    print(f"- {col}: {rate:.3f}")

print("\nWorst day missingness per column:")
for col, info in canary_summary['worst_day_missingness'].items():
    print(f"- {col}: {info['rate']:.3f} on {info['day']}")

print("\nDays with missingness > 0.30:")
if canary_summary['high_missingness_days']:
    for col, info in canary_summary['high_missingness_days'].items():
        print(f"- {col}: {info['count']} days (showing up to 5)")
        for example in info['examples']:
            print(f"    {example}")
else:
    print("- None detected")

=== Canary summary ===
Days: 46
Obs/day (min/median/max): 130 / 143.0 / 144

Drops (unexpectedly few rows):
- None detected

Overall missingness (fraction of rows missing):
- WSPD: 0.001
- WVHT: 0.593
- PRES: 0.003
- ATMP: 0.002
- WTMP: 0.013

Worst day missingness per column:
- WSPD: 0.014 on 2026-01-23
- WVHT: 0.636 on 2026-01-05
- PRES: 0.070 on 2026-02-07
- ATMP: 0.014 on 2026-01-23
- WTMP: 0.035 on 2026-01-05

Days with missingness > 0.30:
- WVHT: 46 days (showing up to 5)
    {'obs_day': datetime.date(2026, 1, 5), 'missing_rate': 0.6363636363636364}
    {'obs_day': datetime.date(2026, 2, 15), 'missing_rate': 0.6230769230769231}
    {'obs_day': datetime.date(2026, 1, 13), 'missing_rate': 0.6223776223776224}
    {'obs_day': datetime.date(2026, 1, 6), 'missing_rate': 0.6153846153846154}
    {'obs_day': datetime.date(2026, 1, 16), 'missing_rate': 0.6153846153846154}


## 5) Leakage audit (conceptual)

✅ **Exercise 5.1 — write a leakage checklist**


In [58]:
# Leakage Audit Checklist
leakage_checklist = [
    "If building rolling features, are they computed using only past data relative to prediction time?",
    "If you standardize/normalize sensors, are stats computed on TRAIN only?",
    "If you impute missing values, does the method avoid using future observations?",
    "Are you aggregating by day in a way that would be unavailable at prediction time?",
    "Is prediction time defined (e.g., predict next-hour wind speed using prior hours only)?"
]

print("Leakage checklist:")
for i, item in enumerate(leakage_checklist, 1):
    print(f"{i}. {item}")


Leakage checklist:
1. If building rolling features, are they computed using only past data relative to prediction time?
2. If you standardize/normalize sensors, are stats computed on TRAIN only?
3. If you impute missing values, does the method avoid using future observations?
4. Are you aggregating by day in a way that would be unavailable at prediction time?
5. Is prediction time defined (e.g., predict next-hour wind speed using prior hours only)?


## 6) Write validation report + run log


In [59]:
def write_validation_report(
    df: pd.DataFrame,
    validation_report: dict,
    anomaly_summary: dict,
    canary_summary: dict,
    leakage_checklist: list,
    station_id: str
) -> dict:

    complete_report = {
        "station_id": station_id,
        "validated_at": datetime.now(timezone.utc).isoformat(),
        "row_count": len(df),
        "column_count": len(df.columns),
        
        # Contracts (from validation_report)
        "contracts": {
            "passed": validation_report['passed'],
            "failures": validation_report['failures'],
            "warnings": validation_report['warnings']
        },
        
        # Anomalies (from anomaly_summary)
        "anomalies": {
            "value_anomalies": anomaly_summary['value_anomalies'],
            "missingness_rates": anomaly_summary['missingness_rates'],
            "suspicious_row_count": anomaly_summary['suspicious_row_count']
        },
        
        # Canaries (from canary_summary)
        "canaries": {
            "day_count": canary_summary['day_count'],
            "obs_per_day": canary_summary['obs_per_day'],
            "drops": canary_summary['drops'],
            "overall_missingness": canary_summary['overall_missingness'],
            "worst_day_missingness": canary_summary['worst_day_missingness'],
            "high_missingness_days": canary_summary['high_missingness_days']
        },
        
        # Leakage checklist
        "leakage_checklist": leakage_checklist
    }
    
    return complete_report

print("Validation report function ready")

Validation report function ready


In [60]:
# Create complete validation report
validation_report_full = write_validation_report(
    df_curated,
    validation_report,
    anomaly_summary,
    canary_summary,
    leakage_checklist,
    STATION_ID
)

# Write to reference directory
validation_path = REF_DIR / f"validation_report_run_{RUN_ID}.json"
write_json(validation_path, validation_report_full)
print(f"Validation report written to: {validation_path}")

Validation report written to: ../work/HW2_NOAA/data/reference/validation_report_run_20260215_220552_utc.json


✅ **Exercise 6.1 — run log**


In [61]:
def create_run_log(
    run_id: str,
    raw_data_path: Path,
    raw_meta_path: Path,
    staged_data_path: Path,
    curated_data_path: Path,
    validation_path: Path,
    query_params: dict = None
) -> dict:

    # Create query fingerprint (hash of query parameters)
    if query_params:
        query_str = json.dumps(query_params, sort_keys=True)
        query_fingerprint = hashlib.md5(query_str.encode()).hexdigest()[:16]
    else:
        query_fingerprint = "no_query"
    
    run_log = {
        "run_id": run_id,
        "generated_at_utc": datetime.now(timezone.utc).isoformat(),
        
        "inputs": {
            "query_fingerprint": query_fingerprint,
            "raw_txt_path": str(raw_data_path.absolute()),
            "raw_size_bytes": raw_data_path.stat().st_size if raw_data_path.exists() else 0,
            "raw_meta_path": str(raw_meta_path.absolute()) if raw_meta_path.exists() else None
        },
        
        "outputs": {
            "staged_path": str(staged_data_path.absolute()) if staged_data_path.exists() else None,
            "curated_path": str(curated_data_path.absolute()),
            "partition_root": str(WH_DIR / "partitions"),
            "validation_report_path": str(validation_path.absolute())
        },
        
        "row_definition": "Each row is one buoy observation at time_utc for a given station_id.",
        
        "notes": [
            "Sensor values may be missing due to instrument downtime or transmission errors.",
            "Treat validations and anomaly flags as guardrails; investigate upstream before changing thresholds."
        ]
    }
    
    return run_log

print("Run log function ready")

Run log function ready


In [62]:
# Create run log
run_log = create_run_log(
    run_id=RUN_ID,
    raw_data_path=raw_data_path,
    raw_meta_path=RAW_DIR / f"ndbc_meta_{STATION_ID}_{RUN_ID}.json",  # Adjust if you created this earlier
    staged_data_path=STAGED_DIR / f"{STATION_ID}_staged.parquet",
    curated_data_path=curated_data_path,
    validation_path=validation_path
)

# Save to pipeline_runs directory
run_log_path = RUN_DIR / f"{RUN_ID}.json"
write_json(run_log_path, run_log)

print(f"Saved: {run_log_path}\n")
display(run_log)

Saved: ../work/HW2_NOAA/data/reference/pipeline_runs/20260215_220552_utc.json



{'run_id': '20260215_220552_utc',
 'generated_at_utc': '2026-02-15T22:07:06.697783+00:00',
 'inputs': {'query_fingerprint': 'no_query',
  'raw_txt_path': '/home/glake/Nextcloud/Classwork/CS6678 - Advanced Machine Learning/Homework/HW2/../work/HW2_NOAA/data/raw/nbdc_44013_run_20260215_220552_utc.txt',
  'raw_size_bytes': 617110,
  'raw_meta_path': None},
 'outputs': {'staged_path': None,
  'curated_path': '/home/glake/Nextcloud/Classwork/CS6678 - Advanced Machine Learning/Homework/HW2/../work/HW2_NOAA/data/warehouse/ndbc_curated_44013.parquet',
  'partition_root': '../work/HW2_NOAA/data/warehouse/partitions',
  'validation_report_path': '/home/glake/Nextcloud/Classwork/CS6678 - Advanced Machine Learning/Homework/HW2/../work/HW2_NOAA/data/reference/validation_report_run_20260215_220552_utc.json'},
 'row_definition': 'Each row is one buoy observation at time_utc for a given station_id.',
 'notes': ['Sensor values may be missing due to instrument downtime or transmission errors.',
  'Treat

## 7) Self-check + reflection


In [63]:
reflection = [
    "Row definition: Each row is one buoy observation at time_utc for station 44013, with sensors measuring wind, waves, pressure, and temperature.",
    
    "Required sensors: WSPD, PRES, ATMP, WTMP must exist and have <99% missing data to pass validation.",
    
    "Range checks: WSPD [0-50 m/s], PRES [900-1100 hPa], ATMP [-50 to 50°C], WTMP [-5 to 40°C], WVHT [0-20 m], GST [0-60 m/s], WDIR [0-360°].",
    
    "Biggest anomaly: WVHT has ~60% missing data across all days (59.8% overall), with 46/46 days exceeding 30% missingness threshold. This is expected for this buoy station and triggers warnings but not failures.",
    
    "Likely breakage scenario: If NDBC changes file format (e.g., removes units row, changes column names, or switches from space-delimited to comma-delimited), the read_ndbc_txt() function would fail. Check: Validate header structure before parsing.",
    
    "Another breakage scenario: If a sensor starts reporting values outside plausible ranges due to calibration drift or malfunction (e.g., ATMP = 100°C), range checks would catch it. The validation fails if >5% of values are out of range.",
    
    "Temporal leakage risk: If building time-series forecasting models, ensure rolling features use only past data, train/test splits are chronological, and no future observations leak into imputation or aggregation.",
    
    "Data quality insight: January 25, 2026 had only 35 observations (vs median 143/day), indicating a potential data collection or transmission issue on that date.",
    
    "Pipeline reproducibility: Run logs track inputs, outputs, and query fingerprints, allowing exact reconstruction of any pipeline execution from raw data to curated output."
]

print("="*80)
print("SELF-CHECK AND REFLECTION")
print("="*80)
for i, item in enumerate(reflection, 1):
    print(f"\n{i}. {item}")
print("\n" + "="*80)

SELF-CHECK AND REFLECTION

1. Row definition: Each row is one buoy observation at time_utc for station 44013, with sensors measuring wind, waves, pressure, and temperature.

2. Required sensors: WSPD, PRES, ATMP, WTMP must exist and have <99% missing data to pass validation.

3. Range checks: WSPD [0-50 m/s], PRES [900-1100 hPa], ATMP [-50 to 50°C], WTMP [-5 to 40°C], WVHT [0-20 m], GST [0-60 m/s], WDIR [0-360°].


5. Likely breakage scenario: If NDBC changes file format (e.g., removes units row, changes column names, or switches from space-delimited to comma-delimited), the read_ndbc_txt() function would fail. Check: Validate header structure before parsing.

6. Another breakage scenario: If a sensor starts reporting values outside plausible ranges due to calibration drift or malfunction (e.g., ATMP = 100°C), range checks would catch it. The validation fails if >5% of values are out of range.

7. Temporal leakage risk: If building time-series forecasting models, ensure rolling feature