# Homework 2: NOAA Buoy Data Pipeline (NDBC)

You will build a repeatable pipeline that ingests **real sensor data**, handles missingness + sentinel values,
creates an analysis-ready table, and writes a validation report you could run every day.

## Dataset Overview: NOAA NDBC Buoy Observations

This project uses data from the **NOAA National Data Buoy Center (NDBC)**, which operates a network of ocean buoys and coastal stations that continuously measure atmospheric and oceanographic conditions.

Each buoy is a **physical sensor platform** deployed at a fixed geographic location. It records observations at regular time intervals (often hourly), transmitting them to NOAA for operational use in weather forecasting, marine safety, and climate research.

---

### What a single row represents

In this dataset:

> **Each row represents one sensor observation at a specific buoy at a specific UTC timestamp.**

This makes the data:
- **Time series** (ordered in time)
- **Stateful** (conditions at a moment, not events)
- **Naturally indexed by time**

There is no “target variable” yet — this dataset is about *measurement*, not decisions.

---

### Core variables (typical)

Not every buoy reports every variable, but common fields include:

- **Wind**
  - `WDIR` — wind direction (degrees)
  - `WSPD` — wind speed (m/s)
  - `GST` — wind gust (m/s)
- **Waves**
  - `WVHT` — significant wave height (m)
  - `DPD` — dominant wave period (s)
  - `APD` — average wave period (s)
- **Atmosphere**
  - `PRES` — sea-level pressure (hPa)
  - `ATMP` — air temperature (°C)
- **Ocean**
  - `WTMP` — water temperature (°C)

Missing values are common and usually reflect **sensor downtime**, **transmission issues**, or **environmental constraints**, not data entry errors.

---

### Why this dataset is realistic (and messy)

This is **real operational sensor data**, not a curated research dataset. As a result:

- Missing values are encoded as sentinels (e.g. `MM`)
- Sensors may fail temporarily or permanently
- Some variables appear or disappear over time
- Units and ranges must be interpreted using domain knowledge
- The dataset contains a rolling time window, not full history

These characteristics make the dataset ideal for practicing **data ingestion, validation, and pipeline design**.

---

### Important constraint: rolling history

The data used here comes from NOAA’s `realtime2` endpoint, which provides a **rolling window of recent observations** (typically ~30–45 days).

This means:
- You are *not* requesting a specific date range
- Older observations are continuously overwritten
- Historical backfills require a different NOAA endpoint

This is intentional and mirrors how real production systems separate **realtime feeds** from **historical archives**.

---

The goal is not to “clean it perfectly,” but to **build trust in the parts you use** — and to document the assumptions you make along the way.


## What you will produce

Artifacts (under a project folder):

- `data/raw/` — raw station snapshot(s) + metadata (station id, URL, timestamp)
- `data/staged/` — parsed/normalized table (typed, missingness normalized)
- `data/warehouse/` — curated table (Parquet; optionally partitioned by day)
- `data/reference/validation_report.json` — contracts + anomaly rates + canaries
- `data/reference/pipeline_runs/` — run logs for reproducibility

> Principle: In sensor data, “cleaning” is mostly about **assumptions** (units, ranges, sentinel values)
> and **guardrails** (contracts + anomaly flags), not deleting rows.


## 0) Setup

Create a project folder somewhere, for example at:

`~/work/homework_2_noaa/`

Create the five directories listed above.


In [1]:
from __future__ import annotations

from pathlib import Path
from datetime import datetime, timezone
import json
import hashlib

import numpy as np
import pandas as pd



Project: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa
Raw: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/raw
Staged: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/staged
Warehouse: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/warehouse
Reference: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/reference
Runs: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/reference/pipeline_runs


### Helper utilities

Create helper utilies.

Helpers ready.


## 1) Ingest: download latest buoy observations

NOAA NDBC provides station “realtime2” text files, e.g.

- `https://www.ndbc.noaa.gov/data/realtime2/41002.txt`

These files are human-readable but still messy:
- header lines starting with `#`
- missing values as sentinel strings like `MM`
- sometimes extra columns depending on station/sensor

**Station 44013 — Boston 16 NM East of Boston, MA** 

We will use data from buoy station `44013` in this homework for these reasons:
- Location in the North Atlantic, a region with rich weather variability (storms, seasonal shifts).
- Long historical coverage (data available back into the 1980s/1990s).
- Strong mix of variables: wind, wave height, pressures, temperatures, etc.
- Very useful for seasonality, trend analysis, anomaly detection, and combining meteorological + oceanographic features.
- This station is especially popular for regional marine research and forecasting, so its data patterns can be both interesting and instructive for data science exercises.

Look at the station information [here](https://www.ndbc.noaa.gov/station_page.php?station=44013).


✅ **Exercise 1.1 — fetch raw data and write snapshot**

Download the file, then write:

- raw text: `data/raw/ndbc_<station>_<runid>.txt`
- raw metadata: `data/raw/ndbc_meta_<station>_<runid>.json`

**Hint:** use `requests.get(url).text` and save as UTF-8.


In [3]:
import requests

STATION_ID = "44013"  



Station: 44013
URL: https://www.ndbc.noaa.gov/data/realtime2/44013.txt
run_id: run_20260125_061431_utc
Wrote: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/raw/ndbc_44013_run_20260125_061431_utc.txt
Wrote: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/raw/ndbc_meta_44013_run_20260125_061431_utc.json
First 15 lines:

#YY  MM DD hh mm WDIR WSPD GST  WVHT   DPD   APD MWD   PRES  ATMP  WTMP  DEWP  VIS PTDY  TIDE
#yr  mo dy hr mn degT m/s  m/s     m   sec   sec degT   hPa  degC  degC  degC  nmi  hPa    ft
2026 01 25 05 40 330  9.0 12.0    MM    MM    MM  MM 1037.3 -10.4   4.9 -15.6   MM   MM    MM
2026 01 25 05 30 330 10.0 12.0    MM    MM    MM  MM 1037.3 -10.4   4.9 -15.7   MM   MM    MM
2026 01 25 05 20 330  9.0 12.0   1.2     5   3.8 306 1037.2 -10.2   4.9 -15.6   MM   MM    MM
2026 01 25 05 10 330  9.0 12.0    MM    MM    MM  MM 1037.4 -10.1   4.9 -15.5   MM   MM    MM
2026 01 25 05 00 320  9.0 12.0    MM    MM    MM  MM 1037.4 -10.2   4.8 -16.0   MM +0.0    MM
20

## 2) Stage: parse + normalize missingness + types

NDBC realtime2 files have:
- one header line naming columns (after `#`)
- data rows with whitespace-separated values

Typical columns include:
- `YY MM DD hh mm` (timestamp components, UTC)
- `WDIR` wind direction (deg)
- `WSPD` wind speed (m/s)
- `GST` gust (m/s)
- `WVHT` wave height (m)
- `DPD` dominant period (s)
- `APD` average period (s)
- `PRES` pressure (hPa)
- `ATMP` air temp (C)
- `WTMP` water temp (C)

But not every station has every column.

✅ **Exercise 2.1 — parse the raw file into a DataFrame**

Implement the function `read_ndbc_txt(path)` that:
- reads the file
- returns a DataFrame

**Hints:**
- Many files have a commented header and sometimes a units line.
- A robust approach:
  - Find the first **non-#** line (header)
  - If the next line looks like units (letters), skip it
  - Parse remaining lines with `delim_whitespace=True` or `sep=r"\s+"`


In [4]:
from io import StringIO



Raw parsed shape: (6470, 19)


Unnamed: 0,YY,MM,DD,hh,mm,WDIR,WSPD,GST,WVHT,DPD,APD,MWD,PRES,ATMP,WTMP,DEWP,VIS,PTDY,TIDE
0,2026,1,25,5,40,330,9.0,12.0,MM,MM,MM,MM,1037.3,-10.4,4.9,-15.6,MM,MM,MM
1,2026,1,25,5,30,330,10.0,12.0,MM,MM,MM,MM,1037.3,-10.4,4.9,-15.7,MM,MM,MM
2,2026,1,25,5,20,330,9.0,12.0,1.2,5,3.8,306,1037.2,-10.2,4.9,-15.6,MM,MM,MM
3,2026,1,25,5,10,330,9.0,12.0,MM,MM,MM,MM,1037.4,-10.1,4.9,-15.5,MM,MM,MM
4,2026,1,25,5,0,320,9.0,12.0,MM,MM,MM,MM,1037.4,-10.2,4.8,-16.0,MM,+0.0,MM


✅ **Exercise 2.2 — construct a UTC timestamp + normalize missing values**

Create a staged table that includes:

- `station_id`
- `time_utc` as timezone-aware datetime
- numeric sensor fields coerced to numeric
- missing values: `MM` becomes NaN (via numeric coercion)

**Hints:**
- Timestamp columns can be `YY MM DD hh mm` or sometimes `YYYY MM DD hh mm`.
- The month column is usually `MM` and minute is `mm` (case matters).


Staged shape: (6470, 16)


Unnamed: 0,station_id,time_utc,WDIR,WSPD,GST,WVHT,DPD,APD,MWD,PRES,ATMP,WTMP,DEWP,VIS,PTDY,TIDE
0,44013,2026-01-25 05:40:00+00:00,330.0,9.0,12.0,,,,,1037.3,-10.4,4.9,-15.6,,,
1,44013,2026-01-25 05:30:00+00:00,330.0,10.0,12.0,,,,,1037.3,-10.4,4.9,-15.7,,,
2,44013,2026-01-25 05:20:00+00:00,330.0,9.0,12.0,1.2,5.0,3.8,306.0,1037.2,-10.2,4.9,-15.6,,,
3,44013,2026-01-25 05:10:00+00:00,330.0,9.0,12.0,,,,,1037.4,-10.1,4.9,-15.5,,,
4,44013,2026-01-25 05:00:00+00:00,320.0,9.0,12.0,,,,,1037.4,-10.2,4.8,-16.0,,0.0,


✅ **Exercise 2.3 — write staged outputs**


Wrote: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/staged/ndbc_staged_44013_run_20260125_061431_utc.parquet
Wrote: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/staged/ndbc_staged_meta_44013_run_20260125_061431_utc.json


## 3) Curate: analysis-ready features

✅ **Exercise 3.1 — time features + flags**


Unnamed: 0,station_id,time_utc,WDIR,WSPD,GST,WVHT,DPD,APD,MWD,PRES,ATMP,WTMP,DEWP,VIS,PTDY,TIDE,obs_day,obs_hour,dayofweek,is_weekend,wind_high,temp_gap_c
0,44013,2026-01-25 05:40:00+00:00,330.0,9.0,12.0,,,,,1037.3,-10.4,4.9,-15.6,,,,2026-01-25,5,6,1,0,-15.3
1,44013,2026-01-25 05:30:00+00:00,330.0,10.0,12.0,,,,,1037.3,-10.4,4.9,-15.7,,,,2026-01-25,5,6,1,0,-15.3
2,44013,2026-01-25 05:20:00+00:00,330.0,9.0,12.0,1.2,5.0,3.8,306.0,1037.2,-10.2,4.9,-15.6,,,,2026-01-25,5,6,1,0,-15.1
3,44013,2026-01-25 05:10:00+00:00,330.0,9.0,12.0,,,,,1037.4,-10.1,4.9,-15.5,,,,2026-01-25,5,6,1,0,-15.0
4,44013,2026-01-25 05:00:00+00:00,320.0,9.0,12.0,,,,,1037.4,-10.2,4.8,-16.0,,0.0,,2026-01-25,5,6,1,0,-15.0


✅ **Exercise 3.2 — choose curated columns and write Parquet**


Wrote: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/warehouse/ndbc_curated_44013.parquet | rows: 6470 | cols: 17
Partitioned days: 46


## 4) Validate: contracts + anomalies + canaries

✅ **Exercise 4.1 — required columns + plausible ranges**


In [9]:
# Note: WVHT exists as a column but is often missing for some stations. We'll monitor it, not fail the run.


Required cols (contract): ['station_id', 'time_utc', 'WSPD', 'PRES', 'ATMP', 'WTMP']
Optional cols (monitor): ['WVHT', 'GST', 'WDIR']
Range checks will apply to: ['WSPD', 'GST', 'WVHT', 'PRES', 'ATMP', 'WTMP', 'WDIR']

Note: WVHT exists as a column but is often missing for some stations. We'll monitor it, not fail the run.


✅ **Exercise 4.2 — implement validation checks**


Validation passed? True

- missing_rate_optional:WVHT {'missing_rate': 0.5979907264296754, 'warn_if_gt': 0.3}


✅ **Exercise 4.3 — anomaly flags + investigation table**


=== Row-level anomaly summary ===
Rows: 6470

Value anomaly counts (top):
- None triggered by current thresholds.

Missingness rates (informational):
- miss_wvht: 59.799%
- miss_wtmp: 1.206%
- miss_atmp: 0.155%
- miss_wspd: 0.093%
- miss_pres: 0.046%

Suspicious rows (value anomalies only): 0


✅ **Exercise 4.4 — canaries + spike/drop detection**


=== Canary summary ===
Days: 46
Obs/day (min/median/max): 35 / 143.0 / 144

Drops (unexpectedly few rows):
- {'obs_day': datetime.date(2026, 1, 25), 'n_obs': 35}

Overall missingness (fraction of rows missing):
- WSPD: 0.001
- WVHT: 0.598
- PRES: 0.000
- ATMP: 0.002
- WTMP: 0.012

Worst day missingness per column:
- WSPD: 0.014 on 2026-01-23
- WVHT: 0.657 on 2026-01-25
- PRES: 0.014 on 2026-01-23
- ATMP: 0.014 on 2026-01-23
- WTMP: 0.035 on 2026-01-05

Days with missingness > 0.30:
- WVHT: 46 days (showing up to 5)
    {'obs_day': datetime.date(2026, 1, 25), 'missing_rate': 0.6571428571428571}
    {'obs_day': datetime.date(2025, 12, 28), 'missing_rate': 0.6503496503496503}
    {'obs_day': datetime.date(2025, 12, 14), 'missing_rate': 0.6363636363636364}
    {'obs_day': datetime.date(2026, 1, 5), 'missing_rate': 0.6363636363636364}
    {'obs_day': datetime.date(2025, 12, 13), 'missing_rate': 0.6363636363636364}


## 5) Leakage audit (conceptual)

✅ **Exercise 5.1 — write a leakage checklist**


['If building rolling features, are they computed using only past data relative to prediction time?',
 'If you standardize/normalize sensors, are stats computed on TRAIN only?',
 'If you impute missing values, does the method avoid using future observations?',
 'Are you aggregating by day in a way that would be unavailable at prediction time?',
 'Is prediction time defined (e.g., predict next-hour wind speed using prior hours only)?']

## 6) Write validation report + run log


Saved: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/reference/validation_report.json


✅ **Exercise 6.1 — run log**


Saved: /home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/reference/pipeline_runs/run_20260125_061431_utc.json


{'run_id': 'run_20260125_061431_utc',
 'generated_at_utc': '2026-01-25T06:14:32.587786+00:00',
 'inputs': {'query_fingerprint': '3e4f1ba7b7be289b',
  'raw_txt_path': '/home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/raw/ndbc_44013_run_20260125_061431_utc.txt',
  'raw_size_bytes': 608370,
  'raw_meta_path': '/home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/raw/ndbc_meta_44013_run_20260125_061431_utc.json'},
 'outputs': {'staged_path': '/home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/staged/ndbc_staged_44013_run_20260125_061431_utc.parquet',
  'curated_path': '/home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/warehouse/ndbc_curated_44013.parquet',
  'partition_root': '/home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/warehouse/partitions',
  'validation_report_path': '/home/jupyter-kerblesl@isu.edu/work/homework_2_noaa/data/reference/validation_report.json'},
 'row_definition': 'Each row is one buoy observation at time_utc for a given station_id.',


## 7) Self-check + reflection


In [16]:
reflection = [
    "You fill this out"# "Row definition: ...",
    # "Required sensors: ...",
    # "Range checks: ...",
    # "Biggest anomaly: ...",
    # "Likely breakage + check: ...",
]
reflection


['You fill this out']