# 00 — Load NHANES data (starter notebook)

This notebook demonstrates how to load a single NHANES file (either **XPT** or **CSV**) into pandas, do a few quick sanity checks, and save a lightweight copy for downstream analysis.

**How to use**
1. Put your raw NHANES file in `data/raw/` (e.g., `data/raw/DEMO_G.XPT`).
2. Set `DATA_FILE` below to the filename you want to load.
3. Run the cells.
4. A feather/parquet copy will be written to `data/processed/`.

> Works offline. Internet not required. No extra packages needed for `.XPT` (pandas can read XPT via `read_sas(..., format="xport")`).

In [None]:
# --- Imports & display options
from pathlib import Path
import pandas as pd

pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 120)

In [None]:
# --- Project paths
# Adjust if your repository uses different names
ROOT = Path.cwd().resolve()
DATA_RAW = ROOT / "data" / "raw"
DATA_PROCESSED = ROOT / "data" / "processed"
DATA_PROCESSED.mkdir(parents=True, exist_ok=True)

ROOT, DATA_RAW, DATA_PROCESSED

In [None]:
# --- Pick the file you want to load
# Example: 'DEMO_G.XPT' (Demographics), or 'example.csv'
DATA_FILE = "DEMO_G.XPT"  # <-- change me

raw_path = (DATA_RAW / DATA_FILE).resolve()
assert raw_path.exists(), f"File not found: {raw_path} — place it in data/raw/ and try again."
raw_path

In [None]:
# --- Helper: load either XPT or CSV based on extension
def load_nhanes(path: Path) -> pd.DataFrame:
    ext = path.suffix.lower()
    if ext == ".xpt":
        # NHANES publishes SAS XPORT (.XPT). pandas can read it with read_sas(format="xport").
        df = pd.read_sas(path, format="xport", encoding="utf-8")
    elif ext == ".csv":
        df = pd.read_csv(path)
    else:
        raise ValueError(f"Unsupported file extension: {ext}. Use .XPT or .CSV")
    # standardise column names: lowercase
    df.columns = [c.lower() for c in df.columns]
    return df

df = load_nhanes(raw_path)
df.shape, list(df.columns)[:10]

In [None]:
# --- Quick look
df.head()

In [None]:
# --- Basic info & missingness summary
display(df.info())
missing_summary = df.isna().mean().sort_values(ascending=False)
missing_summary.head(20)

In [None]:
# --- Optional: select a few useful columns (example)
# Update the list below to match your analysis needs.
example_cols = [c for c in df.columns if c.startswith(("seqn","riagendr","ridageyr","ridreth"))]
df_small = df[example_cols].copy() if example_cols else df.copy()
df_small.head()

In [None]:
# --- Save lightweight copies for faster reloads
feather_path = DATA_PROCESSED / (raw_path.stem.lower() + ".feather")
parquet_path = DATA_PROCESSED / (raw_path.stem.lower() + ".parquet")

df_small.reset_index(drop=True).to_feather(feather_path)
df_small.reset_index(drop=True).to_parquet(parquet_path)

feather_path, parquet_path

## Next steps
- Repeat with other NHANES files (copy this notebook or parameterise `DATA_FILE`).
- Join/merge on the participant ID (`SEQN` → usually lowercased to `seqn` above).
- Add a small data dictionary (codebook) for selected variables in `docs/`.
- Consider using **Jupytext** to keep a `.py` pair for clean diffs in Git.
- When ready, promote any repeatable logic into `src/fb3pfb_nhanes/loader.py` and write tests in `tests/`.
