
# 📘 FB3PFB — NHANES (Intro notebook)

**Goal:** a very simple, step-by-step analysis in Google Colab, with no installations or Drive.  
What we’ll do:
- Load minimal libraries.
- Download **DEMO_J**, **HDL_J**, **TRIGLY_J** directly from the CDC site.
- Recode `riagendr` → `sex` (Male/Female).
- Convert HDL, triglycerides from **mg/dL** → **mmol/L**.
- Show mean/SD/median/IQR for **age** by **sex**.
- Draw a **density plot** of HDL by sex.
- Compute **difference in HDL** (Female − Male) and a **t-test**.
- Make a **scatter plot** of HDL vs Age.
- Fit a tiny **OLS regression**: HDL ~ sex + age.

> This notebook is intentionally simple (no custom functions unless necessary).


## 1) Load libraries

In [None]:
# === Core libraries ===========================================================
import sys                  # lets us inspect the runtime (e.g., detect Colab vs local)
from pathlib import Path    # safer, cleaner file paths than using raw strings

import pandas as pd         # tables/data frames, reading XPT/CSV, cleaning, groupby, etc.
import numpy as np          # basic numerical helpers (arrays, NaNs); pandas uses it under the hood

# === Downloading files from the web (CDC) =====================================
import requests             # simple HTTP library to fetch files from URLs

# === Plotting =================================================================
import matplotlib.pyplot as plt  # basic plotting (histograms, scatter, density via pandas)

# === Statistics ================================================================
from scipy import stats          # t-tests and other classic stats tests
import statsmodels.formula.api as smf  # regression with R-style formulas (e.g., y ~ x1 + x2)

# Show versions & environment
print(f"Python: {sys.version.split()[0]}")
print(f"pandas: {pd.__version__} | numpy: {np.__version__}")
try:
    import scipy, statsmodels
    print(f"scipy: {scipy.__version__} | statsmodels: {statsmodels.__version__}")
except Exception:
    pass


## 2) Download DEMO, HDL, TRIGLY from CDC

We save to `/content/data/raw` (Colab's temporary filesystem) when using Colab - otherwise locally.  
We use a small dictionary (key = filename, value = URL) and loop over it to download each file.


In [None]:
# ==== Paths + Download (works in BOTH Colab and local Jupyter) =================
# This cell:
#   1) Detects whether we are running in Google Colab or on your own computer.
#   2) Chooses a safe place for data:
#        - In Colab: /content/data/raw   (a temporary folder inside the VM)
#        - Locally:  <your-repo>/data/raw  (at the REPOSITORY ROOT, not notebooks/)
#   3) Creates data folders if they don't exist.
#   4) Downloads three NHANES XPORT files (DEMO_J, HDL_J, TRIGLY_J) from CDC
#      if they are not already present.
#   5) Prints a small directory listing so you can SEE what was downloaded.
#


# --- STEP 1: Are we running in Google Colab? ----------------------------------
# When you run in Colab, the special module "google.colab" is available.
# We use that fact to decide which folders to use later on.
IN_COLAB = "google.colab" in sys.modules
print("Running in Colab?", IN_COLAB)

# --- STEP 2: Find a good project "ROOT" folder --------------------------------
# • In Colab we keep things under /content (this is the writable VM workspace).
# • Locally we want the repository ROOT (not the notebooks/ subfolder).
#   To find the repo root, we walk UP the directory tree until we see a ".git" folder.
def find_repo_root(start: Path) -> Path:
    """
    Walk upwards from 'start' until we find a '.git' folder (the repo root).
    If we can't find one within ~10 levels, just return 'start' as a fallback.
    """
    p = start.resolve()  # resolve = make absolute and clean up any ".."
    for _ in range(10):  # try up to 10 parent directories
        if (p / ".git").exists():   # does this folder contain a .git directory?
            return p                # yes → this is the repo root
        if p.parent == p:           # we've reached the filesystem root ("/")
            break
        p = p.parent                # go up one level and try again
    return start.resolve()          # fallback if no .git was found

if IN_COLAB:
    ROOT = Path("/content")         # Colab’s working area
else:
    ROOT = find_repo_root(Path.cwd())  # current working dir → repo root

print("ROOT folder:", ROOT)

# --- STEP 3: Define data folders and make sure they exist ---------------------
# We keep raw downloads in data/raw and any processed outputs in data/processed.

DATA_DIR = ROOT / "data" / "raw"
PROC_DIR = ROOT / "data" / "processed"

# Create the folders if they don't exist.
# parents=True means: also create any missing parent folders.
# exist_ok=True means: don't crash if the folder already exists.
DATA_DIR.mkdir(parents=True, exist_ok=True)
PROC_DIR.mkdir(parents=True, exist_ok=True)

print("DATA_DIR :", DATA_DIR)
print("PROC_DIR :", PROC_DIR)

# --- STEP 4: Download NHANES XPT files from CDC if missing --------------------
# We keep a small dictionary (a “dict” = key→value map):
#   key   = filename we want to save as
#   value = the URL to download from
FILES = {
    "DEMO_J.xpt":   "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/DEMO_J.xpt",
    "HDL_J.xpt":    "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/HDL_J.XPT",
    "TRIGLY_J.xpt": "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/TRIGLY_J.XPT",
}

# Loop through the dict. In Python, "for name, url in FILES.items()" means:
# take each (key, value) pair from the dict as (name, url).
for fname, url in FILES.items():
    dest = DATA_DIR / fname                 # full path to where the file should live
    if dest.exists():
        # If we already have the file, don’t download again (saves time/bandwidth).
        print(f"Already have {fname}")
        continue

    print(f"Downloading {fname} …")
    # requests.get(...) fetches the file from the web.
    # timeout=60 → give up if there's no response for 60 seconds (avoids hanging forever).
    r = requests.get(url, timeout=60)
    r.raise_for_status()                    # if the server responded with an error, raise now
    dest.write_bytes(r.content)             # write the downloaded bytes to disk

# --- STEP 5: Show a simple directory listing of what we have ------------------
print("\nFiles found in DATA_DIR:")
for p in sorted(DATA_DIR.iterdir()):        # iterdir() lists children; sorted(...) makes it alphabetical
    print(" -", p.name)

# At this point:
# • Your three .xpt files should exist in DATA_DIR.
# • The rest of the notebook can refer to DATA_DIR to load them with pandas.
#   Example:
#       import pandas as pd
#       df_demo = pd.read_sas(DATA_DIR / "DEMO_J.xpt", format="xport", encoding="utf-8")
#       df_demo.columns = [c.lower() for c in df_demo.columns]


## 3) Load the three files into DataFrames

We read **XPT** files with `pandas.read_sas(..., format="xport")` and lower-case the columns.


In [None]:
# Straight loading (no helpers)
p_demo   = DATA_DIR / "DEMO_J.xpt"
p_hdl    = DATA_DIR / "HDL_J.xpt"
p_trigly = DATA_DIR / "TRIGLY_J.xpt"

df_demo = pd.read_sas(p_demo, format="xport", encoding="utf-8")
df_hdl  = pd.read_sas(p_hdl,  format="xport", encoding="utf-8")
df_trig = pd.read_sas(p_trigly, format="xport", encoding="utf-8")

# Lowercase all columns
df_demo.columns = [c.lower() for c in df_demo.columns]
df_hdl.columns  = [c.lower() for c in df_hdl.columns]
df_trig.columns = [c.lower() for c in df_trig.columns]

print("Shapes:", df_demo.shape, df_hdl.shape, df_trig.shape)
df_demo.head()

## 4) Minimal merges (left-join labs onto demographics by `seqn`)

NHANES uses `SEQN` (lowercased here to `seqn`) as the participant ID.


In [None]:
# Keep essentials from DEMO to stay tidy
demo_keep = ["seqn", "riagendr", "ridageyr"]
d = df_demo[demo_keep].copy()

# Left-join HDL and triglycerides (drop_duplicates just in case)
d = d.merge(df_hdl.drop_duplicates(subset=["seqn"]), on="seqn", how="left")
d = d.merge(df_trig.drop_duplicates(subset=["seqn"]), on="seqn", how="left")

# Peek at likely HDL/TG column names (NHANES names vary a bit by file)
print([c for c in d.columns if "hdl" in c.lower()][:10])
print([c for c in d.columns if "trig" in c.lower()][:10])

## 5) Recode sex; convert HDL & triglycerides to SI units

- `riagendr`: **1 → Male**, **2 → Female**  
- HDL: **mg/dL → mmol/L** multiply by **0.02586**  
- Triglycerides: **mg/dL → mmol/L** multiply by **0.01129**


In [None]:
# Recode sex
d["riagendr"] = pd.to_numeric(d["riagendr"], errors="coerce")
d["sex"] = d["riagendr"].map({1: "Male", 2: "Female"})

# Identify HDL & triglyceride columns
hdl_col_candidates  = [c for c in d.columns if "hdl"  in c.lower()]
trig_col_candidates = [c for c in d.columns if "trig" in c.lower()]

print("HDL candidates:", hdl_col_candidates[:5])
print("Triglyceride candidates:", trig_col_candidates[:5])

# Pick specific columns (edit if needed)
hdl_mgdl_col  = "lbdhdd" if "lbdhdd" in d.columns else hdl_col_candidates[0]
trig_mgdl_col = "lbxtr"  if "lbxtr"  in d.columns else trig_col_candidates[0]

# Ensure numeric
d[hdl_mgdl_col]  = pd.to_numeric(d[hdl_mgdl_col],  errors="coerce")
d[trig_mgdl_col] = pd.to_numeric(d[trig_mgdl_col], errors="coerce")
d["ridageyr"]    = pd.to_numeric(d["ridageyr"],    errors="coerce")

# Create SI-unit columns
d["hdl_mmol_l"]  = d[hdl_mgdl_col]  * 0.02586
d["tg_mmol_l"]   = d[trig_mgdl_col] * 0.01129

d[["sex", "ridageyr", hdl_mgdl_col, "hdl_mmol_l", trig_mgdl_col, "tg_mmol_l"]].head()

## 6) Show mean/SD/median/IQR for **age**, by **sex**

We loop over the two sex groups and print simple summaries.


In [None]:
print("=== Age summary by sex ===")
for grp in ["Male", "Female"]:
    sub = d.loc[d["sex"] == grp, "ridageyr"].dropna()
    if sub.empty:
        print(f"\n{grp}: no data")
        continue

    n = sub.size
    mean = sub.mean()
    sd   = sub.std(ddof=1)
    median = sub.median()
    q25 = sub.quantile(0.25)
    q75 = sub.quantile(0.75)

    print(f"\n{grp}: n={n}")
    print(f"  Mean ± SD:    {mean:.1f} ± {sd:.1f}")
    print(f"  Median [IQR]: {median:.1f} [{q25:.1f}, {q75:.1f}]")

## 7) Density plot (HDL in mmol/L by sex)

Two kernel density curves, one per sex.


In [None]:
# Prepare series (drop missing)
hdl_f = d.loc[d["sex"] == "Female", "hdl_mmol_l"].dropna()
hdl_m = d.loc[d["sex"] == "Male",   "hdl_mmol_l"].dropna()

plt.figure()
hdl_f.plot(kind="kde", linewidth=2, label="Female")
hdl_m.plot(kind="kde", linewidth=2, label="Male")
plt.xlabel("HDL (mmol/L)")
plt.ylabel("Density")
plt.title("HDL distribution by sex")
plt.legend()
plt.show()

## 8) Difference in HDL (Female − Male)

Simple unweighted mean difference in mmol/L.


In [None]:
hdl_mean_f = hdl_f.mean()
hdl_mean_m = hdl_m.mean()
diff = hdl_mean_f - hdl_mean_m

print(f"Mean HDL (mmol/L): Female={hdl_mean_f:.2f}, Male={hdl_mean_m:.2f}")
print(f"Difference (Female − Male): {diff:.2f} mmol/L")

## 9) t-test (Welch’s, safer default)

Welch’s t-test does not assume equal variances.


In [None]:
t_stat, p_val = stats.ttest_ind(hdl_f, hdl_m, equal_var=False, nan_policy="omit")
print("Welch's t-test on HDL (mmol/L): Female vs Male")
print(f"  t = {t_stat:.3f}, p = {p_val:.3g}")
print(f"  n_female = {hdl_f.size}, n_male = {hdl_m.size}")

## 10) Scatter plot: HDL (mmol/L) vs Age (years)

Points colored by sex.


In [None]:
plt.figure()
for label, sub in d.dropna(subset=["ridageyr", "hdl_mmol_l"]).groupby("sex", dropna=False):
    plt.scatter(sub["ridageyr"], sub["hdl_mmol_l"], s=10, alpha=0.4, label=str(label))
plt.xlabel("Age (years)")
plt.ylabel("HDL (mmol/L)")
plt.title("HDL vs Age by sex")
plt.legend()
plt.show()

## 11) Regression: HDL (mmol/L) ~ sex + age

We treat `sex` as categorical using `C(sex)` and drop rows with missing values in model variables.


In [None]:
reg = d[["hdl_mmol_l", "sex", "ridageyr"]].dropna().copy()
model = smf.ols("hdl_mmol_l ~ C(sex) + ridageyr", data=reg).fit()
print(model.summary())