# 03 · Exposure analysis — intake vs biomarker

> **Purpose**: characterise dietary exposures and examine construct validity via biomarkers.

> **Learning objectives**
- Inspect distributions of key exposures (energy, fruit & veg, red meat, salt, alcohol) and biomarkers.
- Demonstrate energy scaling (per 1000 kcal) and interpret implications.
- Show intake–biomarker alignment: fruit & veg → vitamin C; salt → urinary sodium.
- Probe non-linearity with splines/bins and report simple effect estimates.

---

In [None]:
# Make sure the repo root (which has scripts/bootstrap.py) is on sys.path.
import sys, os, pathlib, subprocess

REPO_NAME = "fb2nep-epi"
REPO_URL  = "https://github.com/ggkuhnle/fb2nep-epi.git"
IN_COLAB  = "google.colab" in sys.modules

def ensure_repo_on_path():
    here = pathlib.Path.cwd()
    # Walk up a few levels to find scripts/bootstrap.py
    for p in [here, *here.parents]:
        if (p / "scripts" / "bootstrap.py").exists():
            os.chdir(p)                 # normalise CWD to repo root
            sys.path.append(str(p))     # ensure imports like "from scripts..." work
            return p
    # Not found locally: if on Colab, clone then chdir
    if IN_COLAB:
        # clone only if missing
        if not (pathlib.Path("/content") / REPO_NAME).exists():
            subprocess.run(["git", "clone", REPO_URL], check=False)
        os.chdir(f"/content/{REPO_NAME}")
        sys.path.append(os.getcwd())
        return pathlib.Path.cwd()
    # Otherwise, we can’t proceed
    raise FileNotFoundError("Could not find repo root containing scripts/bootstrap.py")

repo_root = ensure_repo_on_path()
print("Repo root:", repo_root)

In [None]:
# Bootstrap: ensure repo root on sys.path, then import init
import sys, pathlib
sys.path.append(str(pathlib.Path.cwd().parent))
from scripts.bootstrap import init
df, ctx = init()
df.head(2)

## 1) Variables of interest

In [None]:
expo = ['energy_kcal','fruit_veg_g_d','red_meat_g_d','salt_g_d','alcohol_units_wk','fibre_g_d','ssb_ml_d']
biom = ['plasma_vitC_umol_L','urinary_sodium_mmol_L']
core = ['age','sex','BMI','SES_class']
avail = [c for c in expo+biom+core if c in df.columns]
df[avail].describe(include='all').T.head(12)

## 2) Distributions (orientation, not inference)
Look for skew/heavy tails that might motivate transformations later (log1p, z-score).

In [None]:
import matplotlib.pyplot as plt
plot_cols = [c for c in ['energy_kcal','fruit_veg_g_d','red_meat_g_d','salt_g_d','alcohol_units_wk','plasma_vitC_umol_L','urinary_sodium_mmol_L'] if c in df]
for col in plot_cols:
    x = df[col].dropna()
    plt.figure(figsize=(5.2,4))
    plt.hist(x, bins=40, alpha=0.9)
    plt.xlabel(col); plt.ylabel('count'); plt.title(col)
    plt.tight_layout(); plt.show()

## 3) Energy scaling
Dietary components tend to correlate with total energy. Normalise selected intakes per **1000 kcal** to reduce confounding by energy intake (a simple density approach; not universally appropriate but useful as a teaching baseline).

In [None]:
import numpy as np, pandas as pd
d = df.copy()
eps = 1e-6
if 'energy_kcal' in d:
    for v in ['fruit_veg_g_d','red_meat_g_d','salt_g_d','fibre_g_d','ssb_ml_d']:
        if v in d:
            d[v+'_per_1k'] = d[v] / (d['energy_kcal']+eps) * 1000
            
# Correlations with energy (raw vs density) — orientation
corrs = {}
for v in ['fruit_veg_g_d','red_meat_g_d','salt_g_d']:
    if v in d:
        cr = d[[v,'energy_kcal']].corr().iloc[0,1]
        dv = v+'_per_1k'
        cr_den = d[[dv,'energy_kcal']].corr().iloc[0,1] if dv in d else np.nan
        corrs[v] = {'raw_vs_energy': round(cr,3), 'density_vs_energy': round(cr_den,3)}
pd.DataFrame(corrs).T

**Interpretation prompt**: Do density adjustments reduce the apparent energy correlation as expected? When might density scaling be inappropriate (e.g., if energy is on the causal pathway for your question)?

## 4) Intake ↔ biomarker alignment (construct validity)
Two expected monotone patterns:

- Fruit & veg (g/day) → **plasma vitamin C** (µmol/L)
- Salt (g/day) → **urinary sodium** (mmol/L)

We’ll check both **quintile monotonicity** and **scatter with linear fit**, and then a simple **partialled** estimate adjusting for energy and basic covariates.

In [None]:
import pandas as pd

out = {}
if {'fruit_veg_g_d','plasma_vitC_umol_L'} <= set(d):
    q = pd.qcut(d['fruit_veg_g_d'], 5, duplicates='drop')
    fv_tab = d.groupby(q)['plasma_vitC_umol_L'].agg(['mean','std','count']).round(2)
    out['fruitveg→vitC'] = fv_tab
    display(fv_tab)
    assert fv_tab['mean'].is_monotonic_increasing, "Vitamin C should increase across fruit/veg quintiles"

if {'salt_g_d','urinary_sodium_mmol_L'} <= set(d):
    q = pd.qcut(d['salt_g_d'], 5, duplicates='drop')
    na_tab = d.groupby(q)['urinary_sodium_mmol_L'].agg(['mean','std','count']).round(2)
    out['salt→urNa'] = na_tab
    display(na_tab)
    assert na_tab['mean'].is_monotonic_increasing, "Urinary sodium should increase across salt quintiles"

print('Monotone checks passed where applicable ✅')

### Scatter + fitted line
Note the **noise**: within-person variation, assay error, and reporting error all dilute the signal. We use simple least squares for the visual trend (not a causal estimate).

In [None]:
import numpy as np, matplotlib.pyplot as plt
pairs = [
    ('fruit_veg_g_d','plasma_vitC_umol_L','Fruit & veg (g/day)','Plasma vitamin C (µmol/L)'),
    ('salt_g_d','urinary_sodium_mmol_L','Salt (g/day)','Urinary sodium (mmol/L)')
]
for xcol,ycol,xlab,ylab in pairs:
    if {xcol,ycol} <= set(d):
        xy = d[[xcol,ycol]].dropna()
        if xy.empty: continue
        plt.figure(figsize=(5.4,4))
        plt.scatter(xy[xcol], xy[ycol], s=10, alpha=0.6)
        b1,b0 = np.polyfit(xy[xcol], xy[ycol], 1)
        grid = np.linspace(xy[xcol].min(), xy[xcol].max(), 120)
        plt.plot(grid, b1*grid+b0)
        plt.xlabel(xlab); plt.ylabel(ylab); plt.title(f"{xcol} vs {ycol}")
        plt.tight_layout(); plt.show()

### Partialled estimates (adjusting for energy and basics)
We’re not doing causal inference here—just showing that the **exposure–biomarker signal persists** after adjusting for **energy**, **age**, **sex**, and **SES**. We report the adjusted slope (per unit exposure).

In [None]:
import statsmodels.api as sm
from patsy import dmatrix

def partial_slope(df_, x, y, adjust=('energy_kcal','age','sex','SES_class')):
    cols = [c for c in [x,y,*adjust] if c in df_.columns]
    dd = df_[cols].dropna().copy()
    # encode categoricals
    rhs_terms = []
    for v in adjust:
        if v in dd.columns:
            if dd[v].dtype=='object' or str(dd[v].dtype).startswith('category'):
                rhs_terms.append(f'C({v})')
            else:
                rhs_terms.append(v)
    RHS = ' + '.join(rhs_terms) if rhs_terms else '1'
    # Build design by hand: y ~ x + adjust
    X = dmatrix('1 + ' + x + (' + ' + RHS if RHS!='1' else ''), data=dd, return_type='dataframe')
    mod = sm.OLS(dd[y], X).fit()
    coef = mod.params.get(x, float('nan'))
    ci = mod.conf_int().loc[x].tolist() if x in mod.params.index else [float('nan'), float('nan')]
    return {'n': len(dd), 'beta': coef, 'lo': ci[0], 'hi': ci[1]}

res = []
if {'fruit_veg_g_d','plasma_vitC_umol_L'} <= set(d):
    res.append(('fruit_veg_g_d → plasma_vitC_umol_L', partial_slope(d,'fruit_veg_g_d','plasma_vitC_umol_L')))
if {'salt_g_d','urinary_sodium_mmol_L'} <= set(d):
    res.append(('salt_g_d → urinary_sodium_mmol_L', partial_slope(d,'salt_g_d','urinary_sodium_mmol_L')))

pd.DataFrame([{'Relation': k, **v} for k,v in res]).round(3)

## 5) Non-linearity check (bins & splines)
Visualise possible curvature with **exposure bins** and optionally a **spline** term (for flexible fit). Use this to **motivate** transformations in later modelling, not as proof of causality.

In [None]:
from patsy import bs

def binned_means(df_, x, y, k=10):
    dd = df_[[x,y]].dropna().copy()
    dd['bin'] = pd.qcut(dd[x], q=k, duplicates='drop')
    return dd.groupby('bin')[y].mean()

checks = [
    ('fruit_veg_g_d','plasma_vitC_umol_L','Fruit & veg (g/day)','Vit C (µmol/L)'),
    ('salt_g_d','urinary_sodium_mmol_L','Salt (g/day)','Urinary Na (mmol/L)')
]
import numpy as np, matplotlib.pyplot as plt
for x,y,xlab,ylab in checks:
    if {x,y} <= set(d):
        dd = d[[x,y]].dropna().copy()
        # Binned means
        bm = binned_means(d, x, y, k=10)
        # Spline fit (df=4)
        Xs = dmatrix('1 + bs('+x+', df=4)', data=dd, return_type='dataframe')
        fit = sm.OLS(dd[y], Xs).fit()
        grid = np.linspace(dd[x].min(), dd[x].max(), 120)
        Xg = dmatrix('1 + bs(x, df=4)', data={'x':grid}, return_type='dataframe')
        yg = fit.predict(Xg)

        plt.figure(figsize=(5.6,4.2))
        # plot binned means
        ctrs = bm.index.map(lambda c: 0.5*(c.left+c.right))
        plt.scatter(list(ctrs), bm.values, s=24, alpha=0.8, label='Binned means')
        # spline
        plt.plot(grid, yg, label='Spline (df=4)')
        plt.xlabel(xlab); plt.ylabel(ylab); plt.title(f"Non-linearity check: {x} → {y}")
        plt.legend(); plt.tight_layout(); plt.show()

## 6) # TODO — your turn
1. **Energy scaling**: Create `red_meat_g_d_per_1k` and `salt_g_d_per_1k` if not already present. Compare their correlations with energy vs the raw variables. Write one sentence interpreting the change.
2. **Biomarker alignment**: Compute **Spearman** correlations for (fruit&veg, vit C) and (salt, urinary Na). Are results consistent with the monotone checks?
3. **Adjusted slope**: Re-run `partial_slope` adding `BMI` to the adjusters. Does the exposure coefficient change meaningfully?
4. **Non-linearity**: Using the spline plot you created, argue briefly (2–3 sentences) whether a log or spline term is warranted in later models.

In [None]:
# (2) Spearman correlations (example)
import scipy.stats as st
pairs = []
if {'fruit_veg_g_d','plasma_vitC_umol_L'} <= set(d):
    a = d[['fruit_veg_g_d','plasma_vitC_umol_L']].dropna()
    rho,p = st.spearmanr(a['fruit_veg_g_d'], a['plasma_vitC_umol_L'])
    pairs.append({'pair':'fruit&veg ~ vitC','rho':round(rho,3),'p':p})
if {'salt_g_d','urinary_sodium_mmol_L'} <= set(d):
    b = d[['salt_g_d','urinary_sodium_mmol_L']].dropna()
    rho,p = st.spearmanr(b['salt_g_d'], b['urinary_sodium_mmol_L'])
    pairs.append({'pair':'salt ~ urinaryNa','rho':round(rho,3),'p':p})
import pandas as pd
pd.DataFrame(pairs)

> ## Key takeaways
>
> - Energy drives many diet correlations; density scaling can clarify patterns but must align with your causal story.
> - Construct validity checks (intake ↔ biomarker) should show **monotone** trends despite noise.
> - Simple partialling (energy, age, sex, SES) helps show the signal is not purely compositional.
> - Non-linearity diagnostics guide later modelling choices (transform, polynomial, or spline).