# 01 · Introduction to Nutritional Epidemiology

> **Learning objectives**
- Define epidemiology and its scope within nutrition.
- Contrast landmark effects (e.g., smoking & lung cancer) with subtler nutrition effects.
- Recognise nutrition-specific challenges: complexity, misreporting, long latency.
- Load and inspect the FB2NEP synthetic cohort (N≈25k) and perform simple checks.

---

In [None]:
# one-time per runtime
!git clone https://github.com/ggkuhnle/fb2nep-epi.git
%cd fb2nep-epi

from scripts.bootstrap import init
df, ctx = init()
df.head()

print(df.shape, "— dataset ready")

## What is epidemiology?
Epidemiology studies the **distribution** and **determinants** of health-related states in populations, and applies this to control health problems.

In nutrition, exposures are complex (patterns, foods, nutrients), often **misreported**, and effects can be **modest** and **slow to emerge** compared with hazards like active smoking. This makes design, measurement, and interpretation particularly demanding.

## First look at the dataset

In [None]:
df.head(3)

In [None]:
df.shape, sorted(df.columns)[:12], sorted(df.columns)[12:24]

In [None]:
# Summary (numeric + selected categoricals)
desc_num = df.select_dtypes(include=["float64","int64"]).describe().T
desc_cat = df.select_dtypes(include=["object","category"]).describe().T
desc_num.head(10), desc_cat.head(10)

### Sanity checks (lightweight)
Quick, defensible checks that catch obvious mistakes in a synthetic cohort.

In [None]:
# Age range & cohort minimum
assert df['age'].min() >= 40, "Cohort should be age ≥ 40"
# Core columns exist
required = {'id','age','sex','BMI','IMD_quintile','smoking_status','fruit_veg_g_d','red_meat_g_d','salt_g_d','plasma_vitC_umol_L','urinary_sodium_mmol_L','CVD_incident','Cancer_incident'}
missing = required - set(df.columns)
assert not missing, f"Missing columns: {missing}"
# Incident flags are 0/1
for f in ['CVD_incident','Cancer_incident']:
    vals = set(df[f].dropna().unique().tolist())
    assert vals.issubset({0,1}), f"{f} should be binary 0/1"
print("Basic checks passed ✅")

## Measurement error: self-report vs biomarkers
A simple **construct-validity** signal should be visible: higher fruit & veg intake → higher plasma vitamin C. The relation is noisy (within-person variation, assay error, reporting bias), but **monotonic on average**.

In [None]:
import pandas as pd
# Quintiles of fruit & veg and mean vitamin C per quintile
q = pd.qcut(df['fruit_veg_g_d'], 5, duplicates='drop')
res = df.groupby(q)['plasma_vitC_umol_L'].agg(['mean','std','count']).round(2)
display(res)
assert res['mean'].is_monotonic_increasing, "Mean vit C should increase across fruit/veg quintiles"
print("Construct-validity check passed ✅")

## Visual cue: intake vs biomarker (scatter + fitted line)

In [None]:
import numpy as np, matplotlib.pyplot as plt
xy = df[['fruit_veg_g_d','plasma_vitC_umol_L']].dropna()
plt.figure(figsize=(5.2,4))
plt.scatter(xy['fruit_veg_g_d'], xy['plasma_vitC_umol_L'], s=10, alpha=0.6)
coef = np.polyfit(xy['fruit_veg_g_d'], xy['plasma_vitC_umol_L'], 1)
grid = np.linspace(0, xy['fruit_veg_g_d'].max(), 120)
plt.plot(grid, coef[0]*grid + coef[1])
plt.xlabel('Fruit & veg (g/day)')
plt.ylabel('Plasma vitamin C (µmol/L)')
plt.title('Signal & noise in exposure measurement')
plt.tight_layout(); plt.show()

## Nutrition-specific challenges (discussion)
- **Complex exposures**: foods/nutrients cluster; energy intake scales many variables.
- **Measurement error**: self-report misreporting (differential/non-differential); biomarkers imperfect.
- **Latency & modest effects**: small relative risks; long follow-up → confounding risk.
- **Confounding**: lifestyle and socioeconomic variables correlate with diet and outcomes.

_Prompt:_ Which of these would most bias an association between **red meat** and **cancer** upwards? Downwards? Why?

## # TODO · Quick computations
Answer directly in code/comments; asserts give gentle feedback.

In [None]:
# a) Crude incidences (expect around 10–12%)
p_cvd = df['CVD_incident'].mean()
p_cancer = df['Cancer_incident'].mean()
print({"CVD": round(p_cvd,4), "Cancer": round(p_cancer,4)})
assert 0.06 <= p_cvd <= 0.20, "CVD incidence should be plausible"
assert 0.06 <= p_cancer <= 0.20, "Cancer incidence should be plausible"

In [None]:
# b) Salt intake vs urinary sodium — expect positive correlation
sub = df[['salt_g_d','urinary_sodium_mmol_L']].dropna()
r = sub.corr().iloc[0,1]
print("r(salt, urinary Na) =", round(r,3))
assert r > 0.25, "Expect a positive association between salt and urinary sodium"

In [None]:
# c) Event-date integrity: if incident==1 there must be a date
for flag, date in [("CVD_incident","CVD_date"),("Cancer_incident","Cancer_date")]:
    f = df[flag].astype(int)
    d = df[date].fillna("")
    assert ((f==1) <= (d!="")).all(), f"{date}: some incident=1 rows missing a date"
print("Event-date integrity checks passed ✅")

## Short exercise (reflective)
In **3–5 sentences**, explain why strong biological plausibility (e.g., salt → BP) does not automatically translate to large, decisive observational associations for **hard outcomes** (e.g., CVD events). Consider measurement error, latency, confounding, and competing risks.

_Write your answer in a new Markdown cell below._

### Checkpoint
- Note any odd ranges or surprising values you want to revisit.
- Identify two variables you expect to adjust for in a **red meat → cancer** model.
- Next session: we formalise study designs and exposure/outcome assessment.