# 02 · Study Designs, Exposures & Outcomes

> **Learning objectives**
- Distinguish major epidemiological designs (cross-sectional, cohort, case–control, RCT).
- Explain why nutrition studies are often observational.
- Contrast self-reported vs biomarker exposures.
- Recognise outcome definitions and risks of misclassification.
- Illustrate design logic using DAGs (Directed Acyclic Graphs).

---

In [None]:
# Bootstrap (works in Colab or locally) — loads df via scripts/bootstrap.py
import runpy, pathlib
for p in ["scripts/bootstrap.py","../scripts/bootstrap.py","../../scripts/bootstrap.py"]:
    if pathlib.Path(p).exists():
        print(f"Bootstrapping via: {p}")
        runpy.run_path(p)
        break
else:
    raise FileNotFoundError("scripts/bootstrap.py not found")

print(df.shape, "— dataset ready")

## 1. Study designs in nutrition research

| Design              | Strengths                                | Weaknesses                              | Typical use |
|---------------------|------------------------------------------|-----------------------------------------|-------------|
| Cross-sectional     | Quick, cheap; hypothesis generation      | No temporality, prone to confounding     | Surveys     |
| Cohort              | Temporal sequence; multiple outcomes     | Costly; loss to follow-up; confounding   | Diet & CVD  |
| Case–control        | Efficient for rare outcomes              | Recall & selection bias                  | Cancer      |
| Nested case–control | Stored biosamples; efficient             | Smaller sample; matching issues          | Biomarkers  |
| RCT                 | Randomisation reduces confounding        | Expensive, short-term; ethics            | Salt/BP     |

### Discussion
- Why are RCTs difficult in nutrition? (cost, compliance, long-term exposure).
- Why are cross-sectional designs weak for causal inference?
- Why do we still rely on them for policy signals?

## 2. Exposures: self-report vs biomarkers
- **Self-report**: 24HR, FFQ, diet diaries → cheap, scalable, but **misreporting**.
- **Biomarkers**: recovery (urinary N), concentration (plasma vit C), replacement (urinary Na) → more objective, but costly, sometimes invasive.

🟢 **Practice**: check that fruit & veg intake (self-report) aligns with plasma vitamin C (biomarker).

In [None]:
import pandas as pd
q = pd.qcut(df['fruit_veg_g_d'], 5, duplicates='drop')
df.groupby(q)['plasma_vitC_umol_L'].mean().round(1)

## 3. Outcomes: definitions & misclassification
- **Clinical**: myocardial infarction, cancer diagnosis (often registry-confirmed).
- **Intermediate**: blood pressure, cholesterol.
- Misclassification can be:
  - *Non-differential*: biases towards null.
  - *Differential*: can bias in either direction.

_Check integrity:_ if CVD_incident=1 then CVD_date must not be empty.

In [None]:
for flag,date in [("CVD_incident","CVD_date"),("Cancer_incident","Cancer_date")]:
    f = df[flag].astype(int)
    d = df[date].fillna("")
    assert ((f==1) <= (d!="")).all(), f"{date}: missing dates for incidents"
print("Outcome-date integrity OK ✅")

## 4. DAG illustration (exposure → outcome)
We can represent causal assumptions with a **Directed Acyclic Graph (DAG)**. Example: red meat → cancer, confounded by SES and smoking.

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

G = nx.DiGraph()
G.add_edges_from([
    ("SES","red_meat"),("SES","Cancer"),
    ("Smoking","red_meat"),("Smoking","Cancer"),
    ("red_meat","Cancer")
])
pos = {"SES":(-1,1),"Smoking":(1,1),"red_meat":(0,0),"Cancer":(0,-1)}
plt.figure(figsize=(4,4))
nx.draw(G,pos,with_labels=True,node_color="#e0f7fa",node_size=2000,arrows=True)
plt.title("DAG: confounding in red meat–cancer")
plt.show()

## # TODO exercises
1. Compute crude incidence (%) of cancer by **SES group** (ABC1 vs C2DE). Do you see a gradient?
2. Stratify red meat intake by smoking status (never/former/current). Comment on possible confounding.
3. Draft (Markdown cell) one sentence describing the **strength** of cohort vs case–control for studying diet and cancer.

In [None]:
# 1. Crude cancer incidence by SES
res = df.groupby('SES')['Cancer_incident'].mean().round(3)
print(res)
assert 0.05 <= res.min() <= 0.20, "SES-specific incidence should be plausible"

In [None]:
# 2. Red meat intake by smoking
df.groupby('smoking_status')['red_meat_g_d'].mean().round(1)

> ## Key takeaways
>
> - Different designs answer different questions; no single design is sufficient.
> - Nutrition research leans heavily on **cohorts** due to long-term exposures/outcomes.
> - Exposures: **self-report** is noisy; **biomarkers** strengthen inference but add cost.
> - Outcomes: careful definitions matter; misclassification dilutes effects.
> - DAGs clarify assumptions: who to adjust for, who not.

### Checkpoint
- Which design is best for **rare cancers**? Why?
- Which design is best for **short-term effects of diet on BP**? Why?
- Next session: Data foundations — cleaning, types, missingness.