# FB2NEP Workbook 10 – Causal Inference in Nutritional Epidemiology: Where Next?

Version 0.0.5

This workbook provides an accessible introduction to **modern causal inference approaches** that extend beyond traditional regression.

By the end of this workbook, you should understand the principles behind:

- **Mendelian randomisation (MR)**: principles, required assumptions, and a practical demonstration using the MR-Base API wrapper.
- **Negative controls**: using exposures or outcomes that should not be associated to detect uncontrolled confounding.
- **G-methods** (brief introduction): g-formula, inverse probability weighting (IPW), and marginal structural models (MSM).
- **Trial emulation using cohort data**: designing an observational analysis that mimics a randomised controlled trial.

The approach is conceptual. The aim is not to perform analyses that require external consortia-level infrastructure, but rather to show **how these methods work**, how they relate to familiar epidemiological ideas, and how they can be implemented in practice.

For clarity:

- All examples use the **synthetic FB2NEP cohort**.
- MR sections use only *simulated* genetic instruments; no real genetic data are distributed to students.
- Where specialist software is required (for example, `TwoSampleMR`), we illustrate the workflow but avoid large downloads.




## 0. Initialise the repository and load the synthetic FB2NEP cohort

Run this cell first. It ensures that the notebook can run on Google Colab or in a local Jupyter environment.


In [1]:
# FB2NEP bootstrap cell – use in *all* workbooks
#
# This cell initialises the repository context and loads the synthetic cohort
# into a DataFrame called df. It tries a few possible locations for scripts/bootstrap.py.

import pathlib
import runpy

bootstrap_candidates = [
    "scripts/bootstrap.py",
    "../scripts/bootstrap.py",
    "../../scripts/bootstrap.py",
]

bootstrap_ns = None

for rel in bootstrap_candidates:
    p = pathlib.Path(rel)
    if p.exists():
        print(f"Loading bootstrap from: {p}")
        bootstrap_ns = runpy.run_path(str(p))
        break
else:
    raise FileNotFoundError(
        "Could not find scripts/bootstrap.py. "
        "Please check that you are running this notebook inside fb2nep-epi."
    )

if "init" not in bootstrap_ns:
    raise RuntimeError("bootstrap.py does not define init().")

df, CTX = bootstrap_ns["init"]()

REPO_ROOT = CTX.repo_root
CSV_REL = CTX.csv_rel
IN_COLAB = CTX.in_colab

print("Repository root:", REPO_ROOT)
print("Main dataset:", CSV_REL)
print("df shape:", df.shape)
print("IN_COLAB:", IN_COLAB)

Loading bootstrap from: ../scripts/bootstrap.py
Dataset found: data/synthetic/fb2nep.csv ✅
Repository root: /Users/gunter/Documents/fb2nep-epi
Main dataset: data/synthetic/fb2nep.csv
df shape: (25000, 27)
IN_COLAB: False


## 1. Mendelian randomisation (MR)

### 1.1 Conceptual introduction

Mendelian randomisation exploits the **random allocation of alleles at conception** as a natural experiment. Genetic variants associated with an exposure of interest (for example, LDL-cholesterol or alcohol consumption) are used as **instrumental variables**.

MR relies on three key assumptions:

1. **Relevance**: The genetic variant is associated with the exposure.
2. **Independence**: The genetic variant is independent of confounders of the exposure–outcome relationship.
3. **Exclusion restriction**: The variant affects the outcome *only* through the exposure of interest.

In nutritional epidemiology, MR can be informative for questions such as:

- Does higher BMI cause increased risk of type 2 diabetes?
- Does alcohol intake causally affect blood pressure?
- Do circulating nutrient levels (for example, vitamin D) have causal effects on health outcomes?

Because the FB2NEP cohort does not contain real genetic data, we simulate a small set of plausible genetic instruments to illustrate the workflow.


### 1.2 Simulated genetic instruments

We create three simulated single-nucleotide polymorphisms (SNPs) that are modestly associated with a continuous phenotype such as BMI.


In [None]:
import numpy as np
import pandas as pd

np.random.seed(11088)

# Copy the main dataset to avoid accidental modification.
df_mr = df.copy()

# Simulate three SNPs assuming Hardy–Weinberg equilibrium with varying allele frequencies.
allele_freqs = [0.15, 0.25, 0.40]

for i, p in enumerate(allele_freqs, start=1):
    # Each SNP coded 0/1/2 for number of effect alleles.
    df_mr[f"SNP{i}"] = np.random.binomial(n=2, p=p, size=len(df_mr))

# Display the first rows to verify structure.
df_mr[["SNP1", "SNP2", "SNP3"]].head()

### 1.3 First-stage regression (SNP → exposure)

In MR, the first step is to show that the instrument is associated with the exposure. Here we demonstrate the principle using BMI as the exposure.


In [None]:
import statsmodels.api as sm

X = df_mr[["SNP1", "SNP2", "SNP3"]]
X = sm.add_constant(X)
y = df_mr["BMI"]

model_first = sm.OLS(y, X).fit()
model_first.summary()

### 1.4 Second-stage regression (exposure → outcome) using predicted exposure

We predict BMI using the genetic instruments and then regress the health outcome on the **genetically predicted BMI**, not the observed BMI.

We choose systolic blood pressure (SBP) as the synthetic outcome.


In [None]:
# Predicted BMI from stage 1
df_mr["BMI_hat"] = model_first.predict(X)

# Second-stage regression of outcome on predicted exposure
Y = df_mr["SBP"]
X2 = sm.add_constant(df_mr["BMI_hat"])

model_second = sm.OLS(Y, X2).fit()
model_second.summary()

### 1.5 Interpretation and limitations

This two-stage approach illustrates the logic of MR but is not robust MR analysis. Real pipelines use specialised estimators, multiple sensitivity tests, and large GWAS summary statistics.

Key limitations to emphasise to students:

- Pleiotropy can invalidate the exclusion restriction.
- Weak instruments can bias results towards the confounded observational estimate.
- MR typically estimates lifetime or long-term effects, not short-term nutritional interventions.

Nevertheless, MR provides an important complement to traditional epidemiology.


## 2. Negative controls

Negative controls are variables that should **not** be related to the exposure or outcome if the assumed causal model is correct.

- **Negative control exposure**: an exposure that should have no effect on the outcome.
- **Negative control outcome**: an outcome that should not plausibly be affected by the exposure.

If a strong association is observed nevertheless, this indicates **residual confounding**, measurement error, or selection bias.

In the FB2NEP cohort we illustrate a negative control exposure: a synthetic variable designed to be unrelated to the outcome.


In [None]:
np.random.seed(11088)

# Create a synthetic variable unrelated to health outcomes.
df_mr["NC_exposure"] = np.random.normal(loc=0, scale=1, size=len(df_mr))

X = sm.add_constant(df_mr["NC_exposure"])
Y = df_mr["SBP"]

model_nc = sm.OLS(Y, X).fit()
model_nc.summary()

## 3. G-methods: a brief introduction

G-methods were developed to address situations where traditional regression fails due to **time-varying confounding** affected by prior exposure.

Three key ideas:

- **G-formula**: models the joint distribution of the outcome conditional on exposure and covariates, and computes counterfactual outcomes.
- **Inverse probability weighting (IPW)**: weights individuals by the inverse of the probability of receiving their observed exposure pattern.
- **Marginal structural models (MSMs)**: use IPW to estimate causal effects in the presence of time-varying confounding.

Here we illustrate IPW using a simplified binary exposure.


In [None]:
# For demonstration, define a binary exposure: high physical activity
df_ipw = df.copy()
df_ipw["highPA_bin"] = (df_ipw["highPA"] > df_ipw["highPA"].median()).astype(int)

# Step 1: Estimate propensity scores
import statsmodels.api as sm

X = sm.add_constant(df_ipw[["age", "BMI", "SES"]])  # illustrative covariates
y = df_ipw["highPA_bin"]

ps_model = sm.Logit(y, X).fit()
df_ipw["ps"] = ps_model.predict(X)

# Step 2: Compute inverse probability weights
df_ipw["w"] = np.where(df_ipw["highPA_bin"] == 1,
                        1 / df_ipw["ps"],
                        1 / (1 - df_ipw["ps"]))

# Step 3: Weighted regression of outcome (e.g. SBP) on exposure
Y = df_ipw["SBP"]
X = sm.add_constant(df_ipw["highPA_bin"])

ipw_model = sm.WLS(Y, X, weights=df_ipw["w"]).fit()
ipw_model.summary()

## 4. Trial emulation using cohort data

Randomised controlled trials are often infeasible in nutrition. Trial emulation attempts to reproduce, as closely as possible, the key design features of an RCT within observational data.

**Core components:**

1. Eligibility criteria
2. Time zero (clear baseline)
3. Well-defined exposure strategies
4. Follow-up and censoring rules
5. Pre-specified causal contrast

Here we illustrate a very simple emulation: a trial of increasing daily physical activity to reduce systolic blood pressure.


In [None]:
# Simplified trial emulation: treat highPA (>median) as 'intervention'

df_te = df.copy()
df_te["intervention"] = (df_te["highPA"] > df_te["highPA"].median()).astype(int)

# Model the outcome at follow-up (synthetic: SBP)
Y = df_te["SBP"]
X = sm.add_constant(df_te["intervention"])

trial_model = sm.OLS(Y, X).fit()
trial_model.summary()

## 5. Summary and further reading

This workbook introduced several modern causal inference approaches that are increasingly relevant in nutritional epidemiology.

Students should take away the following key points:

- MR uses genetic variants as instruments but requires strong assumptions.
- Negative controls help detect unmeasured confounding.
- G-methods allow estimation of causal effects under time-varying confounding.
- Trial emulation provides a rigorous framework for analysing cohort data in a quasi-experimental way.

For further reading:

- Hernán and Robins: *Causal Inference – What If?*
- Davey Smith and Ebrahim: Mendelian randomisation foundational papers.
- Danaei et al.: Trial emulation case studies using large observational cohorts.
