# FB2NEP Workbook 9 – Causal Inference in Nutritional Epidemiology: Where Next?

This workbook is mainly conceptual. We briefly illustrate:

- Mendelian randomisation (using a small simulated example).
- The idea of negative controls.
- G-methods and trial emulation (described in text).

The synthetic FB2NEP cohort is available as `df` if you wish to explore further.

In [None]:
import os
import sys
import runpy
import pathlib
import subprocess

REPO_URL = "https://github.com/ggkuhnle/fb2nep-epi.git"
REPO_NAME = "fb2nep-epi"

# 1. If we are in Colab and scripts/bootstrap.py is not present,
#    clone the repository and change into it.
if "google.colab" in sys.modules and not pathlib.Path("scripts/bootstrap.py").exists():
    root = pathlib.Path("/content")
    repo_dir = root / REPO_NAME

    if not repo_dir.exists():
        print(f"Cloning {REPO_URL} …")
        subprocess.run(["git", "clone", REPO_URL], check=True)

    os.chdir(repo_dir)
    print("Changed working directory to:", os.getcwd())

# 2. Now try to locate and run scripts/bootstrap.py
for p in ["scripts/bootstrap.py", "../scripts/bootstrap.py", "../../scripts/bootstrap.py"]:
    if pathlib.Path(p).exists():
        print(f"Bootstrapping via: {p}")
        runpy.run_path(p)
        break
else:
    print("⚠️ scripts/bootstrap.py not found – "
          "please check that the FB2NEP repository is available.")


In [None]:
import pandas as pd

# Load the main synthetic cohort used in all FB2NEP workbooks
df = pd.read_csv("data/synthetic/fb2nep.csv")

# Quick check: first rows
df.head()

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

## 1. Mendelian randomisation (toy example)

This is a simple two-stage least squares example with a genetic instrument `G`,
an exposure `X`, and an outcome `Y`.

In [None]:
np.random.seed(11088)
n = 1000
G = np.random.binomial(2, 0.3, size=n)
X = 0.5 * G + np.random.normal(0, 1, size=n)
Y = 0.8 * X + np.random.normal(0, 1, size=n)
df_iv = pd.DataFrame({"G": G, "X": X, "Y": Y})

# Naïve regression of Y on X
X_design = sm.add_constant(df_iv["X"])
model_naive = sm.OLS(df_iv["Y"], X_design).fit()
print("Naïve association (Y on X):")
print(model_naive.params)

# Two-stage least squares (simple IV)
X1 = sm.add_constant(df_iv["G"])
stage1 = sm.OLS(df_iv["X"], X1).fit()
X_hat = stage1.predict(X1)
X2 = sm.add_constant(X_hat)
stage2 = sm.OLS(df_iv["Y"], X2).fit()
print("\nIV estimate (two-stage):")
print(stage2.params)

## 2. Negative control designs

Negative control outcomes and exposures are variables for which a causal effect is
not expected. If an association is observed, this suggests residual confounding or
other bias.

In the synthetic FB2NEP cohort, you might search for a variable that should not
plausibly be affected by a dietary exposure of interest and use it as a negative
control outcome.

## 3. G-methods and trial emulation

G-methods (for example, marginal structural models) and target trial emulation
are important for longitudinal data with time-varying exposures and confounders.

For FB2NEP the main learning goal is to recognise these methods in the literature
and understand their assumptions, rather than to implement them in detail.