# FB2NEP Workbook 7 – Confounding, DAGs, and Causal Structure

This workbook builds on Workbook 6.

In Workbook 6, we treated regression as a practical tool for estimating associations between an exposure and an outcome, and we focused on model types (linear, logistic, Cox), assumptions, diagnostics, and basic interpretation of coefficients (β, OR, RR, HR).

In this workbook, we move from **association** to **causal thinking**:

- What **confounding** is and why it biases estimates.
- What **colliders** are and how conditioning on them can create associations.
- What it means for a variable to be a **mediator**.
- How **directed acyclic graphs (DAGs)** help to clarify causal structure.
- How to decide **which variables to adjust for** in regression models.
- How this links to ideas such as **Bradford–Hill considerations** and **counterfactuals**.

We will use the synthetic *FB2NEP cohort* throughout. The precise variable names may differ slightly from those used here; if you obtain an error (for example, `KeyError`), carefully check the column names of the dataset and adapt the code accordingly.

As always, read the explanatory text carefully before running the code. The aim is not only to obtain numbers, but to understand what they *mean*.

In [None]:
# FB2NEP bootstrap cell – use in *all* workbooks
#
# This cell initialises the repository context and loads the synthetic cohort
# into a DataFrame called df. It tries a few possible locations for scripts/bootstrap.py.

import pathlib
import runpy

bootstrap_candidates = [
    "scripts/bootstrap.py",
    "../scripts/bootstrap.py",
    "../../scripts/bootstrap.py",
]

bootstrap_ns = None

for rel in bootstrap_candidates:
    p = pathlib.Path(rel)
    if p.exists():
        print(f"Loading bootstrap from: {p}")
        bootstrap_ns = runpy.run_path(str(p))
        break
else:
    raise FileNotFoundError(
        "Could not find scripts/bootstrap.py. "
        "Please check that you are running this notebook inside fb2nep-epi."
    )

if "init" not in bootstrap_ns:
    raise RuntimeError("bootstrap.py does not define init().")

df, CTX = bootstrap_ns["init"]()

REPO_ROOT = CTX.repo_root
CSV_REL = CTX.csv_rel
IN_COLAB = CTX.in_colab

print("Repository root:", REPO_ROOT)
print("Main dataset:", CSV_REL)
print("df shape:", df.shape)
print("IN_COLAB:", IN_COLAB)

In [None]:
r"""
Imports and quick inspection
============================

In this cell we:

- Import common packages used in this workbook.
- Display basic information about the dataset to confirm that it is loaded.

The imports are deliberately explicit. Many students using this workbook
will not yet have much experience with Python, so we avoid implicit
magic and keep the code readable.
"""

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from IPython.display import display

print("DataFrame shape (rows, columns):", df.shape)
print("\nFirst five rows of the dataset:")
display(df.head())

print("\nVariable types (first 20 columns):")
display(df.dtypes.head(20))

## 1. Association, causation, and causal structure

Regression models quantify **associations** between variables:
- In linear regression, we interpret a coefficient as the expected change in the outcome per unit change in the exposure.
- Logistic and Cox regression express effects as odds ratios or hazard ratios.

However, public health decisions concern **causal effects**:
> *If we changed this exposure (for example, salt intake), what would happen to the outcome?*

Causal interpretation requires assumptions about the **structure** of the system—what causes what—and an understanding of:
- Confounders
- Colliders
- Mediators
- Approaches to adjustment
- Counterfactual reasoning

The following sections introduce these ideas and show how they relate to regression models.

## 2. Confounding

### 2.1 Definition

A variable is a **confounder** if it:
1. Is associated with the exposure.
2. Is a cause of, or associated with a cause of, the outcome.
3. Is not on the causal pathway from exposure to outcome.

Uncontrolled confounding leads to **biased** effect estimates.

### 2.2 Example

Age may influence both fruit intake and BMI. If we estimate:

```text
BMI ~ fruit intake
```
without adjusting for age, our estimate may be distorted. We now demonstrate this.

In [None]:
OUTCOME_VAR = "bmi"
EXPOSURE_VAR = "fruit_g_day"
CONFOUND_VAR = "age_years"

for var in [OUTCOME_VAR, EXPOSURE_VAR, CONFOUND_VAR]:
    if var not in df.columns:
        raise KeyError(f"Variable '{var}' not found. Available: {list(df.columns)[:20]}")

df_conf = df[[OUTCOME_VAR, EXPOSURE_VAR, CONFOUND_VAR]].dropna()
print(f"Sample size: {df_conf.shape[0]} observations\n")

formula_crude = f"{OUTCOME_VAR} ~ {EXPOSURE_VAR}"
formula_adj = f"{OUTCOME_VAR} ~ {EXPOSURE_VAR} + {CONFOUND_VAR}"

model_crude = smf.ols(formula_crude, data=df_conf).fit()
model_adj = smf.ols(formula_adj, data=df_conf).fit()

def extract_effect(model, var):
    beta = model.params[var]
    lo, hi = model.conf_int().loc[var]
    return beta, lo, hi, model.pvalues[var]

beta_c, lo_c, hi_c, p_c = extract_effect(model_crude, EXPOSURE_VAR)
beta_a, lo_a, hi_a, p_a = extract_effect(model_adj, EXPOSURE_VAR)

print("Crude:")
print(f"  β = {beta_c:.4f} (95% CI {lo_c:.4f} to {hi_c:.4f}); p={p_c:.4g}\n")

print("Adjusted (for age):")
print(f"  β = {beta_a:.4f} (95% CI {lo_a:.4f} to {hi_a:.4f}); p={p_a:.4g}\n")

print("Interpretation: compare crude vs adjusted. Differences imply confounding by age.")

## 3. Colliders and selection bias

A **collider** is a variable that is *caused* by two others:

```text
exercise  →
            fitness
genes     →
```

Conditioning on a collider (for example, selecting only high-fitness individuals) induces an association between its causes—even when none exists.

Selection into a study is often a collider, leading to **selection bias**.

In [None]:
rng = np.random.default_rng(11088)
n = 5000

exercise = rng.normal(0, 1, n)
genes = rng.normal(0, 1, n)
fitness = 0.8*exercise + 0.8*genes + rng.normal(0,1,n)
risk_lin = 1.2*genes + rng.normal(0,1,n)
risk = (risk_lin > 0).astype(int)

dfc = pd.DataFrame({"exercise": exercise, "genes": genes, "fitness": fitness, "risk": risk})

print("Correlation exercise–genes (should be ~0):")
print(dfc[["exercise","genes"]].corr(), "\n")

print("Full sample: exercise → risk (logistic):")
m_full = smf.logit("risk ~ exercise", dfc).fit(disp=False)
print(m_full.summary().tables[1])

thr = dfc["fitness"].quantile(0.70)
df_high = dfc[dfc["fitness"] >= thr]
print("\nHigh fitness sample size:", df_high.shape[0])

print("High fitness: exercise → risk (logistic):")
m_cond = smf.logit("risk ~ exercise", df_high).fit(disp=False)
print(m_cond.summary().tables[1])

print("\nInterpretation: conditioning on fitness (a collider) induces spurious association.")

## 4. Mediation

A **mediator** lies *on* the causal pathway:

```text
salt intake → blood pressure → stroke
```

Adjusting for a mediator removes part of the exposure effect (the *indirect* pathway). If estimating **total** effects, mediators should *not* be adjusted for.

This conceptual distinction supports correct model specification.

## 5. Directed acyclic graphs (DAGs)

DAGs visually encode our causal assumptions.

They help identify:
- Backdoor (non-causal) paths.
- Variables that require adjustment.
- Variables that we must avoid conditioning on.

Example structure:

```text
age → fruit_intake
age → bmi
fruit_intake → bmi
```

Minimal sufficient adjustment: `{ age }`.

## 6. DAG construction using `dagitty` syntax

Example DAG:

```text
dag {
    age -> fruit_intake
    age -> bmi
    fruit_intake -> bmi
}
```

Use the `dagitty.net` interface to visualise and compute adjustment sets.

## 7. Adjustment strategies

- **Regression modelling**: include confounders as covariates.
- **Stratification**: analyse within levels of the confounder; watch out for small strata.
- **Standardisation** (conceptual): direct/indirect adjustment for confounder distributions.
- **Avoid inappropriate adjustment**: do not adjust for colliders; avoid adjusting for mediators when estimating total effects.

## 8. Bradford–Hill considerations and counterfactuals

### Bradford–Hill

Criteria that *support* causal judgement:
- Strength
- Consistency
- Temporality
- Biological plausibility
- Coherence

Not a checklist; used to evaluate alternative explanations.

### Counterfactuals

Modern causal inference frames causal effects as comparisons of *potential outcomes* under different exposure levels.

Adjustment strategies help approximate these comparisons under plausible assumptions.

In [None]:
OUTCOME_VAR  = "bmi"
EXPOSURE_VAR = "fruit_g_day"
AGE_VAR      = "age_years"
SEX_VAR      = "sex"

for var in [OUTCOME_VAR, EXPOSURE_VAR, AGE_VAR, SEX_VAR]:
    if var not in df.columns:
        raise KeyError(f"Variable '{var}' not found.")

df_work = df[[OUTCOME_VAR, EXPOSURE_VAR, AGE_VAR, SEX_VAR]].dropna()
print(f"Sample size: {df_work.shape[0]} observations\n")

formula_crude = f"{OUTCOME_VAR} ~ {EXPOSURE_VAR}"
formula_adj = f"{OUTCOME_VAR} ~ {EXPOSURE_VAR} + {AGE_VAR} + C({SEX_VAR})"

m_crude = smf.ols(formula_crude, data=df_work).fit()
m_adj = smf.ols(formula_adj, data=df_work).fit()

def summarise(model, var, label):
    beta = model.params[var]
    lo, hi = model.conf_int().loc[var]
    p = model.pvalues[var]
    print(f"{label}:")
    print(f"  β = {beta:.4f} (95% CI {lo:.4f} to {hi:.4f}); p={p:.4g}\n")

summarise(m_crude, EXPOSURE_VAR, "Crude model")
summarise(m_adj, EXPOSURE_VAR, "Adjusted model (age + sex)")

print("Interpretation: differences reflect confounding by age and/or sex.")

## 9. Reflection and exercises

1. **Definitions**: Provide examples of confounders, mediators, and colliders.
2. **Draw a DAG**: Include diet quality, physical activity, age, CVD.
3. **Explain collider bias**: Why does conditioning on a collider induce associations?
4. **Re-run example**: Use a different exposure and outcome.
5. **Bradford–Hill**: Apply three considerations to a published association.

In [None]:
import pathlib

results_dir = pathlib.Path("results_wb7")
results_dir.mkdir(exist_ok=True)

summary = pd.DataFrame({
    "model": ["crude", "adjusted"],
    "beta": [m_crude.params[EXPOSURE_VAR], m_adj.params[EXPOSURE_VAR]],
    "p": [m_crude.pvalues[EXPOSURE_VAR], m_adj.pvalues[EXPOSURE_VAR]]
})

out = results_dir / "wb7_summary.csv"
summary.to_csv(out, index=False)
print(f"Saved: {out}")