# FB2NEP Workbook 7 – Confounding, DAGs, and Causal Structure

Version 0.0.5

This workbook builds on Workbook 6.

In Workbook 6, we treated regression as a practical tool for estimating associations
between an exposure and an outcome, and we focused on model types (linear, logistic,
Cox), assumptions, diagnostics, and basic interpretation of coefficients (β, OR, RR, HR).

In this workbook, we move from **association** to **causal thinking**. We introduce:

- Confounders.
- Colliders.
- Mediators.
- Directed acyclic graphs (DAGs) as a way to formalise causal assumptions.
- Approaches to adjustment (stratification and regression).
- The special role of **energy intake** in nutritional epidemiology.
- A brief introduction to **counterfactual** thinking.

A more formal treatment of causal inference, including modern notation and methods,
is given in **Workbook 9**. Here we focus on intuition and on how causal structure
affects regression analyses in practice.

We will use the synthetic *FB2NEP cohort* throughout. The precise variable names may
differ slightly from those used here; if you obtain an error (for example, `KeyError`),
carefully check the column names of the dataset and adapt the code accordingly.

In [None]:
# FB2NEP bootstrap cell – use in *all* workbooks
#
# This cell initialises the repository context and loads the synthetic cohort
# into a DataFrame called df. It tries a few possible locations for scripts/bootstrap.py.

import pathlib
import runpy

bootstrap_candidates = [
    "scripts/bootstrap.py",
    "../scripts/bootstrap.py",
    "../../scripts/bootstrap.py",
]

bootstrap_ns = None

for rel in bootstrap_candidates:
    p = pathlib.Path(rel)
    if p.exists():
        print(f"Loading bootstrap from: {p}")
        bootstrap_ns = runpy.run_path(str(p))
        break
else:
    raise FileNotFoundError(
        "Could not find scripts/bootstrap.py. "
        "Please check that you are running this notebook inside fb2nep-epi."
    )

if "init" not in bootstrap_ns:
    raise RuntimeError("bootstrap.py does not define init().")

df, CTX = bootstrap_ns["init"]()

REPO_ROOT = CTX.repo_root
CSV_REL = CTX.csv_rel
IN_COLAB = CTX.in_colab

print("Repository root:", REPO_ROOT)
print("Main dataset:", CSV_REL)
print("df shape:", df.shape)
print("IN_COLAB:", IN_COLAB)

In [None]:
"""
Imports and quick inspection
============================

In this cell we:

- Import common packages used in this workbook.
- Display basic information about the dataset to confirm that it is loaded.

The imports are deliberately explicit. Many students using this workbook
will not yet have much experience with Python, so we avoid implicit
magic and keep the code readable.
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from IPython.display import display

print("DataFrame shape (rows, columns):", df.shape)
print("\nFirst five rows of the dataset:")
display(df.head())

print("\nVariable types (first 20 columns):")
display(df.dtypes.head(20))

## 1. Association and causation

Regression models (linear, logistic, Cox) quantify **associations**:

- In linear regression, a coefficient describes how the *mean* outcome changes
  with the exposure.
- In logistic regression, we obtain *odds ratios*.
- In Cox regression, we obtain *hazard ratios*.

However, public health decisions concern **causal effects**:

> *If we changed this exposure (for example, salt intake), what would happen to the outcome?*

In observational data, a non-zero regression coefficient does **not** automatically
imply a causal effect. Several types of variables can distort or create associations:

- **Confounders**: common causes of exposure and outcome that, if unadjusted,
  bias estimated effects.
- **Colliders**: variables that are caused by two other variables; conditioning
  on them can create spurious associations.
- **Mediators**: variables lying *on* the causal pathway; adjusting for them can
  remove part of a genuine effect.

To move from association to causation we need to consider the **causal structure**
of the variables. In this workbook we introduce DAGs and basic adjustment strategies.
Workbook 9 provides a more formal framework for causal inference.

## 2. DAGs for model development and identification of confounders, colliders, and mediators

A **directed acyclic graph (DAG)** is a diagram with arrows that represents
assumptions about which variables cause which. It is:

- **Directed**: arrows have a direction (cause → effect).
- **Acyclic**: there are no feedback loops.

DAGs are useful because they make assumptions **explicit** and allow us to reason
about which variables we should adjust for in regression models.

### 2.1 Constructing a DAG

To construct a DAG for a research question:

1. **List key variables**
   - Exposure (for example, red meat intake).
   - Outcome (for example, incident cancer).
   - Plausible causes of exposure and outcome (for example, socio-economic status,
     age, smoking).

2. **Draw arrows according to subject-matter knowledge**
   - If variable A can plausibly influence variable B, draw `A → B`.
   - Do **not** add arrows simply because two variables are correlated in the data.

3. **Identify paths between exposure and outcome**
   - Causal paths (exposure → … → outcome).
   - Non-causal “backdoor” paths (exposure ← … → outcome) that create confounding.

4. **Decide on an adjustment set**
   - Choose variables to adjust for so that all non-causal backdoor paths are blocked,
     without conditioning on colliders or mediators.

![DAG](../_assets/epi_dag.png)

### 2.2 Why we do not include everything

It might be tempting to adjust for **every available variable**. This is usually
a bad idea because:

- Adjusting for **colliders** can *create* bias.
- Adjusting for **mediators** can remove part of the effect we are interested in
  (for example, when estimating the total effect of an exposure).
- Adjusting for variables that are neither confounders nor mediators can increase
  variance and complicate interpretation without reducing bias.

A good DAG includes **enough** variables to capture the main causal structure, but
not every variable in the dataset. We use subject-matter knowledge and parsimony:
include variables that are plausible causes of exposure and outcome, and that are
important for the research question.

## 3. Confounders

### 3.1 Definition and informal examples

A variable is a **confounder** for the association between an exposure and an outcome if:

1. It is associated with the exposure.
2. It is a cause of, or associated with a cause of, the outcome.
3. It is not on the causal pathway from exposure to outcome.

Intuitively, a confounder is a variable that makes exposed and unexposed individuals
systematically different, in a way that also affects the outcome.

Classic informal examples include:

- **Number of children and BRCA risk**:
  women with more children tend to be older, and age affects the probability of
  having developed breast cancer; age can confound the association between
  “number of children” and “current breast cancer status”.

- **Hair length and income**:
  there may appear to be an association between hair length and income if, in a
  given setting, women tend to have longer hair than men and also have different
  average incomes; sex is a confounder.



### Hippo example (confounding)

**Hippo size and daily grass intake**

Suppose we observe that *larger hippos eat more grass per day*.  
We might be tempted to conclude that **being large makes hippos eat more**.

But consider the underlying biology:

- Older hippos tend to be larger (simply through growth).
- Older hippos also spend more hours grazing because they no longer play in the water as much as juveniles.

**Age** is therefore a confounder:

```text
       age
      /    \
 hippo size  grass intake
```
In causal terms:

Size ← Age → Grass intake

If we ignore age, the association between hippo size and grass intake partly reflects the fact that older hippos both weigh more and eat more, not necessarily that body size itself increases grazing behaviour.

### 3.2 Approaches to adjustment

There are two basic ways to adjust for confounders:

- **Stratification**
  - Analyse the exposure–outcome association *within levels* of the confounder.
  - For example, estimate the association separately in high and low socio-economic
    status (SES) groups.

- **Inclusion in a regression model**
  - Include the confounder as a **covariate** (predictor) in the regression model.
  - For a linear model:
    $$
    Y = \beta_0 + \beta_1 X + \beta_2 C + \varepsilon,
    $$
    where $ X $ is the exposure and $ C $ is the confounder.
  - The coefficient $ \beta_1 $ is then interpreted as the association between
    $ X $ and $ Y $ **for individuals with the same value of $ C $**.

In practice, regression models with appropriate covariates are the most common
approach, but stratified analyses are useful for checking assumptions and for
illustrating confounding.

In the next section we use the FB2NEP cohort to demonstrate confounding in a setting
where the data-generating mechanism includes known confounders.

## 3.3 Example: salt intake, blood pressure, and incident CVD

We now consider a more realistic example using the FB2NEP cohort.

- **Exposure**: daily salt intake (`salt_g_d`).
- **Outcome**: incident cardiovascular disease during follow-up (`CVD_incident`).
- **Potential covariate**: systolic blood pressure (`SBP`).

From a clinical perspective, it is plausible that:

- Higher salt intake increases SBP.
- Higher SBP increases CVD risk.
- Salt may also have a direct effect on CVD risk (beyond SBP).

A very simple causal diagram (DAG) might be:

```text
salt_g_d  →  SBP  →  CVD_incident
   ↘                ↑
     ───────────────
```

In addition, other variables (such as age, BMI, smoking) may influence both SBP and CVD:

```text
age  →  SBP  →  CVD_incident
  ↘           ↑
    ──────────

BMI  →  SBP  →  CVD_incident
  ↘           ↑
    ──────────
```
In this section, we:

1. Fit a crude logistic regression model of CVD on salt.
2. Fit an adjusted model that includes SBP.
3. Optionally explore stratification by SBP.

The aim is to see how adjustment for a strongly related risk factor (here SBP) changes
the estimated association between salt intake and incident CVD, and to reflect on whether
we are estimating a total or a direct effect.

In [None]:
from scripts.helpers_tables import summarise_logit_coef


OUTCOME_VAR   = "CVD_incident"
EXPOSURE_VAR  = "salt_g_d"
COVARIATE_VAR = "SBP"

# ---------------------------------------------------------------------
# 1. Check variables and prepare analysis dataset
# ---------------------------------------------------------------------
for v in [OUTCOME_VAR, EXPOSURE_VAR, COVARIATE_VAR]:
    if v not in df.columns:
        raise KeyError(f"Variable '{v}' not found in df. "
                       f"Available columns (first 20): {list(df.columns)[:20]}")

df_salt = df[[OUTCOME_VAR, EXPOSURE_VAR, COVARIATE_VAR]].dropna().copy()
print(f"Complete-case sample size: {df_salt.shape[0]} observations\n")

# Optional: quick descriptive summaries
print("Descriptive statistics (salt_g_d and SBP):\n")
display(df_salt[[EXPOSURE_VAR, COVARIATE_VAR]].describe().T)

# ---------------------------------------------------------------------
# 2. Fit crude and SBP-adjusted logistic regression models
# ---------------------------------------------------------------------
formula_crude = f"{OUTCOME_VAR} ~ {EXPOSURE_VAR}"
formula_adj   = f"{OUTCOME_VAR} ~ {EXPOSURE_VAR} + {COVARIATE_VAR}"

model_crude = smf.logit(formula_crude, data=df_salt).fit(disp=False)
model_adj   = smf.logit(formula_adj,   data=df_salt).fit(disp=False)

# ---------------------------------------------------------------------
# 3. Summarise the effect of salt_g_d in each model
# ---------------------------------------------------------------------
rows = []
rows.append(
    summarise_logit_coef(
        model_crude,
        var_name=EXPOSURE_VAR,
        label="Crude model (CVD_incident ~ salt_g_d)"
    )
)
rows.append(
    summarise_logit_coef(
        model_adj,
        var_name=EXPOSURE_VAR,
        label="Adjusted model (CVD_incident ~ salt_g_d + SBP)"
    )
)

summary_salt = pd.DataFrame(rows)

print("\nAssociation between salt intake and incident CVD:")
print("Crude vs SBP-adjusted logistic regression\n")
display(summary_salt.round(3))


### Interpreting crude vs SBP-adjusted models

The table above shows the estimated association between salt intake (`salt_g_d`)
and incident CVD, first **without** and then **with** adjustment for SBP.

Focus on the rows for `salt_g_d`:

- Compare the **odds ratios (OR)** from the crude and adjusted models.
- Compare their **95 % confidence intervals**.
- Check whether the p-value changes meaningfully.

Typical questions to ask:

1. **Direction and magnitude**  
   - Is the adjusted OR further from 1.0 than the crude OR, or closer to 1.0?
   - Does adjustment for SBP increase or reduce the apparent effect of salt?

2. **Statistical evidence**  
   - Does the p-value change from “non-significant” to “significant”, or vice versa?
   - Do the confidence intervals overlap substantially?

3. **Causal interpretation**  
   - If SBP lies on the causal pathway from salt to CVD (salt → SBP → CVD),
     adjusting for SBP shifts attention from the **total effect** of salt to the
     **direct effect** not mediated by SBP.
   - If SBP is also influenced by other common causes of salt and CVD, adjustment
     can partly control for confounding, but it may also remove some of the effect
     we wish to measure.

In practice, one should be clear whether the primary target is:

- the **total effect** of salt on CVD (do *not* adjust for mediators such as SBP), or
- a **direct effect** of salt that is not explained by SBP (do adjust for SBP).

For this workbook, the numerical results are less important than the logic:

> Adjustment can change both the size and interpretation of an effect estimate.  
> Understanding the causal role of the covariate is essential before deciding
> whether it should be included in a regression model.


### Stratification by SBP

Instead of (or in addition to) regression adjustment, we can explore the
association between salt intake and incident CVD **within strata of SBP**.

A simple approach is:

1. Divide SBP into **tertiles** (low, medium, high).
2. Within each tertile, fit a logistic regression model:
   $$
   \text{CVD_incident} \sim \text{salt\_g\_d}
   $$
3. Compare the odds ratios for `salt_g_d` across SBP strata.

This illustrates two ideas:

- **Confounding control**: Within a narrow SBP stratum, participants have
  similar blood pressure, so there is less variation in SBP to confound
  the salt–CVD association.
- **Effect modification**: If the salt–CVD association appears stronger
  in one SBP stratum than another, this suggests that the effect of salt
  may depend on baseline blood pressure.


In [None]:
"""
Stratified analysis by SBP tertiles.

We:

1. Create SBP tertiles (low, medium, high).
2. Within each tertile, fit a logistic regression:
       CVD_incident ~ salt_g_d
3. Summarise the effect of salt_g_d in each stratum.

This allows visual comparison with the crude and SBP-adjusted models.
"""

# Copy the complete-case dataset from above
df_strat = df_salt.copy()

# ---------------------------------------------------------------
# 1. Create SBP tertiles
# ---------------------------------------------------------------
# qcut assigns approximately equal numbers of observations to each group.
df_strat["SBP_tertile"] = pd.qcut(
    df_strat[COVARIATE_VAR],
    q=3,
    labels=["low SBP", "medium SBP", "high SBP"]
)

print("SBP tertile counts:\n")
print(df_strat["SBP_tertile"].value_counts().sort_index(), "\n")

# ---------------------------------------------------------------
# 2. Fit logistic regression within each SBP stratum
# ---------------------------------------------------------------
rows_strata = []

for level in df_strat["SBP_tertile"].cat.categories:
    df_t = df_strat[df_strat["SBP_tertile"] == level]

    print(f"Fitting model in stratum: {level} (n = {df_t.shape[0]})")

    # Check for sufficient variation in outcome and exposure
    if df_t[OUTCOME_VAR].nunique() < 2 or df_t[EXPOSURE_VAR].nunique() < 2:
        print("  Not enough variation in outcome or exposure to fit the model.\n")
        continue

    m_t = smf.logit(f"{OUTCOME_VAR} ~ {EXPOSURE_VAR}", data=df_t).fit(disp=False)

    rows_strata.append(
        summarise_logit_coef(
            m_t,
            var_name=EXPOSURE_VAR,
            label=f"SBP tertile: {level}"
        )
    )

# Combine into a summary table, if any strata were analysable
if rows_strata:
    summary_strata = pd.DataFrame(rows_strata)
    print("\nStratum-specific odds ratios for salt_g_d (by SBP tertile):\n")
    display(summary_strata.round(3))




### Interpreting the stratified results by SBP tertile

The table shows the association between salt intake (`salt_g_d`) and incident CVD
within strata of SBP:

- **Low SBP tertile**  
  - OR ≈ 0.99 (95 % CI 0.95 to 1.02), p ≈ 0.45  
  - Point estimate slightly below 1.0, but the confidence interval is narrow and
    clearly includes 1.0.

- **Medium SBP tertile**  
  - OR ≈ 1.02 (95 % CI 0.99 to 1.06), p ≈ 0.14  
  - Point estimate slightly above 1.0, with a confidence interval that again
    includes 1.0 and is compatible with no association.

- **High SBP tertile**  
  - OR ≈ 1.00 (95 % CI 0.98 to 1.03), p ≈ 0.99  
  - Point estimate essentially equal to 1.0, with a narrow confidence interval
    centred on no association.

Taken together:

- All three odds ratios are **very close to 1.0**, and all confidence intervals
  comfortably include 1.0.
- There is **no clear pattern** of a stronger or weaker salt–CVD association in
  any SBP tertile.
- The small differences in point estimates (slightly <1 in the low SBP group,
  slightly >1 in the medium SBP group) are entirely compatible with **random
  variation**.

For this synthetic dataset, the stratified analysis suggests that:

- There is **no strong evidence** of an association between salt intake and
  incident CVD within SBP strata.
- There is also **no evidence of effect modification** by SBP: the salt–CVD
  association does not appear meaningfully different between low, medium, and
  high SBP groups.

This complements the regression-based adjustment:

> Both adjustment and stratification lead us to the same conclusion: in this
> particular synthetic cohort, salt intake is not an important predictor of CVD
> once SBP (and other built-in risk structure) is taken into account.


### Predicted probability of incident CVD across salt intake

So far we have interpreted regression coefficients and odds ratios. It can also be
helpful to translate a logistic regression model into **predicted probabilities**.

In this section we use the **SBP-adjusted** logistic model for incident CVD and:

- Fix SBP at a reference value (for example, 130 mmHg).
- Vary daily salt intake (`salt_g_d`) across its observed range.
- Compute the **predicted probability** of incident CVD for each salt value.
- Plot this probability curve.

This produces a smooth graph showing how the model predicts CVD risk to change as
salt intake increases, **conditional on SBP being held constant**. The plot does not
prove causality, but it is a powerful way to:

- Visualise the *shape* of the association implied by the fitted model.
- Relate abstract model coefficients to changes in predicted risk on the probability
  scale.
- Compare different models (for example, crude vs adjusted) by inspecting how their
  prediction curves differ.


In [None]:
"""
Predicted probability of incident CVD across salt intake.

We:

- Use the SBP-adjusted logistic model.
- Fix SBP at a reference value (for example, 130 mmHg).
- Vary salt_g_d across its observed range.
- Plot the predicted probability of CVD.

This visualises the (conditional) effect of salt for a given SBP.
"""


# Choose a reference value for SBP (for example, median)
sbp_ref = df_salt[COVARIATE_VAR].median()

# Construct a grid of salt values over the central range
salt_grid = np.linspace(
    df_salt[EXPOSURE_VAR].quantile(0.05),
    df_salt[EXPOSURE_VAR].quantile(0.95),
    100
)

pred_df = pd.DataFrame({
    EXPOSURE_VAR: salt_grid,
    COVARIATE_VAR: sbp_ref,
})

pred_df["p_cvd"] = model_adj.predict(pred_df)

fig, ax = plt.subplots(figsize=(6, 4))

ax.plot(pred_df[EXPOSURE_VAR], pred_df["p_cvd"], linewidth=2)

ax.set_xlabel("Salt intake (g/day)")
ax.set_ylabel("Predicted probability of incident CVD")
ax.set_title(f"Adjusted logistic model: CVD_incident ~ salt_g_d + SBP\n(SBP fixed at {sbp_ref:.1f} mmHg)")

plt.tight_layout()
plt.show()


### Interpreting the predicted probability curve

When you run the prediction plot, you will notice that the **predicted probability
of incident CVD decreases slightly as salt intake increases** (with SBP fixed at a
reference value such as 130 mmHg).

This negative slope is often counter-intuitive, so it is important to understand why
it appears.

#### 1. SBP is held constant — the main causal pathway is removed
In the FB2NEP synthetic dataset:

- Salt intake has only a **modest direct effect** on CVD.
- The *major* pathway from salt to CVD operates through **raising SBP**.
- When we fix SBP at a constant value (for example 130 mmHg), we deliberately
  remove this pathway.

What remains in the model is a small **residual association**, which may even be
slightly negative because:

- Salt intake is correlated with other factors (such as physical activity or SES)
  in the synthetic data generation.
- Once SBP is controlled for, these correlations can produce a small negative
  conditional association.

This is an excellent example of the difference between:

- **Total effect** (salt → SBP → CVD + any direct effect), and
- **Direct effect** (salt → CVD, *not* via SBP).

The plot shows **only the direct effect**, because SBP is held constant.

#### 2. A negative slope does *not* imply that salt is protective
The model is telling us:

> *Given two individuals with the same SBP, slightly higher salt intake does not
> meaningfully increase CVD risk in this dataset.*

This is not a biological claim; it is an artefact of the **structural choices**
in the synthetic dataset:

- The direct salt → CVD term in the data generator is deliberately small.
- SBP carries most of the signal.
- Conditioning on SBP absorbs nearly all of salt’s effect.

#### 3. The educational message

This visualisation reinforces two important lessons:

- **Adjustment changes the estimand.**  
  By fixing SBP, we move from the total effect to the direct effect.

- **Interpreting adjusted prediction curves requires causal thinking.**  
  Without a DAG, one might wrongly conclude that salt is “protective”.

In reality, the example shows how removing the primary pathway of influence can
make an exposure appear unrelated or even slightly inversely related to an
outcome in a regression model.

> The key takeaway: always decide which effect — total or direct — you are trying
> to estimate


## 4. Special case: energy intake in nutritional epidemiology

### 4.1 Why total energy intake is different

In nutritional epidemiology, **total energy intake** (for example, `energy_kcal`)
is not a classical confounder in the usual sense. Instead, it is a kind of
“scaling” variable:

- Individuals who eat **more total energy** tend to consume more of many nutrients
  and foods simply because they eat more food.
- Many nutrients are also biologically related to energy intake (for example,
  higher energy intake is often associated with higher body size and physical
  activity).

If we ignore total energy intake, we may incorrectly attribute the effect of
“eating more food overall” to a specific nutrient or food.

### 4.2 Common energy-adjustment methods

Several approaches are used to adjust nutrient intakes for total energy:

1. **Nutrient density method**
   - Express the nutrient per unit of energy, for example g/MJ or % of energy.
   - Example: grams of fibre per 1000 kcal.

2. **Residual method**
   - Regress the nutrient of interest on total energy intake.
   - Use the **residuals** (observed minus expected nutrient intake given energy)
     as an energy-adjusted exposure.
   - This removes the part of the nutrient intake that is explained by total
     energy intake.

3. **Energy-adjusted models**
   - Include both the nutrient and total energy intake as covariates in the
     regression model of interest.

Each method has advantages and disadvantages. The residual method and energy-adjusted
models are particularly useful when working with food-frequency questionnaires (FFQs),
where measurement error and strong correlations between nutrients can be substantial.

### 4.3 Special case of FFQs

FFQs typically record **relative** frequencies of consumption over long periods.
Reported intakes of many foods and nutrients are highly correlated, and systematic
measurement error is common. Adjusting for total energy intake can:

- Reduce measurement error that is common to many foods (for example, general
  over-reporting or under-reporting).
- Focus analyses on **diet composition** rather than total amount of food.

We now illustrate the residual method using `salt_g_d` and `energy_kcal` in the FB2NEP cohort.

In [None]:
"""Energy-adjustment example using the residual method.

We:
- Consider salt intake (salt_g_d) and total energy intake (energy_kcal).
- Regress salt_g_d on energy_kcal.
- Compute residuals as energy-adjusted salt intake.
- Illustrate the effect of adjustment with a simple association with BMI.
"""

for var in ["salt_g_d", "energy_kcal", "BMI"]:
    if var not in df.columns:
        raise KeyError(f"Variable '{var}' not found in df.")

df_energy = df[["salt_g_d", "energy_kcal", "BMI"]].dropna().copy()
print(f"Sample size (complete cases): {df_energy.shape[0]} observations\n")

# Fit linear model: salt_g_d ~ energy_kcal
model_salt_energy = smf.ols("salt_g_d ~ energy_kcal", data=df_energy).fit()
print("Linear regression: salt_g_d ~ energy_kcal\n")
display(model_salt_energy.summary().tables[1])

# Compute residuals: energy-adjusted salt intake
df_energy["salt_residual"] = model_salt_energy.resid

print("\nFirst five rows with residuals:")
display(df_energy.head())

# Compare associations of BMI with raw and energy-adjusted salt
m_raw = smf.ols("BMI ~ salt_g_d", data=df_energy).fit()
m_adj = smf.ols("BMI ~ salt_residual", data=df_energy).fit()

print("\nAssociation with BMI (raw salt):")
display(m_raw.summary().tables[1])

print("\nAssociation with BMI (energy-adjusted salt, residuals):")
display(m_adj.summary().tables[1])

print("\nInterpretation: the residual approach removes the part of salt intake that is"
      " explained by total energy. The comparison of the two models illustrates how"
      " energy adjustment can change estimated associations.")

## 5. Colliders and mediators (brief overview)

### 5.1 Colliders

A **collider** is a variable that is *caused* by two (or more) other variables.
In a simple diagram:

```text
exercise  →
            fitness
genes     →
```

Here, `fitness` is a collider on the path between `exercise` and `genes`. If we
**condition** on fitness (for example, by restricting the analysis to individuals
with high fitness, or adjusting for fitness in a model), we can induce an
association between exercise and genes even if none exists in the population.

This is known as **collider bias** or **selection bias** when the collider is
related to being included in the study.

### 5.2 Mediators

A **mediator** lies *on* the causal pathway between exposure and outcome:

```text
salt intake → blood pressure → stroke
```

If we are interested in the **total effect** of salt intake on stroke risk,
we should **not** adjust for blood pressure, because this would remove part of
the genuine effect (the indirect pathway through blood pressure).

If we are specifically interested in the **direct effect** of salt that is not
mediated by blood pressure, then adjusting for blood pressure is appropriate,
but the interpretation changes.

The key message is that we should adjust for **confounders**, avoid adjusting
for **colliders**, and think carefully before adjusting for **mediators**.
DAGs help us to reason about which variables fall into which category.

In [None]:
"""Short simulation to illustrate collider bias.

We simulate:
- exercise and genes as independent variables.
- fitness as a function of both exercise and genes (a collider).
- risk as a function of genes only.

We then:
- Fit a logistic model of risk on exercise in the full sample.
- Restrict to high fitness (conditioning on the collider) and refit.

Conditioning on fitness induces a spurious association between exercise and risk.
"""

rng = np.random.default_rng(11088)
n = 5000

exercise = rng.normal(0, 1, n)
genes = rng.normal(0, 1, n)
fitness = 0.8 * exercise + 0.8 * genes + rng.normal(0, 1, n)
risk_lin = 1.2 * genes + rng.normal(0, 1, n)
risk = (risk_lin > 0).astype(int)

dfc = pd.DataFrame({"exercise": exercise, "genes": genes, "fitness": fitness, "risk": risk})

print("Correlation exercise–genes (should be ~0):")
print(dfc[["exercise", "genes"]].corr(), "\n")

    # Full sample
print("Full sample: exercise → risk (logistic):")
m_full = smf.logit("risk ~ exercise", dfc).fit(disp=False)
print(m_full.summary().tables[1])

# Condition on high fitness (top 30 %)
thr = dfc["fitness"].quantile(0.70)
df_high = dfc[dfc["fitness"] >= thr]
print("\nHigh fitness sample size:", df_high.shape[0])

print("High fitness: exercise → risk (logistic):")
m_cond = smf.logit("risk ~ exercise", df_high).fit(disp=False)
print(m_cond.summary().tables[1])

print("\nInterpretation: conditioning on fitness (a collider) induces an apparent"
      " association between exercise and risk even though exercise does not cause"
      " risk in the data-generating mechanism.")

## 6. Counterfactuals

Modern causal inference often uses **counterfactual** or **potential outcome**
language. For each individual we imagine:

- \( Y(1) \): the outcome that would occur if the individual were exposed.
- \( Y(0) \): the outcome that would occur if the same individual were not exposed.

The **causal effect** for that individual is the (usually unobservable) difference
\( Y(1) - Y(0) \). In practice we cannot observe both outcomes for the same
person, so we rely on comparisons between groups, together with assumptions about
confounding, measurement, and model specification.

Adjustment strategies (for example, regression with appropriate covariates based on
a sensible DAG) are used to make the exposed and unexposed groups more comparable,
so that the difference in observed outcomes approximates the difference between
counterfactual outcomes.

Workbook 9 returns to these ideas and introduces more formal notation and methods
for estimating causal effects under explicit assumptions.

## 9. Reflection and exercises

1. **Draw a DAG** for the association between red meat intake and incident cancer
   in the FB2NEP cohort. Include at least age, sex, SES_class, IMD_quintile,
   smoking_status, and family history. Identify plausible confounders,
   colliders, and mediators.

2. **Confounders in practice**: Choose a different exposure (for example,
   `fruit_veg_g_d` or `salt_g_d`) and a relevant outcome. Propose at least two
   variables as potential confounders based on subject-matter knowledge. Fit
   crude and adjusted models and compare the estimates.

3. **Energy adjustment**: Using `energy_kcal` and a nutrient of your choice
   (for example, `fibre_g_d`), implement the nutrient density method and the
   residual method. Compare the associations with BMI or another suitable
   outcome for the raw, density-based, and residual-based exposures.

4. **Collider bias**: Modify the collider simulation to use a different
   collider (for example, an indicator of study participation) and show how
   conditioning on participation can induce associations between variables
   that are otherwise independent.

5. **Mediators**: For a hypothetical causal chain in nutrition (for example,
   `diet quality → BMI → blood pressure → CVD`), decide which variables you
   would adjust for when estimating the total effect of diet quality on CVD,
   and which you would *not* adjust for. Explain your reasoning.

6. **Counterfactual thinking**: In your own words, describe what it would mean
   to say that “reducing salt intake by 2 g/day would reduce average SBP by
   5 mmHg” in terms of potential outcomes. What assumptions would be needed for
   this statement to be interpreted causally in an observational study?