# 07 · Advanced topics — confounding, colliders, mediation, imputation

> **Purpose**: sharpen causal thinking (confounders vs colliders vs mediators), show how (mis)adjustment shifts estimates, and run simple missing-data sensitivity checks.

> **Learning objectives**
- Identify **confounders**, **colliders**, and **mediators** in nutrition questions.
- Observe how estimates change when you (in)correctly adjust variables.
- Compare **complete-case** vs **simple imputation** (teaching) and reflect on MNAR.
- (Optional) Visualise DAGs to justify adjustment sets.

---

In [None]:
# Make sure the repo root (which has scripts/bootstrap.py) is on sys.path.
import sys, os, pathlib, subprocess

REPO_NAME = "fb2nep-epi"
REPO_URL  = "https://github.com/ggkuhnle/fb2nep-epi.git"
IN_COLAB  = "google.colab" in sys.modules

def ensure_repo_on_path():
    here = pathlib.Path.cwd()
    # Walk up a few levels to find scripts/bootstrap.py
    for p in [here, *here.parents]:
        if (p / "scripts" / "bootstrap.py").exists():
            os.chdir(p)                 # normalise CWD to repo root
            sys.path.append(str(p))     # ensure imports like "from scripts..." work
            return p
    # Not found locally: if on Colab, clone then chdir
    if IN_COLAB:
        # clone only if missing
        if not (pathlib.Path("/content") / REPO_NAME).exists():
            subprocess.run(["git", "clone", REPO_URL], check=False)
        os.chdir(f"/content/{REPO_NAME}")
        sys.path.append(os.getcwd())
        return pathlib.Path.cwd()
    # Otherwise, we can’t proceed
    raise FileNotFoundError("Could not find repo root containing scripts/bootstrap.py")

repo_root = ensure_repo_on_path()
print("Repo root:", repo_root)

In [None]:
# Bootstrap: ensure repo root on path, then import init
import sys, pathlib
sys.path.append(str(pathlib.Path.cwd().parent))
from scripts.bootstrap import init
df, ctx = init()
df.head(2)

## 1) Definitions (succinct)
- **Confounder**: causes both exposure and outcome; opens a backdoor path. *Adjust for it.*
- **Collider**: is caused by two (or more) variables; conditioning opens a spurious path. *Do not adjust.*
- **Mediator**: lies on the causal path from exposure to outcome. Adjusting estimates the **direct** effect (not total). *Adjust only if your target is the direct effect.*

**Example questions**
- *Red meat → Cancer*: likely confounded by **SES**, **smoking**, **age**; uncertain role for **BMI** (pathway vs confounding).
- *Salt → CVD*: **age**/**SES** confound; **SBP** is a plausible **mediator**.

## 2) Adjustment demo — red meat → Cancer
We’ll fit three logistic models and compare ORs for `red_meat_g_d`:
1) **Unadjusted**  
2) **+ Confounders**: age, sex, SES, IMD, smoking, BMI  
3) **+ Mediator candidate**: add `SBP` (if you argue salt → SBP → CVD is analogous; for cancer, this is a *didactic* overadjustment example)

_Interpretation focus_: how the OR shifts and why that could happen under the DAG.

In [None]:
import pandas as pd, numpy as np, statsmodels.api as sm
from patsy import dmatrices

OUTCOME = 'Cancer_incident'
EXPOSURE = 'red_meat_g_d'
conf = ['age','BMI','sex','SES_class','IMD_quintile','smoking_status']

m = df[[OUTCOME, EXPOSURE] + conf + ['SBP']].dropna().copy()
print('n (complete-cases for this block):', len(m))

def cat(v):
    return f"C({v})" if (m[v].dtype=='object' or str(m[v].dtype).startswith('category')) else v

def tidy_or(fit):
    OR = np.exp(fit.params).rename('OR')
    CI = np.exp(fit.conf_int()).rename(columns={0:'2.5%',1:'97.5%'})
    return pd.concat([OR,CI], axis=1).round(3)

# 1) Unadjusted
y1, X1 = dmatrices(f'{OUTCOME} ~ {EXPOSURE}', data=m, return_type='dataframe')
fit1 = sm.Logit(y1, X1).fit(disp=False)

# 2) + Confounders
rhs2 = ' + '.join([EXPOSURE] + [cat(v) for v in conf])
y2, X2 = dmatrices(f'{OUTCOME} ~ ' + rhs2, data=m, return_type='dataframe')
fit2 = sm.Logit(y2, X2).fit(disp=False)

# 3) + Mediator candidate (overadjustment example)
rhs3 = rhs2 + ' + SBP'
y3, X3 = dmatrices(f'{OUTCOME} ~ ' + rhs3, data=m, return_type='dataframe')
fit3 = sm.Logit(y3, X3).fit(disp=False)

t1, t2, t3 = tidy_or(fit1), tidy_or(fit2), tidy_or(fit3)
t1.loc[[EXPOSURE]], t2.filter(like=EXPOSURE, axis=0), t3.filter(like=EXPOSURE, axis=0)

**Reading the shift**
- If the OR moves **towards 1** after adjusting for confounders → confounding was inflating the crude association.
- If adding a **mediator** (like SBP in salt→CVD; here didactic) pulls the OR towards 1 → you’re estimating a more **direct** effect (part of the total effect is soaked up).

## 3) Collider caution (mini simulation within our cohort)
Create an **artificial collider** `CL` influenced by both exposure and outcome risk, then show that conditioning on it induces association even if we randomise exposure within levels of confounders. This is an illustration — don’t add such variables to your real models.

In [None]:
rng = np.random.default_rng(11088)
d = df[[EXPOSURE, OUTCOME] + conf].dropna().copy()

# Build a collider CL that is more likely when exposure high AND outcome=1
x = (d[EXPOSURE] - d[EXPOSURE].mean())/d[EXPOSURE].std()
p = 1/(1+np.exp(-(0.6*x + 1.0*d[OUTCOME] - 0.1)))
d['CL'] = (rng.uniform(size=len(d)) < p).astype(int)

import statsmodels.api as sm
from patsy import dmatrices

# Model without conditioning on CL
y_nc, X_nc = dmatrices(f'{OUTCOME} ~ {EXPOSURE} + ' + ' + '.join([f'C({v})' if d[v].dtype=='object' or str(d[v].dtype).startswith('category') else v for v in conf]), data=d, return_type='dataframe')
fit_nc = sm.Logit(y_nc, X_nc).fit(disp=False)
# Model conditioning on the collider (WRONG)
y_c, X_c = dmatrices(f'{OUTCOME} ~ {EXPOSURE} + CL + ' + ' + '.join([f'C({v})' if d[v].dtype=='object' or str(d[v].dtype).startswith('category') else v for v in conf]), data=d, return_type='dataframe')
fit_c = sm.Logit(y_c, X_c).fit(disp=False)

def or_of(term, fit):
    import numpy as np, pandas as pd
    OR = np.exp(fit.params[term])
    lo, hi = np.exp(fit.conf_int().loc[term].values)
    return pd.Series({'OR': round(float(OR),3), '2.5%': round(float(lo),3), '97.5%': round(float(hi),3)})

or_nc = or_of(EXPOSURE, fit_nc)
or_c  = or_of(EXPOSURE, fit_c)
print('Without collider conditioning (correct):\n', or_nc.to_dict())
print('With collider conditioning (WRONG):\n', or_c.to_dict())

_Expectation_: adding `CL` typically **distorts** the exposure OR compared with the non-collider model. In real work, colliders are often inadvertently introduced via restricting the sample or adjusting for variables affected by both exposure and outcome (e.g., conditioning on a selection mechanism).

## 4) Missing data — complete-case vs simple imputation (teaching)
We’ll compare **complete-case (CC)** analysis to a crude **median/mode imputation**. This is for **teaching only**; in practice prefer **multiple imputation** (not covered here to avoid new dependencies).

In [None]:
vars_model = [OUTCOME, EXPOSURE] + conf
cc = df[vars_model].dropna().copy()
print('CC n =', len(cc))

imp = df[vars_model].copy()
for c in imp.select_dtypes(include=['float64','int64']).columns:
    imp[c] = imp[c].fillna(imp[c].median())
for c in imp.select_dtypes(include=['object','category']).columns:
    md = imp[c].mode(dropna=True)
    if len(md): imp[c] = imp[c].fillna(md.iloc[0])
print('Imputed n =', len(imp) - imp.isna().sum(axis=1).gt(0).sum())

def fit_logit(dat):
    def cat(v):
        return f"C({v})" if (dat[v].dtype=='object' or str(dat[v].dtype).startswith('category')) else v
    rhs = ' + '.join([EXPOSURE] + [cat(v) for v in conf])
    y, X = dmatrices(f'{OUTCOME} ~ ' + rhs, data=dat, return_type='dataframe')
    return sm.Logit(y, X).fit(disp=False)

fit_cc  = fit_logit(cc)
fit_imp = fit_logit(imp)

def tidy_or_table(term, *fits):
    rows = []
    for tag, ft in fits:
        OR = np.exp(ft.params[term])
        lo, hi = np.exp(ft.conf_int().loc[term].values)
        rows.append({'Model': tag, 'OR': round(float(OR),3), '2.5%': round(float(lo),3), '97.5%': round(float(hi),3)})
    return pd.DataFrame(rows)

tidy_or_table(EXPOSURE, ('Complete-case', fit_cc), ('Median/mode impute', fit_imp))

**Interpretation**
- Differences between CC and imputed estimates indicate **missingness sensitivity**. If MAR is plausible and the imputation model is poor, bias can remain.
- **MNAR** (e.g., sicker participants underreport diet) cannot be diagnosed from observed data alone — acknowledge and explore bounds if critical to inference.

## 5) (Optional) DAG visual to justify adjustment set
Use a small DAG to state your assumptions for **red meat → Cancer** (or your chosen pair), then argue your adjustment set from the graph.

In [None]:
try:
    import networkx as nx, matplotlib.pyplot as plt
    G = nx.DiGraph()
    G.add_edges_from([
        ('SES','red_meat_g_d'),('SES','Cancer_incident'),
        ('smoking_status','red_meat_g_d'),('smoking_status','Cancer_incident'),
        ('age','red_meat_g_d'),('age','Cancer_incident'),
        ('red_meat_g_d','Cancer_incident')
    ])
    pos = {'SES':(-1,1),'smoking_status':(1,1),'age':(0,1.4),'red_meat_g_d':(0,0),'Cancer_incident':(0,-1)}
    plt.figure(figsize=(5.5,4.2))
    nx.draw(G, pos, with_labels=True, node_size=1600, node_color='#e6f2ff', arrows=True)
    plt.title('DAG: confounding in red meat → cancer'); plt.axis('off'); plt.tight_layout(); plt.show()
except Exception as e:
    print('DAG skipped (networkx optional):', e)

## 6) # TODO — your practice
1. Choose a primary question (e.g., `salt_g_d → CVD_incident`). Specify a **minimal sufficient adjustment set** in Markdown and justify with a DAG sketch (optional cell below).
2. Fit **unadjusted**, **confounder-adjusted**, and **mediator-adjusted** (if applicable) models; compare ORs.
3. Repeat with **complete-case** vs **simple imputation**; summarise how sensitive your estimate is to missingness handling.
4. In 3–5 sentences, explain a plausible **MNAR** mechanism for your exposure and what direction of bias it would induce.

> ## Key takeaways
>
> - Causal clarity first: **confounders** in, **colliders** out; **mediators** only if your estimand is the *direct* effect.
> - (Mis)adjustment visibly moves estimates — always link choices back to a DAG.
> - Missing data strategy matters; complete-case vs crude imputation can differ. Real work uses **multiple imputation** with a rich imputation model.
> - Be explicit about **assumptions** and **sensitivity** — that’s the craft of epidemiology.