# 04 · Theoretical model building — DAGs

> **Purpose**: formalise assumptions with Directed Acyclic Graphs (DAGs) and identify adjustment sets.

> **Learning objectives**
- Understand confounding, colliders, mediators through DAGs.
- Draw a simple DAG for a nutrition question (e.g., red meat → cancer).
- Propose a minimal sufficient adjustment set.

---

In [None]:
# Make sure the repo root (which has scripts/bootstrap.py) is on sys.path.
import sys, os, pathlib, subprocess

REPO_NAME = "fb2nep-epi"
REPO_URL  = "https://github.com/ggkuhnle/fb2nep-epi.git"
IN_COLAB  = "google.colab" in sys.modules

def ensure_repo_on_path():
    here = pathlib.Path.cwd()
    # Walk up a few levels to find scripts/bootstrap.py
    for p in [here, *here.parents]:
        if (p / "scripts" / "bootstrap.py").exists():
            os.chdir(p)                 # normalise CWD to repo root
            sys.path.append(str(p))     # ensure imports like "from scripts..." work
            return p
    # Not found locally: if on Colab, clone then chdir
    if IN_COLAB:
        # clone only if missing
        if not (pathlib.Path("/content") / REPO_NAME).exists():
            subprocess.run(["git", "clone", REPO_URL], check=False)
        os.chdir(f"/content/{REPO_NAME}")
        sys.path.append(os.getcwd())
        return pathlib.Path.cwd()
    # Otherwise, we can’t proceed
    raise FileNotFoundError("Could not find repo root containing scripts/bootstrap.py")

repo_root = ensure_repo_on_path()
print("Repo root:", repo_root)

In [None]:
# Bootstrap: ensure repo root on path, then import init
import sys, pathlib
sys.path.append(str(pathlib.Path.cwd().parent))
from scripts.bootstrap import init
df, ctx = init()
df.head(2)

## 1) DAG basics (very short)
- Nodes = variables; arrows = assumed direct causal effects.
- **Backdoor paths** (non-causal paths from exposure to outcome) create confounding.
- A **minimal sufficient adjustment set** blocks all backdoors without conditioning on colliders/mediators.

## 2) Example DAG: red meat → cancer
Assume SES and smoking confound; age confounds; BMI may be on a risk pathway for cancer depending on the hypothesis.

In [None]:
try:
    import networkx as nx
    import matplotlib.pyplot as plt
except Exception as e:
    raise RuntimeError("Install optional dependency: networkx (and matplotlib)") from e

# Build DAG (as a DiGraph for visual)
G = nx.DiGraph()
G.add_edges_from([
    ("SES","red_meat"),("SES","Cancer"),
    ("Smoking","red_meat"),("Smoking","Cancer"),
    ("Age","red_meat"),("Age","Cancer"),
    ("red_meat","Cancer")
])

pos = {"SES":(-1,1),"Smoking":(1,1),"Age":(0,1.4),"red_meat":(0,0),"Cancer":(0,-1)}
plt.figure(figsize=(5.5,4.2))
nx.draw(G, pos, with_labels=True, node_size=1600, node_color="#e6f2ff", arrows=True)
plt.title("DAG: confounding in red meat → cancer")
plt.axis('off'); plt.tight_layout(); plt.show()

## 3) Identify a minimal sufficient adjustment set
From the DAG above, a plausible set is `{Age, SES, Smoking}`. Avoid adjusting for **colliders** or **mediators**.

**Task**: In your own words, state a minimal sufficient set for your primary analysis (you can include BMI if you argue it confounds rather than mediates in your causal story).

## 4) Practice: DAG for salt → CVD via SBP
Sketch a DAG with: `Salt` (exposure), `CVD` (outcome), `SBP` (mediator), `Age`, `SES`, `Smoking` (potential confounders). Decide whether `BMI` is a confounder or mediator in your story and justify.

In [None]:
# OPTIONAL: draw your salt → CVD DAG
H = nx.DiGraph()
H.add_edges_from([
    ("Age","Salt"),("Age","CVD"),
    ("SES","Salt"),("SES","CVD"),
    ("Smoking","CVD"),
    ("Salt","SBP"),("SBP","CVD")
])
pos2 = {"Age":(-1,1),"SES":(1,1),"Smoking":(2,0.8),"Salt":(-0.3,0),"SBP":(0.7,-0.2),"CVD":(0.2,-1)}
plt.figure(figsize=(5.8,4.2))
nx.draw(H, pos2, with_labels=True, node_size=1600, node_color="#eef7e9", arrows=True)
plt.title("DAG: salt → SBP → CVD (with confounding)")
plt.axis('off'); plt.tight_layout(); plt.show()

## 5) # TODO — short exercises
1. In a markdown cell, specify your adjustment set for **red_meat → Cancer** and justify briefly.
2. In a markdown cell, specify your adjustment set for **Salt → CVD** given the DAG above.
3. Which variables would be **colliders** in either DAG if mistakenly adjusted for? Explain.

> ## Key takeaways
>
> - DAGs make assumptions explicit and guide adjustment choices.
> - Aim to block **backdoor paths** without opening new ones via colliders/mediators.
> - Your statistical model should follow your DAG, not the other way around.