In [None]:
%load_ext autoreload
%autoreload 2

# Directed Acyclic Graphs

> **Reference:** *Causal Inference: The Mixtape*, Chapter 3: Directed Acyclic Graphs (pp. 96-118)

This lecture introduces directed acyclic graphs (DAGs) as a tool for reasoning about causal relationships. We apply these concepts using the Online Retail Simulator to answer: **Why does our naive analysis suggest marketing spend hurts sales?**

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from support import (
    draw_dag,
    draw_dag_with_paths,
    simulate_confounded_data,
    add_collider,
    run_regression,
    regression_comparison_table,
    plot_regression_comparison,
    plot_partial_regression,
    print_path_analysis,
    plot_collider_bias_demo,
    draw_movie_star_example,
)

np.random.seed(42)

---

# Part I: Theory

This section covers the theoretical foundations of directed acyclic graphs as presented in Cunningham's *Causal Inference: The Mixtape*, Chapter 3.

## 1. Introduction to DAG Notation

A **directed acyclic graph (DAG)** is a visual representation of causal relationships between variables.

### Core Components

| Element | Representation | Meaning |
|---------|----------------|----------|
| **Node** | Circle | A random variable |
| **Arrow** | Directed edge (→) | Direct causal effect |
| **Path** | Sequence of edges | Connection between variables |

### Key Properties

1. **Directed**: Arrows point in one direction (cause → effect)
2. **Acyclic**: No variable can cause itself (no loops)
3. **Causality flows forward**: Time moves in the direction of arrows

### What DAGs Encode

DAGs encode **qualitative causal knowledge**:
- What IS happening: drawn arrows
- What is NOT happening: missing arrows (equally important!)

A missing arrow from A to B claims that A does not directly cause B.

### Simple DAG: Treatment → Outcome

In [None]:
# Simplest DAG: D causes Y
draw_dag(
    edges=[("D", "Y")],
    node_labels={"D": "Treatment\n(D)", "Y": "Outcome\n(Y)"},
    title="Simple Causal Relationship: D → Y",
)

## 2. Paths: Direct and Backdoor

A **path** is any sequence of edges connecting two nodes, regardless of arrow direction.

### Types of Paths

| Path Type | Direction | Interpretation |
|-----------|-----------|----------------|
| **Direct/Causal** | D → ... → Y | The causal effect we want |
| **Backdoor** | D ← ... → Y | Spurious correlation (bias!) |

### The Backdoor Problem

Backdoor paths create **spurious correlations** between D and Y:
- They make D and Y appear related even without a causal effect
- This is the graphical representation of **selection bias**

In [None]:
# DAG with confounder: classic triangle
draw_dag(
    edges=[("X", "D"), ("X", "Y"), ("D", "Y")],
    node_labels={"X": "Confounder\n(X)", "D": "Treatment\n(D)", "Y": "Outcome\n(Y)"},
    title="Confounded Relationship: D ← X → Y",
)

In [None]:
# Path analysis
print_path_analysis(
    paths=["D → Y", "D ← X → Y"],
    descriptions=["Direct causal path (what we want to estimate)", "Backdoor path (creates bias)"],
)

## 3. Confounders

A **confounder** is a variable that:
1. Causes the treatment (D)
2. Causes the outcome (Y)
3. Is NOT on the causal path from D to Y

### Observed vs. Unobserved

| Type | In DAG | Implication |
|------|--------|-------------|
| **Observed** | Solid circle | Can condition on it |
| **Unobserved** | Dashed circle | Cannot directly control |

### Classic Example: Education and Earnings

Consider estimating the return to education:
- **Treatment**: Years of education
- **Outcome**: Earnings
- **Confounders**: Ability, family background, motivation

People with higher ability tend to:
- Get more education (ability → education)
- Earn more regardless of education (ability → earnings)

This creates a backdoor path that inflates naive estimates of education's effect.

In [None]:
# Education-Earnings DAG with confounders
draw_dag(
    edges=[
        ("Ability", "Education"),
        ("Ability", "Earnings"),
        ("Background", "Education"),
        ("Background", "Earnings"),
        ("Education", "Earnings"),
    ],
    title="Education and Earnings: Multiple Confounders",
)

## 4. Colliders and Collider Bias

A **collider** is a variable where two arrows point INTO it:

$$D \rightarrow C \leftarrow Y$$

### Key Insight About Colliders

Colliders have a **special property**: they naturally BLOCK paths!

| Situation | Path Status |
|-----------|-------------|
| Leave collider alone | Path is CLOSED (blocked) |
| Condition on collider | Path is OPENED (creates bias!) |

### Why Conditioning Opens Colliders

Conditioning on a collider makes its causes appear correlated, even if they're independent in the population.

**Intuition**: If you know someone is a movie star (success), and they're not particularly beautiful, you infer they must be talented. This creates an artificial negative correlation between talent and beauty among stars.

In [None]:
# Collider structure
draw_dag(
    edges=[("D", "C"), ("Y", "C")],
    node_labels={"D": "Treatment\n(D)", "C": "Collider\n(C)", "Y": "Outcome\n(Y)"},
    title="Collider Structure: D → C ← Y",
)

In [None]:
# Movie star example
draw_movie_star_example()

## 5. The Backdoor Criterion

The **backdoor criterion** provides a systematic way to identify what variables to condition on.

### Definition

A set of variables $Z$ satisfies the backdoor criterion relative to $(D, Y)$ if:

1. No variable in $Z$ is a descendant of $D$
2. $Z$ blocks every backdoor path from $D$ to $Y$

### How to Block Paths

| Node Type | To Block | To Open |
|-----------|----------|----------|
| **Non-collider** | Condition on it | Leave alone |
| **Collider** | Leave alone | Condition on it |

### Important Implications

1. **Not all controls are good controls**: Conditioning on a collider creates bias
2. **Minimal sufficiency**: You don't need to condition on everything—just enough to block backdoors
3. **Multiple solutions**: Often several valid conditioning sets exist

## 6. Classic Examples

### Example 1: The Movie Star Paradox

In Hollywood:
- Beauty and Talent are independent in the general population
- Both contribute to becoming a Star
- Among Stars, Beauty and Talent appear negatively correlated

**DAG**: Beauty → Star ← Talent

If you condition on Star (study only celebrities), you create a spurious negative correlation.

### Example 2: Discrimination in Hiring

Consider studying gender discrimination in wages:
- Gender → Occupation (women steered to lower-paying jobs)
- Gender → Wages (direct discrimination)
- Occupation → Wages

**Question**: Should we control for occupation?

**Answer**: It depends on what effect we want to measure!
- **Total effect**: Don't control (captures both direct and indirect discrimination)
- **Direct effect**: Control for occupation (discrimination within same job)

In [None]:
# Discrimination example
draw_dag(
    edges=[
        ("Gender", "Occupation"),
        ("Gender", "Wages"),
        ("Occupation", "Wages"),
    ],
    title="Gender Discrimination: Direct vs. Total Effect",
)

---

# Part II: Application

We now apply DAG concepts to diagnose and solve a confounding problem using simulated data.

## 1. Business Context: The Marketing Paradox

An e-commerce company notices something puzzling in their data:

> **Products with higher marketing spend tend to have LOWER sales.**

The marketing team is confused. Should they cut marketing budgets?

### The Underlying Reality

What's actually happening:
- **Struggling products** (low quality) receive more marketing investment
- **Strong products** (high quality) sell well with minimal marketing
- Marketing **does** increase sales (true causal effect is positive)

But the **confounding** from product quality creates a **negative spurious correlation** that overwhelms the positive causal effect.

## 2. Drawing the DAG

Let's represent this situation graphically:

- **Quality** (Q): Product quality/strength
- **Marketing** (M): Marketing spend
- **Sales** (S): Revenue

Relationships:
1. Quality → Sales (+): Better products sell more
2. Quality → Marketing (−): Struggling products get more marketing
3. Marketing → Sales (+): Marketing increases sales (TRUE causal effect)

In [None]:
# Marketing paradox DAG
draw_dag(
    edges=[
        ("Quality", "Marketing"),
        ("Quality", "Sales"),
        ("Marketing", "Sales"),
    ],
    title="The Marketing Paradox: Quality Confounds Marketing-Sales Relationship",
)

In [None]:
# Identify paths from Marketing to Sales
print_path_analysis(
    paths=["Marketing → Sales", "Marketing ← Quality → Sales"],
    descriptions=[
        "DIRECT PATH: True causal effect (+$2 per $ spent)",
        "BACKDOOR PATH: Creates negative bias (quality confounding)",
    ],
)

## 3. Simulating Confounding

We'll generate data that follows our DAG structure:
- True marketing effect: +$2 per $1 spent
- Quality → Marketing: -$1.50 (low quality → more marketing)
- Quality → Sales: +$10

In [None]:
# Generate confounded data
df = simulate_confounded_data(
    n_products=500,
    quality_effect_marketing=-1.5,  # Negative: low quality products get more marketing
    quality_effect_sales=10,  # Positive: high quality products sell more
    marketing_effect=2,  # TRUE CAUSAL EFFECT: $2 per $ of marketing
    seed=42,
)

print(f"Number of products: {len(df)}")
print(f"\nVariable summary:")
df[["quality", "marketing_spend", "sales"]].describe().round(2)

In [None]:
# Check correlations
print("Correlation matrix:")
print(df[["quality", "marketing_spend", "sales"]].corr().round(3))

print("\nNote the NEGATIVE correlation between marketing_spend and sales!")
print("This is the paradox we need to explain.")

## 4. The Naive Estimate (Biased)

Let's start with what a naive analyst might do: regress Sales on Marketing.

In [None]:
# Naive regression: Sales ~ Marketing
naive_results = run_regression(df, "sales", ["marketing_spend"])

print("Naive Regression: Sales ~ Marketing Spend")
print("=" * 50)
print(naive_results.summary().tables[1])

print(f"\nNaive estimate: {naive_results.params['marketing_spend']:.2f}")
print(f"True effect:    {df['true_marketing_effect'].iloc[0]:.2f}")
print(f"\nThe naive estimate suggests marketing HURTS sales!")

In [None]:
# Visualize the paradox
fig, ax = plt.subplots(figsize=(10, 6))

scatter = ax.scatter(df["marketing_spend"], df["sales"], c=df["quality"], cmap="RdYlGn", alpha=0.6, s=40)
plt.colorbar(scatter, label="Product Quality")

# Add naive regression line
z = np.polyfit(df["marketing_spend"], df["sales"], 1)
p = np.poly1d(z)
x_line = np.linspace(df["marketing_spend"].min(), df["marketing_spend"].max(), 100)
ax.plot(x_line, p(x_line), "r--", linewidth=2, label=f"Naive slope = {z[0]:.2f}")

ax.set_xlabel("Marketing Spend ($)")
ax.set_ylabel("Sales ($)")
ax.set_title("The Marketing Paradox: Higher Spend → Lower Sales?")
ax.legend()
plt.tight_layout()
plt.show()

print("\nNotice: Red points (low quality) cluster at high marketing, low sales")
print("Green points (high quality) cluster at low marketing, high sales")

## 5. Applying the Backdoor Criterion

### Step 1: List all paths from Marketing to Sales

1. **Marketing → Sales** (direct, causal)
2. **Marketing ← Quality → Sales** (backdoor, non-causal)

### Step 2: Identify which paths are open/closed

- Path 1: Always open (it's causal)
- Path 2: Open because Quality is a non-collider on this path

### Step 3: Find conditioning set to block backdoors

To block the backdoor path **Marketing ← Quality → Sales**:
- Condition on **Quality**

This satisfies the backdoor criterion:
- ✓ Quality is not a descendant of Marketing
- ✓ Conditioning on Quality blocks the backdoor path

## 6. Recovering the Causal Effect

Now we condition on quality to close the backdoor path.

In [None]:
# Conditioned regression: Sales ~ Marketing + Quality
conditioned_results = run_regression(df, "sales", ["marketing_spend", "quality"])

print("Conditioned Regression: Sales ~ Marketing Spend + Quality")
print("=" * 60)
print(conditioned_results.summary().tables[1])

print(f"\nConditioned estimate: {conditioned_results.params['marketing_spend']:.2f}")
print(f"True effect:          {df['true_marketing_effect'].iloc[0]:.2f}")
print(f"\nBy conditioning on quality, we recover the TRUE positive effect!")

In [None]:
# Compare naive vs conditioned estimates
results_dict = {
    "Naive (Sales ~ Marketing)": naive_results,
    "Conditioned (Sales ~ Marketing + Quality)": conditioned_results,
}

comparison = regression_comparison_table(results_dict, true_effect=df["true_marketing_effect"].iloc[0])
print(comparison.to_string(index=False))

In [None]:
# Visual comparison
plot_regression_comparison(comparison, title="Recovering the Causal Effect by Conditioning on Quality")

In [None]:
# Partial regression plot
plot_partial_regression(
    df,
    outcome="sales",
    treatment="marketing_spend",
    confounder="quality",
    title="From Negative Bias to True Positive Effect",
)

## 7. Collider Bias Demonstration

Now let's see what happens when we condition on a **collider**.

### Adding a Collider: "Featured" Status

Suppose products get "Featured" status if they have:
- High marketing spend, OR
- High sales

This creates a collider:
$$\text{Marketing} \rightarrow \text{Featured} \leftarrow \text{Sales}$$

In [None]:
# DAG with collider
draw_dag(
    edges=[
        ("Quality", "Marketing"),
        ("Quality", "Sales"),
        ("Marketing", "Sales"),
        ("Marketing", "Featured"),
        ("Sales", "Featured"),
    ],
    title="Extended DAG: Featured Status is a Collider",
)

In [None]:
# Add collider variable
df = add_collider(df, marketing_pct=70, sales_pct=70, seed=42)

print(f"Featured products: {df['featured'].sum()} ({df['featured'].mean() * 100:.1f}%)")

In [None]:
# What happens if we condition on the collider?
featured_df = df[df["featured"]]

# Regression on featured products only
collider_results = run_regression(featured_df, "sales", ["marketing_spend", "quality"])

print("Regression CONDITIONING ON COLLIDER (Featured products only):")
print("=" * 60)
print(f"Marketing effect estimate: {collider_results.params['marketing_spend']:.2f}")
print(f"True effect:               {df['true_marketing_effect'].iloc[0]:.2f}")
print(f"\nConditioning on the collider BIASES our estimate!")

In [None]:
# Full demonstration
plot_collider_bias_demo(df, title="Three Ways to Estimate Marketing Effect")

In [None]:
# Final comparison of all estimates
print("\n" + "=" * 60)
print("SUMMARY: Marketing Effect Estimates")
print("=" * 60)
print(f"True causal effect:                    ${df['true_marketing_effect'].iloc[0]:.2f}")
print(f"\nNaive (ignoring confounding):          ${naive_results.params['marketing_spend']:.2f}  ← WRONG SIGN!")
print(f"Conditioned on quality (correct):      ${conditioned_results.params['marketing_spend']:.2f}  ← CORRECT!")
print(f"Conditioned on collider (featured):    ${collider_results.params['marketing_spend']:.2f}  ← BIASED!")
print("=" * 60)

---

# Conclusion

## Key Takeaways

1. **DAGs encode causal assumptions visually**: They make your causal model explicit and help identify sources of bias.

2. **Backdoor paths create spurious correlations**: When a confounder affects both treatment and outcome, naive estimates are biased.

3. **The backdoor criterion tells us what to condition on**: Block all backdoor paths while leaving direct paths open.

4. **Not all controls are good controls**: Conditioning on colliders OPENS paths and creates bias—the opposite of what we want.

5. **"Controlling for everything" is dangerous**: You need to understand the causal structure before adding controls.

## Connection to Lecture 01

The selection bias we saw in the potential outcomes framework can be represented graphically:

- **Selection bias** = Open backdoor path
- **Randomization** = Severs all backdoor paths (makes D independent of confounders)

DAGs and potential outcomes are complementary frameworks for the same underlying problem.

## What's Next

The remaining lectures cover methods for closing backdoor paths when we can't randomize:

- **Matching & Propensity Scores**: Create comparable groups by conditioning on observables
- **Regression Discontinuity**: Exploit sharp cutoffs that create local randomization
- **Instrumental Variables**: Use exogenous variation that only affects outcome through treatment
- **Difference-in-Differences**: Leverage parallel trends assumption