# Notebook 1: Data Exploration & Causal DAG

**Causal Question:** What is the causal impact of smoking on health outcomes (health score, cancer)?

This notebook generates and explores synthetic BRFSS-like data, demonstrates
confounding between treatment groups, and constructs a causal DAG to identify
the backdoor adjustment set needed for unbiased causal estimation.

In [None]:
import sys
sys.path.insert(0, "..")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from src.preprocessing import generate_synthetic_brfss, clean_data, engineer_features
from src.utils.config import (
    TREATMENT_COL, OUTCOME_HEALTH, OUTCOME_CANCER,
    COVARIATE_COLS, TRUE_ATE_HEALTH, RANDOM_SEED,
)
from src.utils.dag import CausalDAG
from src.utils.visualization import plot_covariate_distributions

sns.set_style("whitegrid")
print("Setup complete.")

## 1. Data Generation

We generate 10,000 synthetic observations with a known data-generating process.
The true ATE on health score is **-5.0** (smoking reduces health by 5 points).

In [None]:
df_raw = generate_synthetic_brfss(n=10_000)
print(f"Dataset shape: {df_raw.shape}")
print(f"True ATE (health score): {TRUE_ATE_HEALTH}")
df_raw.head()

In [None]:
df_raw.describe().round(3)

## 2. Treatment Distribution

In [None]:
counts = df_raw[TREATMENT_COL].value_counts().sort_index()
print(f"Non-smokers: {counts[0]:,}  |  Smokers: {counts[1]:,}")
print(f"Smoking prevalence: {df_raw[TREATMENT_COL].mean():.1%}")

fig, ax = plt.subplots(figsize=(5, 4))
ax.bar(["Non-smoker", "Smoker"], counts.values, color=["#4C72B0", "#DD8452"])
ax.set_ylabel("Count")
ax.set_title("Treatment Distribution")
plt.tight_layout()
plt.show()

## 3. Outcome Distributions

> **Warning:** The naive differences below are **biased by confounding**.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Health score
for label, color, name in [(0, "#4C72B0", "Non-smoker"), (1, "#DD8452", "Smoker")]:
    subset = df_raw.loc[df_raw[TREATMENT_COL] == label, OUTCOME_HEALTH]
    axes[0].hist(subset, bins=40, alpha=0.55, label=f"{name} (mean={subset.mean():.1f})", color=color)
axes[0].set_xlabel("Health Score")
axes[0].set_title("Health Score by Smoking Status")
axes[0].legend()

# Cancer rate
cancer_rates = df_raw.groupby(TREATMENT_COL)[OUTCOME_CANCER].mean()
axes[1].bar(["Non-smoker", "Smoker"], cancer_rates.values, color=["#4C72B0", "#DD8452"])
axes[1].set_ylabel("Cancer Rate")
axes[1].set_title("Cancer Rate by Smoking Status")

plt.tight_layout()
plt.show()

In [None]:
# Naive mean differences
naive_health = (df_raw.loc[df_raw[TREATMENT_COL]==1, OUTCOME_HEALTH].mean()
                - df_raw.loc[df_raw[TREATMENT_COL]==0, OUTCOME_HEALTH].mean())
print(f"Naive health-score difference: {naive_health:.3f}")
print(f"True ATE:                      {TRUE_ATE_HEALTH}")
print(f"Bias:                          {naive_health - TRUE_ATE_HEALTH:.3f}")
print("\nThe naive estimate is biased because confounders differ between groups.")

## 4. Covariate Exploration

In [None]:
# Numeric covariates by treatment group
numeric_covs = [c for c in ["age", "female", "education", "income", "urban", "state_id"]
                if c in df_raw.columns]
plot_covariate_distributions(df_raw, numeric_covs[:4], TREATMENT_COL, save=False)
plt.show()

In [None]:
# Cross-tab: smoking by education
ct = pd.crosstab(df_raw["education"], df_raw[TREATMENT_COL], margins=True)
ct.columns = ["Non-smoker", "Smoker", "Total"]
print("Smoking prevalence by education level:")
print(df_raw.groupby("education")[TREATMENT_COL].mean().round(3))

## 5. Data Cleaning & Feature Engineering

In [None]:
df_clean = clean_data(df_raw)
df = engineer_features(df_clean)
print(f"Shape after cleaning + engineering: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Save processed data
import os
os.makedirs("../data/processed", exist_ok=True)
df.to_csv("../data/processed/cleaned_data.csv", index=False)
print("\nSaved to ../data/processed/cleaned_data.csv")

## 6. Causal DAG

The DAG encodes our assumptions about the causal structure. Using the
**backdoor criterion**, we identify the variables to condition on for
unbiased estimation of Smoking → HealthScore.

In [None]:
dag = CausalDAG.smoking_health_dag()
dag.plot(save=True)
plt.show()

In [None]:
backdoor = dag.backdoor_variables("Smoking", "HealthScore")
print("Backdoor adjustment set:")
for v in sorted(backdoor):
    print(f"  - {v}")

## Summary

1. **Confounding is present** — covariates differ between smokers and non-smokers.
2. **Naive comparisons are biased** — the raw mean difference ≠ true ATE.
3. **The DAG identifies** the variables we must adjust for.
4. **Next:** Notebook 02 applies causal estimators (PSM, IPW, Doubly Robust).