8. Example 1: Causal Analysis on Primary Dataset (Synthetic Medical Study)

8.1 Dataset Description

For our primary worked example, we utilize a high-fidelity Synthetic Medical Dataset representing a Type 2 Diabetes treatment study. Using synthetic data allows us to define the "Ground Truth" causal effect explicitly (which is impossible in real-world data), enabling us to objectively verify if our causal inference methods successfully remove bias.

Scenario:
We observe 1,000 patients who either received a new_drug or standard care.

Source: Generated procedurally within this notebook to ensure reproducibility and stability.

Relevance: This mimics a classic Observational Study where Randomized Controlled Trials (RCTs) are not feasible. In such studies, "Confounding by Indication" is commonâ€”doctors tend to prescribe new, stronger treatments to patients who are already sicker or older.

Variable Definitions:Treatment ($T$): new_drug (Binary: 1 if treated, 0 if control).Outcome ($Y$): hba1c_reduction (Continuous: Reduction in blood sugar levels; higher is better).Confounders ($W$):age: Older patients are more likely to receive the drug.bmi: Body Mass Index.severity: Baseline disease severity (Sicker patients are more likely to receive the drug).

8.2 Problem Setup 

Causal Question: Does the new_drug cause a greater reduction in HbA1c levels compared to standard care, after adjusting for patient characteristics?

The Bias Mechanism: The data generation process introduces intentional Selection Bias. The probability of receiving the drug (prob_drug) is positively correlated with age and severity.

Hypothesis: A Naive Comparison (simple difference in means) will underestimate the drug's effectiveness because the treated group starts with a disadvantage (they are older and sicker). A Causal Model using Propensity Score Stratification should correct this bias and recover the true effect.

8.3 Step-by-Step Causal Analysis We will perform the analysis using the DoWhy library following these steps:

Data Generation & Exploration: Generate the data with a known Ground Truth Effect of 1.5.

Naive Estimation: Calculate $E[Y|T=1] - E[Y|T=0]$ to demonstrate the magnitude of the bias.

Modeling: Define a Causal Graph (DAG) identifying age, bmi, and severity as common causes (confounders) of both new_drug and hba1c_reduction.

Identification: Use the Backdoor Criterion to identify the causal effect.

Estimation: Apply Propensity Score Stratification. This method estimates the probability of treatment (propensity score) for each patient, groups them into strata of similar probabilities, and computes the effect within these balanced groups.

Refutation: Perform a Placebo Treatment Test. We replace the true treatment with a random variable. A robust model should return an effect estimate of 0 (p-value > 0.05), indicating it is not finding patterns in random noise.

In [7]:
!pip install dowhy

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.1.1 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
import dowhy
from dowhy import CausalModel
import pandas as pd
import numpy as np

# 1. Generate Synthetic Data (Guaranteed Execution)
np.random.seed(42)
n = 1000

# Confounders
age = np.random.normal(60, 10, n)
bmi = np.random.normal(30, 5, n)
severity = np.random.normal(5, 2, n)

# Treatment Assignment Mechanism (Bias)
# Probability of getting drug increases with Age and Severity
prob_drug = 1 / (1 + np.exp(-( -5 + 0.05*age + 0.3*severity )))
treatment = np.random.binomial(1, prob_drug)

# Outcome Generation (Ground Truth)
# True Causal Effect = 1.5 units of reduction
# However, Age and Severity negatively impact the natural reduction ability
outcome = 2 + (1.5 * treatment) - (0.02 * age) - (0.1 * severity) + np.random.normal(0, 0.5, n)

df_medical = pd.DataFrame({
    'age': age,
    'bmi': bmi,
    'severity': severity,
    'new_drug': treatment,
    'hba1c_reduction': outcome
})

print("Dataset Head:")
print(df_medical.head())

# 2. Naive Comparison
naive_effect = df_medical[df_medical['new_drug']==1]['hba1c_reduction'].mean() - \
               df_medical[df_medical['new_drug']==0]['hba1c_reduction'].mean()

print(f"\nNaive Difference in Means: {naive_effect:.2f}")
print("Observation: The naive estimate is likely skewed lower than the truth (1.5) due to severity bias.")

# 3. Define Causal Model (DAG)
model = CausalModel(
    data=df_medical,
    treatment='new_drug',
    outcome='hba1c_reduction',
    common_causes=['age', 'bmi', 'severity']
)

# 4. Identification & Estimation
identified_estimand = model.identify_effect()

# Method: Propensity Score Stratification
estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.propensity_score_stratification"
)

print(f"\nCausal Estimate (DoWhy): {estimate.value:.2f}")
print(f"Ground Truth Effect:     1.50")

# 5. Refutation (Robustness Check)
# FIXED: Switched to 'placebo_treatment_refuter' to generate a valid p-value.
# This tests: "If we replace the drug with a random variable, does the effect disappear?"
refute = model.refute_estimate(
    identified_estimand, 
    estimate, 
    method_name="placebo_treatment_refuter",
    placebo_type="permute"
)

# Now this will work because placebo refuter returns a p-value
print(f"\nRefutation p-value: {refute.refutation_result['p_value']:.2f}")
print("(Interpretation: A p-value > 0.05 means our estimate is robust. We want to fail to reject the null for the placebo.)")

Dataset Head:
         age        bmi  severity  new_drug  hba1c_reduction
0  64.967142  36.996777  3.649643         1         1.926224
1  58.617357  34.623168  4.710963         1         2.508196
2  66.476885  30.298152  3.415160         0         0.622029
3  75.230299  26.765316  4.384077         0        -0.149221
4  57.658466  33.491117  1.212771         0         0.854153

Naive Difference in Means: 1.32
Observation: The naive estimate is likely skewed lower than the truth (1.5) due to severity bias.

Causal Estimate (DoWhy): 1.50
Ground Truth Effect:     1.50

Refutation p-value: 0.84
(Interpretation: A p-value > 0.05 means our estimate is robust. We want to fail to reject the null for the placebo.)


## 9. Example 2: Regression Discontinuity Design (RDD)

### 9.1 Dataset Description
For this example, we move away from "Confounding" (Selection Bias) and explore a **Regression Discontinuity** structure. This is distinct because assignment isn't based on complex probability, but on a strict **cutoff**.

**Scenario:**
A university grants a **Merit Scholarship** to students who score above **80 points** on an entrance exam. We want to know if receiving this scholarship causes students to earn more money in their first job.

### 9.2 Problem Setup
* **Running Variable ($X$):** `Entrance_Score` (0-100).
* **Treatment ($T$):** `Scholarship` (1 if Score >= 80, else 0).
* **Outcome ($Y$):** `Future_Income` ($).
* **Causal Logic:** Students who got 79 (control) and 81 (treated) are virtually identical in intelligence and motivation. The only difference is the scholarship. By comparing students *just around the cutoff*, we can estimate a causal effect without needing to control for every background variable.

In [9]:
import dowhy
from dowhy import CausalModel
import pandas as pd
import numpy as np

# 1. Generate Synthetic Data (Guaranteed Execution)
np.random.seed(42)
n = 1000

# Confounders
age = np.random.normal(60, 10, n)
bmi = np.random.normal(30, 5, n)
severity = np.random.normal(5, 2, n)

# Treatment Assignment Mechanism (Bias)
# Probability of getting drug increases with Age and Severity
prob_drug = 1 / (1 + np.exp(-( -5 + 0.05*age + 0.3*severity )))
treatment = np.random.binomial(1, prob_drug)

# Outcome Generation (Ground Truth)
# True Causal Effect = 1.5 units of reduction
# However, Age and Severity negatively impact the natural reduction ability
outcome = 2 + (1.5 * treatment) - (0.02 * age) - (0.1 * severity) + np.random.normal(0, 0.5, n)

df_medical = pd.DataFrame({
    'age': age,
    'bmi': bmi,
    'severity': severity,
    'new_drug': treatment,
    'hba1c_reduction': outcome
})

print("Dataset Head:")
print(df_medical.head())

# 2. Naive Comparison
naive_effect = df_medical[df_medical['new_drug']==1]['hba1c_reduction'].mean() - \
               df_medical[df_medical['new_drug']==0]['hba1c_reduction'].mean()

print(f"\nNaive Difference in Means: {naive_effect:.2f}")
print("Observation: The naive estimate is likely skewed lower than the truth (1.5) due to severity bias.")

# 3. Define Causal Model (DAG)
model = CausalModel(
    data=df_medical,
    treatment='new_drug',
    outcome='hba1c_reduction',
    common_causes=['age', 'bmi', 'severity']
)

# 4. Identification & Estimation
identified_estimand = model.identify_effect()

# Method: Propensity Score Stratification
estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.propensity_score_stratification"
)

print(f"\nCausal Estimate (DoWhy): {estimate.value:.2f}")
print(f"Ground Truth Effect:     1.50")

# 5. Refutation (

Dataset Head:
         age        bmi  severity  new_drug  hba1c_reduction
0  64.967142  36.996777  3.649643         1         1.926224
1  58.617357  34.623168  4.710963         1         2.508196
2  66.476885  30.298152  3.415160         0         0.622029
3  75.230299  26.765316  4.384077         0        -0.149221
4  57.658466  33.491117  1.212771         0         0.854153

Naive Difference in Means: 1.32
Observation: The naive estimate is likely skewed lower than the truth (1.5) due to severity bias.

Causal Estimate (DoWhy): 1.50
Ground Truth Effect:     1.50
