# Causal Inference Methods: An Overview

## 1. Randomized Controlled Trials (RCTs)

Randomized Controlled Trials (RCTs) are the bedrock of causal inference, providing rigorous evidence of causality. They involve randomly assigning participants to **treatment** and **control** groups, ensuring that any observed differences in outcomes can be attributed to the treatment itself.

### Formal Setup

Let

$$
Y_i(1)
$$

denote the potential outcome for unit $ i $ if treated, and

$$
Y_i(0)
$$

if untreated.

The **Average Treatment Effect (ATE)** is defined as:

$$
ATE = \mathbb{E}[Y_i(1) - Y_i(0)]
$$

Because randomization ensures independence between treatment assignment $ D_i $ and potential outcomes:

$$
D_i \perp (Y_i(0), Y_i(1))
$$

we can estimate the causal effect unbiasedly as:

$$
\hat{ATE} = \mathbb{E}[Y_i | D_i = 1] - \mathbb{E}[Y_i | D_i = 0]
$$

### Pros

* Gold standard for causal inference due to random assignment.
* Provides high internal validity, helping establish strong causality.
* Results can be generalized to a larger population under certain conditions.

### Cons

* May be expensive, time-consuming, or ethically challenging.
* Not always feasible for all research questions.

### Real-life Applications

* **Pharmaceutical trials:** Used to assess the efficacy of new drugs by comparing treated and placebo groups.
* **Education interventions:** Used to evaluate innovative teaching methods on student performance.

### Interview Question

**Q:** How do randomized controlled trials ensure unbiased causal inference?

**A:** By using random assignment to create comparable treatment and control groups, ensuring that any differences in outcomes can be attributed to the treatment itself.

In [None]:
#Python Example

import numpy as np
import pandas as pd

# Simulate randomized experiment
np.random.seed(42)
n = 1000
treatment = np.random.binomial(1, 0.5, n)
y0 = np.random.normal(50, 10, n)
treatment_effect = 5
y1 = y0 + treatment_effect
outcome = np.where(treatment == 1, y1, y0)

df = pd.DataFrame({'D': treatment, 'Y': outcome})
ate = df.loc[df.D == 1, 'Y'].mean() - df.loc[df.D == 0, 'Y'].mean()
print(f"Estimated ATE = {ate:.3f}")

## 2. Matching Methods

Matching methods are used when randomization is not feasible. Researchers pair treated and untreated units with similar characteristics to approximate a randomized design.

### Mathematical Idea

Let $ X_i $ denote observed covariates, and $ D_i $ the treatment indicator. Matching ensures:

$$
\mathbb{E}[Y_i(0) | D_i = 1, X_i] = \mathbb{E}[Y_i(0) | D_i = 0, X_i]
$$

so that treatment effects can be estimated as:

$$
ATE = \mathbb{E}_{X}\big[\mathbb{E}[Y_i | D_i=1, X_i] - \mathbb{E}[Y_i | D_i=0, X_i]\big]
$$

### Pros

* Reduces selection bias by balancing covariates between treatment and control groups.
* Suitable for observational studies where random assignment isn’t possible.

### Cons

* Requires careful selection of matching variables.
* Cannot control for unobserved confounders.

### Real-life Applications

* **Job training programs:** Compare employment outcomes between trained and untrained individuals with similar characteristics.
* **Healthcare outcomes:** Match patients by medical history to estimate treatment effects.

### Interview Question

**Q:** What is selection bias, and how do matching methods address it?

**A:** Selection bias arises when treated and control groups differ systematically. Matching reduces it by ensuring groups are similar on observed characteristics.

In [None]:
#Python Example: Propensity Score Matching

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

# Simulated observational data
np.random.seed(0)
n = 500
X = np.random.normal(size=(n, 2))
p = 1 / (1 + np.exp(-X[:,0] + 0.5*X[:,1]))
D = np.random.binomial(1, p)
Y = 2*D + X[:,0] + np.random.normal(0, 1, n)

# Propensity score estimation
model = LogisticRegression().fit(X, D)
pscore = model.predict_proba(X)[:,1]

# Nearest neighbor matching on propensity score
treated = np.where(D==1)[0]
control = np.where(D==0)[0]
nbrs = NearestNeighbors(n_neighbors=1).fit(pscore[control].reshape(-1,1))
_, idx = nbrs.kneighbors(pscore[treated].reshape(-1,1))
matched_control = control[idx.flatten()]

att = np.mean(Y[treated] - Y[matched_control])
print(f"Estimated ATT (Matching) = {att:.3f}")

## 3. Instrumental Variables (IV)

Instrumental Variables are used when treatment $ D_i $ is endogenous (correlated with unobserved factors). An **instrument** $ Z_i $ is used that affects $ D_i $ but not the outcome $ Y_i $ directly.

### Assumptions

1. **Relevance:**
   $$
   \text{Cov}(Z_i, D_i) \neq 0
   $$
2. **Exogeneity (Exclusion Restriction):**
   $$
   \text{Cov}(Z_i, \varepsilon_i) = 0
   $$
   where ( \varepsilon_i ) is the error term in the outcome equation.

### Two-Stage Least Squares (2SLS)

1. First stage:
   $$
   D_i = \pi_0 + \pi_1 Z_i + v_i
   $$
2. Second stage:
   $$
   Y_i = \beta_0 + \beta_1 \hat{D_i} + \varepsilon_i
   $$

The estimate $ \hat{\beta}_1 $ gives the **Local Average Treatment Effect (LATE)**.

### Pros

* Addresses endogeneity using an external instrument.
* Useful when randomization is not feasible.

### Cons

* Requires a valid instrument that affects treatment but not outcome directly.
* Weak instruments can bias estimates.

### Real-life Applications

* **Education and income:** Use proximity to a college as an instrument for education.
* **Economic policies:** Use interest rate changes as instruments for studying consumer spending.

### Interview Question

**Q:** How does an instrumental variable differ from a confounding variable?

**A:** An instrumental variable affects the treatment but not the outcome directly, while a confounding variable affects both treatment and outcome, biasing estimates.

In [None]:
# Python Example (2SLS) using statsmodels only

import numpy as np
import pandas as pd
import statsmodels.api as sm

# Simulate endogenous treatment
np.random.seed(1)
n = 1000
Z = np.random.binomial(1, 0.5, n)              # Instrument
u = np.random.normal(0, 1, n)                  # Error term
D = 0.8*Z + 0.5*u + np.random.normal(0, 0.2, n)  # Endogenous regressor
Y = 2*D + u + np.random.normal(0, 0.2, n)       # Outcome

# Prepare data for 2SLS
df = pd.DataFrame({'Y': Y, 'D': D, 'Z': Z})
df['const'] = 1  # add constant manually

# First stage: regress D on Z
first_stage = sm.OLS(df['D'], df[['const', 'Z']]).fit()
df['D_hat'] = first_stage.fittedvalues

# Second stage: regress Y on predicted D
second_stage = sm.OLS(df['Y'], df[['const', 'D_hat']]).fit()

print("First stage summary:")
print(first_stage.summary())
print("\nSecond stage (2SLS) summary:")
print(second_stage.summary())


In [None]:
# Python Example (2SLS) using statsmodels only

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.sandbox.regression.gmm import IV2SLS

# Simulate endogenous treatment
np.random.seed(1)
n = 1000
Z = np.random.binomial(1, 0.5, n)              # Instrument
u = np.random.normal(0, 1, n)                  # Error term
D = 0.8*Z + 0.5*u + np.random.normal(0, 0.2, n)  # Endogenous regressor
Y = 2*D + u + np.random.normal(0, 0.2, n)       # Outcome

# Prepare data
df = pd.DataFrame({'Y': Y, 'D': D, 'Z': Z})
df['const'] = 1  # intercept

# Run 2SLS directly
iv = IV2SLS(df['Y'], df[['const', 'D']], df[['const', 'Z']]).fit()

# Display results
print(iv.summary())


## 4. Difference-in-Differences (DiD)

Difference-in-Differences compares the **change in outcomes** over time between treated and control groups.

### Model

Let $ Y_{it} $ be the outcome for unit $ i $ at time $ t $, and $ D_i $ indicate treatment.

$$
Y_{it} = \alpha + \delta D_i + \gamma T_t + \beta (D_i \times T_t) + \varepsilon_{it}
$$

The causal effect is captured by:

$$
\hat{\beta}*{DiD} = ( \overline{Y}*{t=1}^{Treated} - \overline{Y}*{t=0}^{Treated} ) - ( \overline{Y}*{t=1}^{Control} - \overline{Y}_{t=0}^{Control} )
$$

### Assumption

* **Parallel Trends:** In absence of treatment, both groups would follow the same trend in outcomes.

### Pros

* Captures treatment effects over time.
* Controls for time-invariant differences.

### Cons

* Requires parallel trends assumption.
* May not control for time-varying confounders.

### Real-life Applications

* **Policy evaluations:** Measure impact of minimum wage laws on employment.
* **Public health:** Assess effectiveness of anti-smoking campaigns.

### Interview Question

**Q:** What does “difference-in-differences” mean?

**A:** It’s the difference in outcome differences between treatment and control groups before and after an intervention.

In [None]:
# Simulate panel data
np.random.seed(10)
n = 500
pre = np.random.normal(50, 5, n)
post = pre + np.random.normal(0, 2, n)
treatment = np.random.binomial(1, 0.5, n)
post_treated = post + treatment * 3 # treatment effect = 3

df = pd.DataFrame({
    'Y_pre': pre, 'Y_post': post_treated, 'D': treatment
})

did = (df.loc[df.D==1, 'Y_post'].mean() - df.loc[df.D==1, 'Y_pre'].mean()) - \
      (df.loc[df.D==0, 'Y_post'].mean() - df.loc[df.D==0, 'Y_pre'].mean())
print(f"Estimated DiD Effect = {did:.3f}")

## 5. Regression Discontinuity Design (RDD)

RDD exploits **natural thresholds** that determine treatment assignment.

### Model

Suppose treatment occurs if the running variable $ X_i $ exceeds a cutoff $ c $:

$$
D_i = 1(X_i \geq c)
$$

Then the treatment effect is estimated as the discontinuity at $ X_i = c $:

$$
\tau_{RDD} = \lim_{x \to c^+} \mathbb{E}[Y_i | X_i = x] - \lim_{x \to c^-} \mathbb{E}[Y_i | X_i = x]
$$

### Pros

* Exploits clear cutoffs for causal inference.
* Controls for unobserved confounders near the threshold.

### Cons

* Only applies near the cutoff (local effect).
* Requires smoothness of other variables around threshold.

### Real-life Applications

* **Elections:** Study the causal effect of political victories on policy outcomes.
* **Education:** Evaluate effects of scholarships awarded based on grade cutoffs.

### Interview Question

**Q:** How does RDD work, and why is the threshold important?

**A:** RDD identifies causal effects by comparing units just above and below a threshold, assuming they are otherwise similar.

## Summary

Causal inference methods provide tools to uncover cause-and-effect relationships:

| Method   | Key Idea                     | Assumptions                 | Typical Use          |
| -------- | ---------------------------- | --------------------------- | -------------------- |
| RCT      | Random assignment            | Independence                | Clinical trials      |
| Matching | Pair similar units           | No unobserved bias          | Observational data   |
| IV       | Use external instrument      | Valid, exogenous instrument | Endogeneity problems |
| DiD      | Compare before–after changes | Parallel trends             | Policy evaluation    |
| RDD      | Exploit threshold cutoffs    | Continuity near cutoff      | Local policy effects |

| Method | Estimator Formula | Assumptions | Typical Python Tool | 
| -------- | --------------------------------- | ----------------------------- | ------------------- |
| RCT | $ \mathbb{E}[Y \| D=1] - \mathbb{E}[Y \| D=0] $ | Randomization | NumPy, Pandas |
| Matching | $ \sum (Y_i - Y_{match(i)})/N_T $ | Conditional Independence | scikit-learn |
| IV | $ \frac{Cov(Z,Y)}{Cov(Z,D)} $ | Valid Instrument | linearmodels | 
| DiD | Double-difference over time | Parallel Trends | Pandas | 
| RDD | $ \lim_{x \downarrow c}E[Y \| X=x] - \lim_{x \uparrow c}E[Y \| X=x] $ | Continuity near cutoff | statsmodels |
