# EconML API Demonstration

**Goal.** Introduce the core API of the [EconML](https://www.pywhy.org/EconML/) library and show how this project wraps that API in `econml_utils.py` to make it easier to:

- Estimate **average treatment effects (ATE)** and
- Explore **heterogeneous treatment effects (CATE)** across student subgroups.

This notebook demonstrates the **native EconML API** on a tiny synthetic example, then the **wrapper API** on the real UCI Student Performance dataset.

In [1]:
# If needed, install econml and ucimlrepo.
# (Once the Docker image has them, you can comment these out.)
# !pip install -q econml ucimlrepo
# !pip install ipywidgets

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression

from econml.dml import LinearDML, CausalForestDML, SparseLinearDML

from econml_utils import (
    load_student_data,
    clean_student_data,
    make_default_config,
    fit_econml_estimator,
    estimate_ate,
    estimate_cate_by_subgroup,
    summarize_treatment,
)

# Notebook Outline

This notebook is organized into two main sections:

1. **Native EconML API (synthetic data)**  
   - Build a simple synthetic dataset with a binary treatment.  
   - Fit a `LinearDML` model directly using the EconML API.  
   - Inspect the estimated average treatment effect (ATE).

2. **Wrapper API on real data (Student Performance)**  
   - Load and clean the UCI Student Performance dataset via `ucimlrepo`.
   - Use `EconMLEducationConfig` to define outcome, treatment, and covariates.  
   - Fit different DML estimators (`LinearDML`, `CausalForestDML`, `SparseLinearDML`).  
   - Estimate ATE and subgroup CATEs using helper functions in `econml_utils.py`.

# Native EconML API on a synthetic dataset

To understand the EconML API, we start with a very small **toy example**. Here, the data generating process is controlled so that to know a "true" treatment effect.

We will:

1. Simulate feature vectors `X`, a binary treatment `T`, and an outcome `Y`.
2. Use `LinearDML` directly (without the wrapper) to estimate the treatment effect.
3. Compare EconML's ATE estimate with the true effect we built into the data.

Key idea: **Double Machine Learning (DML)** uses machine learning models to separately learn:
- how covariates predict the outcome, and
- how covariates predict treatment,

then estimates treatment effects on the residuals.

In [2]:
# 2.1 Simulate a simple DML friendly dataset
rng = np.random.RandomState(42)
n = 2000

# One continuous feature: study-related "ability" score
X = rng.normal(size=(n, 1))

# Binary treatment with probability depending on X
# (students with higher ability are slightly more likely to take the treatment)
p_t = 1 / (1 + np.exp(-0.5 * X.ravel()))
T = rng.binomial(1, p_t)

# True treatment effect is mildly heterogeneous in X
true_tau = 2.0 + 0.5 * X.ravel()

# Baseline outcome that also depends on X
baseline = 5.0 + 1.0 * X.ravel()

# Outcome with noise
Y = baseline + true_tau * T + rng.normal(scale=1.0, size=n)

# 2.2 Fit LinearDML directly with EconML
# In a typical DML setup:
# - Y: outcome
# - T: treatment
# - X: features driving heterogeneity
# - W: additional controls (optional; here it is left empty)

est_linear_native = LinearDML(
    discrete_treatment=True,
    random_state=42,
)

est_linear_native.fit(Y, T, X=X, W=None)

# 2.3 Estimate ATE and compare with the "true" average

ate_est = est_linear_native.ate(X=X)
ate_ci_low, ate_ci_high = est_linear_native.ate_interval(X=X)

true_ate = np.mean(true_tau)

print(f"True ATE (simulated): {true_ate:.3f}")
print(f"EconML LinearDML ATE: {ate_est:.3f}")
print(f"95% CI for EconML LinearDML ATE: [{ate_ci_low:.3f}, {ate_ci_high:.3f}]")

True ATE (simulated): 2.023
EconML LinearDML ATE: 1.970
95% CI for EconML LinearDML ATE: [1.881, 2.060]


## Interpretation

- **True ATE**: comes from the known data-generating process used to build `Y`.
- **Estimated ATE (LinearDML)**: uses the EconML `LinearDML` estimator to recover the causal effect from observational-looking data.

The key takeaway is not the exact numbers but the **API pattern**:

```
est = LinearDML(discrete_treatment=True, random_state=42)
est.fit(Y, T, X=X, W=W)
est.ate(X=X)
```