# CausalML Lecture 5: Heterogeneus Treatment Effect

"The average causal effect T is an _average_ and as such enjoys all the advantages and disadvantages of averages." – P. W. Holland

**Giorgio Coppola**  
**Xiaohan Wu**

## Setup Instructions

To replicate this environment and run this notebook with the exact same package versions using poetry like in Lecture 1:

1. **Install dependencies**:
   ```bash
   poetry install
   ```
   This will create a virtual environment and install all dependencies specified in `pyproject.toml`. Learn more about [dependency management](https://python-poetry.org/docs/dependency-specification/).

2. **Create and install the Jupyter kernel**:
   ```bash
   poetry run python -m ipykernel install --user --name=lecture-5-env --display-name="CausalML Lecture 5"
   ```
   The `poetry run` command executes commands within the Poetry virtual environment. See [Poetry environment management](https://python-poetry.org/docs/managing-environments/).

3. **Start Jupyter**:
   ```bash
   poetry run jupyter notebook
   ```

4. **Select the kernel**: In Jupyter, go to Kernel → Change Kernel → "CausalML Lecture 5"

For more information about Poetry and its features, visit the [official Poetry documentation](https://python-poetry.org/docs/).

## Quick recap form Lecture 5 and previous  

Until now, we **focused on the ATE** in experimental or nonrandomized observational settings, using **global information!**  

For this, **modeling (including ML) helps:** we can adjust for pre-treatment covariates, balance with weights, or do both. These approaches help because they **exploit covariates information to approximate missing counterfactuals** and to implement the assumption of conditional ignorability when it’s plausible.  

In **experimental settings,** ignorability holds by design; covariate adjustment mainly improves precision (and can mitigate sample imbalance and noncompliance), while weighting rebalances covariates to address observed selection bias when randomization fails (if we have overlap). Doubly robust estimators combine outcome and treatment models and are consistent if either is correctly specified.  

In **observational settings,** identification relies on **conditional ignorability** given selected pre-treatment covariates (plus overlap and SUTVA). Models and weights (or both), if they are correctly specified, can help replace the assumption of independence with a less stringent conditional version, yielding to a consistent ATE.  

**With HTE, this is harder:** we need local information, but **local information is scarce!**   

We want to estimate a **causal effect that vary across units**. Problem: we cannot make causal claims about specific individuals with certainty!  

When estimating HTE, we are interested in **ITE or CATE conditional on groups defined by X**. We cannot estimate ITE, but CATE is a compromise. However, as X-defined groups get finer, we can lose overlap locally, we don't have sufficient treatment and control units for each X-defined groups &rarr; **curse of dimensionality**.

As **we don't have an observable target** (counterfactual), we don't have a natural loss function and a validation set. This means that the HTE true signal that an estimator can estimate is small relative to the noise, and naïve ML approaches like "T-learners" or "S-learners" are prone to high variance/overfitting. They usually mistake prognostic signal (features that predict baseline risk) and treatment assignment signal (propensity) for causal moderation (the features that change how much treatment effect vary): you might highlight **spurious heterogeneity**.

The challenge is **recover CATEs without chasing noise** and report uncertainty that’s credible. To do so, we need methods that build a valid loss with orthogonal/pseudo-outcomes and cross-fitting like **causal forests, X-learners, and DR-learners**, so we can target CATE τ(x), rather than modeling the outcome Y alone.  

We are going to show some of them, and how they can help!



## Simulation Plan

### Aims

Show, via a fast Monte Carlo, that:  
1) **Naïve ML for HTE (S/T-learners)** mostly learns prognosis or amplifies noise.  
2) **Causal Forests** (Wager–Athey, 2018) recover heterogeneity and enable inference when overlap is decent.  
3) **Meta/DR learners** (X-learner; DR-learner à la Kennedy, 2020) explicitly target the effect and outperform S/T, especially with imbalance.  


### Data generating mechanism

We will use a **DGP** ...  

The data generating process is defined as follows:  

$$some function$$  

**Key insight**:  

### Estimand

- **CATE:** `tau(x)` evaluated on the test set (truth known in simulation).  
- **ATE:** `E[tau(X)]` (check).  

### Methods

**Baselines (to fail)**  
- **S-learner:** one outcome model `m(a,x)` with `A` as a feature; `taû(x)=m(1,x)−m(0,x)`.  
- **T-learner:** separate outcome models by arm; subtract predictions.

**Solutions**  
- **Causal Forest** (Wager–Athey, 2018).  
- **X-learner** (Künzel et al., 2019) with RF base.  
- **DR-learner** (Kennedy, 2020).  

### Performance

?
