III Effect Heterogeneity and Personalization

# 6 Effect Heterogeneity 

Which "unit" should you treat when treatment effects different people differently? 

## Why Prediction Is Not the Answer 

**Definition:** The <u>individual treatment effect</u> (ITE) of indivicual $i$ is $\widehat{\frac{\Delta Y_i}{\Delta T}}$. 


We want to group individuals by their  ITE, but this quantity is not observable for an individual since it is defined in terms of counterfactuals. Thus it is not possible to have labeled data with ITE for each individual.

To estimate IET we assume that units with the same $X$ values have the same ITE; we plot $Y$ vs $T$ with a bunch of data points colored by their $X$ values and find a different slope for each group of $X$ values. 

**Definition:** The <u>conditional average treatment effect</u> (CATE) for $\{i|X_i =x\}$ is $\frac{\partial }{\partial t} E[Y|X=x]$. 

<img src="images/slope_segmentation.png" width="300">

## CATE with Regression

Instead of a model of the form 
$$
y_i = β_0 + β_1t_i + β_2X_i + e_i
$$
we will use a model (with an interaction term) of the form
$$
y_i = β_0 + β_1t_i + β_2X_i + β_3t_iX_i + e_i
$$
so that 
$$
\frac{\partial y_i}{\partial t} = \beta_1 +\beta_3 X_i
$$
Thus,

- $\hat{y}(x)$ is the model for outcome
- $\widehat{\frac{\partial y}{\partial t}}(x)$ is the model for CITE. 
- $\beta_3$ is rate of change of the effect with $X$. 

e.g. Let the treatment $t$ be discount and the effect $S$ be sales. We seek
$$
\frac{\partial }{\partial t} E[S|X=x]
$$

We will assume a randomized treatment $S_0,S_1 \perp T|X$.

In patsy formula syntax, the operator \* makes several of these terms:
e.g. a\*b will include the terms $a$, $b$ and $a b$ in your regression. 

```python
import statsmodels.formula.api as smf
            X = ["C(month)", "C(weekday)", "is_holiday", "competitors_price"]
            regr_cate = smf.ols(f"sales ~ discounts*({'+'.join(X)})",
                                data=data).fit()
```

If you only want the multiplicative term, you can use the : operator inside the formula.

You *can* get the CATE function $\widehat{\frac{\partial y}{\partial t}} = \hat{\beta}_1 + \hat{\beta}_3 X$ from
1. The coefficients $\hat{\beta_1},~ \hat{\beta}_3  $ from the fit
2. the definition of the derivative with $\epsilon \to 0$
3. using the linearity of $\hat{y}$ to take advantage of the fact that 
$\hat{\frac{\partial y}{\partial t}} = \hat{y}(t+1) - \hat{y}(t)$ 




**Definition:** In <u>price discrimination</u> firms  discriminate consumers into those who are willing to pay more and charge them more.

e.g. half-price entry tickets for students. In this case, the company knows that students make less money on average, meaning they have less to spend.

**Definition:** in <u>intertemporal price discrimination</u> a company distinguish the price sensitivity of customers based on time. 

For exaple airline tickets bought just before the flight cost much more than those a few months early. 

## Effect by Model Quantile 

It would be very useful if you could somehow order units by ITE. WE can order them by CATE and hope that this gives a similar ordering. 

1. From predicted $\text{CATE}(x_i)$, calculate quantile intervals $Q_k$
2. label each unit $i$ by the middle $m_k$ of the quantile interval $Q_k$ it's $\text{CATE}(x_i)$ is in.
3. You now have groups $G_k = \{i|\widehat{\text{CATE}}(x_i) \in Q_k\}$ for each quantile labeled by $m_k$
3. For each quantile group $G_k$ calculate $\hat{\beta}_1$ from the regression $y=\beta_0+\beta_1 t$ over points in $G_k$
    - that is $\hat{\beta}_{1,k} = \frac{\sum_{i\in G_k}(t_i - \bar{t}) y_i}{\sum_{G_k}(t_i - \bar{t})^2}$.


<img src="images/CATE_quantile_vs_id.png" width="300"> <img src="images/CATE_quantile_bar.png" width="300">

--- 
Me: This seems redundant. 
1. We calculate CATE for each x value. 
2. We partition by CATE quantile.
3. we calculate CATE for each quantile. 

There are two reasons I can see for doing this; 
1. the set of units in a quantile group $G_k$ is going to be bigger than the set of units in a group for a single value of $X$. So, fitting on the bigger group gives the ATE on the larger group in a natural way in the sense of regression, as opposed to an artificial weighted average of CATEs over $X$ values. 
2. A sanity check of some kind... on 

- the $y=\beta_0 + \beta_1t+\beta_2tX$ model with derivative $\frac{ \partial y}{\partial t} = \beta_1+\beta_2 X$, which you might not believe in so we compare it to 
- the $y=\beta_0+\tau t$ model, which we believe from earlier on one quantile.

at any rate, when ordering by CATE quantile and then plotting CATE, we expect nearly y=x behavior. 

--- 

## Cumulative Effect

In the previous section 
- we formed groups $G_k$ from quantiles of predicted ITE, $\widehat{\frac{ \partial y}{\partial t} }(x)= \hat{\beta}_1+\hat{\beta}_2 x$ 
- we fit $y=\beta_0+\beta_1 t$ over points in $G_k$ to obtain the ATE  $\beta_{1,k}$ of group $G_K$.

Here, 
- accumulate one group on top of the other to form the sequence $H_k = \underset{k'\leq k}{\cup} G_k$. 
- Fit $y=\delta_0+\delta_1 t$ over points in $H_k$ to obtain the ATE  $\delta_{1,k}$ of cumulative group $H_k$.


<img src="images/cumulative_effect_curve.png" width="300"> 

The grouping criteria is up to you; here we group by the top predicted ITE but you might decide to order or group by some other criteria. (An example would be great, but alas.) If your grouping does a good job of sorting the CATE then the resulting curve will gradually approach the average effect.

The goal is to find groups that have ATE well above average ATE. Thus, you might want to take the area between the cumulateive efect curve and the average ATE over the whle population

The groups on the left are very small and thus have very high variance. 

Thre is a solution to this; weigh the cumulative outcome by the cumulative population fraction to obtain cumulative gain.

## Cumulative Gain

Let $N_{\text{cum} , k } = | H_k |$. 

Multiply each point in the cumulative efect curve by $\frac{N_{\text{cum},k}}{N}$. 

<img src="images/cumulative_gain.png" width="300">

Subtracting ave average CATE before applying this factor give the "normalized" cumulative gain curve wherein the area of interest is more readily seen. 

The area between is the same for both curves, and can be thought of as an indicator of a good ordering of the units; ordering so that we can see the level of effect above random. 

Evaluation of causal models is an area of research that is still developing and has many blindspots. E.g., how do we check for correct CATE prediction instead of this check for a good ordering ? We have no analog of R^2 or MSE. 

One thing we do have is Target Transformation:


## Target Transformation

Using an unbiased estimator of $\tau$ to estimate MSE

## Binary Outcomes

$E[ Y |T=t]$ vs $t$ has an S shape, possibly heavy in the top or bottom. 

<img src="images/binary_outcome_shapes.png" width="500">

For us, likely, most customers will be non-switchers, so we will be bottom heavy.

Intuitively, the closer to the middle of the S shape (the tipping point, $E[Y|T=t]^{-1}(0.50)$) the higher the effect will be since the slope is largest there.



<a href src="https://dl.acm.org/doi/abs/10.5555/3586589.3586648">ends up</a> these models can do very well in some circumstances. 

## CATE for Decision Making



# 7 Metalearners 

Metalearners are an effortless way to leverage off-the-shelf predictive machine learning algorithms for approximating treatment effects.

All predictive models, such as linear regression, boosted decision trees, neural networks, or Gaussian processes, can be repurposed for causal inference using the approaches described in this chapter

At the time of this writing (2023), all the causal inference packages for python are in their early stage. They contain metalearners that are slow because they are running too much python which is slow. They are thus not appropriate for large data applications. Thus, it is not a good idea for a book like this to promote a causal inverence python package at this time.

We will build our metalearners from the ground up. 

## T-Learner 

The T is for two as in the binary treatment case. There are two models to create, $\{ \mu_t(x) | t\in\{1,0\}\} $.  

$$
\mu_0 (x)=E[Y|T=0, X=x]\\
\mu_1 (x)=E[Y|T=1, X=x]
$$

and of course
$$
\hat{\tau}(x) = 
\hat{\mu}_1(x) -\hat{\mu}_0(x)
$$

Me: I'm a little annoyoed that we do not call these $\hat{Y}_0(x)$ and $\hat{Y}_1(x)$. I really thing that conflating expectation values with models is getting something wrong. 

For example, you can use boosted tree regression for both. However, tree regessors self-regulate; when there are fewer points to train on there are fewer pieces of piecewise function. This means that when the treated and untreated groups are of very different sizes, say 25 treated and 10,000 untreated, one of these functions will be able to pick up non-linearities that the other isn't, and in those regions of $x$ values there is bias in $\hat{\tau}(x)$. The X-learner helps

Generalizing beyond the binary treatment case, you can build T learners for T categorical with any number of categories. 

## X-Learner

This is hard to understand, expect to need a re-read. 

In the case where one treatment group is much larger than the other X-learners do much better than T-learners. This is because they correct for inaccurate imputed potential values in the under-represented treatment group. 

An X-learner has two stages a propensity score model, and it's own model. 

1. The first stage is a T-learner, $\mu_0,\mu_1$. 
    - use these to impute all missing potential outcome  values $$\hat{\mu}_1(X,T=0)\\
    \hat{\mu}_0(X,T=1)$$
    - from the imputed potential outcomes calculate TUT and TT estimates
$$
\hat{\tau}(X,T=0) := \widehat{\text{TUT}} =  \hat{\mu}_1(X,T=0) - Y_{T=0}\\
\hat{\tau}(X,T=1) := \widehat{\text{TT}} = Y_{T=1} - \hat{\mu}_0(X,T=1) 
$$
2. second stage 
    - create two more ML models of TUT and TT and train them to predict the $\tau$s above. We call these second stage models. 
$$
\hat{\mu}_{\tau 1} \approx TT = E[\tau(X) | T=0]\\
\hat{\mu}_{\tau 0} \approx TUT  = E[\tau(X) | T=1] \\
$$
3. propensity score model: $\hat{e}(x) \approx  P[T|X=x]$.
4. the X-learner model $\hat{\tau}(x)$ is a propensity score weighted combination of the second stage models:
$$
\hat{\tau}(x) = \hat{\mu}_{\tau 0}(X)\hat{e}(x) + \hat{\mu}_{\tau 1}(X) (1-\hat{e} (x))
$$

Why do we need all this? 

In our example, 
- $\hat{μ}_1$  was fitted on a very small sample and so imputes inaccurately. 
    - Therefore $\hat{τ}( X, T = 0) \sim TUT$ is inacurate. 
- $\hat{\mu}_0$ was accurate  
    - therefore $\hat{\tau}(X,T=1) \sim TT $ is accurate. 

We want to combine the $\hat{\tau}$s in a way that give more weight to the accurate model. The propensity score model helps sort out where the two models are accurate.

Thinking through it 
- In places where the probability of treatment is low the TT provides the estimate.
- in places where the probability of treatment is high the TUT provides the estimate.

In our example, there were very few treated units, for all $x$ probability of treatment is low, and so 
- TT provides the majority of the estimate, and 
- TUT plays a minor role. 

## Metalearners for Continuous Treatments 
The thing about continuous treatments is
- there are infinitely many values $t$ for $T$ 
- for each $i$ only once outcome $Y(t)$ can be observed,
- treatment effect is based on thoses non-obersvables.

So, the goal is to impute all the  unobserved outcomes for for each $i$. 

### S-Learner

This is the simplest learner there is. 

A single (hence the S) machine learning model $μ_s$ is trained on $X$ and $T$ to estimate $Y$ and obtain the estimate
$$
\hat{\mu}_s(x,t) = \mu(x) = E[Y|T=t,X=x]
$$
which consists of almost everywhere counterfactual potential outcome predictions.

After you train the model... 

To put these counterfactual predicitons into a DataFrame requires a bit of tabular data trickery; 
- Partition the continuou of $T$ values into $n$ parts $t_k$. 
- for each row in the original df
    - replicate the row once for each part
- add a new column and fill it with prediciton values $\hat{\mu}_s(x,t)$ 

For a fixed $i$ you can then plot n points of $Y_i(t)$.
You can also calculate the regression slope of each, the ITE for each $i$, if you want. 

Now you can order units by predicted effect quantile, average their estimated ITE over their quantile for an estimated ATE(>p), make a normalized cumulitive gain curve, and use the area under it as an metric of how well you put the units in order.  
(as in ch6)

S-learner is a good first bet for any causal problem, 
- It also tends to perform OK, even if it doesn’t have random data to train.
- it  supports binary,  discrete numerical,   and continuous treatment, 

Downside: 
- S-learner is that it tends to bias the treatment effect toward zero (as opposed too up or down); model regularization can restrict the estimated treatment effect. 

<img src="images/S_learner_bias.png" width="300">

A way around this, proposed in the same paper by Chernozhukov et al., is Double/Debiased Machine Learning, or the R-learner.

## Double/Debiased Machine Learning

aka the R-learner..... I'll revisit if needed.