# Session 6:
## Machine learning for causal inference


## Agenda

1. [A general problem](#A-general-problem)
1. [Inference with Lasso](#Inference-with-Lasso)
1. [Double Machine Learning](#Double-Machine-Learning)


# A general problem


## Estimate treatment effect with selection

Often we are interested in estimating average treatment effect (ATE) of $T$ on $y$ in observational data.

- Matching has for a long time been the defacto standard 
  - (assuming we measure enough variables to assignment conditionally random)
- Can we somehow solve the problem using machine learning?

## Estimate treatment effect with selection
[Chernozhukov et al. (2018)](http://economics.mit.edu/files/12538) assume that we have the following data generating process
\begin{align}
y=& T\theta_0+g_0(X)+U\\
T=&m_0(X) + V\\
E[U|X,T]=&0\\
E[V|X]=&0\\
\end{align}

Basic model properties:
- The outcome $y$ is confounded by unknown nuisance function $g_0(\cdot)$ 
- The treatment $T$ suffer from selection on observable, where $m_0$ is  unknown propensity function
  - Note assumed no selection on unobservables (only "mild" econometric problem)

## Estimate treatment effect with selection

There have been multiple ways proposed to solve the problem:
- *Directly* modify ML to allow for estimation of treatment effect
- *Indirectly* modify estimation procedure to incorporate ML 


# Inference with Lasso

## Treatment effects in linear models 

Suppose we want to estimate a linear model parameter for the causal effect of treatment $T_i$ on $y_i$. 

\begin{equation}y_i=\alpha T_i+x_i\beta+r_{yi}+\zeta_i\end{equation}

- We follow notation in [Belloni et al., 2015](https://doi.org/10.1257/jep.28.2.29)
- We let $r_{yi}$ be an approximation error (we don't know the functional form)


## Treatment effects in linear models (2)

How to select model, i.e. subset of $x$?

- Classic econometrics: 
  - Use **OLS** and include covariates based on theory or inference
  - Problem how to delete covariates systematically? Adjust for multiple hypothesis testing?
  
- Machine learning:
  - Use **LASSO** to perform covariate selection
  - Note - estimates are biased towards zero!  
    - Problem we omit potentially relevant variables!!
    - LASSO excludes  possible confounders if little predictive power $y_i$. 
    - Excluded variables may still have an effect through $T_i$, e.g. covariates correlated with treatment.


## Fixing the LASSO

A simple solution suggested by [Belloni et al. (2015)](https://doi.org/10.1257/jep.28.2.29) us to use Post-LASSO to correct for bias:
- Step 1: estimate two LASSO models
    - a) Regress $y_i$ on $x_i$ 
    - b) Regress $T_i$ on $x_i$ 
- Step 2: run OLS using only variables that were kept in either LASSOs

What about inference?

- We need further assumptions on sparsity.
- See [Chernozhukov et al. (2015)](https://doi.org/10.1257/aer.p20151022) online appendix for details.


## Fixing the LASSO (2)
The LASSO picks the correct variables even if there are more variables than observations!
- In high dimensions this does not work - can return more variables to use estimate than possible in OLS 

# Double Machine Learning

## Linear DML

[Belloni et al. (2015)](https://doi.org/10.1257/jep.28.2.29), [Chernozhukov et al. (2015)](https://doi.org/10.1257/aer.p20151022) write down the two "prediction equations":

\begin{align}
y_i=&\alpha T_i+x_i^{\prime}\theta_y+r_{yi}+\zeta_i\\
T_{i}=&x_{i}^{\prime} \theta_{t}+r_{t i}+v_{i}
\end{align}

The two equations can be combined into a single structural equation (substite $T_i$ into $y_i$):

\begin{align}y_{i}=&x_{i}^{\prime}\left(\alpha \theta_{t}+\theta_{y}\right)+\left(\alpha r_{t i}+r_{y i}\right)+\left(\alpha v_{i}+\zeta_{i}\right)\\=&x_{i}^{\prime} \pi+r_{c i}+\varepsilon_{i}\end{align}


## Linear DML (2)

[Chernozhukov et al. (2015)](https://doi.org/10.1257/aer.p20151022) states the following algorithm for :

1. run the two LASSO equations (as in POST-LASSO) and obtain residuals 
  - $\hat{\rho}_i^y$ from $y_i$ on $x_i$
  - $\hat{\rho}_i^d$ from $T_i$ on $x_i$
1. run a regression of $\hat{\rho}_i^y$ on $\hat{\rho}_i^d$

What is the intuition?

- Similar to Frisch-Waugh-Lowell where we partial out effects.
  - We partial out effect of $x_i$ on both $y_i$ and on $T_i$ seperately 
- Innovation: We make double selection of variables using LASSO


## Linear DML (3)

[Belloni et al. (2015)](https://doi.org/10.1257/jep.28.2.29) simulates the performance of post selection estimators

<center><img src='beh2014_fig1.JPG' alt="Drawing" style="width: 800px;"/></center>


## Naive solution

What happens when use machine learning estimator to estimate directly estimate in $\theta_0$ and $g_0(\cdot)$ in $y=T\theta_0+g_0(X)$ to control for confounders? 
- Where we use a an auxiliary subsample $I^c$ to compute $\hat{g}_0(\cdot)$ using a possibly non-linear model.
- Assume subsample is half of the sample size.


$$\hat{\theta}_0=\frac{\frac{1}{n}\sum_{i\in I}T_i(y_i-\hat{g}_0(X_i))}{\frac{1}{n}\sum_{i\in I}T_i^2}$$

## Naive solution

We decompose estimator into scaled estimation error

$$\sqrt{n}(\theta-\hat{\theta}_0)=
\frac{\frac{1}{\sqrt{n}}\sum_{i\in I}T_iU_i}{\frac{1}{n}\sum_{i\in I}T_i^2}+\frac{\frac{1}{\sqrt{n}}\sum_{i\in I}T_i(g_0-\hat{g}_0(X_i))}{\frac{1}{n}\sum_{i\in I}T_i^2}
$$

What could be problematic here?

## Naive problem

Issue is that $\hat{g}$ will be systematically biased as we are curbing overfitting, e.g. through regularization. 
- Same problem arises for tree-based and neural network models.
- Estimator will have bias term that diverges and is not centered:

$$\frac{\frac{1}{\sqrt{n}}\sum_{i\in I}T_i(g_0-\hat{g}_0(X_i))}{E[T_i^2]}
$$


## Orthogonalization

Suppose we also estimate $\hat{m}_0(\cdot)$ on the auxiliary sample $I^c$. We can then make the following estimate:

$$\check{\theta}_0=\frac{\frac{1}{n}\sum_{i\in I}\hat{V}_i(y_i-\hat{g}_0(X_i))}{\frac{1}{n}\sum_{i\in I}T_i^2}$$

where we use the 

## Orthogonalization

We decompose estimator into scaled estimation error

$$\sqrt{n}(\theta-\hat{\theta}_0)=
\frac{\frac{1}{\sqrt{n}}\sum_{i\in I}T_iU_i}{\frac{1}{n}\sum_{i\in I}T_i^2}+\frac{\frac{1}{\sqrt{n}}\sum_{i\in I}(m_0(X_i)-\hat{m}_0(X_i))(g_0-\hat{g}_0(X_i))}{\frac{1}{n}\sum_{i\in I}T_i^2}
$$

This solves the problem as the product of estimation errors vanishes.

## Orthogonalization

The first major contribution of [Chernozhukov et al. (2018)](http://economics.mit.edu/files/12538) is to show that in general the second double debiasing procedure leads to consistent estimates and can be used estimate average treatment effects.

- The proof depends on sample splitting - using an independent auxiliary sample for estimating $\hat{m}_0,\hat{g}_0$.

## Implementation details

Problem - what to do with auxiliary sample? 

To gain efficient estimates [Chernozhukov et al. (2018)](http://economics.mit.edu/files/12538) 
- Problem - what to do with auxiliary sample?
- We rotate sample using **cross-fitting**: first use one part as auxiliary sample, then the other. Like cross-validation in supervised ML.
- This is second major contribution.

## Implementation details

How do we estimate $\hat{m}_0,\hat{g}_0$? This can be done using cross-validation on auxiliary sample $I^c$. Available estimators:

- linear/logistic models, including regularized
- tree based inclding random forests, boosted trees
- neural networks 
- kernel models including suppert vector machine 


## Extensions

The DML approach is extended in the paper to:
- compute Local Average Treatment Effects (LATE) 
- compute Instrumental Variables 

## The end
