# Part 1

# Why tree based methods for econometrics

## Average Joe and beyond
Tree based methods for 
- computing average treatment effect 
- personalized treatment effect

Problem with ordinary methods

## Two approaches for heterogeneity

- Data driven vs a priori sensible
- When to choose which

## Issues with causal forest
Limited to situation with unconfounding given covariates

- Going beyond unconfounding and heterogeneity and both simultaneously

# The Generalized Random Forest

## Value of trees

From splitting data to computing weights

<center><img src='https://raw.githubusercontent.com/abjer/sds_eml_2020/master/material/session_5/partitions_weights.JPG' alt="Drawing" style="width: 1000px;"/></center>



## Overall procedure

1. Repeatedly estimate trees where splits are based on estimating equation, $\psi$ to obtain weights $\alpha$. 
1. Re-estimate $\psi$ using weights on entire sample where forests splits are weights.

## Estimating equations
The general estimating equation
  - $\mathbb{E}\left[\psi_{\theta(x), \nu(x)}\left(O_{i}\right) | X_{i}=x\right]=0, \quad \forall x.$
    
Where $\psi$  estimating function, maps parameters and data into moment equations
  - Parameters
    - $\theta$ parameter we want estimate 
    - $\nu$ is nuisance we want to "partial out" (optional)
  - Data     
    - $O_i$ main objects we are interested in modelling, e.g. $Y_i, D_i$
    - $X_i$ covariates


## Computing weights $\alpha$

Given a subsample $\mathcal{I}$ of data.

1. Split subsample $\mathcal{I}$ into $\mathcal{J}_1,\mathcal{J}_2$
1. Estimate trees on $\mathcal{J}_1$
  1. Compute estimating equations on subsample at each parent node (before split)
  1. Evaluate different splits - use approximate solutions with gradients
1. Estimate weights using forests on $\mathcal{J}_2$.
  \begin{equation}\alpha_i(x)=\frac{1}{B}\sum_{b=1}\frac{\mathbb{1}(X_i\in L_b(x))}{|L_b(x)|}\end{equation}
    - where $L_b(x)$ training examples falling in the same leaf as x

## Estimating equations

Given $x$ compute the ***local*** estimating equations using weights $\alpha_i(x)$ on entire sample:

\begin{equation}
(\hat{\theta}(x), \hat{\nu}(x)) \in \operatorname{argmin}_{\theta, \nu}\left\{\left\|\sum_{i=1}^{n} \alpha_{i}(x) \psi_{\theta, \nu}\left(O_{i}\right)\right\|_{2}\right\}
\end{equation}

Example of applications in Athey et al. (2019)
- Conditional Average Treatment Effects
- Instrumental Variables
- Quantile Regressions


## Estimation equations for CATE

What does the local estimating equation look like under CATE?

- the estimating equations $\psi$ with possibly multi-dimensional treatments 
<br>
<br>
\begin{align}\psi_{\beta(x), c(x)}\left(Y_{i}, W_{i}\right)=\left(Y_{i}-\beta(x) \cdot W_{i}-c(x)\right)\left(1 \quad W_{i}\right)\end{align}
<br>
- Note: $\left(1 \quad W_{i}\right)$ which implies there is a vector of equations
  

## Estimation equations for CATE (2)

What is the solution?
- Run local regression of $y_i$ on $W_i$ with weights $\alpha$
  
\begin{align}\hat{\theta}(x)=\xi^{\top}\left(\sum_{i=1}^{n} \alpha_{i}(x)\left(W_{i}-\bar{W}_{\alpha}\right)^{\otimes 2}\right)^{-1} \sum_{i=1}^{n} \alpha_{i}(x)\left(W_{i}-\bar{W}_{\alpha}\right)\left(Y_{i}-\bar{Y}_{\alpha}\right)\end{align}

## Main results 

Athey et al (2019) show that Generalized Random Forests have the following propeties:

- Estimates, $\hat{\theta}(x)$, are consistent (Theorem 3)
- Asymptotic normality of estimates (Theorems 5,6)

# Comparison 
### Causal Forests and Generalized Random Forests

## Causal Trees and Forests
Strong econometric properties
- unbiased and consistent (trees and forests)
- asymptotic normality given $x$ (forests)

- weaknesses: 
  - either unconfounding or heterogeneity
  - we "use" data to buy honesty at the price of statistical power
      

## Generalized Random Forest 
Difference from Causal Forest - trees are used for constructing weights!
- strengths: 
  - unconfounding (propensity) *AND* heterogeneity
  - additional uses 
      - quantile regression
      - instrumental variables  
      - clustered standard errors
      - and more
- weakness: 
  - we "waste" data on honesty

# Part 2

# Linear ML for econometrics

## Treatment effects in linear models 

Suppose we want to estimate a linear model parameter for the causal effect of treatment $d_i$ on $y_i$. 

\begin{equation}y_i=\alpha d_i+x_i\beta+r_{yi}+\zeta_i\end{equation}

- We follow notation in [Belloni et al., 2015](https://doi.org/10.1257/jep.28.2.29)
- We let $r_{yi}$ be an approximation error (we don't know the functional form)


## Treatment effects in linear models (2)

How to select model, i.e. subset of $x$?

- Classic econometrics: 
  - Use **OLS** and include covariates based on theory or inference
  - Problem how to delete covariates systematically? Adjust for multiple hypothesis testing?
  
- Machine learning:
  - Use **LASSO** to perform covariate selection
  - Note - estimates are biased towards zero!  
    - Problem we omit potentially relevant variables!!
    - LASSO excludes  possible confounders if little predictive power $y_i$. 
    - Excluded variables may still have an effect through $d_i$, e.g. covariates correlated with treatment.


## Fixing the LASSO

A simple solution suggested by [Belloni et al. (2015)](https://doi.org/10.1257/jep.28.2.29) us to use Post-LASSO to correct for bias:
- Step 1: estimate two LASSO models
    - a) Regress $y_i$ on $x_i$ 
    - b) Regress $d_i$ on $x_i$ 
- Step 2: run OLS using only variables that were kept in either LASSOs

What about inference?

- We need further assumptions on sparsity.
- See [Chernozhukov et al. (2015)](https://doi.org/10.1257/aer.p20151022) online appendix for details.


## A general solution

[Belloni et al. (2015)](https://doi.org/10.1257/jep.28.2.29), [Chernozhukov et al. (2015)](https://doi.org/10.1257/aer.p20151022) write down the two "prediction equations":

\begin{align}
y_i=&\alpha d_i+x_i^{\prime}\theta_y+r_{yi}+\zeta_i\\
d_{i}=&x_{i}^{\prime} \theta_{d}+r_{d i}+v_{i}
\end{align}

The two equations can be combined into a single structural equation (substite $d_i$ into $y_i$):

\begin{align}y_{i}=&x_{i}^{\prime}\left(\alpha \theta_{d}+\theta_{y}\right)+\left(\alpha r_{d i}+r_{y i}\right)+\left(\alpha v_{i}+\zeta_{i}\right)\\=&x_{i}^{\prime} \pi+r_{c i}+\varepsilon_{i}\end{align}


## A general solution (2)

[Chernozhukov et al. (2015)](https://doi.org/10.1257/aer.p20151022) states the following algorithm for :

1. run the two LASSO equations (as in POST-LASSO) and obtain residuals 
  - $\hat{\rho}_i^y$ from $y_i$ on $x_i$
  - $\hat{\rho}_i^d$ from $d_i$ on $x_i$
1. run a regression of $\hat{\rho}_i^y$ on $\hat{\rho}_i^d$

What is the intuition?

- Similar to Frisch-Waugh-Lowell where we partial out effects.
  - We partial out effect of $x_i$ on both $y_i$ and on $d_i$ seperately 
- Innovation: We make double selection of variables using LASSO


## A general solution (3)

[Belloni et al. (2015)](https://doi.org/10.1257/jep.28.2.29) simulates the performance of post selection estimators

<center><img src='beh2014_fig1.JPG' alt="Drawing" style="width: 800px;"/></center>


## Beyond linear post double selection estimator

[Chernozhukov et al. 2018](https://doi.org/10.1111/ectj.12097) that non-linear prediction approaches can be used in prediction steps

- tree based inclding random forests, boosted trees
- neural networks 
- kernel models including suppert vector machine 



# Part 3

# Overview: applications of machine learning in econometrics

## Applications of ML: for estimation 

We have seen two major frameworks

- causal and generalized random forest 
- double selection 
- new applications for time series and diff-and-diff, see references in [Athey and Imbens (2019)](https://doi.org/10.1146/annurev-economics-080217-053433)

## Applications of ML: model evaluation

Making and gauging predictive models (SDS Intro): 

- Assess the performance of our models
- Choose the parameters that help estimate the best performing model (using validation set)

We used two approaches to decomposing models: 

- Make inference comparing different models'  generalization error (Nadeau and Bengio)
- Smart ways of gauging predictive contribution, e.g. SHAP 
  - Note: we do not expect you to be able to apply this

## Applications of ML: new data

Machine learning can help us *'fill in the blanks'* and impute missing data. Examples?

- E.g. cell phone and smartwatch data
    - Inferring poverty ([Blumenstock et al., 2015](https://doi.org/10.1126/science.aac4420))
    - Inferring mode of transportation ([Reddy et al., 2010](https://doi.org/10.1145/1689239.1689243); Bjerre-Nielsen et al., 2020)
    - Measuring sleep and other behavioral traces 

Problem: measurement error creates econometric problems 
- predicted data may have subtle biases
- may need to correct for uncertainty 
    - NOTE: interesting subject for thesis?

## Applications of ML: policy targetting

[Kleinberg et al. (2015)](https://doi.org/10.1257/aer.p20151023) argues that in many policy applications we are **mainly** concerned about prediction. Examples: 



- Short term weather we are (almost) exclusively interested in prediction 
    - predicting whether it rains  vs. making it rain (causal effect)
    - on longer horizon evidence of climate change 
- Treating knee surgeries
  - predict mortality before surgery to avoid treating terminally ill
- Audits for taxes
  - predicting fraud vs. information campaigns 
  - behavioral effects in subsequent years
