# Machine learning 3

Econometrics and machine learning

**Andreas Bjerre-Nielsen**




# Plan

- 13-14: Lecture on machine learning 3
- 14-15: Lecture by David D. Lassen 
- Work on assignment 2 + project



# Overview

How to use **machine learning** (*ML*) as a social scientist?
- Direct applications
- Prediction policy problems
- Machine learning in estimation
  - Instrumental variables
  - Matching with machine learning




## Recap 

What are some strengths of ML?


- Better predictions out-of-sample (lower error)


How do we do it?


- Model selected from data + handle non-linearities
- Cross validation, ensemble learning, regularization


# Direct applications of machine learning

## Direct applications of ML  (1): testing

ML helps us with making predictive models: 

- Assess the performance of our models
- Choose the parameters that help estimate the best performing model 

Can we use ML to help us clarify whether a new feature set is relevant for prediction?




## Direct applications of ML  (2): testing

Using a ML framework we test whether a model's performance is affected by inclusion of a feature set or not. 

- We can bootstrap standard error of performance and make test.

Examples of implementation:

- Moritz, B. and Zimmermann, T., 2016. Tree-based conditional portfolio sorts: The relation between past and future stock returns.

- Bjerre-Nielsen, A. et al., 2018: Wi-Finding: Urban Transportation Sensing 
Using Crowdsourced Wi-Fi.


## Direct applications of ML  (3): testing

How does this work? 

- Make models with and without feature set and compare difference in performance
- We can bootstrap standard error of difference

## Direct applications of ML  (4): new data

Can machine learning help us get new data?


Suppose we do not know the socioeconomic composition of a neighborhood. Could machine learning help us?

- Naik, Raskar, Hidalgo (2016): Cities Are Physical Too: Using Computer Vision to Measure the Quality and Impact of Urban Appearance

## Direct applications of ML (5): new data

Machine learning can help us *'fill in the blanks'* and impute missing data

Cell phone data
- Inferring poverty.
  - Blumenstock, Cadamuro, On (2015): Predicting Poverty and Wealth from Mobile Phone Metadata

- Inferring mode of transportation.
  - Redi et al. (2010): Using mobile phones to determine transportation modes
  - Bjerre-Nielsen et al. (2018): Wi-Finding: Urban Transportation Sensing Using Crowdsourced Wi-Fi
  
- Sleep
    - Cuttone et al. (2017): SensibleSleep: A Bayesian Model for Learning Sleep Patterns


## Direct applications of ML (6): new data

Facebook data can help infer
- Psychological profile, demographics (Cambridge Analytica)
- Socioeconomic status 
  - Facebook has a new patent, see [here](https://www.cbinsights.com/research/facebook-patent-socioeconomic-detection/?utm_source=CB+Insights+Newsletter&utm_campaign=fa48df10a8-ThursNL_02_01_2018&utm_medium=email&utm_term=0_9dc0513989-fa48df10a8-89375681)
- Other: voting, mood

# Prediction policy problems

## Prediction policy problems (1)

Social scientists are often involved in policies aimed at: 
- alleviating poverty, decrease drop-out, crime etc.

Efficacy of these programs requires targetting of individuals:
- who is most poor, who is most at risk of dropping out?

## Prediction policy problems (2)

Kleinberg et al. 2015 state the problem as: 

- choose $X_0$ to maximize welfare $\pi(\cdot)$
- $\pi$ is a function of:
  - $y$ outcome variable which depend on policy in unknown way
  - $X_0$ is the policy 

E.g. hiring an additional teacher - how does this affect average teacher ability? 
- Need to predict new teacher ability

## Prediction policy problems (3)

We can derive optimal policy (total diff.):

\begin{align}
\frac{d\pi(X_0,y)}{dX_0}=\frac{\partial\pi}{\partial X_0}\cdot\underset{predict}{\underbrace{Y}} + \frac{\partial\pi}{\partial Y} \cdot \underset{causal\\effect}{\underbrace{ \frac{\partial Y}{\partial X_0}}}\end{align}

## Prediction policy problems (4)

Example: joint knee and hip surgery. 

- many patients die shortly after surgery
- predict mortality risk
    - top 1 pct. riskiest: 44 pct. mortality rate, \$30M saved
    - top 5 pct. riskiest: 35 pct. mortality rate, \$121M saved
    - top 10 pct. riskiest: 24 pct. mortality rate, \$158M saved
    - top 20 pct. riskiest: 15 pct. mortality rate, \$185M saved

## Prediction policy problems (5)

Other issues:
- discrimination?
    - gender, race, socio-economic status
- GPDR: profiling
- faith in selection algorithm?
- should we incentivize local authorities to use private information?

# Machine learning in estimation

## ML in econometric methods

Overview of econometric tools

- instrumental variable
- matching

# Instrument variables and machine learning

## Instrument variable (1)

Standard problem
- We are interested in effect $X$ on $Y$
    - Model $Y = \beta X + u$ 
    - However, the two are correlated.

## Instrument variable (2)

Linear two step approach:
- 1st stage: predict $X$ from $Z$, call this $\hat{X}$
    - standard approach: regress covariate $X$ on exogenous instruments $Z$
    - often requires i) $Z$,$u$ are uncorrelated, ii) $Z$,$X$ are correlated 
- 2nd stage:
    - regress $Y$ on $\hat{X}$
   


## Instrument variable (3)

The issue:
- 1st stage prediction of $\hat{X}$ may have (very) poor fit
- causes: 
    - sample size is low
    - the number of instruments is high
    - or the instruments are weak

## Instrument variable (4)

Cross validation solutions
- Split sample (training, estimation)
    - Angrist and Krueger 1995
- Jack-knife  (leave one out)
    - Angrist, Imbens, and Krueger 1999

## Instrument variable (5)

ML enhanced solutions
- Regularization:    
    - LASSO: Belloni, Chen, Chernozhukov, and Hansen 2012
    - Ridge regression: Carrasco, 2012; Hansen and Kozbur 2014. 
- Neural network:
    - Hartford, Leyton-Brown, and Taddy 2017

# Econometric matching and machine learning

## Matching and treatment (1)

Aim understand policy treatment D on outcome Y:

- Variable of interest (often called *treatment*): $D_i$

- Outcome of interest: $Y_i$

**Potential outcome framework**
$$
Y_i = \left\{
\begin{array}{rl}
Y_{1i} & \text{if } D_i = 1,\\
Y_{0i} & \text{if } D_i = 0
\end{array} \right.
$$

The observed outcome $Y_i$ can be written in terms of potential outcomes as
$$ Y_i = Y_{0i} + (Y_{1i}-Y_{0i})D_i$$

$Y_{1i}-Y_{0i}$ is the *causal* effect of $D_i$ on $Y_i$. 

## Matching and treatment (2)

Can we measure causal effect?

Problem: We never observe the same individual $i$ in both states. This is the **fundamental problem of causal inference**, Holland, 1986. Implication: cannot take difference within individual across states.

Need to estimate the state we do not observe (the ***counterfactual***). 

Can we use a naive comparison of averages by treatment status? i.e. $E[Y_i|D_i = 1] - E[Y_i|D_i = 0]$?

- Yes, if **random** assignment.

## Matching and treatment (3)

Suppose not random assignment. We can decompose into:

\begin{align}E[Y_i|D_i = 1] - E[Y_i|D_i = 0] =&  
\underset{causal\,effect}{\underbrace{E[Y_{1i}|D_i = 1] - E[Y_{0i}|D_i = 1]}} + \underset{selection\,bias}{\underbrace{E[Y_{0i}|D_i = 1] - E[Y_{0i}|D_i = 0]}}\end{align}

The decomposition:

 - $E[Y_{1i}|D_i = 1] - E[Y_{0i}|D_i = 1] = E[Y_{1i} - Y_{0i}|D_i = 1]$: the average *causal* effect of $D_i$ on $Y$. 

- $E[Y_{0i}|D_i = 1] - E[Y_{0i}|D_i = 0]$: difference in average $Y_{0i}$ between the two groups. Likely to be different from 0 when individuals are allowed to self-select into treatment. Often referred to as ***selection bias***. 

## Matching and treatment (4)

Selection bias can be overcome by **matching**: find similar observations to construct counter factual

- propensity score (inferred treatment assignment probability)
- exact / nearest neighbor

## Matching and treatment (5)

ML can help to estimate propensity score:


- Using cross validation and various models including random forest:
    - Lee, Lessler, and Stuart (2010) use ML
- Debiased machine learning:
    - Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey (forthcoming)


## Matching and treatment (6)

When we have random data ML can help us estimate heterogeneous treatment effects:

- Causal Trees:
    - Athey Imbens (2016)
- Causal Forests:
    - Athey, Tibshirani, Wager (forthcoming)