# Double/Debiased Machine Learning for the Partially Linear Regression Model

This is a simple implementation of Debiased Machine Learning for the Partially Linear Regression Model, which provides an application of DML inference to determine the causal effect of countries' intitial wealth on the rate of economic growth.


Reference:

- https://arxiv.org/abs/1608.00060
- https://www.amazon.com/Business-Data-Science-Combining-Accelerate/dp/1260452778

The code is based on the book.

In [1]:
install.packages("xtable")
install.packages("hdm")
install.packages("randomForest")
install.packages("glmnet")
install.packages("sandwich")

library(xtable)
library(randomForest)
library(hdm)
library(glmnet)
library(sandwich)

set.seed(1)


The downloaded binary packages are in
	/var/folders/k0/5jhn7d7s1l75cf1srqv1wy980000gn/T//RtmpMmnuJP/downloaded_packages

  There is a binary version available but the source version is later:
    binary source needs_compilation
hdm  0.3.1  0.3.2             FALSE



installing the source package 'hdm'





  There is a binary version available but the source version is later:
              binary  source needs_compilation
randomForest 4.7-1.1 4.7-1.2              TRUE



installing the source package 'randomForest'





The downloaded binary packages are in
	/var/folders/k0/5jhn7d7s1l75cf1srqv1wy980000gn/T//RtmpMmnuJP/downloaded_packages

  There is a binary version available but the source version is later:
         binary source needs_compilation
sandwich  3.1-0  3.1-1             FALSE



installing the source package 'sandwich'


randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

Loading required package: Matrix

Loaded glmnet 4.1-8



In [2]:
file <- "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/GrowthData.csv"
data <- read.csv(file)
data <- subset(data, select = -1) # get rid of index column
head(data)
dim(data)

Unnamed: 0_level_0,Outcome,intercept,gdpsh465,bmp1l,freeop,freetar,h65,hm65,hf65,p65,...,seccf65,syr65,syrm65,syrf65,teapri65,teasec65,ex1,im1,xr65,tot1
Unnamed: 0_level_1,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,...,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,-0.02433575,1,6.591674,0.2837,0.153491,0.043888,0.007,0.013,0.001,0.29,...,0.04,0.033,0.057,0.01,47.6,17.3,0.0729,0.0667,0.348,-0.014727
2,0.10047257,1,6.829794,0.6141,0.313509,0.061827,0.019,0.032,0.007,0.91,...,0.64,0.173,0.274,0.067,57.1,18.0,0.094,0.1438,0.525,0.00575
3,0.06705148,1,8.895082,0.0,0.204244,0.009186,0.26,0.325,0.201,1.0,...,18.14,2.573,2.478,2.667,26.5,20.7,0.1741,0.175,1.082,-0.01004
4,0.06408917,1,7.565275,0.1997,0.248714,0.03627,0.061,0.07,0.051,1.0,...,2.63,0.438,0.453,0.424,27.8,22.7,0.1265,0.1496,6.625,-0.002195
5,0.02792955,1,7.162397,0.174,0.299252,0.037367,0.017,0.027,0.007,0.82,...,2.11,0.257,0.287,0.229,34.5,17.6,0.1211,0.1308,2.5,0.003283
6,0.04640744,1,7.21891,0.0,0.258865,0.02088,0.023,0.038,0.006,0.5,...,1.46,0.16,0.174,0.146,34.3,8.1,0.0634,0.0762,1.0,-0.001747


In [3]:
y = as.matrix(data[,1])         # outcome: growth rate
d = as.matrix(data[,3])         # treatment: initial wealth
x = as.matrix(data[,-c(1,2,3)]) # controls: country characteristics

# some summary statistics
cat(sprintf("\nThe length of y is %g \n", length(y) ))
cat(sprintf("\nThe number of features in x is %g \n", dim(x)[2] ))

lres=summary(lm(y~d +x))$coef[2,1:2]
cat(sprintf("\nNaive OLS that uses all features w/o cross-fitting Y ~ D+X yields: \ncoef (se) = %g (%g)\n", lres[1] , lres[2]))


The length of y is 90 



The number of features in x is 60 

Naive OLS that uses all features w/o cross-fitting Y ~ D+X yields: 
coef (se) = -0.00937799 (0.0298877)


# DML algorithm

Here we perform estimation and inference of predictive coefficient $\alpha$ in the partially linear statistical model,
$$
Y = D\alpha + g(X) + U, \quad E (U | D, X) = 0.
$$
For $\tilde Y = Y- E(Y|X)$ and $\tilde D= D- E(D|X)$, we can write
$$
\tilde Y = \alpha \tilde D + U, \quad E (U |\tilde D) =0.
$$
Parameter $\alpha$ is then estimated using cross-fitting approach to obtain the residuals $\tilde D$ and $\tilde Y$.
The algorithm comsumes $Y, D, X$, and machine learning methods for learning the residuals $\tilde Y$ and $\tilde D$, where
the residuals are obtained by cross-validation (cross-fitting).

The statistical parameter $\alpha$ has a causal interpretation of being the effect of $D$ on $Y$ in the causal DAG $$ D\to Y, \quad X\to (D,Y)$$ or the counterfactual outcome model with conditionally exogenous (conditionally random) assignment of treatment $D$ given $X$:
$$
Y(d) = d\alpha + g(X) + U(d),\quad  U(d) \text{ indep } D |X, \quad Y = Y(D), \quad U = U(D).
$$


In [4]:
DML2.for.PLM <- function(x, d, y, dreg, yreg, nfold=2) {
  nobs <- nrow(x) #number of observations
  foldid <- rep.int(1:nfold,times = ceiling(nobs/nfold))[sample.int(nobs)] #define folds indices
  I <- split(1:nobs, foldid)  #split observation indices into folds
  ytil <- dtil <- rep(NA, nobs)
  cat("fold: ")
  for(b in 1:length(I)){
    dfit <- dreg(x[-I[[b]],], d[-I[[b]]]) #take a fold out
    yfit <- yreg(x[-I[[b]],], y[-I[[b]]]) # take a foldt out
    dhat <- predict(dfit, x[I[[b]],], type="response") #predict the left-out fold
    yhat <- predict(yfit, x[I[[b]],], type="response") #predict the left-out fold
    dtil[I[[b]]] <- (d[I[[b]]] - dhat) #record residual for the left-out fold
    ytil[I[[b]]] <- (y[I[[b]]] - yhat) #record residial for the left-out fold
    cat(b," ")
        }
  rfit <- lm(ytil ~ dtil)    #estimate the main parameter by regressing one residual on the other
  coef.est <- coef(rfit)[2]  #extract coefficient
  se <- sqrt(vcovHC(rfit)[2,2]) #record robust standard error
  cat(sprintf("\ncoef (se) = %g (%g)\n", coef.est , se))  #printing output
  return( list(coef.est =coef.est , se=se, dtil=dtil, ytil=ytil) ) #save output and residuals
}


We now run through DML using as first stage models:
 1. OLS
 2. (Rigorous) Lasso
 3. Random Forests
 4. Mix of Random Forest and Lasso

In [5]:
#DML with OLS
cat(sprintf("\nDML with OLS w/o feature selection \n"))
dreg <- function(x,d){ glmnet(x, d, lambda = 0) } #ML method= OLS using glmnet; using lm gives bugs
yreg <- function(x,y){ glmnet(x, y, lambda = 0) } #ML method = OLS
DML2.OLS = DML2.for.PLM(x, d, y, dreg, yreg, nfold=10)


#DML with Lasso:
cat(sprintf("\nDML with Lasso \n"))
dreg <- function(x,d){ rlasso(x,d, post=FALSE) } #ML method= lasso from hdm
yreg <- function(x,y){ rlasso(x,y, post=FALSE) } #ML method = lasso from hdm
DML2.lasso = DML2.for.PLM(x, d, y, dreg, yreg, nfold=10)


#DML with Random Forest:
cat(sprintf("\nDML with Random Forest \n"))
dreg <- function(x,d){ randomForest(x, d) } #ML method=Forest
yreg <- function(x,y){ randomForest(x, y) } #ML method=Forest
DML2.RF = DML2.for.PLM(x, d, y, dreg, yreg, nfold=10)

#DML MIX:
cat(sprintf("\nDML with Lasso for D and Random Forest for Y \n"))
dreg <- function(x,d){ rlasso(x,d, post=FALSE) } #ML method=Forest
yreg <- function(x,y){ randomForest(x, y) } #ML method=Forest
DML2.mix = DML2.for.PLM(x, d, y, dreg, yreg, nfold=10)



DML with OLS w/o feature selection 
fold: 1  2  3  4  5  6  7  8  9  10  
coef (se) = 0.01013 (0.0167061)

DML with Lasso 
fold: 1  2  3  4  5  6  7  8  9  10  
coef (se) = -0.0417523 (0.016092)

DML with Random Forest 
fold: 1  2  3  4  5  6  7  8  9  10  
coef (se) = -0.038251 (0.0149232)

DML with Lasso for D and Random Forest for Y 
fold: 1  2  3  4  5  6  7  8  9  10  
coef (se) = -0.040615 (0.0130871)


Now we examine the RMSE of D and Y to see which method performs well in the first-stage. We print all results below in the following table:

In [6]:
prRes.D<- c( mean((DML2.OLS$dtil)^2), mean((DML2.lasso$dtil)^2), mean((DML2.RF$dtil)^2), mean((DML2.mix$dtil)^2));
prRes.Y<- c(mean((DML2.OLS$ytil)^2), mean((DML2.lasso$ytil)^2),mean((DML2.RF$ytil)^2),mean((DML2.mix$ytil)^2));
prRes<- rbind(sqrt(prRes.D), sqrt(prRes.Y));
rownames(prRes)<- c("RMSE D", "RMSE Y");
colnames(prRes)<- c("OLS", "Lasso", "RF", "Mix")

In [None]:
table <- matrix(0,3,4)

# Point Estimate
table[1,1] <- as.numeric(DML2.OLS$coef.est)
table[2,1] <- as.numeric(DML2.lasso$coef.est)
table[3,1] <- as.numeric(DML2.RF$coef.est)
table[4,1]   <- as.numeric(DML2.mix$coef.est)

# SE
table[1,2] <- as.numeric(DML2.OLS$se)
table[2,2] <- as.numeric(DML2.lasso$se)
table[3,2] <- as.numeric(DML2.RF$se)
table[4,2]   <- as.numeric(DML2.mix$se)

# RMSE Y
table[1,3] <- as.numeric(prRes[2,1])
table[2,3] <- as.numeric(prRes[2,2])
table[3,3] <- as.numeric(prRes[2,3])
table[4,3]   <- as.numeric(prRes[2,4])

# RMSE D
table[1,4] <- as.numeric(prRes[1,1])
table[2,4] <- as.numeric(prRes[1,2])
table[3,4] <- as.numeric(prRes[1,3])
table[4,4]   <- as.numeric(prRes[1,4])



# print results
colnames(table) <- c("Estimate","Standard Error", "RMSE Y", "RMSE D")
rownames(table) <- c("OLS", "Lasso", "RF", "RF/Lasso Mix")
table

Unnamed: 0,Estimate,Standard Error,RMSE Y,RMSE D
OLS,0.01013,0.01670607,0.05372882,0.4667511
Lasso,-0.04175226,0.01609199,0.05259734,0.3709391
RF,-0.03825099,0.01492318,0.04574408,0.3806255
RF/Lasso Mix,-0.04061496,0.01308714,0.04595863,0.367047


In [8]:
print(table, digit=3)

             Estimate Standard Error RMSE Y RMSE D
OLS            0.0101         0.0167 0.0537  0.467
Lasso         -0.0418         0.0161 0.0526  0.371
RF            -0.0383         0.0149 0.0457  0.381
RF/Lasso Mix  -0.0406         0.0131 0.0460  0.367
