# Final Exam

Exam is open book, open note, and open Google. You are not allowed outside
help from another person, however. All work, including coding, must be yours alone. Remember to turn in both the written portion and this coding portion. Turn in this coding portion by downloading your completed Colab notebook as a .ipynb file and submitting it via Learning suite. To get full credit, the completed notebook should be able to run top to bottom, producing the results asked for in the prompt below.

## The Question

An important question in public economics is whether eligibility for publicly-provided health insurance (Medicaid) increases individuals' use of the healthcare system. In this final exam you will use machine learning to estimate the causal effect of Medicaid eligibility on how many times an individual visits the doctor. In the Econ 484 Google Drive "datasets" folder you will find a dataset called "oregon.csv" and the associated codebook "oregon_codebook.txt" that gives some information about each of the variables. The outcome variable is the number of primary care doctor visits. The treatment variable is an indicator variable for ever enrolled in Medicare. A possible instrumental variable is an indicator for being selected in the Oregon Medicaid Experiment lottery. Additional covariates are number of household members, gender, age, employment status, and race indicators.

# Data Manipulation

In [1]:
%matplotlib inline
from google.colab import drive
drive.mount('/content/gdrive')
%cd '/content/gdrive/My Drive/Econ 484/datasets'

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
/content/gdrive/My Drive/Econ 484/datasets


In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('oregon.csv') #read in the data
df = df.dropna(axis=0) #drop rows with missing values
print(df.columns) #print columns and first few rows to get a sense of the data
df.head(5)

Index(['wonlottery', 'numhhmembers', 'female_6m', 'birthyear_6m',
       'employ_hrs_6m', 'race_hisp_6m', 'race_white_6m', 'race_black_6m',
       'race_amerindian_6m', 'race_asian_6m', 'race_pacific_6m',
       'race_other_qn_6m', 'doctor_visits', 'hasmedicaid'],
      dtype='object')


Unnamed: 0,wonlottery,numhhmembers,female_6m,birthyear_6m,employ_hrs_6m,race_hisp_6m,race_white_6m,race_black_6m,race_amerindian_6m,race_asian_6m,race_pacific_6m,race_other_qn_6m,doctor_visits,hasmedicaid
7,0,2,0.0,1968.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0
26,0,2,1.0,1962.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0
27,0,1,1.0,1954.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0
41,1,1,1.0,1946.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0
77,1,1,1.0,1982.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0


In [3]:
#get outcome, treatment, instrument, and covariates subset

y= df['doctor_visits']
D= df['hasmedicaid']
Z= df['wonlottery']
X= df.loc[:,[x for x in df.columns if x not in ('doctor_visits','hasmedicaid','wonlotter')]]

print(D.shape)
print(y.shape)
print(X.shape)
print(Z.shape)

(4666,)
(4666,)
(4666, 12)
(4666,)


In [4]:
#standardize X matrix. Leave y, D and Z as is since D and Z are dummies and y is an outcome we want to leave as is

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() #create scaler object
scaler.fit(X) #feed the scaler object the x
x_scaled = scaler.transform(X)

## The Task

Estimate the causal effect of Medicaid eligibility on doctor visits in two ways:

1) Via OLS regression where you use machine learning to control for the additional covariates.

2) Via instrumental variables regression using the lottery indicator as an instrument for Medicaid eligibility where you use machine learning to control for the additional covariates in both the first stage and reduced form. Note that this is a different application of machine learning to instrumental variables than we learned in class. We are not using machine learning to solve the "many instruments" problem. Rather we are using machine learning to flexibly control for additional covariates in the instrumental variables estimation.

# Double Debiased with sample splitting (OLS regression method)

In [5]:
#import the packages you need
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.linear_model import LinearRegression

In [6]:
#I'll compare the performance of lasso and ridge in predicting y. I'll do this by splitting in test and training data and seeing which performs better. 
#The model with better performance I will then use to do double debiased machine learning
#I'll do double debiased in order to control for many covariates with machine learning and use OLS to get the causal effect
# I'll first do it by hand and then with sample splitting.

In [7]:
#get train test split
from sklearn.model_selection import train_test_split
X_train, X_test, D_train, D_test, y_train, y_test = train_test_split(x_scaled,D,y,random_state=42)

In [8]:
#start with lasso cross validation
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error

lassoxy = LassoCV(cv=5, random_state=0).fit(X_train, y_train)

print(lassoxy.score(X_test, y_test))

lasso_predict = lassoxy.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_predict)
print("lasso mse",lasso_mse)
lassoxy.coef_

0.025083292065102847
lasso mse 8.000721197957448


array([ 0.06647315, -0.06206118,  0.1660925 , -0.10408818, -0.25206037,
        0.        ,  0.15258452, -0.        ,  0.04064783, -0.        ,
       -0.        ,  0.        ])

In [9]:
#Do CV for best parameters of ridge of x on y
bestRidgey = RidgeCV(alphas=[.01, .1, .3,.5,.8,1]).fit(X_train, y_train)

print(bestRidgey.score(X_test, y_test))

ridge_predict = bestRidgey.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_predict)
print("ridge mse",ridge_mse)

print("Best Alpha: ", bestRidgey.alpha_)
print(bestRidgey.coef_)

#Ridge had a slightly better score and mse after cross validation so I will use that model going forward

0.025855805655442765
ridge mse 7.99438151190259
Best Alpha:  1.0
[ 0.09030249 -0.0786514   0.18514746 -0.12438025 -0.26906333  0.01795229
  0.19691888 -0.00268324  0.06862871  0.00897948 -0.01047143  0.02656548]


In [10]:
#fit the optimal ridge regression on the whole data set (x on y) and show how well it does
#this is regression of x on y is the first step of DDML
bestRidgey.fit(x_scaled,y)

ridge_predict = bestRidgey.predict(x_scaled)
ridge_mse = mean_squared_error(y, ridge_predict)
print("ridge mse",ridge_mse)

print("Best Alpha: ", bestRidgey.alpha_)
print('coefficients',bestRidgey.coef_)
print('score',bestRidgey.score(x_scaled, y))


ridge mse 6.853705456031021
Best Alpha:  1.0
coefficients [ 0.13735499 -0.1003574   0.19094071 -0.1197136  -0.26133075  0.01402469
  0.18067523 -0.00554334  0.03792032 -0.00613201  0.00860593  0.02579008]
score 0.029235067825198247


In [11]:
#get residuals from the regression
yresid=y-bestRidgey.predict(x_scaled)

In [12]:
#Do CV for best parameters of ridge of x on d
bestRidged = RidgeCV(alphas=[.01, .1, .3,.5,.8,1]).fit(X_train, D_train)

print(bestRidged.score(X_test, D_test))

ridge_predict = bestRidgey.predict(X_test)
ridge_mse = mean_squared_error(D_test, ridge_predict)
print("ridge mse",ridge_mse)

print("Best Alpha: ", bestRidged.alpha_)
print(bestRidged.coef_)

0.16094459461786537
ridge mse 2.6119442666522885
Best Alpha:  1.0
[ 0.14305    -0.02057878  0.01382168  0.03507404 -0.07554521 -0.00899107
 -0.00523283  0.00057069  0.00038693  0.00017928 -0.00461706  0.00485552]


In [13]:
#fit the optimal ridge regression on the whole data set (x on d) and show how well it does
#this is regression of x on d is the second step of DDML
bestRidged.fit(x_scaled,D)

ridge_predict = bestRidgey.predict(x_scaled)
ridge_mse = mean_squared_error(D, ridge_predict)
print("ridge mse",ridge_mse)

print("Best Alpha: ", bestRidged.alpha_)
print('coefficients',bestRidged.coef_)
print('score',bestRidged.score(x_scaled, D))

ridge mse 2.6299602628607865
Best Alpha:  1.0
coefficients [ 0.14291013 -0.02170197  0.01407619  0.03555553 -0.076947   -0.00578546
  0.00717022  0.00587518  0.00578178  0.00204862 -0.00622726  0.01108803]
score 0.15429044285108928


In [14]:
# get residuals of D from d_hat

dresid = D - bestRidged.predict(x_scaled)

#OLS regress y residuals on d residuals
ddmlregtest =LinearRegression().fit(dresid.to_numpy().reshape(-1,1),yresid)
ddmlregtest.coef_[0]

#now that I've created an estimate the basic way, I'll try sample splitting

1.0690422147587486

In [15]:
x_scaled = pd.DataFrame(x_scaled, columns = list(X.columns))

In [16]:


#try with sample splitting

# create our sample splitting "object"
kf = KFold(n_splits=5,shuffle=True,random_state=42)

# apply the splits to our Xs
kf.get_n_splits(x_scaled)

# create an array to hold each fold's regression coefficient
coeffs=np.zeros(5)

# loop through each fold
ii=0
for train_index, test_index in kf.split(x_scaled):
  X_train, X_test = x_scaled.iloc[train_index,:], x_scaled.iloc[test_index,:]
  y_train, y_test = y.iloc[train_index], y.iloc[test_index]
  d_train, d_test = D.iloc[train_index], D.iloc[test_index]

  #double debiased machine learning process:

  # Ridge y on training folds:
  bestRidgey.fit(X_train, y_train)

  # but get residuals in test set
  yresid=y_test-bestRidgey.predict(X_test)
  
  #Ridge d on training folds
  bestRidged.fit(X_train, d_train)

  #but get residuals in test set
  dresid=d_test-bestRidged.predict(X_test)

  # regress resids on resids
  ddml=LinearRegression().fit(dresid.to_numpy().reshape(-1,1),yresid)

  # save coefficient in a vector
  coeffs[ii]=ddml.coef_[0]
  ii+=1

# Take average
print("Double-Debiased Machine Learning effect of medicaid: {:.3f}".format(np.mean(coeffs)))
coeffs

#the results with and without sample splitting are very similar, both very close to 1.06

Double-Debiased Machine Learning effect of medicaid: 1.056


array([1.10510928, 0.87950401, 1.01571054, 1.32930309, 0.94879214])

# Instrumental Variables Method

This IV implementation is different than we have done in class. The instructions say to include the X covariates in both the first and second stage. Under the assumption of sparsity and to combat overfitting, I'll use lasso to decide which variables to keep in each stage. Then I'll do a simple OLS which will let me compare the coefficients to get the ratio/wald estimator for the instrument.

In [20]:
#First Stage predict D with Z to get D-hat
from sklearn.linear_model import Lasso
#create matrix with z and x


print(Z.shape)
print(X.shape)
X['wonlottery'] = Z
X.shape

fs_x = X
print(fs_x.columns)
lassofs = Lasso(alpha=.02).fit(fs_x,D) #lasso to see which vars to keep
print(lassofs.coef_)

retained_fs = fs_x.iloc[:,(lassofs.coef_!=0)]
retained_fs.columns #This leaves only three columns which we can use in an OLS first stage

(4666,)
(4666, 12)
Index(['wonlottery', 'numhhmembers', 'female_6m', 'birthyear_6m',
       'employ_hrs_6m', 'race_hisp_6m', 'race_white_6m', 'race_black_6m',
       'race_amerindian_6m', 'race_asian_6m', 'race_pacific_6m',
       'race_other_qn_6m'],
      dtype='object')
[ 0.20868556 -0.          0.          0.00269882 -0.04942463 -0.
  0.          0.          0.         -0.         -0.          0.        ]


Index(['wonlottery', 'birthyear_6m', 'employ_hrs_6m'], dtype='object')

In [21]:
#now do OLS first stage with retained x's and z on d
#get retained x's
ols_fs = LinearRegression().fit(retained_fs,D) #using ols allows me to still claim an effect that isn't biased under the right assumptions
z_coef = ols_fs.coef_[0]
#get d_hat
d_hat = ols_fs.predict(retained_fs)

In [22]:
#second stage, predict y with D-hat and X's

#first get the matrix of x's and d-hat

X = X.drop(['wonlottery'], axis=1)
X['d_hat'] = d_hat

ss_x = X
print(ss_x.columns)


Index(['numhhmembers', 'female_6m', 'birthyear_6m', 'employ_hrs_6m',
       'race_hisp_6m', 'race_white_6m', 'race_black_6m', 'race_amerindian_6m',
       'race_asian_6m', 'race_pacific_6m', 'race_other_qn_6m', 'd_hat'],
      dtype='object')


In [23]:
#then do lasso to see which variables should be retained

lasso_ss = Lasso(alpha=.02).fit(ss_x,y)

print(lasso_ss.coef_)

retained_ss = ss_x.iloc[:,(lasso_ss.coef_!=0)]
retained_ss.columns #This leaves only three columns which we can use in an OLS second stage


[-0.14463014  0.31425175 -0.00986284 -0.19370512 -0.          0.30020979
 -0.          0.         -0.          0.         -0.          0.0280126 ]


Index(['numhhmembers', 'female_6m', 'birthyear_6m', 'employ_hrs_6m',
       'race_white_6m', 'd_hat'],
      dtype='object')

In [24]:
#do ols second stage with retained variables
ols_ss = LinearRegression().fit(retained_ss,y)
d_hat_coef = ols_ss.coef_[5]

#divide the two coefficients to get the effect
effect = d_hat_coef/z_coef
print('effect:',effect)

effect: 3.231792596213835


## Hints and Requirements

*   Thoroughly document your code with comments explaining what each part of your code is doing

*   Be sure to "print" all of the relevant results after estimating/calculating them

*   Use best practices that we have learned this semester, including pre-processing variables as necessary and choosing tuning parameters.

*   Hint: the dataset contains missing values for many of the variables and many of the observations. You may assume that observations are missing at random, and therefore restricting your dataset to observations for which no variables are missing is appropriate.

*   Hint: the instrumental variables setting here is somewhat different from the "many instruments" situation we discussed during the semester. Here, you calculate the instrumental variables estimate by estimating the "effect" of the instrument on the outcome (the reduced form), and the "effect" of the instrument on the treatment variable (the first stage), and your instrumental variables estimate is the ratio of the reduced form to the first stage. You will use machine learning to control for the additional covariates in both the reduced form and first stage.

*   Choose the machine learning method(s) you use based on what yields the best out-of-sample accuracy among at least two different machine learning methods, where out-of-sample accuracy is assessed using a held out test set.