# Solution Workgroup 6 - Python script

## I. Debiased Machine Learning (DML) applied to the Convergence Hypothesis

We will use the DML algorithm we have learned using The Testing Convergence Hypothesis Lab and the next variables for the main analysis:

$y$ = outcome: growth rate

$d$ = treatment: initial wealth

$x$ = controls: country characteristics

We will run the next regressions:

- OLS without including the country characteristics.
- OLS including the country characteristics.
- DML using Lasso to predict y an d.
- DML using Post-Lasso to predict y an d.
- DML using Elastic Net to predict y an d.
- DML using Ridge to predict y an d.
- DML using Random Forest to predict y an d.  

We will run the best method and show our results in a table.

In [1]:
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense

In [2]:
# Import relevant packages
import pandas as pd
import numpy as np
import pyreadr
from sklearn import preprocessing
import patsy

from numpy import loadtxt

import math
import hdmpy
import numpy as np
import random
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import colors
import warnings
warnings.filterwarnings('ignore')

In [3]:
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV, ElasticNetCV
from sklearn.linear_model import LinearRegression
from sklearn import linear_model
import itertools
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
from pandas.api.types import is_categorical_dtype
from itertools import compress
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.feature_selection import SelectFromModel

Load the data

In [4]:
# downloading the data that the author used
growth_read = pyreadr.read_r('../data/GrowthData.RData')

# extracting the data frame from rdata_read
data = growth_read[ 'GrowthData' ]



In [5]:
data.head(5)

Unnamed: 0,Outcome,intercept,gdpsh465,bmp1l,freeop,freetar,h65,hm65,hf65,p65,...,seccf65,syr65,syrm65,syrf65,teapri65,teasec65,ex1,im1,xr65,tot1
0,-0.024336,1,6.591674,0.2837,0.153491,0.043888,0.007,0.013,0.001,0.29,...,0.04,0.033,0.057,0.01,47.6,17.3,0.0729,0.0667,0.348,-0.014727
1,0.100473,1,6.829794,0.6141,0.313509,0.061827,0.019,0.032,0.007,0.91,...,0.64,0.173,0.274,0.067,57.1,18.0,0.094,0.1438,0.525,0.00575
2,0.067051,1,8.895082,0.0,0.204244,0.009186,0.26,0.325,0.201,1.0,...,18.14,2.573,2.478,2.667,26.5,20.7,0.1741,0.175,1.082,-0.01004
3,0.064089,1,7.565275,0.1997,0.248714,0.03627,0.061,0.07,0.051,1.0,...,2.63,0.438,0.453,0.424,27.8,22.7,0.1265,0.1496,6.625,-0.002195
4,0.02793,1,7.162397,0.174,0.299252,0.037367,0.017,0.027,0.007,0.82,...,2.11,0.257,0.287,0.229,34.5,17.6,0.1211,0.1308,2.5,0.003283


In [6]:
# data cleaning
y = data['Outcome']
d = data['gdpsh465']
x = data.drop(['Outcome','gdpsh465','intercept'], 1)

### 1. OLS without including the country characteristics.


In [7]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy

We make the model equation without control variables

In [8]:
model = "y ~ d"
baseline_ols = smf.ols(model , data=data).fit()
baseline_ols_table = baseline_ols.summary2().tables[1]
print( baseline_ols_table.iloc[ 1 , 4:] )
baseline_ols_table.iloc[1, :]

[0.025   -0.010810
0.975]    0.013444
Name: d, dtype: float64


Coef.       0.001317
Std.Err.    0.006102
t           0.215777
P>|t|       0.829661
[0.025     -0.010810
0.975]      0.013444
Name: d, dtype: float64

In [9]:
baseline_ols.summary2().tables[1]['Coef.']['d']

0.0013167126134460663

In [10]:
baseline_ols_table.loc[['d'],:]

Unnamed: 0,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
d,0.001317,0.006102,0.215777,0.829661,-0.01081,0.013444


An increase by 1% in the initial wealth cause an increase in the outcome by 0.001317.

### 2. OLS including the country characteristics.

- We select only the country characteristics variables to include in the model equation
- Also run the model including the country characteristics

In [11]:
control_formula = "y~ d + x"

In [12]:
control_ols = smf.ols( control_formula , data=data).fit()
control_ols_table = control_ols.summary2().tables[1]
print( control_ols_table.iloc[ 1 , 4:] )
control_ols_table.iloc[1, :]

[0.025   -0.070600
0.975]    0.051844
Name: d, dtype: float64


Coef.      -0.009378
Std.Err.    0.029888
t          -0.313774
P>|t|       0.756019
[0.025     -0.070600
0.975]      0.051844
Name: d, dtype: float64

In [13]:
control_ols_table.loc[['d'],:]

Unnamed: 0,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
d,-0.009378,0.029888,-0.313774,0.756019,-0.0706,0.051844


This can interpret as if the initial wealth increase by 1%, then the outcome decrease by -0.0093.

## 3. DML algorithm

Here we perform inference of the predictive coefficient $\beta$ in our partially linear statistical model, 

$$
Y = D\beta + g(Z) + \epsilon, \quad E (\epsilon | D, Z) = 0,
$$

using the **double machine learning (DML)** approach. 

For $\tilde Y = Y- E(Y|Z)$ and $\tilde D= D- E(D|Z)$, we can write
$$
\tilde Y = \alpha \tilde D + \epsilon, \quad E (\epsilon |\tilde D) =0.
$$

Using cross-fitting, we employ modern regression methods
to build estimators $\hat \ell(Z)$ and $\hat m(Z)$ of $\ell(Z):=E(Y|Z)$ and $m(Z):=E(D|Z)$ to obtain the estimates of the residualized quantities:

$$
\tilde Y_i = Y_i  - \hat \ell (Z_i),   \quad \tilde D_i = D_i - \hat m(Z_i), \quad \text{ for each } i = 1,\dots,n.
$$

Finally, using ordinary least squares of $\tilde Y_i$ on $\tilde D_i$, we obtain the 
estimate of $\beta$.

First, we create the base fuction to use in the **DML**.

In [14]:
def DML2_for_PLM(x, d, y, dreg, yreg, nfold):
    
    # Num ob observations
    nobs = x.shape[0]
    
    # Define folds indices 
    list_1 = [*range(0, nfold, 1)]*nobs
    sample = np.random.choice(nobs,nobs, replace=False).tolist()
    foldid = [list_1[index] for index in sample]

    # Create split function(similar to R)
    def split(x, f):
        count = max(f) + 1
        return tuple( list(itertools.compress(x, (el == i for el in f))) for i in range(count) ) 

    # Split observation indices into folds 
    list_2 = [*range(0, nobs, 1)]
    I = split(list_2, foldid)
    
    # Create array to save errors 
    dtil = np.zeros( len(x) ).reshape( len(x) , 1 )
    ytil = np.zeros( len(x) ).reshape( len(x) , 1 )
    
    # loop to save results
    for b in range(0,len(I)):
    
        # Split data - index to keep are in mask as booleans
        include_idx = set(I[b])  #Here should go I[b] Set is more efficient, but doesn't reorder your elements if that is desireable
        mask = np.array([(i in include_idx) for i in range(len(x))])

        # Lasso regression, excluding folds selected 
        dfit = dreg(x[~mask,], d[~mask,])
        yfit = yreg(x[~mask,], y[~mask,])

        # predict estimates using the 
        dhat = dfit.predict( x[mask,] )
        yhat = yfit.predict( x[mask,] )

        # save errors  
        dtil[mask] =  d[mask,] - dhat.reshape( len(I[b]) , 1 )
        ytil[mask] = y[mask,] - yhat.reshape( len(I[b]) , 1 )
        print(b, " ")
    
    # Create dataframe 
    data_2 = pd.DataFrame(np.concatenate( ( ytil, dtil), axis = 1), columns = ['ytil','dtil'])
   
    # OLS clustering at the County level
    model = "ytil ~ dtil"
    baseline_ols = smf.ols(model , data = data_2 ).fit()
    coef_est = baseline_ols.summary2().tables[1]['Coef.']['dtil']
    se = baseline_ols.summary2().tables[1]['Std.Err.']['dtil']
    
    Final_result = { 'coef_est' : coef_est , 'se' : se , 'dtil' : dtil , 'ytil' : ytil }

    print("Coefficient is {}, SE is equal to {}".format(coef_est, se))
    
    return Final_result

Now that we have the function, the next step is to use the differents machine learning method 

In [15]:
# converting varianles into a matrix
y = y.to_numpy().reshape( len(y) , 1 )
d = d.to_numpy().reshape( len(y) , 1 )
x = x.to_numpy()

## 3. DML using Lasso to predict $y$ an $d$

### 3.1 Lasso Using scikit-learn


In [16]:
def dreg(x,d):
    alpha=0.00000001
    result = linear_model.Lasso(alpha = alpha).fit(x, d)
    return result

def yreg(x,y):
    alpha=0.00000001
    result = linear_model.Lasso(alpha = alpha).fit(x, y)
    return result

DML2_lasso = DML2_for_PLM(x, d, y, dreg, yreg, 10)

0  
1  
2  
3  
4  
5  
6  
7  
8  
9  
Coefficient is 0.015428236453425687, SE is equal to 0.010473537645094125


###  3.2 Lasso using hdmpy 

In [17]:
import hdmpy
from statsmodels.tools import add_constant

In [18]:
class rlasso_hdmy:
    
    def __init__(self, post ):
        self.post = post
       
    def fit( self, X, Y ):
        
        self.X = X
        self.Y = Y
        
        # Standarization of X and Y
        self.rlasso_model = hdmpy.rlasso( X , Y , post = self.post )                
        return self
    
    def predict( self , X_1 ):
        self.X_1 = X_1
        beta = self.rlasso_model.est['coefficients'].to_numpy()
        
        if beta.sum() == 0:
            prediction = np.repeat( self.rlasso_model.est['intercept'] , self.X_1.shape[0] )
        
        else:
            prediction = ( add_constant( self.X_1 , has_constant = 'add') @ beta ).flatten()
                
        return prediction

In [19]:
# Post = false
def dreg(x, d):
    result = rlasso_hdmy( post = False ).fit( x , d )
    return result

def yreg(x,y):
    result = rlasso_hdmy( post = False ).fit( x , y )
    return result

DML2_lasso_hdmpy = DML2_for_PLM(x, d, y, dreg, yreg, 10)

0  
1  
2  
3  
4  
5  
6  
7  
8  
9  
Coefficient is -0.04032748554785723, SE is equal to 0.0144203550955758


## 4. DML using Post-Lasso to predict $y$ an $d$

### 4.1 Post - Lasso Using scikit-learn


In [20]:
class Lasso_post:
    
    def __init__(self, alpha ):
        self.alpha = alpha

        
    def fit( self, X, Y ):
        self.X = X
        self.Y = Y
        lasso = linear_model.Lasso( alpha = self.alpha ).fit( X , Y )
        model = SelectFromModel( lasso , prefit = True )
        X_new = model.transform( X )
        # Gettin indices from columns which has variance for regression
        index_X = model.get_support()
        
        self.index = index_X
        new_x = X[ : ,  index_X ]
        
        lasso2 = linear_model.Lasso( alpha = self.alpha ).fit( new_x , Y )
        self.model = lasso2
        
        return self
    
    def predict( self , X ):
        
        dropped_X = X[ : , self.index ]
        
        predictions = self.model.predict( dropped_X )
        
        return predictions

In [21]:
def dreg(x,d):
    alpha=0.00000001
    result = Lasso_post( alpha = alpha ).fit( x , d )
    return result

def yreg( x , y ):
    alpha = 0.00000001
    result = Lasso_post( alpha = alpha ).fit( x , y )
    return result

DML2_lasso_post = DML2_for_PLM(x, d, y, dreg, yreg, 10)

0  
1  
2  
3  
4  
5  
6  
7  
8  
9  
Coefficient is 0.034208318739032864, SE is equal to 0.010565179461344037


### 4.2 Post - Lasso using hdmpy

In [22]:
import hdmpy
from statsmodels.tools import add_constant

In [23]:
class rlasso_hdmy:
    
    def __init__(self, post ):
        self.post = post
       
    def fit( self, X, Y ):
        
        self.X = X
        self.Y = Y
        
        # Standarization of X and Y
        self.rlasso_model = hdmpy.rlasso( X , Y , post = self.post )                
        return self
    
    def predict( self , X_1 ):
        self.X_1 = X_1
        beta = self.rlasso_model.est['coefficients'].to_numpy()
        
        if beta.sum() == 0:
            prediction = np.repeat( self.rlasso_model.est['intercept'] , self.X_1.shape[0] )
        
        else:
            prediction = ( add_constant( self.X_1 , has_constant = 'add') @ beta ).flatten()
                
        return prediction

In [24]:
# Post = True
def dreg(x, d):
    result = rlasso_hdmy( post = True ).fit( x , d )
    return result

def yreg(x,y):
    result = rlasso_hdmy( post = True ).fit( x , y )
    return result

DML2_lasso_post_hdmpy = DML2_for_PLM(x, d, y, dreg, yreg, 10)

0  
1  
2  
3  
4  
5  
6  
7  
8  
9  
Coefficient is -0.03956162960966979, SE is equal to 0.0136955657056328


## 5. DML using Elastic Net

In [25]:
class standard_skl_model:
    
    def __init__(self, model ):
        self.model = model
       
    def fit( self, X, Y ):
        
        # Standarization of X and Y
        self.scaler_X = StandardScaler()
        self.scaler_X.fit( X )
        std_X = self.scaler_X.transform( X )
                
        self.model.fit( std_X , Y )
                
        return self
    
    def predict( self , X ):
        
        self.scaler_X = StandardScaler()
        self.scaler_X.fit( X )
        std_X = self.scaler_X.transform( X )
        
        prediction = self.model.predict( std_X )
        
        return prediction

In [26]:
# DML with cross-validated Elastic Net:
def dreg(x,d):
    result = standard_skl_model( ElasticNetCV( cv = 10 , random_state = 0 , l1_ratio = 0.5, max_iter = 100000 ) ).fit( x, d )
    return result

def yreg(x,y):
    result = standard_skl_model( ElasticNetCV( cv = 10 , random_state = 0 , l1_ratio = 0.5, max_iter = 100000 ) ).fit( x, y )
    return result

DML2_elnet = DML2_for_PLM(x, d, y, dreg, yreg, 10 )

0  
1  
2  
3  
4  
5  
6  
7  
8  
9  
Coefficient is 0.0014287915437003508, SE is equal to 0.010706472503853476


## 6. DML using Ridge to predict

In [27]:
# DML with cross-validated Ridge:
def dreg(x,d):
    result = standard_skl_model( ElasticNetCV( cv = 10 ,  random_state = 0 , l1_ratio = 0.0001 ) ).fit( x, d )
    return result

def yreg(x,y):
    result = standard_skl_model( ElasticNetCV( cv = 10 , random_state = 0 , l1_ratio = 0.0001 ) ).fit( x, y )
    return result

DML2_ridge = DML2_for_PLM(x, d, y, dreg, yreg, 10)

0  
1  
2  
3  
4  
5  
6  
7  
8  
9  
Coefficient is -0.01689585663596966, SE is equal to 0.009329337211001655


## 7. DML using Random Forest

In [28]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

In [29]:
# DML with Random Forest:
def dreg(x,d):
    result = RandomForestRegressor( random_state = 0 , n_estimators = 500 , max_features = 60 , n_jobs = 4 , min_samples_leaf = 5 ).fit( x, d )
    return result

def yreg(x,y):
    result = RandomForestRegressor( random_state = 0 , n_estimators = 500 , max_features = 60 , n_jobs = 4 , min_samples_leaf = 5 ).fit( x, y )
    return result

DML2_RF = DML2_for_PLM(x, d, y, dreg, yreg, 10)   # set to 2 due to computation time

0  
1  
2  
3  
4  
5  
6  
7  
8  
9  
Coefficient is -0.04532271079660437, SE is equal to 0.012279553131739403


Now we can join all the results for every method

In [30]:
mods = [DML2_lasso, DML2_lasso_hdmpy, DML2_lasso_post ,DML2_lasso_post_hdmpy, DML2_ridge, DML2_elnet, DML2_RF]
mods_name = ["DML2_lasso", 'DML2_lasso_hdmy',"DML2_lasso_post",'DML2_lasso_post_hdmpy', 'DML2_ridge', 'DML2_elnet', 'DML2_RF']

def mdl( model , model_name ):
    
    RMSEY = np.sqrt( np.mean( model[ 'ytil' ] ) ** 2 )
    RMSED = np.sqrt( np.mean( model[ 'dtil' ] ) ** 2 ) 
    
    result = pd.DataFrame( { model_name : [ RMSEY , RMSED ]} , index = [ 'RMSEY' , 'RMSED' ])
    return result

RES = [ mdl( model , name ) for model, name in zip( mods , mods_name ) ]
    
pr_Res = pd.concat( RES, axis = 1)

pr_Res

Unnamed: 0,DML2_lasso,DML2_lasso_hdmy,DML2_lasso_post,DML2_lasso_post_hdmpy,DML2_ridge,DML2_elnet,DML2_RF
RMSEY,0.00225,6.3e-05,0.00237,0.001528,4.54883e-18,3.5696750000000005e-17,0.000735
RMSED,0.005838,0.007499,0.002246,0.000624,3.355341e-16,7.697546e-16,0.02539


## 8. Run the best method i.e. the best combination of methods to predict $y$ an $d$

In [31]:
def dreg(x,d):
    result = standard_skl_model(  ElasticNetCV(cv = 10 , random_state = 0 , alphas = [0]) ).fit( x, d )
    return result


def yreg(x,y):
    result = standard_skl_model( ElasticNetCV( cv = 10 ,  random_state = 0 , l1_ratio = 0.0001 ) ).fit( x, y )
    return result

DML2_best = DML2_for_PLM(x, d, y , dreg, yreg, 10)



0  
1  
2  
3  
4  
5  
6  
7  
8  
9  
Coefficient is 0.006406830446803534, SE is equal to 0.006843369328091645


##  9. Show your results in a table

In [32]:
table = np.zeros( ( 10 , 2 ))
table[ 0 , 0] = baseline_ols.summary2().tables[1]['Coef.']['d']
table[ 1 , 0] = control_ols.summary2().tables[1]['Coef.']['d']
table[ 2 , 0] = DML2_lasso['coef_est']
table[ 3 , 0] = DML2_lasso_hdmpy['coef_est']
table[ 4 , 0] = DML2_lasso_post['coef_est']
table[ 5 , 0] = DML2_lasso_post_hdmpy['coef_est']
table[ 6 , 0] = DML2_ridge['coef_est']
table[ 7 , 0] = DML2_elnet['coef_est']
table[ 8 , 0] = DML2_RF['coef_est']
table[ 9 , 0] = DML2_best['coef_est']
table[ 0 , 1] = baseline_ols.summary2().tables[1]['Std.Err.']['d']
table[ 1 , 1] = control_ols.summary2().tables[1]['Std.Err.']['d']
table[ 2 , 1] = DML2_lasso['se']
table[ 3 , 1] = DML2_lasso_hdmpy['se']
table[ 4 , 1] = DML2_lasso_post['se']
table[ 5 , 1] = DML2_lasso_post_hdmpy['se']
table[ 6 , 1] = DML2_ridge['se']
table[ 7 , 1] = DML2_elnet['se']
table[ 8 , 1] = DML2_RF['se']
table[ 9 , 1] = DML2_best['se']

In [33]:
table = pd.DataFrame(table, index = [ "Baseline OLS", "OLS with controls", "Lasso", 'Lasso using hdmy', \
                                       "Post-Lasso",'Post-Lasso using hdmy',"CV Elnet", "CV Ridge", \
                                       "Random Forest", "Best" ] , \
                     columns = ["Estimate","Standard Error"] )
table.round( 3 )

Unnamed: 0,Estimate,Standard Error
Baseline OLS,0.001,0.006
OLS with controls,-0.009,0.03
Lasso,0.015,0.01
Lasso using hdmy,-0.04,0.014
Post-Lasso,0.034,0.011
Post-Lasso using hdmy,-0.04,0.014
CV Elnet,-0.017,0.009
CV Ridge,0.001,0.011
Random Forest,-0.045,0.012
Best,0.006,0.007
