# HIGH-DIMENSIONAL METRICS IN JULIA

## 2. How to get started

In [1]:
import Pkg; using Pkg

In [251]:
# Pkg.rm("HDMjl")

In [18]:
# Pkg.add(url = "https://github.com/d2cml-ai/HDMjl.jl", rev = "prueba2")

In [20]:
Pkg.add(url = "https://github.com/d2cml-ai/HDMjl.jl")

[32m[1m    Updating[22m[39m git-repo `https://github.com/d2cml-ai/HDMjl.jl`


[32m[1m   Resolving[22m[39m package versions...


[32m[1m    Updating[22m[39m `C:\Users\User\.julia\environments\v1.8\Project.toml`




 [90m [8de29b41] [39m

[92m+ HDMjl v0.0.11 `https://github.com/d2cml-ai/HDMjl.jl#main`[39m
[32m[1m    Updating[22m[39m 

`C:\Users\User\.julia\environments\v1.8\Manifest.toml`


 [90m [8de29b41] [39m[92m+ HDMjl v0.0.11 `https://github.com/d2cml-ai/HDMjl.jl#main`[39m




[32m[1mPrecompiling[22m[39m

 project...




[33m  ✓ [39mHDMjl
  1 dependency successfully precompiled in 42 seconds. 297 already precompiled.
  [33m1[39m dependency precompiled but a different version is currently loaded. Restart julia to access the new version


In [2]:
using CodecXz, RData, DataFrames, StatsModels, Statistics, Distributions, PrettyTables, GLM, CSV, LinearAlgebra, StatsModels

## 3. Prediction using Approximate Sparsity

### 3.2. A Joint Significance Test for Lasso Regression.

Example. (Prediction Using Lasso and Post-Lasso) Consider generated data from a sparse linear model:

In [6]:
dta = get_data("seed_100")
n, p = size(dta);
Y = dta[:,1];
X = dta[:,2:end];

Next we estimate the model, print the results, and make in-sample and out-of sample predictions. We can use methods print and summarize to print the results, where the option all can be set to FALSE to limit the print only to the non-zero coefficients.


In [10]:
lasso_reg = rlasso(X, Y, post = false);
sum_lasso = r_summary(lasso_reg)


    Post-Lasso Estimation: false
    Total number of variables: 100
    Number of selected variables: 11
    ---
     
 [1m Variable  [0m [1m Estimate    [0m
  Intercept   0.056855
  X2          4.77121
  X3          4.69284
  X4          4.76568
  X14         -0.0453685
  X16         -0.0467382
  X17         -0.00499617
  X20         -0.0922336
  X23         -0.0272553
  X41         -0.0105032
  X62         0.113585
  X101        -0.0247296

    ----
    Multiple R-squared: 0.9912720815874809
    Adjusted R-squared: 0.9901810917859161
    

In [15]:
new_dta = get_data("seed_200")
Xnew = new_dta[:, Not(1)]
Ynew = new_dta[:, 1]
yhat_lasso_new = r_predict(lasso_reg, xnew = Matrix(Xnew))
post_lasso_reg = rlasso(X, Y, post = true)
y_hat_postlasso = r_predict(post_lasso_reg, xnew = Matrix(Xnew))
r_summary(post_lasso_reg)


    Post-Lasso Estimation: true
    Total number of variables: 100
    Number of selected variables: 3
    ---
     
 [1m Variable  [0m [1m Estimate  [0m
  Intercept   0.0341043
  X2          4.92413
  X3          4.85787
  X4          4.96442

    ----
    Multiple R-squared: 0.9906284190077158
    Adjusted R-squared: 0.990335557101707
    

In [16]:
yhat_postlasso = r_predict(post_lasso_reg) #in-sample prediction
yhat_postlasso_new = r_predict(post_lasso_reg, xnew = Matrix(Xnew)) #in-sample prediction
;

In [17]:
MAE = mean(eachrow(hcat(abs.(Ynew - yhat_lasso_new), abs.(Ynew - yhat_postlasso_new))))
MAE = DataFrame([[MAE[1]], [MAE[2]]], :auto)
MAE = rename!(MAE, ["lasso MAE", "Post-lasso MAE"])
pretty_table(MAE, tf = tf_simple, nosubheader = true)

 [1m lasso MAE [0m [1m Post-lasso MAE [0m
   0.879583          0.78017


## 4. Inference on Target Regression Coefficients

### 4.1. Intuition for the Orthogonality Principle in Linear Models via Partialling Out.

In [18]:
dta = get_data("seed_300")
n, p = size(dta);
y = dta[:,"y"];
d = dta[:,"d"];
x = dta[:,3:end];

We can estimate $\alpha_0$ by running full least squares:

In [19]:
full_fit = lm(hcat(ones(length(y)), Matrix(dta[:,2:end])), y);
DataFrame(
    Estimate = coef(full_fit)[2], 
    Std_Error = stderror(full_fit)[2])

Unnamed: 0_level_0,Estimate,Std_Error
Unnamed: 0_level_1,Float64,Float64
1,0.978075,0.0137122


Another way to estimate $\alpha_0$ is to first partial out the x-variables from $y_i$ and $d_i$, and run least squares on the residuals:

In [20]:
rY_1 = lm(hcat(ones(length(y)), Matrix(dta[:,3:end])), y);
rY = y - predict(rY_1)
rD_1 = lm(hcat(ones(length(y)), Matrix(dta[:,3:end])), d);
rD = d - predict(rD_1);

In [21]:
partial_fit_ls = lm(hcat(ones(length(y)), rD), rY)
DataFrame(Estimate = coef(partial_fit_ls)[2], Std_Error = stderror(partial_fit_ls)[2])

Unnamed: 0_level_0,Estimate,Std_Error
Unnamed: 0_level_1,Float64,Float64
1,0.978075,0.0136862


In high-dimensional settings, we can no longer rely on the full least-squares and instead may rely on
Lasso or Post-Lasso for partialling out

In [22]:
rY_1 = rlasso(hcat(ones(length(y)), Matrix(dta[:,3:end])), y);
rY = rY_1["residuals"]
rD_1 = rlasso(hcat(ones(length(y)), Matrix(dta[:,3:end])), d);
rD = rD_1["residuals"]
partial_fit_postlasso = lm(hcat(ones(length(y)), rD), vec(rY))
DataFrame(Estimate = coef(partial_fit_postlasso)[2], Std_Error = stderror(partial_fit_postlasso)[2])

Unnamed: 0_level_0,Estimate,Std_Error
Unnamed: 0_level_1,Float64,Float64
1,0.972739,0.0136868


The orthogonal estimating equations method – based on partialling out via Lasso or post-Lasso – is
implemented by the function rlassoEffect, using method= "partialling out":

In [23]:
Eff = rlassoEffect(x, y, d, method = "partialling out");
r_summary(Eff);

Estimates and significance testing of the effect of target variables
 [1m Row [0m [1m Estimate. [0m [1m Std. Error [0m [1m  t value [0m [1m Pr(>|t|) [0m

    1     0.97274      0.01369   71.05478    0.0 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1




Another orthogonal estimating equations method – based on the double selection of covariates – is implemented by the the function rlassoEffect, using method= "double selection

In [24]:
Eff = rlassoEffect(Matrix(x), y, d, method = "double selection");
r_summary(Eff);

Estimates and significance testing of the effect of target variables
 [1m Row [0m [1m Estimate. [0m [1m Std. Error [0m [1m  t value [0m [1m Pr(>|t|) [0m

    1     0.97807      0.01416   69.07274    0.0 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


### 4.2. Inference: Confidence Intervals and Significance Testing. The function rlassoEffects

In [31]:
data = get_data("seed_400")
n, p = size(data);
y = data[:,1];
#d = dta[:,"d"];
x = data[:,2:end];

We can do inference on a set of variables of interest, e.g. the first, second, third, and the fiftieth:

In [32]:
lasso_effects = rlassoEffects(x, y, index = [1,2,3,50]);

In [34]:
r_print(lasso_effects, digits = 4)

Coefficients:

 [1m    X1     [0m [1m    X2     [0m [1m    X3     [0m [1m    X50    [0m

   2.9445      3.0413      2.9754       0.072


In [36]:
r_summary(lasso_effects);

Estimates and significance testing of the effect of target variables
 [1m     [0m [1m           [0m [1m Estimate. [0m [1m Std. Error [0m [1m  t value [0m [1m Pr(>|t|) [0m

 [1m  X1 [0m    2.94448     2.94448      0.08815   33.40306    0.0 ***
 [1m  X2 [0m    3.04127     3.04127      0.08389   36.25307    0.0 ***
 [1m  X3 [0m     2.9754      2.9754      0.07804    38.1266    0.0 ***
 [1m X50 [0m  0.0719553     0.07196      0.07765    0.92672   0.35407
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1




In [37]:
r_confint(lasso_effects);

 [1m     [0m [1m       2.5% [0m [1m    97.5% [0m

 [1m  X1 [0m     2.77171    3.11724
 [1m  X2 [0m     2.87685     3.2057
 [1m  X3 [0m     2.82245    3.12836
 [1m X50 [0m  -0.0802271   0.224138


We will also demonstrate the application of joint confidence intervals in an empirical application in
the next section.

In [39]:
r_confint(lasso_effects, joint = true);

 [1m     [0m [1m      2.5% [0m [1m    97.5% [0m

 [1m  X1 [0m    2.72666    3.16229
 [1m  X2 [0m    2.83591    3.24664
 [1m  X3 [0m    2.78218    3.16863
 [1m X50 [0m  -0.116561   0.260472


### 4.3. Application: the effect of gender on wage

In [111]:
using  StatsModels, StatsBase, Combinatorics

In [89]:
function data_formula(frml::FormulaTerm, data::DataFrame)
    form = apply_schema(frml, schema(frml, data));
    res, pred = modelcols(form, data);

    coef_names = string.(coefnames(form.rhs));

    # print(coef_names, "\n")
    if length(coef_names) < 2
        x_df = DataFrame(coef_names = pred)
    else
        x_df = DataFrame(pred, :auto)
        x_df = rename!(x_df, coef_names)
    end
    return (x_df)
    # return coef_names
end

data_formula (generic function with 1 method)

In [113]:
cps2012 = get_data("cps2012")
n, p = size(cps2012);
size(cps2012)
x_formula1 = @formula(lnw ~ -1 + female + female & (widowed + divorced + separated + nevermarried +
                        hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3))
x1 = data_formula(x_formula1, cps2012);
;

In [125]:
x_formula2 = @formula(lnw ~ -1 + widowed + divorced + separated + nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so +
                we + exp1 + exp2 + exp3)
poly = 2

2

In [161]:
coef_names

15-element Vector{String}:
 "widowed"
 "divorced"
 "separated"
 "nevermarried"
 "hsd08"
 "hsd911"
 "hsg"
 "cg"
 "ad"
 "mw"
 "so"
 "we"
 "exp1"
 "exp2"
 "exp3"

In [114]:
frml = x_formula2
form = apply_schema(frml, schema(frml, data));
res, pred = modelcols(form, data);

coef_names = string.(coefnames(form.rhs));

In [164]:
# fom_1 =     ["widowed", "divorced", "separated", "nevermarried", "hsd08", "hsd911", "hsg", "cg", "ad", "mw", "so",
#             "we", "exp1", "exp2", "exp3"];
frml = x_formula2
data = cps2012
form = apply_schema(frml, schema(frml, data));
res, pred = modelcols(form, data);
coef_names = string.(coefnames(form.rhs));
data = cps2012[:,coef_names];
# sub_data = ones(size(data)[1])
sub_data = Matrix(copy(data))

for i in 1:size(data)[2]
    if i <= (size(data)[2] -1)
        sub_data = hcat(sub_data, Matrix(data[:, i] .* data[:, Not(1:i)]) )
    end
end
;

In [190]:
poly = 2

2

In [209]:
size(sub_data)

(29217, 120)

In [208]:
# sub_data = ones(size(data)[1])
sub_data = Matrix(copy(data))
for i in 1:size(data)[2]
    if i <= (size(data)[2] -1)
        sub_data = hcat(sub_data, Matrix(data[:, i] .* data[:, Not(poly-1:i)]) )
    end
end

In [193]:
(data[:,1] .* data[:,2]) .* data[:, Not(poly-1:i)]

Unnamed: 0_level_0,divorced,separated,nevermarried,hsd08,hsd911,hsg,cg,ad
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [212]:
pension = get_data("pension")
y = pension[:, "tw"];
d = pension[:, "p401"];
z = pension[:, "e401"];
X = pension[:, ["i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5", "fsize", "hs", "smcol", "col", "marr", "twoearn", "db", "pira", "hown"]]
# x = Matrix(X[:, :])
d = Matrix(d[:, :])
y = Matrix(y[:, :])
z = Matrix(z[:, :])
n = size(x, 1)
p = size(x, 2)
r_summary(rlassoLATET(X, d, y, z))
r_summary(rlassoATET(X, d, y))

[-4454.977853570672, -9223.558412198592, -5277.834766728774, 0.0, 0.0, 25854.22562316646, 108296.14881737166, -4555.615957146337, 0.0, 19607.98405807405, 50784.21125774554, 0.0, 0.0, 0.0, 8404.441042092554, 0.0, 0.0, 0.0, 60009.79465912404, 46326.655580216255]

(9915, 19)


    Estimation and significance tesing of the treatment effect
    Type: LATET
    Bootstrap: none
    
 [1m   Coeff [0m [1m      SE [0m [1m t.value [0m
  15323.2   3645.28   4.20357


[-4050.1778925613617, -9248.63779332838, -5843.539731659136, 0.0, 0.0, 26740.944218515244, 108266.40971808617, -4621.818810182787, 0.0, 20010.66656673543, 53255.67972386175, 0.0, 0.0, 0.0, 7508.6444769095915, 0.0, 0.0, 0.0, 58718.49607815778, 45320.14313429714](9915, 19

)
    Estimation and significance tesing of the treatment effect
    Type: ATET
    Bootstrap: none
    
 [1m   Coeff [0m [1m      SE [0m [1m t.value [0m
  12628.5   2944.43   4.28893


1×3 Matrix{Float64}:
 12628.5  2944.43  4.28893

In [9]:
x_formula = @formula(lnw ~ -1 + female + female*widowed + female*divorced + female*separated + female*nevermarried +
                    female*hsd08 + female*hsd911 + female*hsg + female*cg + female*ad + female*mw + female*so + female*we + female*exp1 + female*exp2 + female*exp3)
x_dframe = ModelFrame( x_formula, cps2012)
x1 = ModelMatrix(x_dframe)
X = x1.m[:,Not(1:16)];
X = hcat(x1.m[:,1:16], X)
size(X)
fom_1 =     ["widowed", "divorced", "separated", "nevermarried", "hsd08", "hsd911", "hsg", "cg", "ad", "mw", "so",
            "we", "exp1", "exp2", "exp3"];
data = cps2012[:,fom_1];
sub_data = ones(size(data)[1])
for i in 1:size(data)[2]
    if i <= (size(data)[2] -1)
        sub_data = hcat(sub_data, Matrix(data[:, i] .* data[:, Not(1:i)]) )
    end
end
sub_data = sub_data[:,2:end]
size(sub_data)
x = hcat(X, sub_data)
size(x)
filter = var.(eachcol(x)) .!= 0
x = x[:,filter]
print(size(x))
index_gender = [1,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31];
y = cps2012.lnw;
;

(29217, 116)

The parameter estimates for the target parameters, i.e. all coefficients related to gender (i.e. by
interaction with other variables) are calculated and summarized by the following commands

In [10]:
effects_female = rlassoEffects(x, y, index = index_gender);

In [12]:
r_summary(effects_female);

Estimates and significance testing of the effect of target variables
 [1m     [0m [1m   Estimate. [0m [1m Std. Error [0m [1m    t value [0m [1m    Pr(>|t|) [0m

 [1m  X1 [0m    -0.154923    0.0501624     -3.08843    0.00201216
 [1m X17 [0m     0.136095    0.0906626      1.50112      0.133325
 [1m X18 [0m     0.136939    0.0221817      6.17353    6.6782e-10
 [1m X19 [0m    0.0233028    0.0532118     0.437925      0.661441
 [1m X20 [0m     0.186853    0.0199424      9.36966   7.27651e-21
 [1m X21 [0m    0.0278103     0.120914         0.23      0.818092
 [1m X22 [0m    -0.119335    0.0518797     -2.30023     0.0214354
 [1m X23 [0m   -0.0128898    0.0192232    -0.670533      0.502518
 [1m X24 [0m    0.0101386    0.0183265     0.553218      0.580114
 [1m X25 [0m   -0.0304637    0.0218061     -1.39703      0.162405
 [1m X26 [0m  -0.00106344    0.0191918   -0.0554112      0.955811
 [1m X27 [0m  -0.00818334    0.0193568    -0.422763      0.672468
 [1m X28 [0

Finally, we estimate and plot confident intervals, first ”pointwise” and then the joint confidence intervals.

In [14]:
joint_CI = r_confint(effects_female, 0.95, joint = true);
joint_CI;

 [1m     [0m [1m       2.5% [0m [1m      97.5% [0m

 [1m  X1 [0m   -0.295562   -0.0142847
 [1m X17 [0m   -0.136261     0.408452
 [1m X18 [0m   0.0742004     0.199678
 [1m X19 [0m   -0.118061     0.164666
 [1m X20 [0m    0.128705     0.245002
 [1m X21 [0m   -0.378365     0.433985
 [1m X22 [0m   -0.270462    0.0317919
 [1m X23 [0m  -0.0656411    0.0398615
 [1m X24 [0m  -0.0421792    0.0624564
 [1m X25 [0m  -0.0964645    0.0355371
 [1m X26 [0m   -0.055223    0.0530961
 [1m X27 [0m  -0.0632443    0.0468776
 [1m X28 [0m  -0.0662182     0.057766
 [1m X29 [0m  -0.0167011    0.0265717
 [1m X30 [0m   -0.285341   -0.0336976
 [1m X31 [0m   0.0166404    0.0602608


### 4.4. Application: Estimation of the treatment effect in a linear model with many confounding factors

First, we load and prepare the data

In [40]:
GrowthData = get_data("GrowthData")
y = GrowthData[:, 1];
d = GrowthData[:, 3:3];
X = Matrix(GrowthData[:, Not(1, 2, 3)]);
X_1 = Matrix(GrowthData[:, Not(1, 2)]);

Now we can estimate the effect of the initial GDP level. First, we estimate by OLS:

In [41]:
Q, R = qr(hcat(ones(length(y)), X_1))
β = pinv(hcat(ones(length(y)), X_1)) * y

res = y - hcat(ones(length(y)), X_1) * β;
n = size(hcat(ones(length(y)), X_1))[1]
k = size(hcat(ones(length(y)), X_1))[2]

sigma2_hat = (res' * res) / (n - k)
vcov_beta_hat = sigma2_hat .* inv(hcat(ones(length(y)), X_1)' * hcat(ones(length(y)), X_1));
se = sqrt.(diag(vcov_beta_hat))

ls_effect = DataFrame(Estimate = β, stderror = se);

Second, we estimate the effect by the partialling out by Post-Lasso:

In [42]:
lasso_effect = rlassoEffect(X, y, d, method = "partialling out");
r_summary(lasso_effect);

Estimates and significance testing of the effect of target variables
 [1m Row [0m [1m Estimate. [0m [1m Std. Error [0m [1m  t value [0m [1m    Pr(>|t|) [0m

    1    -0.04981      0.01394   -3.57317   0.00035 ***
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Third, we estimate the effect by the double selection method:

In [43]:
doublesel_effect = rlassoEffect(X, y, d, method = "double selection");
r_summary(doublesel_effect);

Estimates and significance testing of the effect of target variables
 [1m          [0m [1m Estimate. [0m [1m Std. Error [0m [1m  t value [0m [1m   Pr(>|t|) [0m

  gdpsh465    -0.05001      0.01579   -3.16719   0.00154 **
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


We then collect results in a nice latex table:

In [46]:
table = zeros(3,2)
table[1,:] = [round(Matrix(ls_effect)[2,1], digits = 2), round.(Matrix(ls_effect)[2,2], digits = 5)]
table[2,:] = [round(lasso_effect.coefficients, digits =2), round(lasso_effect.se, digits = 5)]
table[3,:] = [round(doublesel_effect.coefficients, digits =2), round(doublesel_effect.se, digits = 5)];
index = ["full reg via ols", "partial reg
via post-lasso ", "partial reg via double selection"]
pretty_table(hcat(index, table), show_row_number = false, header = [" ", "Estimate", "Std. Error"], tf = tf_simple, nosubheader = true)

 [1m                                  [0m [1m Estimate [0m [1m Std. Error [0m
                  full reg via ols      -0.01      0.02989
      partial reg\nvia post-lasso       -0.05      0.01394
  partial reg via double selection      -0.05      0.01579


## 5. Instrumental Variable Estimation in a High-Dimensional Setting

### 5.2. Application: Economic Development and Institutions.

First, we process the data

In [55]:
AJR = get_data("AJR")
y = AJR[!,"GDP"]
d = AJR[!,2:2]
z = AJR[!,"logMort"];
x_formula = @formula(GDP ~ -1 + Latitude + Latitude2 + Africa + Asia + 
    Namer + Samer + Latitude*Latitude2 + Latitude*Africa + 
    Latitude*Asia + Latitude*Namer + Latitude*Samer + Latitude2*Africa +
    Latitude2*Asia + Latitude2*Namer + Latitude2*Samer + Africa*Asia 
    + Africa*Namer + Africa*Samer + Asia*Namer + Asia*Samer
    + Namer*Samer
    )
x_formula0 = apply_schema(x_formula, schema(AJR))
x_dframe = ModelFrame( x_formula, AJR)
x = DataFrame(ModelMatrix(x_dframe).m, :auto)
x = rename!(x, string.(coefnames(x_formula0)[2]))
size(x)

(64, 21)

Then we estimate an IV model with selection on the X

In [48]:
AJR_Xselect  = rlassoIV(x, d, y, z, select_X=true, select_Z=false);
r_summary(AJR_Xselect);

Estimates and Significance Testing of the effect of target variables in the IV regression model
 [1m        [0m [1m  coeff. [0m [1m     se. [0m [1m t-value [0m [1m    p-value [0m

  Exprop   0.84503   0.26993   3.13055   0.00174 **
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


In [51]:
r_confint(AJR_Xselect);

 [1m        [0m [1m     2.5% [0m [1m   97.5% [0m

  Exprop   0.315981   1.37407


It is interesting to understand what the procedure above is doing. In essence, it partials out xi from
yi
, di and zi using Post-Lasso and applies the 2SLS to the residual quantities.
Let us investigate partialling out in more detail in this example. We can first try to use OLS for
partialling out:


In [79]:
rY_1 = lm(hcat(ones(length(y)), Matrix(AJR[:,3:end])), y);
rY = y - predict(rY_1)
rD_1 = lm(hcat(ones(length(y)), Matrix(AJR[:,3:end])), vec(Matrix(d[:,:])));
rD = vec(Matrix(d[:,:])) - predict(rD_1);
rZ_1 = lm(hcat(ones(length(y)), Matrix(AJR[:,3:end])), z);
rZ = z - predict(rZ_1);

In [82]:
rY_1 = lm(@formula(GDP ~ Latitude + Latitude2 + Africa + Asia + Namer + Samer + Latitude*Latitude2 + Latitude*Africa + Latitude*Asia + Latitude*Namer + Latitude*Samer
          + Latitude2*Africa + Latitude2*Asia + Latitude2*Namer + Latitude2*Samer + Africa*Asia + Africa*Namer + Africa*Samer
          + Asia*Namer + Asia*Samer + Namer*Samer), AJR)
rY = y - predict(rY_1)

rD_1 = lm(@formula(Exprop ~ Latitude + Latitude2 + Africa + Asia + Namer + Samer + Latitude*Latitude2 + Latitude*Africa + Latitude*Asia + Latitude*Namer + Latitude*Samer
          + Latitude2*Africa + Latitude2*Asia + Latitude2*Namer + Latitude2*Samer + Africa*Asia + Africa*Namer + Africa*Samer
          + Asia*Namer + Asia*Samer + Namer*Samer), AJR)
rD = vec(Matrix(d[:,:])) - predict(rD_1)

rZ_1 = lm(@formula(logMort ~ Latitude + Latitude2 + Africa + Asia + Namer + Samer + Latitude*Latitude2 + Latitude*Africa + Latitude*Asia + Latitude*Namer + Latitude*Samer
          + Latitude2*Africa + Latitude2*Asia + Latitude2*Namer + Latitude2*Samer + Africa*Asia + Africa*Namer + Africa*Samer
          + Asia*Namer + Asia*Samer + Namer*Samer), AJR)
rZ = z - predict(rZ_1);

In [84]:
ivfit_lm = tsls(rD, rY, rZ, nothing, intercept=false)
DataFrame(Estimate = ivfit_lm["coefficients"][1,2], Std_Error = ivfit_lm["se"])

Unnamed: 0_level_0,Estimate,Std_Error
Unnamed: 0_level_1,Float64,Float64
1,1.26721,1.73054


We see that the estimates exhibit large standard errors. The imprecision is expected because dimension
of x is quite large, comparable to the sample size.
Next, we replace the OLS operator by post-Lasso for partialling out

In [86]:
x_formula1 = @formula(GDP ~ Latitude + Latitude2 + Africa + Asia + Namer + Samer
    + Latitude*Latitude2 + Latitude*Africa + Latitude*Asia + Latitude*Namer + Latitude*Samer
    + Latitude2*Africa + Latitude2*Asia + Latitude2*Namer + Latitude2*Samer
    + Africa*Asia + Africa*Namer + Africa*Samer
    + Asia*Namer + Asia*Samer
    + Namer*Samer)
x_dframe1 = ModelFrame( x_formula, AJR)
x1_1 = ModelMatrix(x_dframe)
xx = x1_1.m;

In [89]:
rY_1 = rlasso(xx, y);
rY = rY_1["residuals"]
rD_1 = rlasso(xx, d);
rD = rD_1["residuals"]
rZ_1 = rlasso(xx, z);
rZ = rZ_1["residuals"]

ivfit_lasso = tsls(rD, rY, rZ)
DataFrame(Estimate = ivfit_lasso["coefficients"][1,2], Std_Error = ivfit_lasso["se"][1])

Unnamed: 0_level_0,Estimate,Std_Error
Unnamed: 0_level_1,Float64,Float64
1,0.845027,0.272094


### 5.3. Application: Impact of Eminent Domain Decisions on Economic Outcomes.

First, we load the data an construct the matrices with the controls (x), instruments (z), outcome (y),
and treatment variables (d). Here we consider regional GDP as the outcome variable.

In [90]:
EminentDomain = get_data("EminentDomain")
z = EminentDomain["logGDP"]["z"];
x = EminentDomain["logGDP"]["x"];
d = EminentDomain["logGDP"]["d"];
y = EminentDomain["logGDP"]["y"];
x = x[:, (mean(x, dims = 1) .> 0.05)'];
z = z[:, (mean(z, dims = 1) .> 0.05)'];

As mentioned above, y is the economic outcome, the logarithm of the GDP, d the number of pro
plaintiff appellate takings decisions in federal circuit court c and year t, x is a matrix with control
variables, and z is the matrix with instruments. Here we consider socio-economic and demographic
characteristics of the judges as instruments.
First, we estimate the effect of the treatment variable by simple OLS and 2SLS using two instruments:

In [91]:
ED_ols = lm(hcat(ones(length(vec(y))), hcat(d, x)), vec(y));
ED_2sls = tsls(d, y, z[:,1:2], x, intercept = false);

Next, we estimate the model with selection on the instruments.


In [92]:
lasso_IV_Z = rlassoIV(x, d, y, z, select_X = false, select_Z = true);

In [93]:
r_summary(lasso_IV_Z);

Estimates and Significance Testing of the effect of target variables in the IV regression model
 [1m    [0m [1m coeff. [0m [1m     se. [0m [1m t-value [0m [1m  p-value [0m

  d1   0.4146   0.29025   1.42842   0.15317
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


In [96]:
r_confint(lasso_IV_Z);

 [1m    [0m [1m      2.5% [0m [1m   97.5% [0m

  d1   -0.154276   0.98348


Finally, we do selection on both the x and z variables.

In [97]:
lasso_IV_XZ = rlassoIV(x, d, y, z, select_X = true, select_Z = true);
r_summary(lasso_IV_XZ);

Estimates and Significance Testing of the effect of target variables in the IV regression model
 [1m    [0m [1m   coeff. [0m [1m     se. [0m [1m  t-value [0m [1m  p-value [0m

  d1   -0.02383   0.12851   -0.18543   0.85289
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


In [98]:
r_confint(lasso_IV_XZ);

 [1m    [0m [1m      2.5% [0m [1m    97.5% [0m

  d1   -0.275703   0.228033


Finally, we compare all results

In [100]:
table = zeros(4,2)
table[1,:] = [GLM.coef(ED_ols)[2], stderror(ED_ols)[2]]
table[2,:] = [ED_2sls["coefficients"][1,2], ED_2sls["se"][1]]
table[3,:] = Matrix(r_summary(lasso_IV_Z)[:,2:3]);
table[4, :] = Matrix(r_summary(lasso_IV_XZ)[:, 2:3]);
index = ["ols regression", "IV estimation ", "selection on Z", "selection on X and Z"]
pretty_table(hcat(index, table), show_row_number = false, header = [" ", "Estimate", "Std. Error"], tf = tf_simple, nosubheader = true)

Estimates and Significance Testing of the effect of target variables in the IV regression model
 [1m    [0m [1m coeff. [0m [1m     se. [0m [1m t-value [0m [1m  p-value [0m

  d1   0.4146   0.29025   1.42842   0.15317
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimates and Significance Testing of the effect of target variables in the IV regression model
 [1m    [0m [1m   coeff. [0m [1m     se. [0m [1m  t-value [0m [1m  p-value [0m

  d1   -0.02383   0.12851   -0.18543   0.85289
---
Signif. codes:
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 [1m                      [0m [1m   Estimate [0m [1m Std. Error [0m
        ols regression   0.00786473   0.00986593
        IV estimation    -0.0107097    0.0337652
        selection on Z       0.4146      0.29025
  selection on X and Z     -0.02383      0.12851


## 6. Inference on Treatment Effects in a High-Dimensional Setting

### 6.3. Application: 401(k) plan participation.

Again, we start first with the data preparation:

In [213]:
pension = get_data("pension")
y = pension[:, "tw"];
d = pension[:, "p401"];
z = pension[:, "e401"];
X = pension[:, ["i2", "i3", "i4", "i5", "i6", "i7", "a2", "a3", "a4", "a5", "fsize", "hs", "smcol", "col", "marr", "twoearn", "db", "pira", "hown"]];

Now we can compute the estimates of the target treatment effect parameters. For ATE and ATET we
report the the effect of eligibility for 401(k).

In [237]:
pension_ate = rlassoATE(X, d, y);
r_summary(pension_ate);


    Estimation and significance tesing of the treatment effect
    Type: ATE
    Bootstrap: none
    
 [1m   Coeff [0m [1m      SE [0m [1m t.value [0m
  10180.1   1930.68    5.2728


In [238]:
pension_atet = rlassoATET(X, d, y);
r_summary(pension_atet);


    Estimation and significance tesing of the treatment effect
    Type: ATET
    Bootstrap: none
    
 [1m   Coeff [0m [1m      SE [0m [1m t.value [0m
  12628.5   2944.43   4.28893


For LATE and LATET we estimate the effect of 401(k) participation (d) with plan eligibility (z) as
instrument.

In [239]:
pension_late = rlassoLATE(X, d, y, z);
r_summary(pension_late);


    Estimation and significance tesing of the treatment effect
    Type: LATE
    Bootstrap: none
    
 [1m   Coeff [0m [1m     SE [0m [1m t.value [0m
  12992.1   2326.9   5.58344


In [240]:
pension_latet = rlassoLATET(X, d, y, z);
r_summary(pension_latet);


    Estimation and significance tesing of the treatment effect
    Type: LATET
    Bootstrap: none
    
 [1m   Coeff [0m [1m      SE [0m [1m t.value [0m
  15323.2   3645.28   4.20357


For LATE and LATET we estimate the effect of 401(k) participation (d) with plan eligibility (z) as
instrument.

In [244]:
using PrettyTables
table = zeros(4,2)
table[1,:] = round.(vec(r_summary(pension_ate)[:, 1:2]), digits = 2);
table[2,:] = round.(vec(r_summary(pension_atet)[:, 1:2]), digits = 2);
table[3,:] = round.(vec(r_summary(pension_late)[:, 1:2]), digits = 2);
table[4,:] = round.(vec(r_summary(pension_latet)[:, 1:2]), digits = 2);
index = ["ATE", "ATET ", "LATE", "LATET"];


    Estimation and significance tesing of the treatment effect
    Type: ATE
    Bootstrap: none
    
 [1m   Coeff [0m [1m      SE [0m [1m t.value [0m
  10180.1   1930.68    5.2728

    Estimation and significance tesing of the treatment effect
    Type: ATET
    Bootstrap: none
    
 [1m   Coeff [0m [1m      SE [0m [1m t.value [0m
  12628.5   2944.43   4.28893

    Estimation and significance tesing of the treatment effect
    Type: LATE
    Bootstrap: none
    
 [1m   Coeff [0m [1m     SE [0m [1m t.value [0m
  12992.1   2326.9   5.58344

    Estimation and significance tesing of the treatment effect
    Type: LATET
    Bootstrap: none
    
 [1m   Coeff [0m [1m      SE [0m [1m t.value [0m
  15323.2   3645.28   4.20357


In [245]:
pretty_table(hcat(index, table), show_row_number = false, 
            header = [" ", "Estimate", "Std. Error"], tf = tf_simple, nosubheader = true)

 [1m       [0m [1m Estimate [0m [1m Std. Error [0m
    ATE    10180.1      1930.68
  ATET     12628.5      2944.43
   LATE    12992.1       2326.9
  LATET    15323.2      3645.28


Finally, we estimate a model including all interaction effects:

In [246]:
pension_ate = rlassoATE(X, z, y);
pension_atet = rlassoATET(X, z, y);
pension_late = rlassoLATE(X, d, y, z);
pension_latet = rlassoLATET(X, d, y, z);

In [249]:
table = zeros(4, 2)
table[1,:] = r_summary(pension_ate)[:, 1:2]
table[2,:] = r_summary(pension_atet)[:, 1:2]
table[3,:] = r_summary(pension_late)[:, 1:2]
table[4,:] = r_summary(pension_latet)[:, 1:2];


    Estimation and significance tesing of the treatment effect
    Type: ATE
    Bootstrap: none
    
 [1m   Coeff [0m [1m      SE [0m [1m t.value [0m
  8491.99   1902.92    4.4626

    Estimation and significance tesing of the treatment effect
    Type: ATET
    Bootstrap: none
    
 [1m   Coeff [0m [1m      SE [0m [1m t.value [0m
  10795.3   2568.13   4.20357

    Estimation and significance tesing of the treatment effect
    Type: LATE
    Bootstrap: none
    
 [1m   Coeff [0m [1m     SE [0m [1m t.value [0m
  12992.1   2326.9   5.58344

    Estimation and significance tesing of the treatment effect
    Type: LATET
    Bootstrap: none
    
 [1m   Coeff [0m [1m      SE [0m [1m t.value [0m
  15323.2   3645.28   4.20357


In [250]:
index = ["ATE", "ATET ", "LATE", "LATET"]
pretty_table(hcat(index, table), show_row_number = false, 
            header = [" ", "Estimate", "Std. Error"], tf = tf_simple, nosubheader = true)

 [1m       [0m [1m Estimate [0m [1m Std. Error [0m
    ATE    8491.99      1902.92
  ATET     10795.3      2568.13
   LATE    12992.1       2326.9
  LATET    15323.2      3645.28


## 7. The Lasso Methods for Discovery of Significant Causes amongst Many Potential Causes, with Many Controls


In [84]:
data = get_data("seed_500")
n, p = size(data);
p1 = 20;
X = data[:,2:end]
Y = data[:,1];

In [85]:
r_confint(rlassoEffects(Matrix(X), Y, index = [1:p1;]), joint = true);

 [1m      [0m [1m       2.5% [0m [1m     97.5% [0m

 [1m  V 1 [0m     4.50639     5.22251
 [1m  V 2 [0m    -0.32155    0.312205
 [1m  V 3 [0m   -0.358732    0.193109
 [1m  V 4 [0m   -0.260592    0.293738
 [1m  V 5 [0m   -0.283061    0.282752
 [1m  V 6 [0m   -0.328685    0.301444
 [1m  V 7 [0m   -0.232431    0.307122
 [1m  V 8 [0m  -0.0534616    0.479771
 [1m  V 9 [0m   -0.193325    0.396997
 [1m V 10 [0m   -0.243113    0.269989
 [1m V 11 [0m   -0.320854    0.215603
 [1m V 12 [0m    -0.31593    0.272461
 [1m V 13 [0m   -0.180614    0.383283
 [1m V 14 [0m   -0.331885    0.393743
 [1m V 15 [0m   -0.329421     0.32057
 [1m V 16 [0m   -0.271937    0.337993
 [1m V 17 [0m   -0.186196     0.42395
 [1m V 18 [0m   -0.374205   0.0518391
 [1m V 19 [0m   -0.113184    0.399561
 [1m V 20 [0m   -0.221241    0.260961
